WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL OOM and Memory Pressure Incident Runbook

Production-first runbook to triage, contain, and recover when PostgreSQL or Odoo processes are OOM-killed, including safe memory tuning, query containment, rollback, and verification.

When database memory pressure spikes, Linux can OOM-kill PostgreSQL backends or Odoo workers. Symptoms look random at first (worker restarts, canceled queries, 500s), but recovery needs strict ordering to avoid making it worse.

This runbook focuses on live containment first, then safe recovery and hardening.

Scope: memory pressure incidents on Odoo + PostgreSQL hosts (or shared node pools), including bad work_mem usage, query fan-out, and connection storms.

Incident signals

Common production signals:

  • Odoo logs: intermittent 500s, worker restarts, long request latency.
  • PostgreSQL logs: server process ... was terminated by signal 9, out of memory, FATAL: terminating connection due to administrator command.
  • Kernel logs: Out of memory: Killed process ... (postgres) or (python3).
  • p95/p99 request latency spikes during write-heavy paths.
odoocli logs tail --service postgres --since 20m --grep "out of memory|signal 9|terminated|FATAL"
odoocli logs tail --service odoo --since 20m --grep "Worker|restarting|MemoryError|500|timeout"
# On DB host
sudo dmesg -T | grep -Ei "out of memory|killed process|oom"

Step 0 - Stabilize blast radius

  1. Freeze deploys and schema changes.
  2. Pause non-essential high-memory workloads (large exports, BI cron, heavy queue jobs).
  3. Keep one operator handling DB parameter changes.
  4. Avoid broad restart loops; they hide the culprit query/workload.
# Example containment
odoocli scale --service odoo-worker --replicas 1
odoocli scale --service odoo-cron --replicas 0

Step 1 - Confirm where memory is being consumed

1.1 Host-level memory pressure

free -h
vmstat 1 10
ps -eo pid,comm,%mem,%cpu,rss --sort=-rss | head -20

If swap is thrashing hard and OOM kills continue, prioritize immediate load reduction before tuning.

1.2 PostgreSQL session/query pressure

-- Active sessions by state and age
select state, count(*)
from pg_stat_activity
where datname = current_database()
group by state
order by count(*) desc;
-- Long-running active queries
select pid,
       usename,
       now() - query_start as runtime,
       wait_event_type,
       wait_event,
       left(query, 200) as query
from pg_stat_activity
where state = 'active'
  and now() - query_start > interval '30 seconds'
order by runtime desc
limit 20;
-- Temp-file heavy databases often indicate sort/hash memory pressure
select datname,
       temp_files,
       pg_size_pretty(temp_bytes) as temp_bytes
from pg_stat_database
order by temp_bytes desc;

1.3 Current memory-related settings

show shared_buffers;
show work_mem;
show maintenance_work_mem;
show max_connections;
show effective_cache_size;

Rule of thumb during incident: high max_connections + high work_mem + many concurrent active queries is a classic OOM pattern.

Step 2 - Contain safely (production order)

2.1 Stop query fan-out before changing DB memory knobs

  • Pause/reporting/batch workloads first.
  • Keep customer-facing essentials online if possible.
  • Limit app concurrency temporarily.
odoocli scale --service odoo-worker --replicas 1
odoocli scale --service odoo-web --replicas 1

2.2 Cancel worst offenders (before terminate)

-- Ask query to stop cleanly first
select pg_cancel_backend(<pid>);
-- Use terminate only if cancel fails and blast radius is growing
select pg_terminate_backend(<pid>);

Prefer terminating analytical/report sessions first, not critical OLTP transactions.

2.3 Apply temporary guardrails

Use conservative, reversible limits during incident response:

-- Example temporary caps (adjust to your environment)
alter role odoo set statement_timeout = '60s';
alter role odoo set idle_in_transaction_session_timeout = '120s';

If work_mem is clearly too high for concurrency, reduce it carefully:

-- Safer to scope by role/database than global if possible
alter role odoo in database <odoo_db_name> set work_mem = '16MB';

2.4 Restart only when pressure is controlled

If postgres postmaster or key Odoo services are unstable, restart after fan-out is reduced:

sudo systemctl restart postgresql
sudo systemctl is-active postgresql
odoocli restart --service odoo-web

Step 3 - Recovery verification

3.1 No ongoing OOM kills

sudo dmesg -T | tail -n 100 | grep -Ei "out of memory|killed process|oom"

3.2 PostgreSQL health

select now(), pg_is_in_recovery();
select count(*) filter (where state = 'active') as active,
       count(*) filter (where state = 'idle in transaction') as idle_in_txn
from pg_stat_activity
where datname = current_database();

3.3 Odoo path checks

odoocli logs tail --service odoo --since 15m --grep "Traceback|MemoryError|500|timeout"
  • Login works
  • Create/update flows succeed
  • Cron backlog trends down after controlled re-enable

3.4 Gradual scale-up

odoocli scale --service odoo-cron --replicas 1
odoocli scale --service odoo-worker --replicas 2

Increase in steps and watch DB memory after each increment.

Rollback / backout plan

If temporary tuning causes regressions (timeouts too aggressive, job starvation):

alter role odoo reset statement_timeout;
alter role odoo reset idle_in_transaction_session_timeout;
alter role odoo in database <odoo_db_name> reset work_mem;

Then:

  1. Keep background workload paused.
  2. Return to last known-safe replica counts.
  3. Reintroduce load gradually with narrower guardrails.

Hardening and prevention checklist

  • Put alerts on host memory, swap-in/out rate, and OOM-kill events.
  • Track pg_stat_database.temp_bytes and top temp-file-producing queries.
  • Keep max_connections realistic; use pooling where appropriate.
  • Set role-specific statement_timeout and idle_in_transaction_session_timeout.
  • Benchmark work_mem changes under real concurrency before production rollout.
  • Split heavy reporting/ETL from OLTP where possible.
  • Capacity-test Odoo worker counts against DB memory ceilings quarterly.
  • Document safe emergency profile (reduced workers + conservative query limits).

References

  • PostgreSQL Documentation: Resource Consumption (shared_buffers, work_mem, maintenance_work_mem).
  • PostgreSQL Documentation: Monitoring (pg_stat_activity, pg_stat_database).
  • PostgreSQL Documentation: Server Signaling Functions (pg_cancel_backend, pg_terminate_backend).
  • Odoo Documentation: Deployment and worker process sizing guidance.

Principle: reduce fan-out first, then apply reversible memory guardrails, verify stability, and scale back up in controlled steps.

Back to blog