WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL Disk-Full (ENOSPC) Emergency Runbook

Production-safe incident runbook for Odoo operators to triage, contain, and recover from PostgreSQL disk-full failures without corrupting data or deleting critical WAL files.

When PostgreSQL hits No space left on device (ENOSPC), Odoo can fail in ugly ways: login errors, transaction rollbacks, stuck workers, and cascading queue/cron failures.

This runbook is for live production containment first, then safe recovery.

Scope: broad PostgreSQL data-volume exhaustion incidents for Odoo (base/, pg_wal/, temp files, logs), not only WAL-archiver backlog.

Incident signals

Typical signals during this incident:

  • Odoo errors show OperationalError, could not extend file, or No space left on device.
  • PostgreSQL logs include PANIC, FATAL, or repeated could not write to file messages.
  • Spike in HTTP 500s on write-heavy Odoo endpoints.
  • Database node disk usage > 95% and rising.
odoocli logs tail --service odoo --since 20m --grep "OperationalError|No space left on device|could not extend file|500"
odoocli logs tail --service postgres --since 20m --grep "No space left on device|could not write|PANIC|FATAL"

Step 0 - Stabilize blast radius

  1. Declare incident and freeze non-essential deploys/migrations.
  2. Temporarily stop high-write background load (bulk imports, heavy cron batches, queue workers).
  3. Keep one operator responsible for DB-level actions.
  4. Do not manually delete files from PGDATA/pg_wal.

If storage runway is minutes (not hours), prioritize emergency free space first, then deeper diagnosis.

Step 1 - Confirm where space is exhausted

Run on the PostgreSQL host:

# Filesystem headroom
sudo df -h
sudo df -i
# Top consumers inside PGDATA (adjust path if needed)
sudo du -xhd1 /var/lib/postgresql/data | sort -h
sudo du -xhd1 /var/lib/postgresql/data/pg_wal | sort -h
-- DB-level temporary file pressure (requires pg_stat_statements for best value)
select datname, temp_files, pg_size_pretty(temp_bytes) as temp_bytes
from pg_stat_database
order by temp_bytes desc;
-- Check archiver and replication slot retention signals
select archived_count, failed_count, last_archived_wal, last_failed_wal, stats_reset
from pg_stat_archiver;
select
  slot_name,
  slot_type,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
from pg_replication_slots
order by pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) desc nulls last;

Quick interpretation:

  • pg_wal dominant -> retention/archiving/replication problem.
  • base/ dominant -> table/index growth or bloat pressure.
  • pgsql_tmp dominant -> sort/hash spills from expensive queries.
  • Inodes exhausted (df -i) with moderate bytes -> too many small files/log artifacts.

Step 2 - Emergency containment actions (safe ordering)

2.1 Reduce write pressure immediately

  • Pause queue workers/cron jobs causing heavy writes.
  • Disable large import/export jobs.
  • If necessary, put Odoo in temporary maintenance mode for write endpoints.
# Example: scale down async load first
odoocli scale --service odoo-worker --replicas 0
odoocli scale --service odoo-cron --replicas 0

2.2 Create short-term free space without touching WAL manually

Preferred order:

  1. Remove/rotate oversized PostgreSQL or system logs on same volume.
  2. Clear safe transient artifacts (old crash dumps, old package caches) outside PGDATA.
  3. Extend volume if cloud/storage platform supports fast expansion.
# Example log cleanup (adapt to distro/logging policy)
sudo journalctl --disk-usage
sudo journalctl --vacuum-time=7d
# If data volume supports online grow, expand filesystem after volume resize
sudo growpart /dev/nvme0n1 1 || true
sudo xfs_growfs /var/lib/postgresql || sudo resize2fs /dev/nvme0n1p1

2.3 If pg_wal is the hotspot, fix root retention cause

  • Recover broken WAL archiving destination/credentials/network.
  • Address stuck replication slots (inactive logical/physical consumers).
  • Reconnect/repair lagged replicas that pin WAL.
-- Drop an obsolete inactive slot ONLY if confirmed unused by any consumer
select slot_name, active, restart_lsn from pg_replication_slots;
-- select pg_drop_replication_slot('obsolete_slot_name');

Never run rm inside pg_wal; it can make the cluster unrecoverable.

Step 3 - Recover PostgreSQL and Odoo safely

If PostgreSQL was crash-looping due to ENOSPC, after restoring headroom:

sudo systemctl status postgresql
sudo systemctl restart postgresql
sudo systemctl is-active postgresql

Then validate DB write/read health:

select now(), pg_is_in_recovery();
select count(*) from ir_module_module;
create temporary table odoocli_enospc_probe(id int);
drop table odoocli_enospc_probe;

Then restore Odoo traffic gradually:

  1. Bring web + essential workers first.
  2. Re-enable cron/queue in controlled batches.
  3. Watch error rate and DB saturation metrics between each step.
odoocli scale --service odoo-worker --replicas 2
odoocli scale --service odoo-cron --replicas 1

Step 4 - Verification checklist

4.1 Storage safety margin is restored

sudo df -h /var/lib/postgresql/data

Target: at least your defined incident floor (commonly 20-30% free after recovery).

4.2 PostgreSQL error stream is clean

odoocli logs tail --service postgres --since 15m --grep "No space left on device|PANIC|FATAL|could not extend file"

4.3 Odoo app path is healthy

odoocli logs tail --service odoo --since 15m --grep "Traceback|OperationalError|500"
  • Login works
  • Create/write flows succeed
  • Cron backlog drains instead of growing

4.4 WAL retention normalizes

select now(), archived_count, failed_count, last_archived_wal, last_failed_wal
from pg_stat_archiver;
select slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
from pg_replication_slots
order by pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) desc nulls last;

Rollback / backout plan

If changes introduce instability:

  1. Revert aggressive scale-up of workers/cron.
  2. Keep non-essential jobs paused.
  3. Reapply maintenance mode for write-heavy routes.
  4. If disk pressure returns rapidly, repeat containment and escalate to storage expansion before full traffic restore.

If a replication slot was dropped in error, recreate downstream consumer and resynchronize from a known-good snapshot/base backup path.

Hardening and prevention checklist

  • Set alerts for both disk percent used and growth rate on DB volume.
  • Alert on inode exhaustion (df -i), not just bytes.
  • Track pg_stat_archiver.failed_count and replication slot retained WAL.
  • Enforce temp_file_limit and sensible work_mem per role for reporting users.
  • Keep PostgreSQL logs off the same constrained volume as data where possible.
  • Capacity-plan with explicit free-space runway SLO (for example, >= 24h runway at p95 growth).
  • Test volume expansion runbook and restore runbook quarterly.
  • Audit and remove stale logical replication slots.

References

  • PostgreSQL Documentation: Routine Database Maintenance and Monitoring Database Activity.
  • PostgreSQL Documentation: Continuous Archiving and PITR (pg_stat_archiver, WAL lifecycle).
  • PostgreSQL Documentation: Replication slots and WAL retention behavior.
  • Odoo Documentation: Deployment/operations guidance and worker/cron management considerations.

Principle: restore headroom safely, fix WAL/retention root cause, then ramp traffic gradually with verification gates.

Back to blog