Odoo PostgreSQL Disk-Full (ENOSPC) Emergency Runbook
Production-safe incident runbook for Odoo operators to triage, contain, and recover from PostgreSQL disk-full failures without corrupting data or deleting critical WAL files.
When PostgreSQL hits No space left on device (ENOSPC), Odoo can fail in ugly ways: login errors, transaction rollbacks, stuck workers, and cascading queue/cron failures.
This runbook is for live production containment first, then safe recovery.
Scope: broad PostgreSQL data-volume exhaustion incidents for Odoo (
base/,pg_wal/, temp files, logs), not only WAL-archiver backlog.
Incident signals
Typical signals during this incident:
- Odoo errors show
OperationalError,could not extend file, orNo space left on device. - PostgreSQL logs include
PANIC,FATAL, or repeatedcould not write to filemessages. - Spike in HTTP 500s on write-heavy Odoo endpoints.
- Database node disk usage > 95% and rising.
odoocli logs tail --service odoo --since 20m --grep "OperationalError|No space left on device|could not extend file|500"
odoocli logs tail --service postgres --since 20m --grep "No space left on device|could not write|PANIC|FATAL"
Step 0 - Stabilize blast radius
- Declare incident and freeze non-essential deploys/migrations.
- Temporarily stop high-write background load (bulk imports, heavy cron batches, queue workers).
- Keep one operator responsible for DB-level actions.
- Do not manually delete files from
PGDATA/pg_wal.
If storage runway is minutes (not hours), prioritize emergency free space first, then deeper diagnosis.
Step 1 - Confirm where space is exhausted
Run on the PostgreSQL host:
# Filesystem headroom
sudo df -h
sudo df -i
# Top consumers inside PGDATA (adjust path if needed)
sudo du -xhd1 /var/lib/postgresql/data | sort -h
sudo du -xhd1 /var/lib/postgresql/data/pg_wal | sort -h
-- DB-level temporary file pressure (requires pg_stat_statements for best value)
select datname, temp_files, pg_size_pretty(temp_bytes) as temp_bytes
from pg_stat_database
order by temp_bytes desc;
-- Check archiver and replication slot retention signals
select archived_count, failed_count, last_archived_wal, last_failed_wal, stats_reset
from pg_stat_archiver;
select
slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
from pg_replication_slots
order by pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) desc nulls last;
Quick interpretation:
pg_waldominant -> retention/archiving/replication problem.base/dominant -> table/index growth or bloat pressure.pgsql_tmpdominant -> sort/hash spills from expensive queries.- Inodes exhausted (
df -i) with moderate bytes -> too many small files/log artifacts.
Step 2 - Emergency containment actions (safe ordering)
2.1 Reduce write pressure immediately
- Pause queue workers/cron jobs causing heavy writes.
- Disable large import/export jobs.
- If necessary, put Odoo in temporary maintenance mode for write endpoints.
# Example: scale down async load first
odoocli scale --service odoo-worker --replicas 0
odoocli scale --service odoo-cron --replicas 0
2.2 Create short-term free space without touching WAL manually
Preferred order:
- Remove/rotate oversized PostgreSQL or system logs on same volume.
- Clear safe transient artifacts (old crash dumps, old package caches) outside PGDATA.
- Extend volume if cloud/storage platform supports fast expansion.
# Example log cleanup (adapt to distro/logging policy)
sudo journalctl --disk-usage
sudo journalctl --vacuum-time=7d
# If data volume supports online grow, expand filesystem after volume resize
sudo growpart /dev/nvme0n1 1 || true
sudo xfs_growfs /var/lib/postgresql || sudo resize2fs /dev/nvme0n1p1
2.3 If pg_wal is the hotspot, fix root retention cause
- Recover broken WAL archiving destination/credentials/network.
- Address stuck replication slots (inactive logical/physical consumers).
- Reconnect/repair lagged replicas that pin WAL.
-- Drop an obsolete inactive slot ONLY if confirmed unused by any consumer
select slot_name, active, restart_lsn from pg_replication_slots;
-- select pg_drop_replication_slot('obsolete_slot_name');
Never run rm inside pg_wal; it can make the cluster unrecoverable.
Step 3 - Recover PostgreSQL and Odoo safely
If PostgreSQL was crash-looping due to ENOSPC, after restoring headroom:
sudo systemctl status postgresql
sudo systemctl restart postgresql
sudo systemctl is-active postgresql
Then validate DB write/read health:
select now(), pg_is_in_recovery();
select count(*) from ir_module_module;
create temporary table odoocli_enospc_probe(id int);
drop table odoocli_enospc_probe;
Then restore Odoo traffic gradually:
- Bring web + essential workers first.
- Re-enable cron/queue in controlled batches.
- Watch error rate and DB saturation metrics between each step.
odoocli scale --service odoo-worker --replicas 2
odoocli scale --service odoo-cron --replicas 1
Step 4 - Verification checklist
4.1 Storage safety margin is restored
sudo df -h /var/lib/postgresql/data
Target: at least your defined incident floor (commonly 20-30% free after recovery).
4.2 PostgreSQL error stream is clean
odoocli logs tail --service postgres --since 15m --grep "No space left on device|PANIC|FATAL|could not extend file"
4.3 Odoo app path is healthy
odoocli logs tail --service odoo --since 15m --grep "Traceback|OperationalError|500"
- Login works
- Create/write flows succeed
- Cron backlog drains instead of growing
4.4 WAL retention normalizes
select now(), archived_count, failed_count, last_archived_wal, last_failed_wal
from pg_stat_archiver;
select slot_name, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
from pg_replication_slots
order by pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) desc nulls last;
Rollback / backout plan
If changes introduce instability:
- Revert aggressive scale-up of workers/cron.
- Keep non-essential jobs paused.
- Reapply maintenance mode for write-heavy routes.
- If disk pressure returns rapidly, repeat containment and escalate to storage expansion before full traffic restore.
If a replication slot was dropped in error, recreate downstream consumer and resynchronize from a known-good snapshot/base backup path.
Hardening and prevention checklist
- Set alerts for both disk percent used and growth rate on DB volume.
- Alert on inode exhaustion (
df -i), not just bytes. - Track
pg_stat_archiver.failed_countand replication slot retained WAL. - Enforce
temp_file_limitand sensiblework_memper role for reporting users. - Keep PostgreSQL logs off the same constrained volume as data where possible.
- Capacity-plan with explicit free-space runway SLO (for example, >= 24h runway at p95 growth).
- Test volume expansion runbook and restore runbook quarterly.
- Audit and remove stale logical replication slots.
References
- PostgreSQL Documentation: Routine Database Maintenance and Monitoring Database Activity.
- PostgreSQL Documentation: Continuous Archiving and PITR (
pg_stat_archiver, WAL lifecycle). - PostgreSQL Documentation: Replication slots and WAL retention behavior.
- Odoo Documentation: Deployment/operations guidance and worker/cron management considerations.
Principle: restore headroom safely, fix WAL/retention root cause, then ramp traffic gradually with verification gates.