Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL Disk-Full (ENOSPC) Emergency Runbook

Production-safe incident runbook for Odoo operators to triage, contain, and recover from PostgreSQL disk-full failures without corrupting data or deleting critical WAL files.

When PostgreSQL hits No space left on device (ENOSPC), Odoo can fail in ugly ways: login errors, transaction rollbacks, stuck workers, and cascading queue/cron failures.

This runbook is for live production containment first, then safe recovery.

Scope: broad PostgreSQL data-volume exhaustion incidents for Odoo (base/, pg_wal/, temp files, logs), not only WAL-archiver backlog.

Incident signals

Typical signals during this incident:

Odoo errors show OperationalError, could not extend file, or No space left on device.
PostgreSQL logs include PANIC, FATAL, or repeated could not write to file messages.
Spike in HTTP 500s on write-heavy Odoo endpoints.
Database node disk usage > 95% and rising.

odoocli logs tail --service odoo --since 20m --grep "OperationalError|No space left on device|could not extend file|500"

odoocli logs tail --service postgres --since 20m --grep "No space left on device|could not write|PANIC|FATAL"

Step 0 - Stabilize blast radius

Declare incident and freeze non-essential deploys/migrations.
Temporarily stop high-write background load (bulk imports, heavy cron batches, queue workers).
Keep one operator responsible for DB-level actions.
Do not manually delete files from PGDATA/pg_wal.

If storage runway is minutes (not hours), prioritize emergency free space first, then deeper diagnosis.

Step 1 - Confirm where space is exhausted

Run on the PostgreSQL host:

# Filesystem headroom
sudo df -h
sudo df -i

# Top consumers inside PGDATA (adjust path if needed)
sudo du -xhd1 /var/lib/postgresql/data | sort -h
sudo du -xhd1 /var/lib/postgresql/data/pg_wal | sort -h

-- DB-level temporary file pressure (requires pg_stat_statements for best value)
select datname, temp_files, pg_size_pretty(temp_bytes) as temp_bytes
from pg_stat_database
order by temp_bytes desc;

-- Check archiver and replication slot retention signals
select archived_count, failed_count, last_archived_wal, last_failed_wal, stats_reset
from pg_stat_archiver;

select
  slot_name,
  slot_type,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
from pg_replication_slots
order by pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) desc nulls last;

Quick interpretation:

pg_wal dominant -> retention/archiving/replication problem.
base/ dominant -> table/index growth or bloat pressure.
pgsql_tmp dominant -> sort/hash spills from expensive queries.
Inodes exhausted (df -i) with moderate bytes -> too many small files/log artifacts.

Step 2 - Emergency containment actions (safe ordering)

2.1 Reduce write pressure immediately

Pause queue workers/cron jobs causing heavy writes.
Disable large import/export jobs.
If necessary, put Odoo in temporary maintenance mode for write endpoints.

# Example: scale down async load first
odoocli scale --service odoo-worker --replicas 0
odoocli scale --service odoo-cron --replicas 0

2.2 Create short-term free space without touching WAL manually

Preferred order:

Remove/rotate oversized PostgreSQL or system logs on same volume.
Clear safe transient artifacts (old crash dumps, old package caches) outside PGDATA.
Extend volume if cloud/storage platform supports fast expansion.

# Example log cleanup (adapt to distro/logging policy)
sudo journalctl --disk-usage
sudo journalctl --vacuum-time=7d

# If data volume supports online grow, expand filesystem after volume resize
sudo growpart /dev/nvme0n1 1 || true
sudo xfs_growfs /var/lib/postgresql || sudo resize2fs /dev/nvme0n1p1

2.3 If `pg_wal` is the hotspot, fix root retention cause

Recover broken WAL archiving destination/credentials/network.
Address stuck replication slots (inactive logical/physical consumers).
Reconnect/repair lagged replicas that pin WAL.

-- Drop an obsolete inactive slot ONLY if confirmed unused by any consumer
select slot_name, active, restart_lsn from pg_replication_slots;
-- select pg_drop_replication_slot('obsolete_slot_name');

Never run rm inside pg_wal; it can make the cluster unrecoverable.

Step 3 - Recover PostgreSQL and Odoo safely

If PostgreSQL was crash-looping due to ENOSPC, after restoring headroom:

sudo systemctl status postgresql
sudo systemctl restart postgresql
sudo systemctl is-active postgresql

Then validate DB write/read health:

select now(), pg_is_in_recovery();
select count(*) from ir_module_module;
create temporary table odoocli_enospc_probe(id int);
drop table odoocli_enospc_probe;

Then restore Odoo traffic gradually:

Bring web + essential workers first.
Re-enable cron/queue in controlled batches.
Watch error rate and DB saturation metrics between each step.

odoocli scale --service odoo-worker --replicas 2
odoocli scale --service odoo-cron --replicas 1

Step 4 - Verification checklist

4.1 Storage safety margin is restored

sudo df -h /var/lib/postgresql/data

Target: at least your defined incident floor (commonly 20-30% free after recovery).

4.2 PostgreSQL error stream is clean

odoocli logs tail --service postgres --since 15m --grep "No space left on device|PANIC|FATAL|could not extend file"

4.3 Odoo app path is healthy

odoocli logs tail --service odoo --since 15m --grep "Traceback|OperationalError|500"

Login works
Create/write flows succeed
Cron backlog drains instead of growing

4.4 WAL retention normalizes

select now(), archived_count, failed_count, last_archived_wal, last_failed_wal
from pg_stat_archiver;

select slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal
from pg_replication_slots
order by pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) desc nulls last;

Rollback / backout plan

If changes introduce instability:

Revert aggressive scale-up of workers/cron.
Keep non-essential jobs paused.
Reapply maintenance mode for write-heavy routes.
If disk pressure returns rapidly, repeat containment and escalate to storage expansion before full traffic restore.

If a replication slot was dropped in error, recreate downstream consumer and resynchronize from a known-good snapshot/base backup path.

Hardening and prevention checklist

Set alerts for both disk percent used and growth rate on DB volume.
Alert on inode exhaustion (df -i), not just bytes.
Track pg_stat_archiver.failed_count and replication slot retained WAL.
Enforce temp_file_limit and sensible work_mem per role for reporting users.
Keep PostgreSQL logs off the same constrained volume as data where possible.
Capacity-plan with explicit free-space runway SLO (for example, >= 24h runway at p95 growth).
Test volume expansion runbook and restore runbook quarterly.
Audit and remove stale logical replication slots.

References

PostgreSQL Documentation: Routine Database Maintenance and Monitoring Database Activity.
PostgreSQL Documentation: Continuous Archiving and PITR (pg_stat_archiver, WAL lifecycle).
PostgreSQL Documentation: Replication slots and WAL retention behavior.
Odoo Documentation: Deployment/operations guidance and worker/cron management considerations.

Principle: restore headroom safely, fix WAL/retention root cause, then ramp traffic gradually with verification gates.

Back to blog