Odoo PostgreSQL WAL Archiver Backlog and Disk Pressure Runbook
Production incident runbook to detect, contain, and remediate PostgreSQL WAL archiving backlog before Odoo is impacted by pg_wal disk exhaustion.
When PostgreSQL cannot archive WAL segments, pg_wal grows until disk pressure hits write latency, checkpoint behavior, and eventually Odoo availability. This is one of those incidents where "wait and see" turns into emergency downtime.
This runbook gives a production-safe order of operations: verify archiver failure, estimate time to exhaustion, reduce WAL generation safely, restore archiving, and confirm backlog drain.
Incident signals (page-worthy)
Treat this as an incident when one or more signals persist for 5–10 minutes:
- PostgreSQL logs repeatedly show
archive command failedorcould not write file. pg_waldisk usage climbs continuously.- Odoo users report growing write latency during posting/validation flows.
- Checkpoint warnings increase (
checkpoints are occurring too frequently). - Monitoring shows WAL generation outpacing archive throughput.
Triage checklist (first 10 minutes)
- Freeze deploys and risky maintenance while triage runs.
- Confirm whether failure is archiver path/permission/object-store/network related.
- Estimate
pg_walfree-space runway. - Reduce non-critical write amplification (bulk imports/recompute jobs).
- Capture baseline metrics before remediation (
failed_count, WAL bytes/min, disk free).
Step 1 — Confirm archiver failure and blast radius
1.1 Check PostgreSQL archiver stats
psql "$ODOO_DB_URI" -c "
select
archived_count,
failed_count,
last_archived_wal,
last_archived_time,
last_failed_wal,
last_failed_time,
stats_reset
from pg_stat_archiver;
"
If failed_count keeps increasing and last_archived_time is stale, archiving is actively failing.
1.2 Validate WAL volume pressure on disk
df -h /var/lib/postgresql
sudo du -sh /var/lib/postgresql/data/pg_wal
If your data directory differs, use the real mount path from your PostgreSQL service config.
1.3 Estimate WAL generation rate (bytes/min)
psql "$ODOO_DB_URI" -Atc "select pg_current_wal_lsn();"
sleep 60
psql "$ODOO_DB_URI" -Atc "select pg_current_wal_lsn();"
Convert to delta:
psql "$ODOO_DB_URI" -c "
select pg_wal_lsn_diff('<LSN_2>', '<LSN_1>') as wal_bytes_per_minute;
"
Use this with free disk space to estimate time-to-exhaustion.
1.4 Pull log evidence quickly
odoocli logs tail --service postgres --since 20m --grep "archive command failed|archiver|could not write file|No space left on device"
Step 2 — Contain growth while preserving core business traffic
- Pause non-critical high-write lanes first (imports, heavy recomputes, mass mail batches).
- Keep customer-critical transactions available.
- Avoid immediate PostgreSQL restart; restart does not fix broken archive destinations.
Example controls:
odoocli cron pause --tag heavy-write
odoocli cron pause --tag bulk-sync
If backlog is severe and runway is short, temporarily reduce write pressure from background workers before touching customer flows.
Step 3 — Fix archive destination path safely
Most incidents are one of four classes: credential expiration, permission/path breakage, remote object-store outage, or full archive destination.
3.1 Confirm effective archiving config
psql "$ODOO_DB_URI" -c "
show archive_mode;
show archive_command;
show archive_timeout;
"
3.2 Test destination with same service identity
If archiving to local/NFS path:
sudo -u postgres test -w /var/backups/postgresql/wal && echo "archive path writable"
If archiving to object storage, validate credentials/network from DB host with your standard tooling (AWS CLI, gsutil, etc.) before changing PostgreSQL settings.
3.3 Repair root cause, then force an archive attempt
After fixing destination access, force WAL switch:
psql "$ODOO_DB_URI" -c "select pg_switch_wal();"
Recheck pg_stat_archiver; archived_count should move and failed_count should stop accelerating.
Step 4 — Emergency relief if disk runway is critically low
Only do this under incident command and with explicit backup/recovery owner approval.
- Prefer adding disk capacity or expanding the volume first.
- Do not delete files manually from
pg_wal. - Do not disable archiving as a quick fix if your PITR/recovery posture depends on it.
If you must buy time quickly, reduce non-critical writes and scale down non-essential Odoo worker lanes until archiving is healthy.
Step 5 — Verification loop (every 5 minutes)
5.1 Archiver counters should recover
watch -n 60 "psql \"$ODOO_DB_URI\" -x -c \"
select archived_count, failed_count, last_archived_wal, last_archived_time, last_failed_time
from pg_stat_archiver;
\""
Expected trend:
archived_countincreases steadily.failed_countplateaus.last_archived_timeupdates continuously.
5.2 pg_wal size should stop growing and begin draining
watch -n 60 "sudo du -sh /var/lib/postgresql/data/pg_wal"
5.3 Business-path checks
- Sales order confirmation succeeds within normal latency.
- Invoice posting remains stable.
- No fresh PostgreSQL archiver errors in logs.
Rollback / normalization plan
After stabilization:
- Resume paused cron lanes one group at a time.
- Watch WAL bytes/min and archiver success ratio after each resume.
- If temporary worker reductions were made, restore gradually.
odoocli cron resume --tag heavy-write
odoocli cron resume --tag bulk-sync
If any lane causes WAL generation to exceed archive throughput again, re-pause and tune before full restore.
Hardening checklist (post-incident)
- Alert on
pg_stat_archiver.failed_countrate-of-change, not only absolute count. - Track archive lag (time since
last_archived_time) with paging thresholds. - Add disk runway alerting for PostgreSQL data mount and
pg_walgrowth slope. - Validate archive destination credentials/permissions with synthetic probes.
- Document RTO/RPO impact if archiving is unavailable for N minutes.
- Load-test expected WAL generation during Odoo peak jobs and verify archive throughput margin.
- Rehearse PITR restore from recent base backup + archived WAL in staging.
Practical references
- PostgreSQL continuous archiving and PITR: https://www.postgresql.org/docs/current/continuous-archiving.html
- PostgreSQL archiver statistics (
pg_stat_archiver): https://www.postgresql.org/docs/current/monitoring-stats.html - PostgreSQL WAL configuration parameters: https://www.postgresql.org/docs/current/runtime-config-wal.html
- Odoo deployment/operations guidance: https://www.odoo.com/documentation/17.0/administration/on_premise/deploy.html
Operational rule: when WAL archiving fails, optimize for runway preservation + archive path recovery, not restart-first reactions.