WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL WAL Archiver Backlog and Disk Pressure Runbook

Production incident runbook to detect, contain, and remediate PostgreSQL WAL archiving backlog before Odoo is impacted by pg_wal disk exhaustion.

When PostgreSQL cannot archive WAL segments, pg_wal grows until disk pressure hits write latency, checkpoint behavior, and eventually Odoo availability. This is one of those incidents where "wait and see" turns into emergency downtime.

This runbook gives a production-safe order of operations: verify archiver failure, estimate time to exhaustion, reduce WAL generation safely, restore archiving, and confirm backlog drain.

Incident signals (page-worthy)

Treat this as an incident when one or more signals persist for 5–10 minutes:

  • PostgreSQL logs repeatedly show archive command failed or could not write file.
  • pg_wal disk usage climbs continuously.
  • Odoo users report growing write latency during posting/validation flows.
  • Checkpoint warnings increase (checkpoints are occurring too frequently).
  • Monitoring shows WAL generation outpacing archive throughput.

Triage checklist (first 10 minutes)

  • Freeze deploys and risky maintenance while triage runs.
  • Confirm whether failure is archiver path/permission/object-store/network related.
  • Estimate pg_wal free-space runway.
  • Reduce non-critical write amplification (bulk imports/recompute jobs).
  • Capture baseline metrics before remediation (failed_count, WAL bytes/min, disk free).

Step 1 — Confirm archiver failure and blast radius

1.1 Check PostgreSQL archiver stats

psql "$ODOO_DB_URI" -c "
select
  archived_count,
  failed_count,
  last_archived_wal,
  last_archived_time,
  last_failed_wal,
  last_failed_time,
  stats_reset
from pg_stat_archiver;
"

If failed_count keeps increasing and last_archived_time is stale, archiving is actively failing.

1.2 Validate WAL volume pressure on disk

df -h /var/lib/postgresql
sudo du -sh /var/lib/postgresql/data/pg_wal

If your data directory differs, use the real mount path from your PostgreSQL service config.

1.3 Estimate WAL generation rate (bytes/min)

psql "$ODOO_DB_URI" -Atc "select pg_current_wal_lsn();"
sleep 60
psql "$ODOO_DB_URI" -Atc "select pg_current_wal_lsn();"

Convert to delta:

psql "$ODOO_DB_URI" -c "
select pg_wal_lsn_diff('<LSN_2>', '<LSN_1>') as wal_bytes_per_minute;
"

Use this with free disk space to estimate time-to-exhaustion.

1.4 Pull log evidence quickly

odoocli logs tail --service postgres --since 20m --grep "archive command failed|archiver|could not write file|No space left on device"

Step 2 — Contain growth while preserving core business traffic

  1. Pause non-critical high-write lanes first (imports, heavy recomputes, mass mail batches).
  2. Keep customer-critical transactions available.
  3. Avoid immediate PostgreSQL restart; restart does not fix broken archive destinations.

Example controls:

odoocli cron pause --tag heavy-write
odoocli cron pause --tag bulk-sync

If backlog is severe and runway is short, temporarily reduce write pressure from background workers before touching customer flows.

Step 3 — Fix archive destination path safely

Most incidents are one of four classes: credential expiration, permission/path breakage, remote object-store outage, or full archive destination.

3.1 Confirm effective archiving config

psql "$ODOO_DB_URI" -c "
show archive_mode;
show archive_command;
show archive_timeout;
"

3.2 Test destination with same service identity

If archiving to local/NFS path:

sudo -u postgres test -w /var/backups/postgresql/wal && echo "archive path writable"

If archiving to object storage, validate credentials/network from DB host with your standard tooling (AWS CLI, gsutil, etc.) before changing PostgreSQL settings.

3.3 Repair root cause, then force an archive attempt

After fixing destination access, force WAL switch:

psql "$ODOO_DB_URI" -c "select pg_switch_wal();"

Recheck pg_stat_archiver; archived_count should move and failed_count should stop accelerating.

Step 4 — Emergency relief if disk runway is critically low

Only do this under incident command and with explicit backup/recovery owner approval.

  • Prefer adding disk capacity or expanding the volume first.
  • Do not delete files manually from pg_wal.
  • Do not disable archiving as a quick fix if your PITR/recovery posture depends on it.

If you must buy time quickly, reduce non-critical writes and scale down non-essential Odoo worker lanes until archiving is healthy.

Step 5 — Verification loop (every 5 minutes)

5.1 Archiver counters should recover

watch -n 60 "psql \"$ODOO_DB_URI\" -x -c \"
select archived_count, failed_count, last_archived_wal, last_archived_time, last_failed_time
from pg_stat_archiver;
\""

Expected trend:

  • archived_count increases steadily.
  • failed_count plateaus.
  • last_archived_time updates continuously.

5.2 pg_wal size should stop growing and begin draining

watch -n 60 "sudo du -sh /var/lib/postgresql/data/pg_wal"

5.3 Business-path checks

  • Sales order confirmation succeeds within normal latency.
  • Invoice posting remains stable.
  • No fresh PostgreSQL archiver errors in logs.

Rollback / normalization plan

After stabilization:

  1. Resume paused cron lanes one group at a time.
  2. Watch WAL bytes/min and archiver success ratio after each resume.
  3. If temporary worker reductions were made, restore gradually.
odoocli cron resume --tag heavy-write
odoocli cron resume --tag bulk-sync

If any lane causes WAL generation to exceed archive throughput again, re-pause and tune before full restore.

Hardening checklist (post-incident)

  • Alert on pg_stat_archiver.failed_count rate-of-change, not only absolute count.
  • Track archive lag (time since last_archived_time) with paging thresholds.
  • Add disk runway alerting for PostgreSQL data mount and pg_wal growth slope.
  • Validate archive destination credentials/permissions with synthetic probes.
  • Document RTO/RPO impact if archiving is unavailable for N minutes.
  • Load-test expected WAL generation during Odoo peak jobs and verify archive throughput margin.
  • Rehearse PITR restore from recent base backup + archived WAL in staging.

Practical references

Operational rule: when WAL archiving fails, optimize for runway preservation + archive path recovery, not restart-first reactions.

Back to blog