Odoo PostgreSQL "Requested WAL Segment Has Already Been Removed" Reseed Runbook
Practical incident response for Odoo environments when a PostgreSQL replica cannot catch up because required WAL files are gone, including safe containment, re-seed, verification, and prevention.
When a replica fails with requested WAL segment ... has already been removed, it has fallen behind beyond available WAL retention and can no longer continue streaming from its current timeline position.
In Odoo production, this is not just a replica health warning. It reduces failover safety, can overload the primary if read traffic shifts abruptly, and often indicates retention/slot/config gaps that can recur.
This runbook prioritizes restoring a safe HA posture without risking primary write availability.
Incident signals
Treat as active incident if one or more are true:
- Replica logs show
requested WAL segment ... has already been removed. pg_stat_replicationis missing an expected standby.- Replication lag jumps from minutes to unrecoverable state.
- Odoo read-only/reporting workloads lose replica capacity and start pressuring primary.
Step 0 — Stabilize before repair
- Keep all Odoo writes pinned to the current primary.
- Remove the failed replica from load balancing/read routing.
- Pause non-critical read-heavy tasks (large exports, BI refreshes, ad-hoc analytics).
- Freeze failover experiments/topology changes until at least one healthy standby is confirmed.
Step 1 — Confirm diagnosis and blast radius
Run on primary:
select application_name,
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
from pg_stat_replication
order by application_name;
Check slot state (if slots are used):
select slot_name,
slot_type,
active,
restart_lsn,
wal_status,
safe_wal_size
from pg_replication_slots
order by slot_name;
Confirm retention settings relevant to this failure mode:
show wal_keep_size;
show max_slot_wal_keep_size;
show archive_mode;
show archive_command;
Replica-side (failed node) log check:
odoocli logs tail --service postgres-replica --since 30m --grep "requested WAL segment|removed|replication"
Triage checklist
- Which Odoo flows currently depend on replica reads (reports, portals, integrations)?
- Is this replica required for failover SLA right now?
- Is WAL archival healthy and complete for this window?
- Are replication slots configured correctly (or missing entirely)?
- Did network outage, maintenance pause, or disk pressure cause extended lag?
Step 2 — Decide recovery path (archive catch-up vs full reseed)
2.1 Attempt archive-based catch-up only if prerequisites are met
Use this path only when:
- WAL archive is verified complete for missing segment range.
restore_commandis configured and tested on the replica.- Recovery target timeline/LSN continuity is clear.
If these are uncertain, choose full reseed immediately. Partial recovery attempts can waste critical incident time.
2.2 Full reseed (default safe path)
For most production incidents, rebuild the replica from a fresh base backup.
- Stop replica PostgreSQL service.
- Preserve old data directory for short rollback window.
- Create new base backup from primary.
- Reconfigure replication settings.
- Start replica and verify streaming.
Example (adjust paths/service names):
# On replica host
sudo systemctl stop postgresql
sudo mv /var/lib/pgsql/data /var/lib/pgsql/data.pre-reseed.$(date +%F-%H%M)
pg_basebackup \
--host=<primary-host> \
--port=5432 \
--username=replicator \
--pgdata=/var/lib/pgsql/data \
--write-recovery-conf \
--checkpoint=fast \
--wal-method=stream \
--progress --verbose
sudo chown -R postgres:postgres /var/lib/pgsql/data
sudo systemctl start postgresql
If you use physical slots, bind this replica to the intended slot explicitly in primary_conninfo / primary_slot_name per your PostgreSQL version and deployment method.
Step 3 — Verify replica health before reintroducing read traffic
On replica:
select now(),
pg_is_in_recovery() as in_recovery,
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn(),
now() - pg_last_xact_replay_timestamp() as replay_delay;
On primary:
select application_name,
state,
sync_state,
write_lag,
flush_lag,
replay_lag
from pg_stat_replication
order by application_name;
Odoo-side smoke check before routing reads back:
odoocli status
odoocli logs tail --service odoo --since 15m --grep "psycopg2|OperationalError|could not receive data|read-only"
Reintroduce replica traffic gradually (for example: reporting first, then lower-risk read paths, then full pool).
Safe rollback plan
If the new replica shows instability after reseed:
- Keep replica out of read routing.
- Continue single-primary mode for writes.
- Stop replica, inspect errors (
authentication,timeline,slot,network). - If needed, rebuild again from known-good primary backup rather than forcing partial state reuse.
Do not promote this replica during the incident unless failover is unavoidable and data freshness is explicitly validated.
Verification gates for incident closure
Run for at least 20–30 minutes:
-- Primary: replica remains connected and streaming
select application_name, client_addr, state, sync_state
from pg_stat_replication
order by application_name;
-- Primary: lag stays bounded and stable
select application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
from pg_stat_replication
order by lag_bytes desc;
# Odoo symptoms should stay quiet after read traffic restoration
odoocli logs tail --service odoo --since 30m --grep "timeout|could not serialize|OperationalError|connection"
Exit criteria:
- Replica is consistently streaming with bounded lag.
- Odoo error rate remains normal after read-path reintroduction.
- On-call has confidence failover posture is restored.
Hardening and prevention checklist
- Configure WAL retention to match real outage tolerance (
wal_keep_size, archival policy, storage budget). - Use replication slots intentionally; remove orphan slots and cap runaway retention with
max_slot_wal_keep_sizewhere appropriate. - Alert on replication lag growth and replica disconnect duration, not just binary up/down.
- Validate WAL archive restore path in drills (don’t wait for incident day).
- Document replica reseed SOP with host-specific commands and IAM/network prerequisites.
- After planned maintenance, verify replicas rejoin and catch up before closing change windows.
Practical references
- PostgreSQL replication settings (
wal_keep_size, slots, sender/receiver behavior): https://www.postgresql.org/docs/current/runtime-config-replication.html - PostgreSQL monitoring views (
pg_stat_replication,pg_replication_slots): https://www.postgresql.org/docs/current/monitoring-stats.html - PostgreSQL backup and
pg_basebackup: https://www.postgresql.org/docs/current/app-pgbasebackup.html - PostgreSQL warm standby and log shipping/streaming concepts: https://www.postgresql.org/docs/current/warm-standby.html
- Odoo deployment/operations baseline: https://www.odoo.com/documentation/17.0/administration/on_premise/deploy.html
Operational rule: if required WAL is gone, prioritize deterministic reseed over clever partial recovery attempts. Fast, clean replica recovery is safer than prolonged uncertainty under production load.