WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS WORK IN PROGRESS

Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL max_wal_senders Exhaustion Runbook

Practical incident response for PostgreSQL WAL sender exhaustion in Odoo environments, with safe triage, containment, remediation, and hardening steps.

When max_wal_senders is exhausted, new replication sender processes cannot start. In Odoo production stacks, this can degrade read replicas, block fresh standbys from attaching, and reduce failover safety margins during peak traffic.

This runbook focuses on restoring healthy replication capacity without introducing write-path risk.

Incident signals

Treat as active incident if one or more are true:

  • PostgreSQL logs include errors like remaining connection slots are reserved for replication or could not start WAL sender / number of requested standby connections exceeds max_wal_senders.
  • A replica that was healthy cannot reconnect after restart/network flap.
  • pg_stat_replication has missing expected replicas while Odoo read traffic shows staleness.
  • HA checks show replication degraded and failover readiness drops.

Step 0 — Stabilize first

  1. Keep Odoo writes on the known-good primary.
  2. Pause non-critical read-heavy jobs if stale replica data could hurt operations (large reporting exports, BI refreshes).
  3. Freeze topology changes (no ad-hoc failover/reparent actions) until sender capacity is restored.

Step 1 — Confirm WAL sender exhaustion and scope

Run on primary:

select now(),
       current_setting('max_wal_senders') as max_wal_senders,
       count(*) filter (where backend_type = 'walsender') as active_walsenders,
       count(*) filter (where backend_type = 'walsender' and state = 'active') as active_streaming_walsenders
from pg_stat_activity;

List replication clients:

select pid,
       usename,
       application_name,
       client_addr,
       state,
       sync_state,
       backend_start,
       state_change,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn
from pg_stat_replication
order by backend_start;

Check slot pressure and inactive slots (common hidden cause):

select slot_name,
       slot_type,
       active,
       restart_lsn,
       confirmed_flush_lsn
from pg_replication_slots
order by slot_name;

Triage checklist

  • Which replicas are business-critical for Odoo reads/failover SLA?
  • Any unexpected replication clients (old host, backup tooling, abandoned CDC connector)?
  • Any duplicate standby identities (application_name) creating confusion?
  • Any inactive logical/physical slots that should not exist?
  • Is this a transient reconnect storm or sustained capacity mismatch?

Step 2 — Contain impact while keeping primary safe

2.1 Identify and stop rogue/obsolete replication clients

Examples:

  • Decommissioned replica still retrying
  • Misconfigured backup/clone job repeatedly opening replication sessions
  • CDC process using slot/replication path unexpectedly

From app/infra side, disable the offending service first. Then verify sessions stop reappearing.

2.2 Terminate stale WAL sender sessions (targeted, not broad kill)

-- Terminate only clearly stale/unwanted replication senders
select pg_terminate_backend(pid)
from pg_stat_replication
where application_name in ('old-replica-1', 'stale-backup-agent')
  and state in ('startup', 'catchup', 'backup');

Do not terminate healthy senders for your only HA replica unless you have confirmed redundancy.

2.3 If a replica is non-critical, keep it detached temporarily

  • Remove it from read routing.
  • Prevent reconnect loops until incident is resolved.
  • Prioritize at least one healthy replica for failover posture.

Step 3 — Recover expected replication topology

After freeing sender slots, restart/connect one replica at a time and verify before adding the next.

Replica-side quick check:

psql "$REPLICA_DB_URI" -c "select pg_is_in_recovery(), now(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

Primary-side lag verification:

select application_name,
       client_addr,
       state,
       sync_state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) as lag_bytes
from pg_stat_replication
order by application_name;

If replicas reconnect but lag keeps growing, treat as a separate throughput incident (network/disk/IO/WAL generation).

Step 4 — Capacity correction (post-containment)

If normal topology legitimately needs more sender capacity:

  1. Update max_wal_senders in PostgreSQL configuration to a value covering:
    • steady-state replicas
    • one maintenance/reseed path
    • short-lived overlap during failover/rejoin
  2. Apply through your change process.
  3. Restart PostgreSQL during an approved window (this setting is postmaster-level).

Example inspection command:

show max_wal_senders;

After restart, re-check:

select current_setting('max_wal_senders') as max_wal_senders;

Rollback / safety plan

If remediation causes instability:

  1. Keep Odoo writes on primary, and keep non-essential read paths throttled.
  2. Re-enable only previously known-good replica connections.
  3. Revert recent replication-client config changes that introduced reconnect storms.
  4. If needed, roll back PostgreSQL config change via standard config management and restart window.

Verification before incident closure

Run for 15–30 minutes:

-- Primary: expected replicas present and streaming
select application_name, state, sync_state, client_addr
from pg_stat_replication
order by application_name;
-- Primary: sender usage below hard limit with headroom
select current_setting('max_wal_senders')::int as max_wal_senders,
       count(*) filter (where backend_type = 'walsender') as active_walsenders
from pg_stat_activity;
# Odoo log scan for replica/read-path symptoms
odoocli logs tail --service odoo --since 20m --grep "replica|readonly|could not receive data|timeout"

Exit criteria:

  • Expected replicas are connected and stable.
  • WAL sender usage has headroom (not pinned at limit).
  • Odoo read paths are consistent and error rate is normal.

Hardening and prevention checklist

  • Capacity-plan max_wal_senders with failover/reseed overlap, not just steady state.
  • Alert when active WAL senders exceed 70–80% of configured limit.
  • Inventory and owner-tag every replication client (application_name, host, purpose).
  • Remove unused replication slots and retire decommissioned replica configs promptly.
  • Add runbook drill: replica reconnect after failover under peak write load.
  • Keep one reserved operational path for emergency re-seed/maintenance.

Practical references

Operational rule: do not treat sender exhaustion as “just a replica issue.” In Odoo production, it is an HA readiness incident and should be handled with the same discipline as write-path degradation.

Back to blog