Odoo PostgreSQL Synchronous Replication Commit Stall Runbook
A production-safe runbook for when Odoo writes freeze because PostgreSQL commits are waiting on synchronous standbys.
When Odoo starts timing out on writes (create, write, checkout, invoice posting) but PostgreSQL CPU and locks look normal, a common root cause is synchronous replication commit wait.
If a required standby is slow or unavailable, primary commits can block on SyncRep and user-facing writes stall.
This runbook prioritizes safe containment: confirm synchronous commit stall, reduce blast radius, apply reversible mitigation, verify recovery, then restore strict durability policy.
Incident signals
Treat this as an active incident when multiple signals persist:
- Odoo requests that perform writes time out, while read-only pages may still work.
- PostgreSQL sessions show
wait_event = 'SyncRep'. pg_stat_replicationshows sync standbys disconnected, lagging hard, or not insync_state = 'sync'.- Queue workers and cron jobs pile up behind database commits.
Step 0 — Stabilize application pressure
Before changing replication settings:
- Pause non-critical cron/import/bulk-write jobs.
- Keep only revenue/fulfillment-critical write paths active.
- Announce a temporary degraded-write incident state to operators.
This prevents unbounded backlog growth while you triage.
Step 1 — Confirm commit wait is the bottleneck
# Sessions currently blocked on synchronous replication commit wait
psql "$ODOO_DB_URI" -c "
select
pid,
usename,
application_name,
state,
wait_event_type,
wait_event,
now() - xact_start as txn_age,
left(query, 140) as query
from pg_stat_activity
where datname = current_database()
and wait_event = 'SyncRep'
order by xact_start asc;
"
# Replication health from primary perspective
psql "$ODOO_DB_URI" -c "
select
application_name,
client_addr,
state,
sync_state,
sync_priority,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as sent_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) as write_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) as flush_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag_bytes
from pg_stat_replication
order by sync_priority nulls last, application_name;
"
# Current sync settings (capture before any mitigation)
psql "$ODOO_DB_URI" -c "
show synchronous_commit;
show synchronous_standby_names;
"
Triage checklist
- Do blocked sessions consistently show
wait_event = 'SyncRep'? - Is a required synchronous standby absent or far behind?
- Did this start right after standby maintenance, network change, or AZ issue?
- Is
synchronous_standby_namesstricter than currently available standbys?
If yes, move to controlled mitigation.
Step 2 — Validate standby side quickly
On each expected synchronous standby:
# Check receiver/apply health on standby
psql "$STANDBY_DB_URI" -c "
select status, conninfo, received_lsn, latest_end_lsn, latest_end_time
from pg_stat_wal_receiver;
"
psql "$STANDBY_DB_URI" -c "
select now() - pg_last_xact_replay_timestamp() as replay_delay;
"
If standby is unhealthy (network split, disk full, crashed postgres), fix it first when recovery is expected within your write SLA. If not quickly recoverable, apply temporary primary-side mitigation.
Step 3 — Mitigate in safest reversible order
Option A (preferred short-term): relax commit strictness per session/app tier
If your Odoo deployment allows targeted DB options, temporarily set less strict commit mode for non-critical workers first.
-- Example (run in controlled session/user scope where possible)
set synchronous_commit = local;
Use this approach when you can scope blast radius to lower-criticality workloads.
Option B (incident-wide): temporarily reduce sync standby requirement
If writes are broadly down and business impact is severe, make a time-boxed change on primary:
# Example: require any 1 available sync standby instead of an unavailable named set
psql "$ODOO_DB_URI" -c "alter system set synchronous_standby_names = 'ANY 1 (standby_a,standby_b,standby_c)';"
psql "$ODOO_DB_URI" -c "select pg_reload_conf();"
If no synchronous standby is currently healthy and you must restore writes immediately, use a temporary emergency setting:
psql "$ODOO_DB_URI" -c "alter system set synchronous_commit = local;"
psql "$ODOO_DB_URI" -c "select pg_reload_conf();"
Safety note: this reduces durability guarantees if primary fails before replicas receive WAL. Record exact start/end timestamps for this reduced-protection window.
Step 4 — Verify mitigation actually restored service
# SyncRep waits should drop quickly after effective mitigation
psql "$ODOO_DB_URI" -c "
select count(*) as sync_rep_waiters
from pg_stat_activity
where datname = current_database()
and wait_event = 'SyncRep';
"
# Odoo-side timeout/error signal check
odoocli logs tail --service odoo --since 15m --grep "timeout|could not serialize|cursor already closed|Connection refused"
# Backlog trend check (example for queue_job installations)
psql "$ODOO_DB_URI" -c "
select state, count(*)
from queue_job
group by state
order by count(*) desc;
"
Incident stabilization criteria:
SyncRepwaiters trend to near zero.- Odoo write paths recover within normal latency/SLA.
- Queue/cron backlog starts draining instead of growing.
Step 5 — Roll back emergency settings safely
After standby health is confirmed stable for a sustained window (for example 30–60 minutes):
# Restore baseline durability policy captured in Step 1
psql "$ODOO_DB_URI" -c "alter system set synchronous_commit = 'on';"
psql "$ODOO_DB_URI" -c "alter system set synchronous_standby_names = 'FIRST 1 (standby_a,standby_b)';"
psql "$ODOO_DB_URI" -c "select pg_reload_conf();"
Then verify:
psql "$ODOO_DB_URI" -c "show synchronous_commit; show synchronous_standby_names;"
psql "$ODOO_DB_URI" -c "select application_name, state, sync_state from pg_stat_replication;"
If write latency regresses after rollback, revert to the temporary mitigation and re-open the incident (do not force full policy under unstable replica conditions).
Hardening and prevention checklist
- Monitor
pg_stat_activitycount ofwait_event = 'SyncRep'and alert on sustained non-zero values. - Alert on replica disconnects and lag bytes for standbys listed in
synchronous_standby_names. - Keep
synchronous_standby_namesaligned with real topology changes (hostnames, priorities, decommissioned nodes). - Run quarterly failover drills that include synchronous replica loss scenarios.
- Define pre-approved incident policy: when to switch to
ANY 1 (...)vs temporarysynchronous_commit=local. - Ensure Odoo cron/import workloads can be quickly paused to reduce write pressure during replication incidents.
Operator references
- PostgreSQL documentation: synchronous replication (
synchronous_commit,synchronous_standby_names,pg_stat_replication). - PostgreSQL monitoring views:
pg_stat_activity,pg_stat_replication,pg_stat_wal_receiver.
The key rule: restore service first with explicit, time-boxed durability trade-offs, then return to strict synchronous policy as soon as replica health is proven.