Odoo PostgreSQL Hot Standby Conflict Incident Runbook
A production-safe runbook for Odoo teams to triage and mitigate PostgreSQL hot standby query cancellations on read replicas without blindly trading availability for bloat risk.
When Odoo reporting or BI workloads hit a PostgreSQL read replica, you can see sudden query failures like:
canceling statement due to conflict with recoveryUser query might have needed to see row versions that must be removed
This usually means replay on the standby needed to apply WAL, but a running query was blocking cleanup/recovery progress. If untreated, teams often overcorrect (hot_standby_feedback=on everywhere), which can reduce cancellations but create table/index bloat pressure on primary.
This runbook focuses on safe triage first, then controlled mitigation.
Incident signals
Treat this as active when one or more conditions are true:
- Odoo dashboards/reports intermittently fail only when routed to replica.
- PostgreSQL standby logs show repeated recovery-conflict cancellations.
- Replica query success drops during write-heavy windows on primary.
pg_stat_database_conflictscounters rise quickly on standby.
# Application-side signal check
odoocli logs tail --service odoo --since 30m --grep "conflict with recovery|read replica|canceling statement|OperationalError"
# Standby PostgreSQL log signal check
odoocli logs tail --service postgres-replica --since 30m --grep "conflict with recovery|canceling statement"
Step 0 - Stabilize user impact
- Freeze non-essential analytics/report deploys.
- Route business-critical read paths temporarily to primary (carefully rate-limited) if replica cancellations are widespread.
- Pause ad-hoc long-running report jobs on replica.
- Keep one incident operator for DB parameter changes; avoid parallel tuning.
If primary is already near CPU/IO saturation, do not mass-reroute all reads to primary. Prioritize only critical user paths.
Step 1 - Confirm conflict type and blast radius
Run on the standby:
-- High-level conflict counters by database
select
datname,
confl_tablespace,
confl_lock,
confl_snapshot,
confl_bufferpin,
confl_deadlock,
(confl_tablespace + confl_lock + confl_snapshot + confl_bufferpin + confl_deadlock) as total_conflicts
from pg_stat_database_conflicts
where datname = current_database();
-- Replay delay and WAL receive/replay state
select
now() as observed_at,
pg_last_wal_receive_lsn() as receive_lsn,
pg_last_wal_replay_lsn() as replay_lsn,
now() - pg_last_xact_replay_timestamp() as replay_delay;
-- Identify long-running queries on standby that are likely cancellation candidates
select
pid,
usename,
application_name,
client_addr,
now() - query_start as query_age,
wait_event_type,
wait_event,
left(query, 160) as query
from pg_stat_activity
where state = 'active'
and backend_type = 'client backend'
order by query_start asc
limit 30;
Interpretation guide:
- Rising
confl_snapshotduring heavy vacuum cleanup windows -> readers are holding snapshots too long. - Rising
confl_lock-> replay blocked by lock conflicts from WAL-applied changes. - Large replay delay + frequent cancellations -> standby cannot satisfy both fast replay and long reads with current settings/workload.
Step 2 - Apply safe immediate mitigations
2.1 Cap runaway read duration in Odoo/reporting sessions
Use role/database-level timeouts for replica-facing report users so single queries cannot occupy snapshot windows indefinitely.
-- Example: reporting role on standby
alter role odoo_reporting set statement_timeout = '90s';
alter role odoo_reporting set idle_in_transaction_session_timeout = '30s';
If you use a dedicated replica database entry:
alter database odoo_replica set statement_timeout = '90s';
2.2 Tune standby delay budget (short-term)
On standby, increase cancellation tolerance in a controlled way:
-- Requires superuser; use ALTER SYSTEM carefully, then reload
alter system set max_standby_streaming_delay = '60s';
alter system set max_standby_archive_delay = '60s';
select pg_reload_conf();
This allows longer query grace before cancellation, but can increase replica staleness. Track replay delay closely.
2.3 Use hot_standby_feedback surgically (not blindly)
show hot_standby_feedback;
If currently off and cancellations are severe, you can temporarily enable it on the standby:
alter system set hot_standby_feedback = 'on';
select pg_reload_conf();
Important: this can increase dead tuple retention on primary. Treat as temporary mitigation unless you validate bloat impact.
Step 3 - Primary-side safety checks after mitigation
Run on primary after changing standby behavior:
-- Watch dead tuples and autovacuum effectiveness on Odoo-heavy tables
select
schemaname,
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
last_autovacuum,
autovacuum_count
from pg_stat_user_tables
order by n_dead_tup desc
limit 25;
-- Table/index size growth hotspots
select
n.nspname,
c.relname,
pg_size_pretty(pg_total_relation_size(c.oid)) as total_size
from pg_class c
join pg_namespace n on n.oid = c.relnamespace
where c.relkind = 'r'
and n.nspname not in ('pg_catalog', 'information_schema')
order by pg_total_relation_size(c.oid) desc
limit 25;
If dead tuples or relation size growth accelerate after enabling hot_standby_feedback, plan to disable it and move reporting jobs to shorter chunks/materialized extracts.
Step 4 - Verification and service recovery
4.1 Confirm cancellations have dropped
-- Re-check conflict counters after 15-30 minutes
select
datname,
confl_lock,
confl_snapshot,
confl_bufferpin,
confl_deadlock
from pg_stat_database_conflicts
where datname = current_database();
odoocli logs tail --service odoo --since 15m --grep "conflict with recovery|canceling statement|ERROR|Traceback"
4.2 Confirm replica freshness is acceptable
select now() - pg_last_xact_replay_timestamp() as replay_delay;
Define an SLO (example: < 30s for operational dashboards). If delay breaches SLO persistently, reduce standby delay settings or offload heavy queries differently.
4.3 Functional checks
- Critical Odoo list/report pages load successfully against replica.
- No sustained error spike from replica-routed read endpoints.
- Primary CPU/IO and autovacuum activity remain within normal incident bounds.
Rollback strategy
If mitigation causes new risk (primary bloat or unacceptable replica lag):
- Revert temporary standby settings:
alter system set hot_standby_feedback = 'off';
alter system set max_standby_streaming_delay = '30s';
alter system set max_standby_archive_delay = '30s';
select pg_reload_conf();
- Keep strict statement timeouts for replica report users.
- Route only essential low-latency reads to replica; run heavy reports on controlled windows or separate analytics stack.
Hardening and prevention checklist
- Enforce replica-specific
statement_timeoutandidle_in_transaction_session_timeoutfor all reporting roles. - Track
pg_stat_database_conflictsas first-class alert metrics. - Alert on replica replay delay and define per-workload freshness SLOs.
- Avoid long snapshot readers in app code (paginate, chunk, pre-aggregate).
- Review autovacuum thresholds/cost settings on primary to avoid cleanup bursts colliding with massive report windows.
- If
hot_standby_feedbackis enabled long-term, add bloat budget alerts and periodic table/index bloat audits. - Separate operational reads from heavy BI/reporting workloads where possible.
References
- PostgreSQL Documentation: Hot Standby conflict handling and recovery conflict behavior.
- PostgreSQL Documentation:
pg_stat_database_conflictsand monitoring statistics views. - PostgreSQL Documentation: runtime replication settings (
max_standby_streaming_delay,max_standby_archive_delay,hot_standby_feedback). - Odoo Documentation: PostgreSQL deployment/operations guidance and worker/query tuning considerations.
Core principle: first constrain runaway read behavior, then tune standby conflict tolerance with explicit lag and bloat guardrails.