Sat Mar 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Odoo PostgreSQL NOTIFY Queue Saturation and Long-Polling Bus Runbook

A production-safe runbook for Odoo incidents where PostgreSQL LISTEN/NOTIFY queue pressure degrades long-polling and real-time features.

When Odoo real-time features degrade (Discuss updates delayed, POS sync lag, chat presence stale) while database load is otherwise moderate, one common root cause is PostgreSQL notification queue pressure behind Odoo's bus long-polling flow.

This runbook focuses on safe recovery order: confirm queue saturation, stop notification floods, clear long transactions that block queue cleanup, restore long-polling health, then harden to prevent recurrence.

Incident signals

Treat this as active when multiple signals appear together:

Odoo users report delayed or missing real-time updates.
/longpolling/poll becomes slow or intermittently fails.
PostgreSQL logs warn about NOTIFY queue usage or queue cleanup delays.
pg_notification_queue_usage() remains elevated instead of returning to low baseline.

Step 0 — Stabilize first (reduce non-essential publish traffic)

Before changing DB processes:

Pause non-critical high-chatter automations (mass mail campaigns, webhook fan-out, noisy custom bus publishers).
Defer bulk imports that trigger many message notifications.
Keep business-critical paths only.
Assign one owner for DB actions and one for Odoo app-level traffic controls.

Step 1 — Confirm queue pressure and blast radius

# Check Odoo logs for long-polling and bus errors (adapt command to your deployment)
odoocli logs tail --service odoo --since 30m --grep "longpoll|bus|notify|timeout|503"

# Queue usage ratio (0.0 to 1.0). Sustained growth is the danger signal.
psql "$ODOO_DB_URI" -c "select pg_notification_queue_usage() as notify_queue_usage;"

# Sample queue usage repeatedly to see trend (every 5s for 1 minute)
for i in {1..12}; do
  psql "$ODOO_DB_URI" -Atc "select now(), round(pg_notification_queue_usage()::numeric, 4);"
  sleep 5
done

# Find oldest open transactions in the Odoo DB (often block NOTIFY queue cleanup)
psql "$ODOO_DB_URI" -c "
select
  pid,
  usename,
  application_name,
  client_addr,
  state,
  now() - xact_start as xact_age,
  wait_event_type,
  wait_event,
  left(query, 180) as query
from pg_stat_activity
where datname = current_database()
  and xact_start is not null
order by xact_start asc
limit 25;
"

# Check whether long-polling endpoint is degraded
curl -sS -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" https://<odoo-host>/longpolling/poll

Triage checklist

Is pg_notification_queue_usage() climbing or stuck high (>0.2 and rising is already a warning in most environments)?
Did a custom module/job start publishing unusually high bus volume?
Are there long-running transactions (especially idle in transaction) that can block cleanup?
Is degradation isolated to real-time paths while standard read/write flows remain mostly normal?

If yes, continue with controlled remediation.

Step 2 — Controlled remediation (safe order)

2.1 Stop new notification flood sources

Pause or throttle noisy cron jobs and custom publishers.
Temporarily disable non-essential integrations that post high-frequency updates.
Keep critical transaction paths active.

Do this first so cleanup can catch up.

2.2 Recycle long-polling workers (app-side pressure relief)

Recycle Odoo workers serving long-polling/bus to drop stale listeners and re-establish clean sessions.

# Systemd example
sudo systemctl restart odoo

# Container example
docker restart <odoo_container>

If you run separate long-polling workers/processes, restart only that tier first to reduce blast radius.

2.3 Clear unsafe long transactions in PostgreSQL

Prefer cancel before terminate.

-- First attempt: graceful cancel
select pg_cancel_backend(<pid>);

-- If still stuck and approved by incident owner: terminate
select pg_terminate_backend(<pid>);

Use backend termination only for non-critical sessions that are clearly blocking recovery.

2.4 Re-measure queue trend after each change

psql "$ODOO_DB_URI" -c "select pg_notification_queue_usage() as notify_queue_usage;"

If usage keeps rising after source throttling + worker recycle + targeted backend cleanup, escalate to a controlled failover/restart plan.

Step 3 — Verification before incident close

# Queue should return near baseline and remain stable
psql "$ODOO_DB_URI" -Atc "select now(), round(pg_notification_queue_usage()::numeric, 4);"

# Ensure no runaway old transactions remain
psql "$ODOO_DB_URI" -c "
select pid, state, now() - xact_start as xact_age, left(query, 140)
from pg_stat_activity
where datname = current_database()
  and xact_start is not null
order by xact_start asc
limit 10;
"

# Endpoint and application validation
curl -sS -o /dev/null -w "HTTP %{http_code} in %{time_total}s\n" https://<odoo-host>/longpolling/poll
odoocli logs tail --service odoo --since 20m --grep "longpoll|bus|notify|timeout|error"

Exit incident mode only when:

Queue usage is low and not trending upward.
Long-polling response times are back within normal SLO.
User-visible real-time features recover (Discuss/POS presence/live updates).
No recurring surge appears after re-enabling paused jobs in phases.

Rollback / safety path

If remediation increases impact:

Stop further backend terminations immediately.
Keep only critical business traffic enabled.
Revert aggressive worker-count changes made during incident response.
Restore previous app config and proceed with controlled DB failover/restart window.
Keep noisy publishers disabled until root cause is documented.

Hardening and prevention checklist

Add alerting on pg_notification_queue_usage() trend (not just single threshold).
Enforce idle_in_transaction_session_timeout to prevent old sessions from lingering.
Set sane statement_timeout and lock_timeout per role to reduce stuck transactions.
Add rate limits/batching in custom modules that publish to Odoo bus channels.
Separate long-polling workers from heavy HTTP workers so queue spikes do not cascade.
Instrument long-polling endpoint latency and error rate with paging thresholds.
Load-test real-time event bursts before releasing modules with chat/POS/live dashboards.
During upgrades, temporarily reduce non-essential publishers until steady state is confirmed.

Practical rule: reduce publish rate first, clean old transactions second, and only then escalate to restart/failover.

Back to blog