Odoo PostgreSQL Page Corruption and Checksum Incident Runbook
A production-safe runbook to triage PostgreSQL page corruption signals behind Odoo outages, contain blast radius, recover data safely, and prevent repeat incidents.
When PostgreSQL reports errors like invalid page in block, could not read block, or checksum failures, this is a data-integrity incident, not just a performance issue.
For Odoo, corruption can surface as random 500s on specific models, failed cron jobs, broken exports, or replica replay failures.
This runbook is operator-first: preserve evidence, isolate affected objects, recover with the smallest safe change, and only then reopen write traffic.
Incident signals
Treat this as a Sev-1/Sev-2 data-integrity incident when one or more appear:
- PostgreSQL logs contain
invalid page in block,checksum mismatch,could not read block, orPANIC: could not locate a valid checkpoint record. - Odoo endpoints fail consistently for one model/table while others still work.
- Read replicas stop replay with page-level read/checksum errors.
pg_dumpfails on specific tables with block read errors.- Query behavior becomes non-deterministic (same query intermittently fails).
Step 0 — Contain first, then diagnose
- Pause non-critical Odoo writes (imports, bulk cron, connector sync jobs).
- Keep one operator on DB actions and one on timeline/decision logging.
- Preserve current logs before rotation.
- Do not run broad
VACUUM FULL,REINDEX DATABASE, or ad-hoc filesystem edits.
# Snapshot recent PostgreSQL + Odoo errors for incident notes
odoocli logs tail --service postgres --since 30m --grep "invalid page|checksum|could not read block|PANIC"
odoocli logs tail --service odoo --since 30m --grep "psycopg2|ERROR|Internal Server Error"
Step 1 — Identify exact blast radius (relation, fork, block)
Extract relation and block identifiers from logs first.
# Example direct database log pull if odoocli logs is unavailable
sudo journalctl -u postgresql --since "30 min ago" | grep -E "invalid page|checksum|could not read block"
Typical log line includes relation base/<db_oid>/<relfilenode>, block <N>.
Map that to a table/index:
select
c.oid,
n.nspname as schema,
c.relname,
c.relkind,
pg_relation_filenode(c.oid) as filenode
from pg_class c
join pg_namespace n on n.oid = c.relnamespace
where pg_relation_filenode(c.oid) = <RELFILE_NODE>;
If the corrupted object is an index (relkind = 'i'), recovery is usually faster/safer than heap-table corruption.
Step 2 — Triage checklist before remediation
- Is corruption isolated to one index/table, or multiple unrelated objects?
- Are fresh backups and WAL archive continuity confirmed?
- Is a clean replica/clone available for targeted data extraction?
- Have you frozen risky write-heavy jobs in Odoo?
- Did you capture exact error strings, relfilenode, and first-seen timestamp?
If backups or WAL continuity are uncertain, escalate immediately before making destructive changes.
Step 3 — Safe remediation path (smallest scope first)
3.1 If corruption is index-only: rebuild index concurrently
Index corruption can often be remediated without table data rewind.
-- Confirm index identity first
select schemaname, tablename, indexname
from pg_indexes
where indexname = '<suspect_index_name>';
-- Rebuild with minimal write blocking
reindex index concurrently <schema>.<suspect_index_name>;
If concurrent rebuild is not possible in your version/constraints, schedule a maintenance window and use non-concurrent REINDEX INDEX.
3.2 If heap/table page corruption is suspected: isolate + restore targeted data
- Create a recovery clone from latest valid backup + WAL (PITR sidecar).
- Compare row counts/checksums around affected business keys.
- Restore only affected rows/partitions where possible.
# On recovery clone: export clean subset
pg_dump "$RECOVERY_DB_URI" \
--data-only \
--table=public.stock_move \
--column-inserts \
--file=/tmp/stock_move_repair.sql
# Review SQL, then apply in production during controlled window
psql "$ODOO_DB_URI" -f /tmp/stock_move_repair.sql
If corruption is broad across critical tables, execute full PITR cutback/cutover runbook instead of piecemeal repair.
3.3 Validate physical integrity after immediate fix
Use amcheck for B-Tree indexes and checksum tooling where supported.
create extension if not exists amcheck;
-- Check one critical index
select bt_index_check('public.stock_move_pkey'::regclass, true);
# PostgreSQL cluster checksum validation (offline requirement on many setups)
# Run only with PostgreSQL stopped, or on detached volume/snapshot clone.
pg_checksums --check -D "$PGDATA"
Step 4 — Verification and rollback criteria
4.1 Verification (must pass before incident close)
-- Watch for repeated IO/checksum failures
select now(), state, wait_event_type, wait_event, query
from pg_stat_activity
where state <> 'idle'
order by query_start asc
limit 20;
odoocli logs tail --service postgres --since 20m --grep "invalid page|checksum|could not read block|PANIC"
odoocli logs tail --service odoo --since 20m --grep "psycopg2|ERROR|Traceback"
Close incident only after:
- No new corruption/checksum log lines during a sustained observation window (for example 30+ minutes).
- Affected Odoo workflows complete successfully (UI + API + cron path).
- Replica replay is healthy (if replicas exist).
4.2 Rollback / escalation logic
- If errors persist after index rebuild, stop and treat as table/filesystem/storage corruption.
- If new objects start failing, escalate to full node replacement + PITR cutover.
- If storage layer is suspected (EBS/local NVMe faults), migrate PostgreSQL to healthy volume/instance before reopening writes.
Do not repeatedly retry writes against known-corrupted pages; that increases user-visible failures and can complicate recovery.
Step 5 — Hardening and prevention checklist
- Enable data checksums for new clusters (or during planned migration).
- Run periodic
amcheckon critical Odoo indexes during low-traffic windows. - Enforce regular restore drills (table-level and full PITR) with timing SLAs.
- Alert on PostgreSQL log patterns:
invalid page,checksum,could not read block,PANIC. - Keep firmware/kernel/storage driver patching policy documented.
- Validate backup chain integrity and WAL archive completeness daily.
- Maintain runbook for fast promotion to known-good replica.
Example lightweight weekly index integrity sweep:
create extension if not exists amcheck;
select bt_index_check(i.indexrelid, true)
from pg_index i
join pg_class c on c.oid = i.indexrelid
join pg_namespace n on n.oid = c.relnamespace
where n.nspname = 'public'
and c.relname like 'stock_%';
Practical references
- PostgreSQL routine database maintenance (
REINDEX,VACUUM, ops context): https://www.postgresql.org/docs/current/routine-maintenance.html - PostgreSQL
REINDEX: https://www.postgresql.org/docs/current/sql-reindex.html - PostgreSQL
amcheckextension: https://www.postgresql.org/docs/current/amcheck.html - PostgreSQL checksums (
pg_checksums): https://www.postgresql.org/docs/current/app-pgchecksums.html - PostgreSQL backup + PITR fundamentals: https://www.postgresql.org/docs/current/continuous-archiving.html
- Odoo on-prem deployment and operations: https://www.odoo.com/documentation/17.0/administration/on_premise/deploy.html
Core principle: corruption incidents are integrity incidents. Contain writes, prove blast radius, recover from known-good data, and verify before normal traffic resumes.