Postmortem: Database failover failure — 4h customer-impacting outage
Date: Tuesday, [date]
Duration: 14:02 – 18:14 PT (4h 12m)
Severity: P1 (customer-facing, revenue-impacting)
Author: [SRE on-call] · Reviewed by: [Eng director, Customer team]
TL;DR
Our primary Postgres instance experienced a hardware failure at 14:02 PT. The configured automatic failover to the standby did not trigger because the health check we relied on had been silently misconfigured during a deploy three weeks earlier. We manually promoted the standby at 18:09, restored writes at 18:14. Approximately 87% of write requests during the window failed; reads degraded but mostly succeeded due to read-replica caching. No customer data was lost. Estimated revenue impact: $42K in failed transactions, all of which were retried successfully by the upstream Stripe flow.
Timeline (PT)
- 14:02 — Primary Postgres EC2 instance becomes unresponsive (hardware fault, confirmed by AWS event log)
- 14:03 — RDS proxy starts returning connection errors. Datadog fires
db-primary-down alert.
- 14:04 — Expected failover does not occur (we discover later why)
- 14:07 — On-call SRE acknowledges, begins triage. Sees primary is down, expects failover-in-progress.
- 14:14 — On-call confirms standby is NOT being promoted. Pages secondary on-call + database lead.
- 14:18 — Status page updated to 'investigating'. Customer-facing error message updated.
- 14:31 — Team identifies that the health check controlling failover (
/internal/db-health) is misconfigured (calls the wrong endpoint, never returns unhealthy)
- 14:42 — Decision made to manually promote standby. Database lead pulled into incident.
- 14:55 — Decision delayed pending review of latest read-replica lag (need to ensure no data loss). Read replica is 0 seconds behind. Cleared to promote.
- 16:32 — Manual promotion blocked by an unrelated security-group rule preventing the standby from advertising itself. Team works through SG change.
- 17:48 — Security group updated; standby promotion can proceed.
- 18:09 — Standby promoted to primary. Writes resume.
- 18:14 — Application fully healthy. Status page updated to 'monitoring'.
- 19:01 — All systems normal. Incident closed.
Customer impact
- Affected customers: approximately 4,200 (all writers on the platform between 14:02-18:14)
- Failed requests: 87% of write requests in window (read traffic degraded 30% but mostly succeeded via read-replica)
- Revenue impact: $42K in failed Stripe transactions; all subsequently retried successfully
- Data loss: None — read-replica was 0 sec behind at the time of promotion
- CSAT impact: 14 customer-support tickets filed; 6 high-value customers contacted by CSM proactively
Root cause
The automatic failover health check that triggers RDS standby promotion was misconfigured during a deploy on [date 3 weeks prior]. A refactor of the health-check endpoint caused the deployed /internal/db-health to always return 200 OK regardless of database state. Because the check never returned 'unhealthy,' the failover automation never triggered.
5-whys
- Why did writes fail? Primary DB unresponsive.
- Why didn't we failover? Standby promotion didn't trigger.
- Why didn't promotion trigger? Health check never returned 'unhealthy.'
- Why was the health check broken? A refactor changed its behavior; no test caught it.
- Why was no test caught? We had no integration test that verified the health-check endpoint returned the right value for a bad DB state — only for a good one.
Contributing factors
- The SG rule that delayed manual promotion was a separate latent issue we discovered only during incident response.
- Our runbook for manual standby promotion was 18 months out of date.
- The secondary on-call took 7 minutes to acknowledge — within SLA but at the high end.
What went well
- Detection at 14:03 (within 1 min of failure)
- Read replica had no lag, so promotion was data-safe
- Communication via status page was honest and frequent (10 updates total)
Action items
| # |
Action |
Owner |
Due |
| 1 |
Add integration test: health endpoint must return 503 when DB is unreachable |
[Eng lead] |
7 days |
| 2 |
Update standby-promotion runbook + dry-run once a quarter |
[SRE lead] |
14 days |
| 3 |
Add monitoring on the health check itself ('check-the-checker') |
[Platform] |
14 days |
| 4 |
Audit all health-check endpoints for the same flaw |
[Eng lead] |
30 days |
| 5 |
Fix the SG rule that delayed promotion + add to disaster-recovery drill |
[Platform] |
30 days |
| 6 |
Add a chaos test that kills the primary DB monthly |
[SRE] |
60 days |
| 7 |
Customer follow-up: send $42K transaction list to finance; ensure all retried successfully |
[Customer ops] |
7 days |
Lessons
For SREs: any health check that drives critical automation must be tested in the failure direction, not just the happy path. We test 'returns 200 when healthy.' We didn't test 'returns 503 when unhealthy.'
For engineering: any refactor of a controlling-automation endpoint should require manual review by the SRE on-call rotation before merge. We're adding a CODEOWNERS rule.
For incident commanders: when an expected automation doesn't fire, treat that as a separate sub-incident. Investigate the automation in parallel with the original symptom. We lost ~15 minutes assuming failover was 'just slow.'