Real Incident Report Examples

See how ProdRescue AI turns production chaos into structured postmortems. Real logs, real analysis.

Kubernetes Crash Loop Postmortem

P1
MTTR 18 minPeak error 92%

This Kubernetes CrashLoopBackOff postmortem documents a production incident where checkout-api pods entered a crash loop after deploying v2.15.0. The root cause was a nil pointer dereference in PaymentService.Process() when Redis session cache was unavailable. The incident lasted 18 minutes and affected approximately 8,000 users. Timeline: 09:00:02 PagerDuty alert, 09:00:20 checkout-api connection refused, 09:00:35 rollback initiated, 09:01:15 all pods healthy. Detection → Response → Resolution completed in under 60 seconds. This postmortem includes 5 Whys analysis, prevention checklist, and log evidence.

View Report

Redis Cluster Failure RCA

P1
MTTR 2.5hPeak error

This Redis cluster failure root cause analysis covers a 2.5-hour degraded performance incident. Primary node OOM crash due to memory fragmentation; replica failover delayed 8 minutes. ~180,000 sessions lost. Cache stampede exhausted DB connection pool. Root cause: unbounded cache key growth (40M keys, no TTL). Timeline: 11:38 memory warning, 11:42 primary crash, 11:52 replica ready, 14:12 resolved. Includes 5 Whys, prevention checklist, and evidence.

View Report

Stripe API Timeout Incident

P1
MTTR 47 minPeak error 100%

This Stripe API timeout incident report documents a 47-minute payment outage. Circuit breaker misconfiguration + connection pool exhaustion. ~12,000 users affected, ~$340K lost revenue. Timeline: 14:23 first alert, 14:32 pool exhausted, 14:48 rollback decision, 15:10 resolved. Root cause: circuit breaker too tolerant (10 failures/30s), Stripe timeout increased 5s→15s. Includes 5 Whys, prevention checklist, payment gateway failure recovery.

View Report

Database Connection Pool Exhaustion

P2
MTTR 45 minPeak error 15%

This database connection pool postmortem covers PostgreSQL exhaustion during a traffic spike. Slow analytical query (8+ min) + connection leak in order-sync-worker v1.2.0. All 200 connections held. P99 latency 30s, 15% requests failed. Timeline: 16:15 traffic spike, 16:22 pool exhausted, 16:28 kill query + scale worker, 17:07 resolved. Root cause: no statement_timeout, leak in error path. Includes 5 Whys, prevention checklist.

View Report

Your next incident deserves the same analysis. Generate your report in 2 minutes.

Try Free