Redis Cluster Failure: Incident Report
May 5, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
2.5h
Peak error rate
—
Users impacted
~180,000 sessions
Status
Resolved
Context
Incident verdict
Failure chain detected from production logs and cited evidence lines.
Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.
May 5, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
2.5h
Peak error rate
—
Users impacted
~180,000 sessions
Status
Resolved
On March 3, 2025, at 11:42 UTC, our production Redis cluster experienced a primary replica failure that led to a 2.5-hour period of degraded performance. The primary node crashed due to severe memory fragmentation following a large key eviction event. Failover to the replica was delayed because the replica was still syncing a large RDB snapshot. During the failover window, all cached sessions were invalidated, forcing ~180,000 users to re-authenticate. After failover completed, a cache stampede drove unprecedented load to the primary PostgreSQL database. Root cause was insufficient Redis memory headroom and a cache key design that allowed unbounded growth.
The primary root cause was memory fragmentation in the Redis primary. A poorly designed cache key pattern (user activity feed) had grown to ~40M keys without TTL. When memory pressure triggered eviction, Redis attempted to evict large hash objects. The eviction process itself caused additional fragmentation, and the allocator could not coalesce freed memory. The process hit the configured maxmemory limit and was killed by the OOM killer.
A contributing factor was the failover delay. The replica was 8 minutes behind the primary due to a large RDB snapshot transfer. During this window, Redis was unavailable for writes, and all in-memory session data was lost. We had not configured Redis to use AOF (append-only file) for session data.
The cache stampede was a secondary effect. When Redis came back online, every request was a cache miss. Thousands of requests simultaneously queried PostgreSQL for session and user data, exhausting the connection pool.
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| [ ] P1 | Fix cache key design: Add TTL to all cache keys | @backend | 2024-03-10 | Open |
| [ ] P1 | Enable AOF for Redis session persistence | @platform | 2024-03-08 | Open |
| [ ] P2 | Increase Redis memory: 30% headroom | @platform | 2024-03-06 | Open |
| [ ] P2 | Cache stampede protection: request coalescing | @backend | 2024-03-15 | Open |
| [ ] P3 | Failover monitoring: alert on replica lag > 60s | @sre | 2024-03-09 | Open |
Detection (11:38–11:42 UTC): Redis memory crossed 90% at 11:38. Primary OOM crash at 11:42. Alert fired within 30 seconds. MTTD: ~4 minutes from first warning to full outage.
Response (11:42–11:52 UTC): Platform team engaged. Replica promoted but 8 minutes behind. Manual failover decision. Session cache invalidated: all users logged out. Auth service overwhelmed by DB fallback.
Resolution (11:52–14:12 UTC): Replica finished sync at 11:52. Cache stampede began: DB connection pool exhausted. Read replicas added at 12:15. Memory fragmentation addressed by 12:48. All caches warm by 14:12. MTTR: 2.5 hours.
[ERROR] Redis primary OOM killed. Replica lag: 8m 12s. Promoting replica...
[WARN] auth-service Session validation falling back to DB - cache miss rate: 94%
[ERROR] PostgreSQL connection pool exhausted (200/200). P99 latency: 5200ms
This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, optional Suggest Fix on saved incidents, and GitHub actions by plan.
1) Inputs & context
All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.
2) Evidence-backed RCA
Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.
3) Suggest Fix
On saved incident reports, users can generate patch-style remediation suggestions and copy diffs or, on Team plan, open a review-ready change on GitHub (no auto-merge).
4) GitHub actions (plan-aware)
Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).
Had a similar incident?
Paste your logs in the workspace: ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.
Paste your logs