Redis Cluster Failure Incident Report — Feb 25, 2026
Generated on: Feb 25, 2026, 11:41 PM · Prepared for: [Your Organization]
P1
Severity
2.5h
Service Outage Duration
—
Peak error rate
~180,000 sessions
Users impacted
Resolved
Service Status
Generated on: Feb 25, 2026, 11:41 PM · Prepared for: [Your Organization]
P1
Severity
2.5h
Service Outage Duration
—
Peak error rate
~180,000 sessions
Users impacted
Resolved
Service Status
On March 3, 2025, at 11:42 UTC, our production Redis cluster experienced a primary replica failure that led to a 2.5-hour period of degraded performance. The primary node crashed due to severe memory fragmentation following a large key eviction event. Failover to the replica was delayed because the replica was still syncing a large RDB snapshot. During the failover window, all cached sessions were invalidated, forcing ~180,000 users to re-authenticate. After failover completed, a cache stampede drove unprecedented load to the primary PostgreSQL database. Root cause was insufficient Redis memory headroom and a cache key design that allowed unbounded growth.
The primary root cause was memory fragmentation in the Redis primary. A poorly designed cache key pattern (user activity feed) had grown to ~40M keys without TTL. When memory pressure triggered eviction, Redis attempted to evict large hash objects. The eviction process itself caused additional fragmentation, and the allocator could not coalesce freed memory. The process hit the configured maxmemory limit and was killed by the OOM killer.
A contributing factor was the failover delay. The replica was 8 minutes behind the primary due to a large RDB snapshot transfer. During this window, Redis was unavailable for writes, and all in-memory session data was lost. We had not configured Redis to use AOF (append-only file) for session data.
The cache stampede was a secondary effect. When Redis came back online, every request was a cache miss. Thousands of requests simultaneously queried PostgreSQL for session and user data, exhausting the connection pool.
| Priority | Action | Owner | Due Date | Status |
|---|---|---|---|---|
| [ ] P1 | Fix cache key design — Add TTL to all cache keys | @backend | 2024-03-10 | Open |
| [ ] P1 | Enable AOF for Redis session persistence | @platform | 2024-03-08 | Open |
| [ ] P2 | Increase Redis memory — 30% headroom | @platform | 2024-03-06 | Open |
| [ ] P2 | Cache stampede protection — request coalescing | @backend | 2024-03-15 | Open |
| [ ] P3 | Failover monitoring — alert on replica lag > 60s | @sre | 2024-03-09 | Open |
Detection (11:38–11:42 UTC): Redis memory crossed 90% at 11:38. Primary OOM crash at 11:42. Alert fired within 30 seconds. MTTD: ~4 minutes from first warning to full outage.
Response (11:42–11:52 UTC): Platform team engaged. Replica promoted but 8 minutes behind. Manual failover decision. Session cache invalidated — all users logged out. Auth service overwhelmed by DB fallback.
Resolution (11:52–14:12 UTC): Replica finished sync at 11:52. Cache stampede began — DB connection pool exhausted. Read replicas added at 12:15. Memory fragmentation addressed by 12:48. All caches warm by 14:12. MTTR: 2.5 hours.
[ERROR] Redis primary OOM killed. Replica lag: 8m 12s. Promoting replica...
[WARN] auth-service Session validation falling back to DB - cache miss rate: 94%
[ERROR] PostgreSQL connection pool exhausted (200/200). P99 latency: 5200ms
Your next incident deserves the same analysis. Generate your report in 2 minutes.
Try Free