Back to Example Reports
ProdRescue AIIncident Report

Redis Cluster Failure Incident Report — Feb 25, 2026

Generated on: Feb 25, 2026, 11:41 PM · Prepared for: [Your Organization]

P1

Severity

2.5h

Service Outage Duration

Peak error rate

~180,000 sessions

Users impacted

Resolved

Service Status

ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM1/11
ProdRescue AIIncident Report
02/11

Executive Summary

On March 3, 2025, at 11:42 UTC, our production Redis cluster experienced a primary replica failure that led to a 2.5-hour period of degraded performance. The primary node crashed due to severe memory fragmentation following a large key eviction event. Failover to the replica was delayed because the replica was still syncing a large RDB snapshot. During the failover window, all cached sessions were invalidated, forcing ~180,000 users to re-authenticate. After failover completed, a cache stampede drove unprecedented load to the primary PostgreSQL database. Root cause was insufficient Redis memory headroom and a cache key design that allowed unbounded growth.

ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM2/11
ProdRescue AIIncident Report
03/11

Timeline

  • 11:38 UTC — Redis memory usage crosses 90%. Eviction policy begins removing keys.
  • 11:42 UTC — Primary Redis node OOM crash. Replica promoted but still catching up.
  • 11:43 UTC — All session cache misses. Users logged out. Auth service overwhelmed.
  • 11:45 UTC — Incident declared. On-call and platform team engaged.
  • 11:52 UTC — Replica finishes sync. Accepting writes. Session cache repopulating.
  • 11:55 UTC — Cache stampede begins. DB connection pool exhausted.
  • 12:15 UTC — DB read replicas added. Query load distributed.
  • 12:48 UTC — Redis memory fragmentation addressed. New primary stable.
  • 14:12 UTC — All caches warm. Latency normalized. Incident resolved.
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM3/11
ProdRescue AIIncident Report
04/11

Root Cause Analysis

The primary root cause was memory fragmentation in the Redis primary. A poorly designed cache key pattern (user activity feed) had grown to ~40M keys without TTL. When memory pressure triggered eviction, Redis attempted to evict large hash objects. The eviction process itself caused additional fragmentation, and the allocator could not coalesce freed memory. The process hit the configured maxmemory limit and was killed by the OOM killer.

A contributing factor was the failover delay. The replica was 8 minutes behind the primary due to a large RDB snapshot transfer. During this window, Redis was unavailable for writes, and all in-memory session data was lost. We had not configured Redis to use AOF (append-only file) for session data.

The cache stampede was a secondary effect. When Redis came back online, every request was a cache miss. Thousands of requests simultaneously queried PostgreSQL for session and user data, exhausting the connection pool.

ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM4/11
ProdRescue AIIncident Report
05/11

Impact

  • Duration: 2.5 hours (degraded), 13 minutes (full outage)
  • Sessions lost: ~180,000 users logged out
  • Database impact: Connection pool exhaustion, P99 latency > 5s for 1 hour
  • Revenue impact: Checkout abandonment rate 3x normal during window
  • Customer support: 2,100 tickets related to "logged out" or "session expired"
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM5/11
ProdRescue AIIncident Report
06/11

Action Items

PriorityActionOwnerDue DateStatus
[ ] P1Fix cache key design — Add TTL to all cache keys@backend2024-03-10Open
[ ] P1Enable AOF for Redis session persistence@platform2024-03-08Open
[ ] P2Increase Redis memory — 30% headroom@platform2024-03-06Open
[ ] P2Cache stampede protection — request coalescing@backend2024-03-15Open
[ ] P3Failover monitoring — alert on replica lag > 60s@sre2024-03-09Open
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM6/11
ProdRescue AIIncident Report
07/11

Detection, Response & Resolution

Detection (11:38–11:42 UTC): Redis memory crossed 90% at 11:38. Primary OOM crash at 11:42. Alert fired within 30 seconds. MTTD: ~4 minutes from first warning to full outage.

Response (11:42–11:52 UTC): Platform team engaged. Replica promoted but 8 minutes behind. Manual failover decision. Session cache invalidated — all users logged out. Auth service overwhelmed by DB fallback.

Resolution (11:52–14:12 UTC): Replica finished sync at 11:52. Cache stampede began — DB connection pool exhausted. Read replicas added at 12:15. Memory fragmentation addressed by 12:48. All caches warm by 14:12. MTTR: 2.5 hours.

ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM7/11
ProdRescue AIIncident Report
08/11

5 Whys Analysis

  1. Why did users get logged out? → Redis primary crashed, all session data lost.
  2. Why did Redis crash? → OOM killer terminated process due to memory exhaustion.
  3. Why was memory exhausted? → Unbounded cache key growth (40M keys, no TTL) + eviction caused fragmentation.
  4. Why was cache unbounded? → Activity feed cache design had no TTL or size limit.
  5. Why wasn't this caught earlier?ROOT CAUSE: No memory headroom monitoring, no cache key audits, replica lag not alerted.
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM8/11
ProdRescue AIIncident Report
09/11

Prevention Checklist

  • Add TTL to all Redis cache keys (max 24h for session, 1h for feed)
  • Enable AOF for session persistence
  • 30% memory headroom for Redis
  • Cache stampede protection (request coalescing, probabilistic early expiration)
  • Alert on replica lag > 60 seconds
  • Regular cache key audits and size limits
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM9/11
ProdRescue AIIncident Report
10/11

Evidence & Log Samples

[ERROR] Redis primary OOM killed. Replica lag: 8m 12s. Promoting replica...
[WARN] auth-service Session validation falling back to DB - cache miss rate: 94%
[ERROR] PostgreSQL connection pool exhausted (200/200). P99 latency: 5200ms
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM10/11
ProdRescue AIIncident Report
11/11

Lessons Learned

  • Unbounded cache growth is a ticking time bomb. The activity feed cache had no TTL and no size limit.
  • Memory fragmentation can cause sudden OOM. Regular memory profiling and headroom are essential.
  • Failover is not instant. Replica sync lag meant 13 minutes of downtime. AOF would have reduced data loss.
  • Cache stampede is a real risk. We need stampede protection for critical caches.
  • Redis cluster failure root cause analysis — this postmortem documents our OOM and failover recovery.
ConfidentialPrepared by ProdRescue AI · prodrescueai.comGenerated Feb 25, 2026, 11:41 PM11/11

Similar Incident Reports

Your next incident deserves the same analysis. Generate your report in 2 minutes.

Try Free