Kafka Consumer Lag Incident — Root Cause Analysis Example — Incident Report
May 2, 2026 · Prepared for: [Your Organization]
Severity
P2
Service outage
35 min
Peak error rate
lag +2.1M msgs
Users impacted
~120K event delay window
Status
Resolved
Context
Incident verdict
Failure chain detected from production logs and cited evidence lines.
Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.
May 2, 2026 · Prepared for: [Your Organization]
Severity
P2
Service outage
35 min
Peak error rate
lag +2.1M msgs
Users impacted
~120K event delay window
Status
Resolved
On August 19, 2025, the analytics-ingest consumer group on Kafka cluster prod-east lagged by over 2.1 million messages. Producer throughput held steady at ~85k msgs/sec while aggregate consumer fetch rate dropped after deployment v2.8.0 introduced synchronous GeoIP enrichment in the hot path. Partitions (48) could not absorb burst replay after a broker rolling restart. Resolution combined partition reassignment, temporary scale-out of consumers (12 → 28 instances), and rollback of enrichment to async side-channel processing. MTTR from incident declaration to lag under SLA: 35 minutes.
analytics-ingest > 500k behind (threshold 200k)Primary root cause: v2.8.0 added synchronous third-party GeoIP calls inside the poll loop, increasing per-record latency ~15× under load. With fixed partition count and max poll interval pressure, consumers could not keep pace with producers after a replay window from broker restarts.
Contributing factors: (1) Partition under-provisioned for peak (48 partitions vs. sustained 85k msg/s), (2) no back-pressure to slow producers for this topic, (3) missing load test for v2.8.0 on full production traffic shape.
| Priority | Action | Owner | Due | Status |
|---|---|---|---|---|
| P1 | Increase partitions to 96 + rebalance plan | @streaming | 2025-08-22 | Open |
| P1 | GeoIP only async path; contract tests on poll latency | @backend | 2025-08-21 | Open |
| P2 | Producer back-pressure when lag > 100k | @platform | 2025-08-28 | Open |
Detection via lag monitor (12:04). Response: ruled out broker failure, profiled consumer, tied to deploy. Resolution: feature rollback + scale-out + rebalance.
2025-08-19T12:06:02.441Z WARN analytics-ingest-7b2kf lag=1247831 partitions=[0-47] max.poll.interval.ms approaching
{"topic":"events.raw","consumer":"analytics-ingest","partition":12,"lag_offets":482910,"process_ms_p99":176}
[kafka-coordinator] Heartbeat failed: consumer processing exceeded max.poll.interval.ms (300000ms) — rebalance triggered
Hot-path network I/O in Kafka consumers starves fetch loops. Treat partition count and consumer processing SLA as one capacity model.
This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.
1) Inputs & context
All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.
2) Evidence-backed RCA
Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.
3) Ask ProdRescue
On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.
4) GitHub actions (plan-aware)
Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).
Get answers. Find the fix.
Suggested Fix (preview)
- payment.Amount
+ if payment == nil {
+ return ErrInvalidPayment
+ }
+ amount := payment.AmountChange preview
fix(incident-kafka-co): apply suggested remediation
Team plan can publish the change for review on GitHub. No auto-merge.
Had a similar incident?
Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.
Paste your logs