Elasticsearch Split Brain — Cluster Partition — Incident Report
May 2, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
3h
Peak error rate
cluster_red
Users impacted
Global search + internal BI
Status
Resolved
Context
Incident verdict
Failure chain detected from production logs and cited evidence lines.
Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.
May 2, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
3h
Peak error rate
cluster_red
Users impacted
Global search + internal BI
Status
Resolved
On January 21, 2026, a transient network partition isolated AZ-a from AZ-b/c for ~90 seconds. Elasticsearch 7.17 cluster master voting split: two nodes separately elected themselves as master for overlapping windows. Inconsistent cluster metadata produced duplicate indexing targets and search skew. Recovery required full cluster restart with forced allocation awareness + restoring correct quorum settings (later migrated to voting-only nodes and Elasticsearch 8 discovery APIs). MTTR 3 hours.
Primary: Insufficient quorum constraint for master elections during partition — configuration allowed two sides to form independent majorities incorrectly due to asymmetric node loss pattern + stale zen2 config.
Contributing: Mixed dedicated master topology migration incomplete; JVM heap pressure delayed leader responses exacerbating election churn.
P1, MTTR 3h, financial analytics dashboards stale / contradictory.
[2026-01-21T03:19:12,441][WARN ][o.e.c.c.Coordinator ] master not discovered yet: have discovered possible quorum [...]
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master]];
Distributed consensus incidents are long MTTR — invest in partition simulation tests.
This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.
1) Inputs & context
All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.
2) Evidence-backed RCA
Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.
3) Ask ProdRescue
On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.
4) GitHub actions (plan-aware)
Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).
Get answers. Find the fix.
Suggested Fix (preview)
- payment.Amount
+ if payment == nil {
+ return ErrInvalidPayment
+ }
+ amount := payment.AmountChange preview
fix(incident-elastics): apply suggested remediation
Team plan can publish the change for review on GitHub. No auto-merge.
Had a similar incident?
Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.
Paste your logs