ProdRescue AI
Back to Example Reports

Context

Repocheckout-apiBranchv2.15.0Deploy#ed.Incident#elasticsP1Resolved

Incident verdict

Failure chain detected from production logs and cited evidence lines.

Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.

ProdRescue AIIncident Report
01 / 09

Elasticsearch Split Brain — Cluster Partition — Incident Report

May 2, 2026 · Prepared for: [Your Organization]

Severity

P1

Service outage

3h

Peak error rate

cluster_red

Users impacted

Global search + internal BI

Status

Resolved

ConfidentialMay 2, 2026
ProdRescue AIIncident Report
02 / 09

Executive Summary

On January 21, 2026, a transient network partition isolated AZ-a from AZ-b/c for ~90 seconds. Elasticsearch 7.17 cluster master voting split: two nodes separately elected themselves as master for overlapping windows. Inconsistent cluster metadata produced duplicate indexing targets and search skew. Recovery required full cluster restart with forced allocation awareness + restoring correct quorum settings (later migrated to voting-only nodes and Elasticsearch 8 discovery APIs). MTTR 3 hours.

Confidential02 / 09
ProdRescue AIIncident Report
03 / 09

Timeline

  • 03:18 UTC — Elastic Cloud monitoring: cluster health RED
  • 03:21 UTC — Split-brain suspected; masters both publishing different cluster UUID versions
  • 03:35 UTC — Stop ingest; snapshot verified on S3 repo hourly
  • 04:40 UTC — Controlled rolling restart with min_master_nodes corrected (calculation N/2+1)
  • 05:55 UTC — Shard replication green; search A/B hash parity tests pass
  • 06:18 UTC — Incident closed with observability follow-ups
Confidential03 / 09
ProdRescue AIIncident Report
04 / 09

Root Cause Analysis

Primary: Insufficient quorum constraint for master elections during partition — configuration allowed two sides to form independent majorities incorrectly due to asymmetric node loss pattern + stale zen2 config.

Contributing: Mixed dedicated master topology migration incomplete; JVM heap pressure delayed leader responses exacerbating election churn.

Confidential04 / 09
ProdRescue AIIncident Report
05 / 09

Impact

P1, MTTR 3h, financial analytics dashboards stale / contradictory.

Confidential05 / 09
ProdRescue AIIncident Report
06 / 09

5 Whys Analysis

  1. Why inconsistent search? → Divergent shard routing maps.
  2. Why divergent? → Two masters accepted conflicting updates.
  3. Why two masters? → Partition split electorate incorrectly.
  4. Why incorrect electorate? → minimum_master_nodes / voting config wrong for topology.
  5. ROOT CAUSE: Cluster upgrade checklist did not validate quorum math after adding coordinator-only nodes.
Confidential06 / 09
ProdRescue AIIncident Report
07 / 09

Prevention Checklist

  • Dedicated master nodes + voting-only configuration peer-reviewed
  • Automatic zen discovery smoke test in staging under simulated partition
  • Snapshot SLAs + restore drill quarterly
Confidential07 / 09
ProdRescue AIIncident Report
08 / 09

Evidence & Log Samples

[2026-01-21T03:19:12,441][WARN ][o.e.c.c.Coordinator ] master not discovered yet: have discovered possible quorum [...]
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master]];
Confidential08 / 09
ProdRescue AIIncident Report
09 / 09

Lessons Learned

Distributed consensus incidents are long MTTR — invest in partition simulation tests.

Confidential09 / 09

How this report maps to the real product workflow

This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.

1) Inputs & context

All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.

2) Evidence-backed RCA

Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.

3) Ask ProdRescue

On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.

4) GitHub actions (plan-aware)

Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).

Free/PAYG/Pro: incident clarityTeam: automation + suggested fixes

Ask ProdRescue

Get answers. Find the fix.

Incident-aware
What is the root cause?What changed before the incident?What evidence supports this?Suggest a fixWhat should we do next?

Suggested Fix (preview)

- payment.Amount
+ if payment == nil {
+   return ErrInvalidPayment
+ }
+ amount := payment.Amount

Change preview

fix(incident-elastics): apply suggested remediation

Team plan can publish the change for review on GitHub. No auto-merge.

Similar Incident Reports

Had a similar incident?

Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.

Paste your logs