ProdRescue AI
Back to Example Reports

Context

Repocheckout-apiBranchv2.15.0Deploy#ed.Incident#java-memP1Resolved

Incident verdict

Failure chain detected from production logs and cited evidence lines.

Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.

ProdRescue AIIncident Report
01 / 10

Java Memory Leak Postmortem — Heap Exhaustion in Production — Incident Report

May 2, 2026 · Prepared for: [Your Organization]

Severity

P1

Service outage

2h 10min

Peak error rate

38%

Users impacted

~22K failed API calls

Revenue impact

Checkout retries elevated

Status

Resolved

ConfidentialMay 2, 2026
ProdRescue AIIncident Report
02 / 10

Executive Summary

Between September 2–3, 2025, billing-api (Java 17 / Spring Boot 3.2) pods experienced escalating heap usage culminating in OOMKilled events every 4–11 minutes after steady-state load. Root cause: a PreparedStatement opened inside a retry loop was not closed when SQLException triggered the retry path, leaking native and heap references until GC could not reclaim sufficient space.

Confidential02 / 10
ProdRescue AIIncident Report
03 / 10

Timeline

  • Sep 2 14:18 UTC — Heap warning: Old Gen > 82% on pod billing-api-58d9x
  • Sep 2 18:40 UTC — First OOMKilled; Kubernetes restarted pod; traffic shifted
  • Sep 3 02:05 UTC — Error budget burn; incident declared; JVM heap dumps captured
  • Sep 3 02:28 UTC — MAT analysis: 1.2M unreachable PreparedStatement wrappers retained via retry handler map
  • Sep 3 03:12 UTC — Code path identified: RetryTemplate + JDBC without try-with-resources on exception branch
  • Sep 3 03:45 UTC — Hotfix v1.6.4: try-with-resources + cap retries + statement cache limits
  • Sep 3 04:28 UTC — Rollout complete; heap stable <60% old gen over 2h observation
Confidential03 / 10
ProdRescue AIIncident Report
04 / 10

Root Cause Analysis

Primary: Resource leak — PreparedStatements not closed on SQL exception retry branch in InvoiceJdbcRepository.saveBatch.

Contributing: JDBC statement cache enabled without leak detection; insufficient JVM OOM heap dumps on prior crashes (lost crash loops).

Confidential04 / 10
ProdRescue AIIncident Report
05 / 10

Impact

P1 billing partial outage pattern; MTTR 2h 10m from declare to verified fix.

Confidential05 / 10
ProdRescue AIIncident Report
06 / 10

Action Items

PriorityActionOwnerDueStatus
P1Mandatory try-with-resources audit JDBC module@backend2025-09-06Open
P2OOM heap dump sidecar + S3 upload on SIGKILL@platform2025-09-10Open
Confidential06 / 10
ProdRescue AIIncident Report
07 / 10

5 Whys Analysis

  1. Why OOM? → Heap exhausted.
  2. Why heap growth? → JDBC objects retained.
  3. Why retained? → Statements not closed on exception path.
  4. Why exception path? → Transient DB failover injected retries.
  5. ROOT CAUSE: Code review did not enforce try-with-resources; static analysis skipped JDBC module.
Confidential07 / 10
ProdRescue AIIncident Report
08 / 10

Prevention Checklist

  • SpotBugs + Error Prone JDBC leak rules in CI
  • Integration tests force SQLException during batch save
  • Heap baseline alerts at 75% old gen for billing-api
Confidential08 / 10
ProdRescue AIIncident Report
09 / 10

Evidence & Log Samples

OpenJDK 64-Bit Server VM warning: Exception java.lang.OutOfMemoryError thrown from UncaughtExceptionHandler
java.lang.OutOfMemoryError: Java heap space
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:912)
[billing-api] Retry attempt 4/5 for saveBatch — SQLException: Communications link failure during rollback
Confidential09 / 10
ProdRescue AIIncident Report
10 / 10

Lessons Learned

Retries amplify leaks if resources are not scoped per attempt.

Confidential10 / 10

How this report maps to the real product workflow

This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.

1) Inputs & context

All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.

2) Evidence-backed RCA

Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.

3) Ask ProdRescue

On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.

4) GitHub actions (plan-aware)

Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).

Free/PAYG/Pro: incident clarityTeam: automation + suggested fixes

Ask ProdRescue

Get answers. Find the fix.

Incident-aware
What is the root cause?What changed before the incident?What evidence supports this?Suggest a fixWhat should we do next?

Suggested Fix (preview)

- payment.Amount
+ if payment == nil {
+   return ErrInvalidPayment
+ }
+ amount := payment.Amount

Change preview

fix(incident-java-mem): apply suggested remediation

Team plan can publish the change for review on GitHub. No auto-merge.

Similar Incident Reports

Had a similar incident?

Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.

Paste your logs