ProdRescue AI
Back to Example Reports

Context

Repocheckout-apiBranchv2.15.0Deploy#ed.Incident#nginx-50P1Resolved

Incident verdict

Failure chain detected from production logs and cited evidence lines.

Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.

ProdRescue AIIncident Report
01 / 09

Nginx 502 Bad Gateway — Upstream Failure — Incident Report

May 2, 2026 · Prepared for: [Your Organization]

Severity

P1

Service outage

22 min

Peak error rate

73%

Users impacted

~41K requests

Status

Resolved

ConfidentialMay 2, 2026
ProdRescue AIIncident Report
02 / 09

Executive Summary

On October 14, 2025, edge Nginx returned 502 Bad Gateway for 73% of requests to api.example.com during a deployment window. Upstreams pointed to Kubernetes pods whose Node.js process exited immediately: startup validation threw because STRIPE_WEBHOOK_SECRET was absent from the mounted secret api-gateway-secrets v3 (key renamed in Vault but not synced). Rollback to prior ReplicaSet restored traffic in 22 minutes.

Confidential02 / 09
ProdRescue AIIncident Report
03 / 09

Timeline

  • 09:12:08 UTC — Error budget alert: 502 rate > 15%
  • 09:13:21 UTC — Nginx error_log shows connect() failed (111: Connection refused) upstream 10.24.x.x:3000
  • 09:14:02 UTC — Pods CrashLoopBackOff; logs: Error: STRIPE_WEBHOOK_SECRET required
  • 09:16:40 UTC — Change correlation: Helm chart 4.2.0 references new secret key name
  • 09:18:55 UTC — Rollback deployment to chart 4.1.8; pods healthy
  • 09:34:10 UTC — Vault → K8s sync corrected; forward fix redeployed off-hours
Confidential03 / 09
ProdRescue AIIncident Report
04 / 09

Root Cause Analysis

Primary: Misconfigured secret — required environment variable missing at container start, causing process exit before listen().

Contributing: Health check probed TCP socket opened by sidecar, not application readiness; deploy marked Ready prematurely.

Confidential04 / 09
ProdRescue AIIncident Report
05 / 09

Impact

P1, MTTR 22 min, peak 502 rate 73%.

Confidential05 / 09
ProdRescue AIIncident Report
06 / 09

5 Whys Analysis

  1. Why 502? → Nginx could not reach upstream.
  2. Why unreachable? → Node process not listening.
  3. Why? → Process exited on missing env.
  4. Why missing? → Secret key rename not propagated to cluster.
  5. ROOT CAUSE: No schema validation for required env vars at deploy; readiness probe did not hit HTTP /health on app port.
Confidential06 / 09
ProdRescue AIIncident Report
07 / 09

Prevention Checklist

  • Readiness probe hits application /health with dependency checks
  • Pre-deploy manifest diff gates required keys
  • Synthetic canary hits API route post-deploy before traffic shift
Confidential07 / 09
ProdRescue AIIncident Report
08 / 09

Evidence & Log Samples

2025/10/14 09:13:18 [error] 1244#1244: *992831 connect() failed (111: Connection refused) while connecting to upstream
Error: STRIPE_WEBHOOK_SECRET is required at startup (config/stripe.ts:42)
[npm] Lifecycle script `start` failed with error code 1
Confidential08 / 09
ProdRescue AIIncident Report
09 / 09

Lessons Learned

Never trust sidecar-only readiness for application availability.

Confidential09 / 09

How this report maps to the real product workflow

This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.

1) Inputs & context

All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.

2) Evidence-backed RCA

Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.

3) Ask ProdRescue

On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.

4) GitHub actions (plan-aware)

Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).

Free/PAYG/Pro: incident clarityTeam: automation + suggested fixes

Ask ProdRescue

Get answers. Find the fix.

Incident-aware
What is the root cause?What changed before the incident?What evidence supports this?Suggest a fixWhat should we do next?

Suggested Fix (preview)

- payment.Amount
+ if payment == nil {
+   return ErrInvalidPayment
+ }
+ amount := payment.Amount

Change preview

fix(incident-nginx-50): apply suggested remediation

Team plan can publish the change for review on GitHub. No auto-merge.

Similar Incident Reports

Had a similar incident?

Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.

Paste your logs