Nginx 502 Bad Gateway — Upstream Failure — Incident Report
May 2, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
22 min
Peak error rate
73%
Users impacted
~41K requests
Status
Resolved
Context
Incident verdict
Failure chain detected from production logs and cited evidence lines.
Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.
May 2, 2026 · Prepared for: [Your Organization]
Severity
P1
Service outage
22 min
Peak error rate
73%
Users impacted
~41K requests
Status
Resolved
On October 14, 2025, edge Nginx returned 502 Bad Gateway for 73% of requests to api.example.com during a deployment window. Upstreams pointed to Kubernetes pods whose Node.js process exited immediately: startup validation threw because STRIPE_WEBHOOK_SECRET was absent from the mounted secret api-gateway-secrets v3 (key renamed in Vault but not synced). Rollback to prior ReplicaSet restored traffic in 22 minutes.
Primary: Misconfigured secret — required environment variable missing at container start, causing process exit before listen().
Contributing: Health check probed TCP socket opened by sidecar, not application readiness; deploy marked Ready prematurely.
P1, MTTR 22 min, peak 502 rate 73%.
2025/10/14 09:13:18 [error] 1244#1244: *992831 connect() failed (111: Connection refused) while connecting to upstream
Error: STRIPE_WEBHOOK_SECRET is required at startup (config/stripe.ts:42)
[npm] Lifecycle script `start` failed with error code 1
Never trust sidecar-only readiness for application availability.
This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.
1) Inputs & context
All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.
2) Evidence-backed RCA
Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.
3) Ask ProdRescue
On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.
4) GitHub actions (plan-aware)
Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).
Get answers. Find the fix.
Suggested Fix (preview)
- payment.Amount
+ if payment == nil {
+ return ErrInvalidPayment
+ }
+ amount := payment.AmountChange preview
fix(incident-nginx-50): apply suggested remediation
Team plan can publish the change for review on GitHub. No auto-merge.
Had a similar incident?
Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.
Paste your logs