ProdRescue AI
Back to Example Reports

Context

Repocheckout-apiBranchv2.15.0Deploy#ed.Incident#aws-lambP1Resolved

Incident verdict

Failure chain detected from production logs and cited evidence lines.

Generated from real production signals. Logs, Slack context, and monitoring traces are correlated before RCA and fix guidance.

ProdRescue AIIncident Report
01 / 08

AWS Lambda Timeout Cascade — Serverless Incident — Incident Report

May 2, 2026 · Prepared for: [Your Organization]

Severity

P1

Service outage

55 min

Peak error rate

timeout 34%

Users impacted

~18K validations queued

Status

Resolved

ConfidentialMay 2, 2026
ProdRescue AIIncident Report
02 / 08

Executive Summary

On December 1, 2025, the order-validation Lambda (512 MB, 30s timeout) exceeded duration under burst traffic. Functions inside a VPC incurred ENI cold start latency (~6–9s) combined with a cold full table scan against DynamoDB in handler init path. Partial failures caused SQS visibility timeouts → cumulative backlog 340k messages. Mitigations: enabled provisioned concurrency (50), removed VPC where only DynamoDB/VPC endpoints unnecessary (later iteration), replaced scan with query + GSI. MTTR 55 min.

Confidential02 / 08
ProdRescue AIIncident Report
03 / 08

Timeline

  • 18:05 UTC — CloudWatch: Duration p99 > 25000ms; Throttles 0
  • 18:09 UTC — SQS ApproximateNumberOfMessagesVisible spike
  • 18:14 UTC — Identified ENI init + Dynamo scan in init outside reuse path
  • 18:22 UTC — Scale provisioned concurrency; patch hotfix removing scan from cold path
  • 18:40 UTC — Backlog draining; DLQ monitored empty
  • 19:00 UTC — SLO restored
Confidential03 / 08
ProdRescue AIIncident Report
04 / 08

Root Cause Analysis

Primary: VPC-attached Lambda cold path + expensive Dynamo operation during first invocation per sandbox.

Contributing: No reserved concurrency; partial deploy doubled concurrent cold starts.

Confidential04 / 08
ProdRescue AIIncident Report
05 / 08

5 Whys Analysis

  1. Why backlog? → Lambdas too slow to drain queue.
  2. Why slow? → Timeouts and retries.
  3. Why timeouts? → Cold start + DB scan exceeded remaining time.
  4. Why scan? → Developer reused debug helper in prod path.
  5. ROOT CAUSE: Lack of architecture review for VPC necessity; no perf budget on Lambda cold path.
Confidential05 / 08
ProdRescue AIIncident Report
06 / 08

Prevention Checklist

  • VPC only when mandatory; otherwise VPC endpoints / no VPC
  • Provisioned concurrency for latency-sensitive queue consumers
  • Ban Dynamo Scan in Lambda CI rule pack
Confidential06 / 08
ProdRescue AIIncident Report
07 / 08

Evidence & Log Samples

REPORT RequestId: a1b2c3d4 Duration: 30000.00 ms Billed Duration: 30000 ms Status: timeout
Init Duration: 8234.56 ms Phase: init — VPC ENI attached
Confidential07 / 08
ProdRescue AIIncident Report
08 / 08

Lessons Learned

Cold starts are incident triggers — budget ENI + init in serverless SLOs.

Confidential08 / 08

How this report maps to the real product workflow

This page is a real-format example so teams can evaluate the full flow before login: input signals, evidence-backed RCA, Ask ProdRescue follow-up, and optional GitHub actions by plan.

1) Inputs & context

All plans can paste logs directly. With Slack connected (Pro / Team), pull threads or channels from war rooms and keep that context in one evidence-backed report.

2) Evidence-backed RCA

Timeline, root cause, impact, and action items are generated with citations tied to real log lines (e.g. [1], [6], [8]) so teams can verify every claim.

3) Ask ProdRescue

On report pages, users can ask follow-up questions like "why this happened", "show the evidence", or "suggest a fix" and get incident-context answers grounded in report data.

4) GitHub actions (plan-aware)

Team: connect repo, import commits, run manual deploy analysis, add webhook automation, and submit suggested fixes for review on GitHub (no auto-merge).

Free/PAYG/Pro: incident clarityTeam: automation + suggested fixes

Ask ProdRescue

Get answers. Find the fix.

Incident-aware
What is the root cause?What changed before the incident?What evidence supports this?Suggest a fixWhat should we do next?

Suggested Fix (preview)

- payment.Amount
+ if payment == nil {
+   return ErrInvalidPayment
+ }
+ amount := payment.Amount

Change preview

fix(incident-aws-lamb): apply suggested remediation

Team plan can publish the change for review on GitHub. No auto-merge.

Similar Incident Reports

Had a similar incident?

Paste your logs in the workspace — ProdRescue cites every claim to an evidence line. First analysis free; no credit card required.

Paste your logs