AgentKits

AI Incident Response Agent

Flagship BlueprintAgentAz™ Enhanced
0New

Includes Agent Blueprint + Implementation Guide

An on-call co-pilot that turns a noisy alert into a structured response. It correlates the alert with metrics, logs, and recent deploys, forms an evidence-backed hypothesis, and proposes mitigations — executing low-risk, reversible ones automatically while gating anything risky behind human approval. It drafts clear status updates, respects blast radius and cost, and escalates and pages on real SEV1s instead of fumbling. Built so the worst case is a fast, well-summarized handoff to a human, never an unsafe automated action.

incident-responsesredevopson-callobservabilityautonomous-agentrunbooksescalationagentazagent-governancetrust-levelproduction-readiness
StackClaude, LangGraph, OpenAI
DifficultyAdvanced
Setup50 min
Version2.0.0 · 2026-06-21

Overview

Alert → correlate → hypothesize → mitigate → communicate: a full first-responder loop grounded in your real telemetry.

Acts only within its authority: low-risk, reversible steps run automatically; rollbacks, scaling, and anything risky require human approval.

Evidence-based: every hypothesis cites the metric, log, or deploy that supports it — no guessing at root cause.

Fails to a human, fast: real SEV1s and ambiguous, high-blast-radius incidents are escalated and paged with a clean summary and a holding status update.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A4 — Limited Autonomy
DNA PatternExecution (Research → Plan → Execute → Verify)
Worst-Case ActionExecutes an allowlisted low-risk, reversible remediation step (e.g. restarting a stuck service) that turns out to be unnecessary. Every auto-executed step is sandboxed with a registered rollback. Irreversible or high-impact actions — production rollbacks, scaling, security or config changes — are never auto-executed; they require human approval.
Authority BoundaryMay run a small allowlist of low-risk, reversible remediation steps automatically during an incident, each with a tested rollback and full audit trail. Anything risky or irreversible is proposed for human approval, not executed. It drafts status updates and escalates SEV1, but never makes an irreversible change on its own.
Verification TestTrigger a high-risk action (e.g. scale-down production) → confirm it routes to human approval and is NOT auto-executed, and confirm every auto-executed step has a registered rollback.
Production Readiness6/6 dimensions passing. Tool isolation: only reversible/low-risk tools are auto-executable; risky tools are gated. Human gates: irreversible actions require approval. Confidence escalation: low confidence routes to on-call. Cost ceiling: bounded per incident. Audit trail: every action and rollback logged append-only. Escalation path: SEV1 routed to human on-call.
Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json
{
  "$schema": "./agentaz.schema.json",
  "agent_id": "incident-response-agent",
  "version": "2.0.0",
  "trust_level": "A4",
  "dna_pattern": "Execution",
  "worst_case_action": "Runs a reversible, low-risk auto-remediation that was unnecessary; rolled back. Irreversible actions require human approval.",
  "authority_boundary": "Auto-runs allowlisted reversible steps with rollback; risky/irreversible actions require human approval.",
  "last_reviewed": "2026-06-24",
  "tags": [
    "devops",
    "sre",
    "security",
    "sandboxed",
    "rollback",
    "human-approval",
    "agentaz",
    "agent-governance",
    "trust-level",
    "production-readiness"
  ],
  "tool_boundary": {
    "auto_executable_tools": [
      "restart_service",
      "clear_cache",
      "rotate_log",
      "health_check"
    ],
    "approval_required_tools": [
      "rollback_deploy",
      "scale_service",
      "change_config",
      "modify_security_group"
    ],
    "execution_tools_absent": false,
    "rollback_required": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_without_approval": [
      "rollback_deploy",
      "scale_service",
      "change_config",
      "modify_security_group"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.35,
    "alert_threshold_usd": 0.25
  },
  "loop_boundary": {
    "max_reasoning_turns": 12
  },
  "human_handoff": {
    "triggers": [
      "sev1",
      "irreversible_action",
      "low_confidence"
    ],
    "destination": "oncall_engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "actions",
      "rollbacks",
      "approvals",
      "escalations"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 (spec text under CC‑BY‑4.0) — schema and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goalBounded by the authority spec above
Trust LevelA4 — Limited Autonomy
Tool accessScoped tools; high-risk actions gated behind approval
Context handlingGrounded in provided inputs; cites or flags rather than guessing
Memory strategyTask-scoped; no persistent cross-session memory
Human approvalRequired on sev1, irreversible action, low confidence → oncall engineer
Audit trailAppend-only log (actions, rollbacks, approvals, escalations)
Cost & loop bounds≤ $0.35 per loop · ≤ 12 reasoning turns
Recovery / escalationEscalates to oncall engineer

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

AgentPrimary reasoner — Limited Autonomy authority (A4)
Toolsrestart service, clear cache, rotate log, health check; approval-gated: rollback deploy, scale service, change config, modify security group
MemoryTask-scoped working context; no persistent cross-session memory
GuardrailsWorst-case classified (A4); high-risk actions gated; ≤ $0.35/loop · ≤ 12 turns
EvaluatorConfidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
HandoffEscalates to oncall engineer on sev1, irreversible action, low confidence

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Misdiagnoses the incident and targets the wrong service with a remediation step.

Detection
Pre-action validation checks the action target against the alert's affected service; a mismatch raises an anomaly before anything runs.
Mitigation
Remediation actions are reversible and sandboxed; destructive steps are gated behind human approval and must match the diagnosed scope.
Recovery
Automatic rollback of the reversible step; the incident is escalated to on-call with the full diagnosis trail.

Acts on a stale or duplicate alert for an incident that is already resolved.

Detection
Incident status and a dedup key are checked before any action.
Mitigation
An idempotency key per incident makes repeated triggers a no-op.
Recovery
The duplicate is closed and the dedup event is logged.

Remediation loops — repeated restarts that never converge.

Detection
A loop counter and max-reasoning-turn cap detect repeated identical actions.
Mitigation
Bounded retries with backoff; escalate after N attempts.
Recovery
Automation halts and pages a human with the full attempt history.

A cascading action makes the incident worse.

Detection
A health check runs after each step; if health degrades, the chain aborts.
Mitigation
One reversible action at a time with a health gate between steps.
Recovery
The last action is rolled back, automation is frozen, and control passes to a human.

Evaluation

Action correctness matters most here — whether the remediation it ran or proposed was the right one for the diagnosed incident — because a wrong action has real blast radius.

Diagnosis accuracyShare of incidents where the identified root cause and affected service match the ground truth.
Action correctnessOf actions taken or proposed, the share that were appropriate and scoped to the actual incident.
Rollback successWhen a reversible action is undone, how reliably the system returns to its prior state.
Escalation rateHow often it hands off to on-call, split into correct escalations vs missed or unnecessary ones.
Latency to first actionTime from alert to the first correct remediation step.

Recommended approach. Replay a labeled set of historical incidents in a sandbox and compare proposed actions to what on-call actually did; track rollback success and escalation quality separately. Never grade on live production traffic.

When to use

Use it when

  • You run on-call and want faster triage on the flood of alerts, especially the repetitive, well-understood ones.
  • You have metrics, logs, and deploy history the agent can correlate to form grounded hypotheses.
  • You have runbooks with clearly safe, reversible steps that can be automated under guardrails.
  • You want consistent, timely status updates drafted during an incident.
  • You want the agent to handle first response and escalate the genuinely serious incidents to humans with context.

Avoid it when

  • You have no observability data for the agent to ground hypotheses in — it would be guessing.
  • You expect it to resolve novel SEV1s autonomously; those need experienced humans and the agent should escalate.
  • Your mitigations are all high-risk/irreversible with no safe automation surface.
  • You are unwilling to put approval gates in front of production-changing actions.

System prompt

system-prompt.md
You are an Autonomous Incident Response Agent acting as a first responder for an on-call SRE team. Your job is to triage one alert/incident: understand it, mitigate what is safe, communicate clearly, and escalate fast when it is serious. You are judged on reducing time-to-mitigate AND on never taking an unsafe action and never hiding a real incident.

== CORE PRINCIPLES ==
1. Evidence first. Form a hypothesis only from telemetry you have actually queried — metrics, logs, traces, recent deploys/changes. Cite the specific signal. Never assert a cause you cannot show.
2. Safety over speed. A fast wrong action is worse than a clean escalation. When in doubt, stabilize, communicate, and hand to a human.
3. Smallest safe action. Prefer the least invasive, most reversible mitigation that addresses the evidence.

== HARD RULES (NON-NEGOTIABLE) ==
- ACTION TIERS: You may AUTONOMOUSLY take only low-risk, reversible, explicitly allow-listed actions (e.g. restart a stateless pod, clear a cache, scale up within a cap, silence a known-false alert). Any rollback, deploy, scale-down, data operation, traffic shift, or config change to production REQUIRES human approval — propose it, do not execute it.
- NEVER hide severity. Do not downgrade or silence an alert that could be a real incident to make the board look clean. Suppress only alerts you can show are non-actionable, and say why.
- BLAST RADIUS: Estimate the blast radius before any action. If an action could affect a broad scope or a critical/customer-facing service, it is not autonomous — escalate or seek approval.
- DON'T BREAK MORE: Do not take actions that could worsen the incident (e.g. mass restarts during a thundering-herd). If unsure of an action's effect, don't take it.
- COMMUNICATE: Keep humans informed with concise, honest status updates. Never promise a resolution time or root cause you cannot support.

== SEVERITY & DECISION ==
- Assess severity (SEV1 critical/customer-facing outage or data risk; SEV2 major degradation; SEV3 minor/limited; SEV4 noise).
- AUTO_MITIGATE: SEV3/known-pattern with an allow-listed, reversible fix and confidence >= 0.8. Execute, verify, communicate.
- PROPOSE: a non-allow-listed but evidence-backed mitigation (e.g. rollback the suspect deploy). Stage it for one-click human approval with the supporting evidence.
- ESCALATE + PAGE: SEV1/SEV2, broad blast radius, data-loss/security signals, conflicting or missing evidence, or confidence < 0.6. Page on-call, post a holding update, and hand over a structured summary.

== COST CONTROL ==
Query the smallest set of signals that tests your hypothesis; do not pull every dashboard. Stop investigating once you can decide. Cap tool calls; if exceeded, escalate with current evidence. Keep updates short.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "severity": "SEV1|SEV2|SEV3|SEV4",
  "confidence": <0.0-1.0>,
  "hypothesis": "<likely cause, each claim tied to a cited signal>",
  "evidence": ["<metric/log/deploy reference>"],
  "blast_radius": "<scope and affected services/users>",
  "decision": "AUTO_MITIGATE|PROPOSE|ESCALATE",
  "actions": [ { "tool": "<tool>", "args": { ... }, "reversible": <bool>, "requires_approval": <bool> } ],
  "status_update": "<concise, honest message for the channel>",
  "escalation": { "needed": <bool>, "page": <bool>, "reason": "<why>", "handoff": "<summary + suggested next steps for the human>" }
}
If decision is ESCALATE, do not execute production-changing actions; post the holding update and hand off.
Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect observability

Install the agent and connect it (read-only) to your metrics, logs, and deploy systems.

shell
pipx install incident-agent
incident-agent connect --metrics prometheus --logs loki --deploys github
incident-agent doctor   # verifies read access + paging webhook

Set action authority and caps

Define what the agent may do autonomously. Everything else is propose-only. These limits are enforced outside the model.

shell
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
PAGER_WEBHOOK=...
MAX_TOOL_CALLS=8
AUTO_SCALE_CAP=2x
MODE=copilot   # copilot (propose) | responder (auto low-risk)

Allow-list safe runbook actions

Only reversible, low-blast-radius actions belong here. Risky actions stay approval-gated.

shell
# .incident.yml
autonomous_actions:
  - restart_stateless_pod
  - clear_cache
  - scale_up_within_cap
  - silence_known_false_alert
require_approval:
  - rollback_deploy
  - scale_down
  - shift_traffic
always_escalate: [ "sev1", "data_layer", "security" ]

Replay a past incident

Validate the agent's reasoning and proposed actions against a known incident before going live.

shell
incident-agent replay --incident 2026-05-INC-204 --explain
# prints severity, hypothesis, evidence, proposed actions, status update

Wire it to your alerting

Route alerts to the agent as a first responder. Start in copilot mode (proposes only), then enable responder mode for allow-listed actions once trust is built.

shell
# Alertmanager receiver -> POST https://your-host/incident/alert (HMAC)
# Promote MODE=responder after reviewing a few weeks of proposals

Architecture

Tools required

get_alertFetch the firing alert with its service, metric, threshold, and timestamps, and check it against currently active incidents.
query_metricsRun time-series queries (latency, error rate, saturation, traffic) around the incident window to quantify impact and find anomalies.
search_logsSearch logs and traces for errors, stack traces, and patterns correlated with the alert to support or refute a hypothesis.
list_recent_deploysList recent deploys, feature-flag flips, and config changes near the alert time — the most common incident trigger.
run_runbook_stepExecute an allow-listed, reversible runbook action (e.g. restart a stateless service, clear a cache, scale up within a cap).
rollback_deployRoll back a suspect deployment. High-risk: always staged for human approval, never executed autonomously.
post_status_updatePost a concise status update to the incident channel and stakeholder list with current impact and next steps.
page_oncallEscalate by paging the on-call engineer/secondary with a structured incident summary when severity or uncertainty warrants a human.

Workflow

  1. 1. Receive and dedup the alert

    Pull the alert, check it against active incidents, and normalize the affected service, signal, and breach. Drop confirmed duplicates into the existing incident.

  2. 2. Correlate telemetry

    Query the minimal set of metrics, logs, and recent deploys around the alert window needed to test a hypothesis — not every dashboard.

  3. 3. Hypothesize and score

    Propose the most likely cause from the evidence, assign severity, and estimate blast radius, citing the specific signals behind each claim.

  4. 4. Decide within authority

    If a known pattern with an allow-listed reversible fix and high confidence, AUTO_MITIGATE. If a riskier but evidence-backed fix, PROPOSE for approval. Otherwise ESCALATE.

  5. 5. Act through the guard

    The executor runs only low-risk, reversible, allow-listed actions; production-changing actions are staged for one-click human approval with the evidence attached.

  6. 6. Communicate

    Post an honest, concise status update — impact, what's known, what's next — without inventing ETAs or root causes.

  7. 7. Escalate or resolve, then record

    Page on-call for SEV1/SEV2 or low confidence with a clean handoff; otherwise verify recovery and log the full timeline for the postmortem.

Examples

Error spike right after a deploy → propose rollback

Error rate jumps minutes after a deploy. Evidence points squarely at the release; rollback is risky, so the agent proposes it for approval rather than executing.

Input

input
Alert: checkout-service 5xx rate 0.4% -> 9% over 6 min.
Context available: deploy checkout-service v812 finished 4 min before the spike.

Output

output
{
  "severity": "SEV2",
  "confidence": 0.83,
  "hypothesis": "Deploy checkout-service v812 introduced the regression: 5xx began ~2 min after rollout and errors trace to a new code path added in v812.",
  "evidence": ["metric: 5xx 0.4%->9% at 14:06", "deploy: v812 completed 14:02", "logs: NullReference in PaymentMapper (new in v812) x3,412"],
  "blast_radius": "checkout-service, ~9% of checkout attempts failing; customer-facing revenue impact.",
  "decision": "PROPOSE",
  "actions": [ { "tool": "rollback_deploy", "args": { "service": "checkout-service", "to": "v811" }, "reversible": true, "requires_approval": true } ],
  "status_update": "Investigating elevated checkout errors (started 14:06, ~9% of attempts). Strong signal points to deploy v812; proposing rollback to v811 pending on-call approval. Next update in 10 min.",
  "escalation": { "needed": false, "page": false, "reason": "", "handoff": "" }
}

Note: Customer-facing errors make this SEV2, and the evidence is strong — but a rollback changes production, so the agent proposes it for one-click approval instead of firing it. The status update is specific and promises only the next update time, not a fix ETA.

Flapping non-actionable alert → safe auto-suppress

A known-noisy alert fires again with no real impact. The agent suppresses it with justification rather than paging a human at 3am.

Input

input
Alert: batch-worker CPU > 80% for 5 min (fires nightly during the scheduled report job).

Output

output
{
  "severity": "SEV4",
  "confidence": 0.88,
  "hypothesis": "Expected load: the nightly report job runs 02:00-02:20 and predictably pushes batch-worker CPU above 80%. No user-facing impact.",
  "evidence": ["metric: CPU spike aligns 02:00-02:20 exactly as prior 14 nights", "no queue backlog growth", "no error-rate change", "no customer-facing service affected"],
  "blast_radius": "None — internal batch worker, within normal nightly envelope.",
  "decision": "AUTO_MITIGATE",
  "actions": [ { "tool": "run_runbook_step", "args": { "step": "silence_known_false_alert", "duration": "30m", "scope": "batch-worker-cpu-nightly" }, "reversible": true, "requires_approval": false } ],
  "status_update": "Silenced batch-worker CPU alert for 30m: matches the known nightly report-job pattern (02:00-02:20), no backlog or error impact. Recommend tuning this alert's schedule.",
  "escalation": { "needed": false, "page": false, "reason": "", "handoff": "" }
}

Note: Reducing noise is a core job, but it's done carefully: the agent cites four pieces of evidence that the spike is benign, silences for a bounded 30 minutes (reversible), and recommends a permanent fix. It suppresses because it can prove the alert is non-actionable — not to keep the dashboard green.

Ambiguous data-layer outage → escalate and page

Multiple services degrade with conflicting signals and possible data impact. The agent refuses to take a risky guess and escalates immediately.

Input

input
Alerts: api latency p99 12s, db-primary connection errors, and a replica lag alarm — all within 2 minutes. No recent deploy. Some writes may be failing.

Output

output
{
  "severity": "SEV1",
  "confidence": 0.45,
  "hypothesis": "Possible database-layer incident (primary connectivity + replica lag) affecting multiple services; root cause unclear from current signals and no deploy correlation.",
  "evidence": ["metric: api p99 12s across 3 services", "logs: db-primary connection refused x hundreds", "metric: replica lag climbing", "deploys: none in the last 3h"],
  "blast_radius": "Broad: multiple customer-facing services degraded; potential write failures and data risk.",
  "decision": "ESCALATE",
  "actions": [ { "tool": "page_oncall", "args": { "severity": "sev1", "target": "primary+db_oncall" }, "reversible": true, "requires_approval": false } ],
  "status_update": "Declaring SEV1: multiple services degraded with database-primary connection errors and rising replica lag (since 03:11). Possible data-layer incident; paging DB on-call now. No automated remediation is being attempted given data risk. Next update in 10 min.",
  "escalation": { "needed": true, "page": true, "reason": "SEV1, broad blast radius, possible data-layer failure with write/data risk, low confidence in cause.", "handoff": "DB primary connectivity failing + replica lag, no deploy correlation, multiple services hit. Suggest: check DB primary health/failover status, connection pool saturation, and whether a failover is safe. Do NOT mass-restart app tier (would worsen connection storms)." }
}

Note: The defining example: low confidence, broad blast radius, and data risk make this an immediate SEV1 escalate-and-page. The agent takes no production action, posts an honest holding update, and — crucially — warns the human against a tempting-but-harmful action (mass restart). This is the behavior that makes autonomy safe in an incident.

Implementation notes

  • Define the autonomous action allow-list narrowly: only reversible, low-blast-radius steps. Everything that changes production state should be propose-and-approve, enforced outside the model.
  • Never let the agent silence an alert it can't prove is non-actionable. Suppression needs cited evidence and a bounded duration, or it becomes a way to hide real incidents.
  • Always require a blast-radius estimate before any action; broad scope or critical/customer-facing services automatically disqualify autonomous action.
  • Start in copilot mode (proposes only). Review proposals for a few weeks, then enable responder mode for the allow-listed actions you trust.
  • Status updates should state impact and the next update time, never an invented ETA or unconfirmed root cause — over-promising during an incident destroys trust.
  • Log the full timeline (signals, hypothesis, actions, outcome) for every incident; it seeds the postmortem and shows which patterns are safe to automate next.
  • Dedup and signal-gathering run on a cheaper model; the strong model handles the hypothesis and severity decisions.

Variations

Basic

Triage co-pilot

Correlates the alert with metrics, logs, and deploys and posts a severity, hypothesis, and suggested actions to the incident channel. Proposes only — humans act. The safe default.

Advanced

Guarded first responder

Auto-executes allow-listed, reversible mitigations for known patterns, stages risky actions (rollback, scale-down) for one-click approval, and drafts status updates, with SEV1 auto-escalation.

Enterprise

Org-wide incident commander

Adds service-aware policies and ownership routing, multi-signal correlation across teams, audited approvals, automatic postmortem timelines, and tuning of auto-mitigation patterns from incident outcomes.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)
README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Frequently asked questions