AgentKits

Alert Noise Reduction Agent

Production Blueprint
0TrendingNew

Includes Agent Blueprint + Implementation Guide

An agent that fights on-call alert fatigue without going deaf to real problems. It analyzes your alert stream, scores each alert by actual actionability (how often it's acted on or tied to a real incident), correlates and deduplicates related alerts, recommends concrete tuning, and time-box-suppresses proven noise within strict guardrails. It is built defensively: it will not suppress any alert that has ever correlated with a real incident, never touches critical-service or SEV1 alerts autonomously, makes every suppression reversible and audited, and escalates anything uncertain to on-call.

alertingon-callsredevopsobservabilityalert-fatigueautonomous-agentmonitoringagentazagent-governancetrust-levelproduction-readiness
StackClaude, LangGraph, OpenAI
DifficultyAdvanced
Setup45 min
Version2.0.0 · 2026-06-21

Overview

Score → correlate → tune → suppress (carefully): turns a noisy alert stream into a quieter, still-trustworthy one.

Actionability-based: it ranks alerts by how often they're actually acted on or tied to incidents, not by volume alone.

Recommends concrete tuning (thresholds, grouping, dedup) and only time-box-suppresses alerts proven to be noise.

Defensive: never suppresses an alert ever linked to a real incident, never auto-touches critical/SEV1 alerts, and keeps every action reversible and audited.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend
DNA PatternEvaluation (Research → Evaluate)
Worst-Case ActionRecommends suppressing or grouping an alert that actually mattered, surfaced for review. It cannot silence, suppress, or close alerts on its own — execution tools are absent and critical alerts are never auto-suppressed.
Authority BoundaryClusters and deduplicates alerts and recommends suppression or grouping rules for human approval. It never silences, suppresses, or closes alerts autonomously, and it never proposes suppressing a critical-severity alert. An engineer approves any rule.
Verification TestAttempt to auto-suppress an alert → confirm suppression requires human approval and critical alerts are excluded; confirm no silence tool runs autonomously.
Production Readiness6/6 dimensions passing. Tool isolation: suppression requires approval. Human gates: an engineer approves rules. Confidence escalation: uncertain groupings flagged. Cost ceiling: bounded per batch. Audit trail: groupings and recommendations logged. Escalation path: critical alerts always surfaced.
Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json
{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "alert-noise-reducer-agent",
  "trust_level": "A2",
  "dna_pattern": "Evaluation",
  "worst_case_action": "Recommends suppressing a meaningful alert for review. Cannot auto-suppress; criticals never suppressed.",
  "authority_boundary": "Clusters alerts and recommends suppression rules for approval; autonomous suppression absent.",
  "tags": [
    "devops-sre",
    "alerting",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_alerts",
      "cluster",
      "dedupe",
      "recommend_rule"
    ],
    "execution_tools_absent": true,
    "never_suppress_critical": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "silence_alert",
      "auto_suppress",
      "close_alert"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.2,
    "alert_threshold_usd": 0.14
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "uncertain_grouping",
      "critical_severity"
    ],
    "destination": "oncall_engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "groupings",
      "recommendations"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goalBounded by the authority spec above
Trust LevelA2 — Recommend
Tool accessLeast privilege — execution tools absent (read-only)
Context handlingGrounded in provided inputs; cites or flags rather than guessing
Memory strategyTask-scoped; no persistent cross-session memory
Human approvalRequired on uncertain grouping, critical severity → oncall engineer
Audit trailAppend-only log (groupings, recommendations)
Cost & loop bounds≤ $0.2 per loop · ≤ 8 reasoning turns
Recovery / escalationEscalates to oncall engineer

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

AgentPrimary reasoner — Recommend authority (A2)
Toolsread alerts, cluster, dedupe, recommend rule — execution tools absent (read-only)
MemoryTask-scoped working context; no persistent cross-session memory
GuardrailsWorst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns
EvaluatorConfidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
HandoffEscalates to oncall engineer on uncertain grouping, critical severity

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Recommends suppressing an alert that actually mattered.

Detection
Critical severity is excluded from suppression and grouping confidence is scored.
Mitigation
Suppression is a recommendation requiring approval; criticals are never suppressed.
Recovery
An engineer rejects the rule and the alert remains.

Over-groups distinct alerts, masking a second incident.

Detection
A grouping similarity threshold runs and divergent signals are flagged.
Mitigation
Uncertain groupings are flagged, not merged silently.
Recovery
The engineer splits the group.

A suppression rule persists after the underlying issue changes.

Detection
Rules are time-bounded and reviewed.
Mitigation
Rules expire and require re-approval.
Recovery
Stale rules lapse and the alert resurfaces.

Evaluation

Suppression precision with critical-alert safety is primary — suppressing an alert that mattered is the failure.

Suppression precisionOf alerts recommended for suppression, the share that were genuinely noise.
Critical-miss rateFrequency of critical alerts caught in a suppression recommendation — must be zero.
Grouping accuracyShare of alert groupings that are correct, with no masked second incident.
Rule-decay handlingShare of stale suppression rules correctly expired.
LatencyTime to a grouping or recommendation.

Recommended approach. Use a labeled alert stream with known noise versus actionable alerts; measure suppression precision and treat any suppressed critical as a hard failure. Verify groupings don't merge distinct incidents and rules expire.

When to use

Use it when

  • On-call is drowning in alerts and real signals are getting lost in the noise.
  • You have alert history (fire/ack/incident-correlation) the agent can score actionability from.
  • You want data-backed tuning recommendations and safe, reversible suppression of proven noise.
  • You want to cut fatigue while keeping a hard guarantee that incident-linked and critical alerts are never silenced.

Avoid it when

  • You lack alert/incident history, so actionability can't be measured — suppression would be blind.
  • You expect it to autonomously silence critical-service alerts; those are recommendation-only.
  • Your 'noisy' alerts are actually under-investigated real signals.
  • You can't keep suppression reversible, time-boxed, and audited.

System prompt

system-prompt.md
You are an Alert Noise Reduction Agent helping an on-call/SRE team cut alert fatigue. You analyze alerts, recommend tuning, and suppress proven noise — WITHOUT ever silencing a real signal. You are judged on reducing non-actionable noise AND on never suppressing an alert that matters.

== CORE PRINCIPLES ==
1. Actionability, not volume. Judge an alert by evidence of whether it leads to action: ack rate, time-to-ack, and — most importantly — whether it has ever correlated with a real incident. A high-volume alert that's always acted on is signal, not noise.
2. Suppress nothing you can't prove is noise. Only recommend/auto-suppress alerts with a strong, evidence-backed non-actionability record. When in doubt, recommend tuning, not silence.
3. Reversible and time-boxed. Suppression is always temporary, scoped, auditable, and easy to undo. You never permanently delete an alert rule.

== HARD RULES (NON-NEGOTIABLE) ==
- INCIDENT-LINKED = NEVER SUPPRESS: If an alert has EVER correlated with a real incident (even once), you must not suppress it. At most, recommend tuning (threshold/grouping). This rule is absolute.
- CRITICAL SERVICES ESCALATE: For alerts on critical/customer-facing services or SEV1-capable signals, you never auto-suppress — you recommend tuning and escalate the decision to a human.
- EVIDENCE REQUIRED: Auto-suppress only with a clear record (e.g. fired many times over a meaningful window with ~0 acks and 0 incident correlations) on a non-critical signal. State the numbers.
- BOUNDED SUPPRESSION: Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged. Never an open-ended silence.
- NO BLIND DEDUP: When grouping/deduping, preserve the ability to see the underlying alerts; never collapse distinct real signals into one that hides a problem.

== METHOD ==
- Pull each alert's history: fire count, ack rate, time-to-ack, and incident correlations over a window.
- Score actionability. Correlate/dedupe related alerts into groups. Identify chronically non-actionable, never-incident-linked, non-critical alerts as noise candidates.
- For noise candidates: recommend tuning and, if enabled and within guardrails, time-box suppress. For everything else: recommend tuning only or leave as-is.

== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_SUPPRESS: non-critical, zero incident correlation, strong non-actionable record, confidence >= 0.85. Time-boxed + tracked.
- RECOMMEND_TUNING: noisy but incident-linked at least once, or critical service, or moderate evidence. Propose thresholds/grouping; do not suppress.
- ESCALATE: critical-service/SEV1 alerts, conflicting evidence, or anything you're unsure about.

== COST CONTROL ==
Pull the history you need to score; reuse it across related alerts. Cap tool calls; if exceeded, recommend based on what you have.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "alert": "<alert name/id or group>",
  "actionability": "<score + the numbers: fires, ack rate, incident correlations over window>",
  "incident_linked": <bool>,
  "critical_service": <bool>,
  "decision": "AUTO_SUPPRESS|RECOMMEND_TUNING|ESCALATE",
  "suppression": { "applied": <bool>, "duration": "<time-box, or empty>", "scope": "<specific alert/condition>", "reversible": true },
  "tuning": ["<concrete recommendation: threshold/grouping/dedup>"],
  "rationale": "<evidence-grounded reason>",
  "escalation": { "needed": <bool>, "reason": "<critical/uncertain, or empty>" }
}
If incident_linked is true or critical_service is true, decision must NOT be AUTO_SUPPRESS.
Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect alerting

Install the agent and connect it (read) to your alerting and incident systems.

shell
pipx install alert-noise-agent
alert-noise-agent connect --alerts prometheus,pagerduty --incidents pagerduty
alert-noise-agent doctor

Configure guardrails

The incident-linked and critical-service protections are enforced here, not by the model.

shell
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
NEVER_SUPPRESS_IF_INCIDENT_LINKED=true
MAX_SUPPRESSION=24h     # time-box; auto-expires
MODE=advise   # advise (recommend) | act (auto-suppress proven noise)

Mark critical services

Alerts on these are recommendation-only — never auto-suppressed.

shell
# .alerts.yml
critical_services: ["checkout", "auth", "payments", "db-primary"]
noise_threshold: { window: 30d, min_fires: 50, max_ack_rate: 0.02, incident_correlations: 0 }
suppression: { reversible: true, max_duration: 24h }

Backtest on alert history

Replay history to confirm it would never have suppressed an incident-linked alert.

shell
alert-noise-agent backtest --range 90d --explain
# reports noise found + a hard check: suppressed-incident-linked count (must be 0)

Wire in (advise first)

Run scheduled analysis and post recommendations; enable auto-suppression for proven non-critical noise once backtests are clean.

shell
# scheduled job -> recommendations to #sre; tuning PRs to the monitoring repo
# promote MODE=act after a clean backtest

Architecture

Tools required

get_alert_streamRetrieve the alert inventory/stream with service, severity, and owner metadata over the analysis window.
alert_historyReturn an alert's fire count, ack rate, time-to-ack, and — key — its correlation with past real incidents.
actionability_scoreScore how actionable an alert is from its history, separating signal from chronic noise.
correlate_dedupeGroup related/duplicate alerts into a single logical signal while preserving the underlying ones.
recommend_tuningGenerate concrete tuning (threshold changes, grouping, dedup windows) for noisy alerts.
suppress_alertApply a time-boxed, reversible, scoped suppression. Gated: rejects incident-linked or critical-service alerts.
create_tuning_prOpen a config/monitoring-as-code PR with the recommended tuning for human review.
escalate_to_oncallEscalate critical-service alerts and uncertain cases to on-call with the evidence and recommendation.

Workflow

  1. 1. Ingest the stream

    Pull the alert inventory and metadata over the analysis window.

  2. 2. Score actionability

    For each alert, compute ack rate, time-to-ack, and fire volume, and pull incident correlations.

  3. 3. Correlate & dedupe

    Group related/duplicate alerts so one root cause isn't ten pages, keeping the underlying alerts visible.

  4. 4. Identify noise candidates

    Flag chronically non-actionable, never-incident-linked, non-critical alerts — and nothing else.

  5. 5. Apply the gate

    Forbid suppressing any incident-linked or critical-service alert; those get tuning recommendations or escalation instead.

  6. 6. Tune or suppress

    Open tuning PRs and apply bounded, reversible, time-boxed suppression only where the evidence and guardrails allow.

  7. 7. Audit & review

    Log every action with evidence and expiry, and surface a review so on-call sees exactly what was quieted.

Examples

Chronic non-actionable alert → time-boxed auto-suppress + tuning PR

An alert fires constantly, is never acked, and has never been tied to an incident, on a non-critical service. Safe, bounded noise to quiet.

Input

input
Alert: 'batch-worker-cpu-high' fired 312 times in 30d, ack rate 0.6%, 0 incident correlations, service=batch-worker (non-critical).

Output

output
{
  "alert": "batch-worker-cpu-high",
  "actionability": "Very low: 312 fires/30d, 0.6% ack rate, 0 incident correlations",
  "incident_linked": false,
  "critical_service": false,
  "decision": "AUTO_SUPPRESS",
  "suppression": { "applied": true, "duration": "24h", "scope": "batch-worker-cpu-high only", "reversible": true },
  "tuning": ["Raise CPU threshold from 80% to 92% for batch-worker, or add a 15m sustained-duration condition", "Group with batch-worker-memory into one 'batch-worker saturation' alert"],
  "rationale": "Fires constantly, almost never acked, and has never correlated with an incident on a non-critical service — textbook noise. Suppressing for 24h while a tuning PR adjusts the threshold; suppression auto-expires and is reversible.",
  "escalation": { "needed": false, "reason": "" }
}

Note: Clear, quantified non-actionability (312 fires, 0.6% acks, 0 incidents) on a non-critical service. It applies a bounded 24h suppression (auto-expiring, reversible) and — crucially — also opens a tuning PR so the real fix lands, rather than just muting forever.

Noisy but incident-linked once → recommend tuning, do NOT suppress

An alert is genuinely noisy, but it correlated with a real outage once. The absolute rule blocks suppression; it recommends tuning instead.

Input

input
Alert: 'api-latency-p99-high' fired 140 times in 30d, ack rate 8%, but correlated with 1 real SEV2 incident last month. Service=api (customer-facing).

Output

output
{
  "alert": "api-latency-p99-high",
  "actionability": "Mixed: 140 fires/30d, 8% ack rate, but 1 real SEV2 incident correlation",
  "incident_linked": true,
  "critical_service": true,
  "decision": "RECOMMEND_TUNING",
  "suppression": { "applied": false, "duration": "", "scope": "api-latency-p99-high", "reversible": true },
  "tuning": ["Add a sustained-duration condition (p99 high for >5m) to cut transient spikes", "Raise the page threshold but keep a ticket-level alert", "Group flapping instances to reduce repeat pages"],
  "rationale": "It is noisy, but it correlated with a real SEV2 once and sits on a customer-facing service — so it must not be suppressed. Instead, recommend tuning to cut transient noise while preserving the signal that caught the incident.",
  "escalation": { "needed": false, "reason": "" }
}

Note: The defining rule in action: the alert is noisy enough that muting it would be tempting — but it caught a real SEV2 once, so the incident-linked rule forbids suppression entirely. The agent instead recommends duration/threshold tuning that reduces noise while keeping the signal that matters. Noise reduction never trumps catching incidents.

Critical-service alert, noisy → escalate, never auto-act

A flapping alert on the payments service. Even with no incident link yet, the critical-service rule means a human decides.

Input

input
Alert: 'payments-error-rate' flapping, 90 fires in 30d, ack rate 30%, 0 incident correlations so far. Service=payments (critical).

Output

output
{
  "alert": "payments-error-rate",
  "actionability": "Moderate: 90 fires/30d, 30% ack rate, 0 incident correlations so far",
  "incident_linked": false,
  "critical_service": true,
  "decision": "ESCALATE",
  "suppression": { "applied": false, "duration": "", "scope": "payments-error-rate", "reversible": true },
  "tuning": ["Consider grouping by error type", "Possibly raise threshold — but only with payments-team sign-off given the criticality"],
  "rationale": "Even though it hasn't correlated with an incident yet and is somewhat acked, this is the payments service. I won't auto-suppress or auto-tune a critical-service alert; the cost of missing a payments issue is too high. Escalating to on-call with tuning options.",
  "escalation": { "needed": true, "reason": "Critical-service (payments) alert — tuning/suppression decisions require human sign-off regardless of current noise level." }
}

Note: Critical-service guardrail: payments alerts are recommendation-only, so even a plausibly-noisy one is escalated rather than touched. The agent offers tuning options but explicitly defers the decision to the payments team, because the downside of muting a real payments alert dwarfs the annoyance of noise.

Implementation notes

  • Make 'never suppress an incident-linked alert' an absolute, deterministic gate — a single past incident correlation permanently disqualifies an alert from suppression, no matter how noisy.
  • Score by actionability (acks, time-to-ack, incident correlation), not raw volume; a frequent-but-always-acted-on alert is signal.
  • Keep critical-service and SEV1-capable alerts recommendation-only and escalate decisions to humans.
  • Make every suppression time-boxed, scoped, reversible, and audited — never an open-ended silence — and pair it with a tuning PR so the root cause gets fixed.
  • Preserve visibility when deduping; collapsing distinct signals into one can hide a real problem.
  • Backtest with 'suppressed-incident-linked alerts' as a hard zero metric before enabling any auto-suppression.
  • The strong model earns its cost on the suppress-vs-tune judgment, while a cheaper model can pull and aggregate history.

Variations

Basic

Noise analyzer

Scores alerts by actionability, correlates duplicates, and recommends tuning with the supporting numbers for an SRE. No suppression.

Advanced

Guarded auto-suppression

Adds time-boxed, reversible suppression of proven non-critical noise and tuning PRs, with the absolute incident-linked and critical-service guardrails enforced.

Enterprise

Org-wide alert hygiene

Adds multi-team alert inventories, monitoring-as-code PR workflows, suppression audit and auto-expiry, on-call load analytics, and tuning from outcomes — incident-linked alerts always protected.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)
README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Frequently asked questions