Could it silence an alert that matters?

No — that's the hard guarantee. Any alert that has ever correlated with a real incident is permanently ineligible for suppression, and critical-service alerts are recommendation-only. It can only auto-suppress proven, never-incident-linked, non-critical noise.

How does it decide what's noise?

By actionability, not volume: fire count, ack rate, time-to-ack, and incident correlation over a window. A high-volume alert that's consistently acted on is treated as signal, not noise.

Is suppression permanent?

No. Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged — and it's paired with a tuning recommendation/PR so the underlying noise actually gets fixed.

What does it do with critical-service alerts?

It never auto-suppresses or auto-tunes them. It surfaces recommendations and escalates the decision to on-call, because missing a real issue on a critical service is far costlier than the noise.

Will deduping hide separate problems?

It groups related/duplicate alerts while preserving visibility into the underlying ones, so a single root cause stops paging ten times without collapsing genuinely distinct signals.

How do we trust it before going live?

Backtest it on your alert history; the key check is that it would have suppressed zero incident-linked alerts. Start in advise mode (recommendations + tuning PRs) and enable auto-suppression for non-critical proven noise only once that holds.

Alert Noise Reduction Agent

Overview

Score → correlate → tune → suppress (carefully): turns a noisy alert stream into a quieter, still-trustworthy one.

Actionability-based: it ranks alerts by how often they're actually acted on or tied to incidents, not by volume alone.

Recommends concrete tuning (thresholds, grouping, dedup) and only time-box-suppresses alerts proven to be noise.

Defensive: never suppresses an alert ever linked to a real incident, never auto-touches critical/SEV1 alerts, and keeps every action reversible and audited.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend

DNA PatternEvaluation (Research → Evaluate)

Worst-Case ActionRecommends suppressing or grouping an alert that actually mattered, surfaced for review. It cannot silence, suppress, or close alerts on its own — execution tools are absent and critical alerts are never auto-suppressed.

Authority BoundaryClusters and deduplicates alerts and recommends suppression or grouping rules for human approval. It never silences, suppresses, or closes alerts autonomously, and it never proposes suppressing a critical-severity alert. An engineer approves any rule.

Verification TestAttempt to auto-suppress an alert → confirm suppression requires human approval and critical alerts are excluded; confirm no silence tool runs autonomously.

Production Readiness6/6 dimensions passing. Tool isolation: suppression requires approval. Human gates: an engineer approves rules. Confidence escalation: uncertain groupings flagged. Cost ceiling: bounded per batch. Audit trail: groupings and recommendations logged. Escalation path: critical alerts always surfaced.

Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json

{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "alert-noise-reducer-agent",
  "trust_level": "A2",
  "dna_pattern": "Evaluation",
  "worst_case_action": "Recommends suppressing a meaningful alert for review. Cannot auto-suppress; criticals never suppressed.",
  "authority_boundary": "Clusters alerts and recommends suppression rules for approval; autonomous suppression absent.",
  "tags": [
    "devops-sre",
    "alerting",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_alerts",
      "cluster",
      "dedupe",
      "recommend_rule"
    ],
    "execution_tools_absent": true,
    "never_suppress_critical": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "silence_alert",
      "auto_suppress",
      "close_alert"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.2,
    "alert_threshold_usd": 0.14
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "uncertain_grouping",
      "critical_severity"
    ],
    "destination": "oncall_engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "groupings",
      "recommendations"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goal	Bounded by the authority spec above
Trust Level	A2 — Recommend
Tool access	Least privilege — execution tools absent (read-only)
Context handling	Grounded in provided inputs; cites or flags rather than guessing
Memory strategy	Task-scoped; no persistent cross-session memory
Human approval	Required on uncertain grouping, critical severity → oncall engineer
Audit trail	Append-only log (groupings, recommendations)
Cost & loop bounds	≤ $0.2 per loop · ≤ 8 reasoning turns
Recovery / escalation	Escalates to oncall engineer

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

Agent	Primary reasoner — Recommend authority (A2)
Tools	read alerts, cluster, dedupe, recommend rule — execution tools absent (read-only)
Memory	Task-scoped working context; no persistent cross-session memory
Guardrails	Worst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns
Evaluator	Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
Handoff	Escalates to oncall engineer on uncertain grouping, critical severity

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Recommends suppressing an alert that actually mattered.

Detection: Critical severity is excluded from suppression and grouping confidence is scored.
Mitigation: Suppression is a recommendation requiring approval; criticals are never suppressed.
Recovery: An engineer rejects the rule and the alert remains.

Over-groups distinct alerts, masking a second incident.

Detection: A grouping similarity threshold runs and divergent signals are flagged.
Mitigation: Uncertain groupings are flagged, not merged silently.
Recovery: The engineer splits the group.

A suppression rule persists after the underlying issue changes.

Detection: Rules are time-bounded and reviewed.
Mitigation: Rules expire and require re-approval.
Recovery: Stale rules lapse and the alert resurfaces.

Evaluation

Suppression precision with critical-alert safety is primary — suppressing an alert that mattered is the failure.

Suppression precision	Of alerts recommended for suppression, the share that were genuinely noise.
Critical-miss rate	Frequency of critical alerts caught in a suppression recommendation — must be zero.
Grouping accuracy	Share of alert groupings that are correct, with no masked second incident.
Rule-decay handling	Share of stale suppression rules correctly expired.
Latency	Time to a grouping or recommendation.

Recommended approach. Use a labeled alert stream with known noise versus actionable alerts; measure suppression precision and treat any suppressed critical as a hard failure. Verify groupings don't merge distinct incidents and rules expire.

When to use

Use it when

On-call is drowning in alerts and real signals are getting lost in the noise.
You have alert history (fire/ack/incident-correlation) the agent can score actionability from.
You want data-backed tuning recommendations and safe, reversible suppression of proven noise.
You want to cut fatigue while keeping a hard guarantee that incident-linked and critical alerts are never silenced.

Avoid it when

You lack alert/incident history, so actionability can't be measured — suppression would be blind.
You expect it to autonomously silence critical-service alerts; those are recommendation-only.
Your 'noisy' alerts are actually under-investigated real signals.
You can't keep suppression reversible, time-boxed, and audited.

System prompt

system-prompt.md

You are an Alert Noise Reduction Agent helping an on-call/SRE team cut alert fatigue. You analyze alerts, recommend tuning, and suppress proven noise — WITHOUT ever silencing a real signal. You are judged on reducing non-actionable noise AND on never suppressing an alert that matters.

== CORE PRINCIPLES ==
1. Actionability, not volume. Judge an alert by evidence of whether it leads to action: ack rate, time-to-ack, and — most importantly — whether it has ever correlated with a real incident. A high-volume alert that's always acted on is signal, not noise.
2. Suppress nothing you can't prove is noise. Only recommend/auto-suppress alerts with a strong, evidence-backed non-actionability record. When in doubt, recommend tuning, not silence.
3. Reversible and time-boxed. Suppression is always temporary, scoped, auditable, and easy to undo. You never permanently delete an alert rule.

== HARD RULES (NON-NEGOTIABLE) ==
- INCIDENT-LINKED = NEVER SUPPRESS: If an alert has EVER correlated with a real incident (even once), you must not suppress it. At most, recommend tuning (threshold/grouping). This rule is absolute.
- CRITICAL SERVICES ESCALATE: For alerts on critical/customer-facing services or SEV1-capable signals, you never auto-suppress — you recommend tuning and escalate the decision to a human.
- EVIDENCE REQUIRED: Auto-suppress only with a clear record (e.g. fired many times over a meaningful window with ~0 acks and 0 incident correlations) on a non-critical signal. State the numbers.
- BOUNDED SUPPRESSION: Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged. Never an open-ended silence.
- NO BLIND DEDUP: When grouping/deduping, preserve the ability to see the underlying alerts; never collapse distinct real signals into one that hides a problem.

== METHOD ==
- Pull each alert's history: fire count, ack rate, time-to-ack, and incident correlations over a window.
- Score actionability. Correlate/dedupe related alerts into groups. Identify chronically non-actionable, never-incident-linked, non-critical alerts as noise candidates.
- For noise candidates: recommend tuning and, if enabled and within guardrails, time-box suppress. For everything else: recommend tuning only or leave as-is.

== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_SUPPRESS: non-critical, zero incident correlation, strong non-actionable record, confidence >= 0.85. Time-boxed + tracked.
- RECOMMEND_TUNING: noisy but incident-linked at least once, or critical service, or moderate evidence. Propose thresholds/grouping; do not suppress.
- ESCALATE: critical-service/SEV1 alerts, conflicting evidence, or anything you're unsure about.

== COST CONTROL ==
Pull the history you need to score; reuse it across related alerts. Cap tool calls; if exceeded, recommend based on what you have.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "alert": "<alert name/id or group>",
  "actionability": "<score + the numbers: fires, ack rate, incident correlations over window>",
  "incident_linked": <bool>,
  "critical_service": <bool>,
  "decision": "AUTO_SUPPRESS|RECOMMEND_TUNING|ESCALATE",
  "suppression": { "applied": <bool>, "duration": "<time-box, or empty>", "scope": "<specific alert/condition>", "reversible": true },
  "tuning": ["<concrete recommendation: threshold/grouping/dedup>"],
  "rationale": "<evidence-grounded reason>",
  "escalation": { "needed": <bool>, "reason": "<critical/uncertain, or empty>" }
}
If incident_linked is true or critical_service is true, decision must NOT be AUTO_SUPPRESS.

Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect alerting

Install the agent and connect it (read) to your alerting and incident systems.

shell

pipx install alert-noise-agent
alert-noise-agent connect --alerts prometheus,pagerduty --incidents pagerduty
alert-noise-agent doctor

Configure guardrails

The incident-linked and critical-service protections are enforced here, not by the model.

shell

cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
NEVER_SUPPRESS_IF_INCIDENT_LINKED=true
MAX_SUPPRESSION=24h     # time-box; auto-expires
MODE=advise   # advise (recommend) | act (auto-suppress proven noise)

Mark critical services

Alerts on these are recommendation-only — never auto-suppressed.

shell

# .alerts.yml
critical_services: ["checkout", "auth", "payments", "db-primary"]
noise_threshold: { window: 30d, min_fires: 50, max_ack_rate: 0.02, incident_correlations: 0 }
suppression: { reversible: true, max_duration: 24h }

Backtest on alert history

Replay history to confirm it would never have suppressed an incident-linked alert.

shell

alert-noise-agent backtest --range 90d --explain
# reports noise found + a hard check: suppressed-incident-linked count (must be 0)

Wire in (advise first)

Run scheduled analysis and post recommendations; enable auto-suppression for proven non-critical noise once backtests are clean.

shell

# scheduled job -> recommendations to #sre; tuning PRs to the monitoring repo
# promote MODE=act after a clean backtest

Architecture

Alert-stream intakeIngests the alert stream/inventory with metadata (service, severity, owner) for analysis over a configurable window.

History & incident correlationPulls each alert's fire/ack history and — critically — whether it has ever correlated with a real incident, the hard gate on suppression.

Actionability scoringScores each alert by how often it's actually acted on (ack rate, time-to-ack) versus pure firing volume, surfacing chronically non-actionable ones.

Correlation & dedupGroups related/duplicate alerts so a single root cause doesn't page ten times, while preserving visibility into the underlying alerts.

Tuning & suppression gateA deterministic gate forbids suppressing incident-linked or critical-service alerts; only proven, non-critical, never-incident-linked noise can be time-box suppressed.

Recommendation & PRProduces concrete tuning (thresholds, grouping rules) — often as a config PR — and applies bounded, reversible suppression where allowed.

Audit & reviewLogs every suppression with its evidence, expiry, and scope, and surfaces a review so humans can see exactly what was quieted and why.

Tools required

get_alert_streamRetrieve the alert inventory/stream with service, severity, and owner metadata over the analysis window.

alert_historyReturn an alert's fire count, ack rate, time-to-ack, and — key — its correlation with past real incidents.

actionability_scoreScore how actionable an alert is from its history, separating signal from chronic noise.

correlate_dedupeGroup related/duplicate alerts into a single logical signal while preserving the underlying ones.

recommend_tuningGenerate concrete tuning (threshold changes, grouping, dedup windows) for noisy alerts.

suppress_alertApply a time-boxed, reversible, scoped suppression. Gated: rejects incident-linked or critical-service alerts.

create_tuning_prOpen a config/monitoring-as-code PR with the recommended tuning for human review.

escalate_to_oncallEscalate critical-service alerts and uncertain cases to on-call with the evidence and recommendation.

Workflow

1. Ingest the stream
Pull the alert inventory and metadata over the analysis window.
2. Score actionability
For each alert, compute ack rate, time-to-ack, and fire volume, and pull incident correlations.
3. Correlate & dedupe
Group related/duplicate alerts so one root cause isn't ten pages, keeping the underlying alerts visible.
4. Identify noise candidates
Flag chronically non-actionable, never-incident-linked, non-critical alerts — and nothing else.
5. Apply the gate
Forbid suppressing any incident-linked or critical-service alert; those get tuning recommendations or escalation instead.
6. Tune or suppress
Open tuning PRs and apply bounded, reversible, time-boxed suppression only where the evidence and guardrails allow.
7. Audit & review
Log every action with evidence and expiry, and surface a review so on-call sees exactly what was quieted.

Examples

Chronic non-actionable alert → time-boxed auto-suppress + tuning PR

An alert fires constantly, is never acked, and has never been tied to an incident, on a non-critical service. Safe, bounded noise to quiet.

Input

input

Alert: 'batch-worker-cpu-high' fired 312 times in 30d, ack rate 0.6%, 0 incident correlations, service=batch-worker (non-critical).

Output

output

{
  "alert": "batch-worker-cpu-high",
  "actionability": "Very low: 312 fires/30d, 0.6% ack rate, 0 incident correlations",
  "incident_linked": false,
  "critical_service": false,
  "decision": "AUTO_SUPPRESS",
  "suppression": { "applied": true, "duration": "24h", "scope": "batch-worker-cpu-high only", "reversible": true },
  "tuning": ["Raise CPU threshold from 80% to 92% for batch-worker, or add a 15m sustained-duration condition", "Group with batch-worker-memory into one 'batch-worker saturation' alert"],
  "rationale": "Fires constantly, almost never acked, and has never correlated with an incident on a non-critical service — textbook noise. Suppressing for 24h while a tuning PR adjusts the threshold; suppression auto-expires and is reversible.",
  "escalation": { "needed": false, "reason": "" }
}

Note: Clear, quantified non-actionability (312 fires, 0.6% acks, 0 incidents) on a non-critical service. It applies a bounded 24h suppression (auto-expiring, reversible) and — crucially — also opens a tuning PR so the real fix lands, rather than just muting forever.

Noisy but incident-linked once → recommend tuning, do NOT suppress

An alert is genuinely noisy, but it correlated with a real outage once. The absolute rule blocks suppression; it recommends tuning instead.

Input

input

Alert: 'api-latency-p99-high' fired 140 times in 30d, ack rate 8%, but correlated with 1 real SEV2 incident last month. Service=api (customer-facing).

Output

output

{
  "alert": "api-latency-p99-high",
  "actionability": "Mixed: 140 fires/30d, 8% ack rate, but 1 real SEV2 incident correlation",
  "incident_linked": true,
  "critical_service": true,
  "decision": "RECOMMEND_TUNING",
  "suppression": { "applied": false, "duration": "", "scope": "api-latency-p99-high", "reversible": true },
  "tuning": ["Add a sustained-duration condition (p99 high for >5m) to cut transient spikes", "Raise the page threshold but keep a ticket-level alert", "Group flapping instances to reduce repeat pages"],
  "rationale": "It is noisy, but it correlated with a real SEV2 once and sits on a customer-facing service — so it must not be suppressed. Instead, recommend tuning to cut transient noise while preserving the signal that caught the incident.",
  "escalation": { "needed": false, "reason": "" }
}

Note: The defining rule in action: the alert is noisy enough that muting it would be tempting — but it caught a real SEV2 once, so the incident-linked rule forbids suppression entirely. The agent instead recommends duration/threshold tuning that reduces noise while keeping the signal that matters. Noise reduction never trumps catching incidents.

Critical-service alert, noisy → escalate, never auto-act

A flapping alert on the payments service. Even with no incident link yet, the critical-service rule means a human decides.

Input

input

Alert: 'payments-error-rate' flapping, 90 fires in 30d, ack rate 30%, 0 incident correlations so far. Service=payments (critical).

Output

output

{
  "alert": "payments-error-rate",
  "actionability": "Moderate: 90 fires/30d, 30% ack rate, 0 incident correlations so far",
  "incident_linked": false,
  "critical_service": true,
  "decision": "ESCALATE",
  "suppression": { "applied": false, "duration": "", "scope": "payments-error-rate", "reversible": true },
  "tuning": ["Consider grouping by error type", "Possibly raise threshold — but only with payments-team sign-off given the criticality"],
  "rationale": "Even though it hasn't correlated with an incident yet and is somewhat acked, this is the payments service. I won't auto-suppress or auto-tune a critical-service alert; the cost of missing a payments issue is too high. Escalating to on-call with tuning options.",
  "escalation": { "needed": true, "reason": "Critical-service (payments) alert — tuning/suppression decisions require human sign-off regardless of current noise level." }
}

Note: Critical-service guardrail: payments alerts are recommendation-only, so even a plausibly-noisy one is escalated rather than touched. The agent offers tuning options but explicitly defers the decision to the payments team, because the downside of muting a real payments alert dwarfs the annoyance of noise.

Implementation notes

Make 'never suppress an incident-linked alert' an absolute, deterministic gate — a single past incident correlation permanently disqualifies an alert from suppression, no matter how noisy.
Score by actionability (acks, time-to-ack, incident correlation), not raw volume; a frequent-but-always-acted-on alert is signal.
Keep critical-service and SEV1-capable alerts recommendation-only and escalate decisions to humans.
Make every suppression time-boxed, scoped, reversible, and audited — never an open-ended silence — and pair it with a tuning PR so the root cause gets fixed.
Preserve visibility when deduping; collapsing distinct signals into one can hide a real problem.
Backtest with 'suppressed-incident-linked alerts' as a hard zero metric before enabling any auto-suppression.
The strong model earns its cost on the suppress-vs-tune judgment, while a cheaper model can pull and aggregate history.

Variations

Basic

Noise analyzer

Scores alerts by actionability, correlates duplicates, and recommends tuning with the supporting numbers for an SRE. No suppression.

Advanced

Guarded auto-suppression

Adds time-boxed, reversible suppression of proven non-critical noise and tuning PRs, with the absolute incident-linked and critical-service guardrails enforced.

Enterprise

Org-wide alert hygiene

Adds multi-team alert inventories, monitoring-as-code PR workflows, suppression audit and auto-expiry, on-call load analytics, and tuning from outcomes — incident-linked alerts always protected.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)

README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Alert Noise Reduction Agent

Overview

AgentAz™ specification

Governance matrix

Agent component mapping

Failure modes

Evaluation

When to use

System prompt

Simulate run

Setup guide

Architecture

Tools required

Workflow

Examples

Implementation notes

Variations

Frequently asked questions

Could it silence an alert that matters?

How does it decide what's noise?

Is suppression permanent?

What does it do with critical-service alerts?

Will deduping hide separate problems?

How do we trust it before going live?

Related kits

AI Incident Response Agent

Competitive Intelligence Digest Agent

Compliance Control Monitoring Agent

Metric Anomaly Investigation Agent