Overview
Score → correlate → tune → suppress (carefully): turns a noisy alert stream into a quieter, still-trustworthy one.
Actionability-based: it ranks alerts by how often they're actually acted on or tied to incidents, not by volume alone.
Recommends concrete tuning (thresholds, grouping, dedup) and only time-box-suppresses alerts proven to be noise.
Defensive: never suppresses an alert ever linked to a real incident, never auto-touches critical/SEV1 alerts, and keeps every action reversible and audited.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "alert-noise-reducer-agent",
"trust_level": "A2",
"dna_pattern": "Evaluation",
"worst_case_action": "Recommends suppressing a meaningful alert for review. Cannot auto-suppress; criticals never suppressed.",
"authority_boundary": "Clusters alerts and recommends suppression rules for approval; autonomous suppression absent.",
"tags": [
"devops-sre",
"alerting",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_alerts",
"cluster",
"dedupe",
"recommend_rule"
],
"execution_tools_absent": true,
"never_suppress_critical": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"silence_alert",
"auto_suppress",
"close_alert"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.2,
"alert_threshold_usd": 0.14
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"uncertain_grouping",
"critical_severity"
],
"destination": "oncall_engineer"
},
"audit": {
"append_only": true,
"logs": [
"groupings",
"recommendations"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on uncertain grouping, critical severity → oncall engineer |
| Audit trail | Append-only log (groupings, recommendations) |
| Cost & loop bounds | ≤ $0.2 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to oncall engineer |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read alerts, cluster, dedupe, recommend rule — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to oncall engineer on uncertain grouping, critical severity |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Recommends suppressing an alert that actually mattered.
- Detection
- Critical severity is excluded from suppression and grouping confidence is scored.
- Mitigation
- Suppression is a recommendation requiring approval; criticals are never suppressed.
- Recovery
- An engineer rejects the rule and the alert remains.
Over-groups distinct alerts, masking a second incident.
- Detection
- A grouping similarity threshold runs and divergent signals are flagged.
- Mitigation
- Uncertain groupings are flagged, not merged silently.
- Recovery
- The engineer splits the group.
A suppression rule persists after the underlying issue changes.
- Detection
- Rules are time-bounded and reviewed.
- Mitigation
- Rules expire and require re-approval.
- Recovery
- Stale rules lapse and the alert resurfaces.
Evaluation
Suppression precision with critical-alert safety is primary — suppressing an alert that mattered is the failure.
| Suppression precision | Of alerts recommended for suppression, the share that were genuinely noise. |
|---|---|
| Critical-miss rate | Frequency of critical alerts caught in a suppression recommendation — must be zero. |
| Grouping accuracy | Share of alert groupings that are correct, with no masked second incident. |
| Rule-decay handling | Share of stale suppression rules correctly expired. |
| Latency | Time to a grouping or recommendation. |
Recommended approach. Use a labeled alert stream with known noise versus actionable alerts; measure suppression precision and treat any suppressed critical as a hard failure. Verify groupings don't merge distinct incidents and rules expire.
When to use
Use it when
- On-call is drowning in alerts and real signals are getting lost in the noise.
- You have alert history (fire/ack/incident-correlation) the agent can score actionability from.
- You want data-backed tuning recommendations and safe, reversible suppression of proven noise.
- You want to cut fatigue while keeping a hard guarantee that incident-linked and critical alerts are never silenced.
Avoid it when
- You lack alert/incident history, so actionability can't be measured — suppression would be blind.
- You expect it to autonomously silence critical-service alerts; those are recommendation-only.
- Your 'noisy' alerts are actually under-investigated real signals.
- You can't keep suppression reversible, time-boxed, and audited.
System prompt
You are an Alert Noise Reduction Agent helping an on-call/SRE team cut alert fatigue. You analyze alerts, recommend tuning, and suppress proven noise — WITHOUT ever silencing a real signal. You are judged on reducing non-actionable noise AND on never suppressing an alert that matters.
== CORE PRINCIPLES ==
1. Actionability, not volume. Judge an alert by evidence of whether it leads to action: ack rate, time-to-ack, and — most importantly — whether it has ever correlated with a real incident. A high-volume alert that's always acted on is signal, not noise.
2. Suppress nothing you can't prove is noise. Only recommend/auto-suppress alerts with a strong, evidence-backed non-actionability record. When in doubt, recommend tuning, not silence.
3. Reversible and time-boxed. Suppression is always temporary, scoped, auditable, and easy to undo. You never permanently delete an alert rule.
== HARD RULES (NON-NEGOTIABLE) ==
- INCIDENT-LINKED = NEVER SUPPRESS: If an alert has EVER correlated with a real incident (even once), you must not suppress it. At most, recommend tuning (threshold/grouping). This rule is absolute.
- CRITICAL SERVICES ESCALATE: For alerts on critical/customer-facing services or SEV1-capable signals, you never auto-suppress — you recommend tuning and escalate the decision to a human.
- EVIDENCE REQUIRED: Auto-suppress only with a clear record (e.g. fired many times over a meaningful window with ~0 acks and 0 incident correlations) on a non-critical signal. State the numbers.
- BOUNDED SUPPRESSION: Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged. Never an open-ended silence.
- NO BLIND DEDUP: When grouping/deduping, preserve the ability to see the underlying alerts; never collapse distinct real signals into one that hides a problem.
== METHOD ==
- Pull each alert's history: fire count, ack rate, time-to-ack, and incident correlations over a window.
- Score actionability. Correlate/dedupe related alerts into groups. Identify chronically non-actionable, never-incident-linked, non-critical alerts as noise candidates.
- For noise candidates: recommend tuning and, if enabled and within guardrails, time-box suppress. For everything else: recommend tuning only or leave as-is.
== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_SUPPRESS: non-critical, zero incident correlation, strong non-actionable record, confidence >= 0.85. Time-boxed + tracked.
- RECOMMEND_TUNING: noisy but incident-linked at least once, or critical service, or moderate evidence. Propose thresholds/grouping; do not suppress.
- ESCALATE: critical-service/SEV1 alerts, conflicting evidence, or anything you're unsure about.
== COST CONTROL ==
Pull the history you need to score; reuse it across related alerts. Cap tool calls; if exceeded, recommend based on what you have.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"alert": "<alert name/id or group>",
"actionability": "<score + the numbers: fires, ack rate, incident correlations over window>",
"incident_linked": <bool>,
"critical_service": <bool>,
"decision": "AUTO_SUPPRESS|RECOMMEND_TUNING|ESCALATE",
"suppression": { "applied": <bool>, "duration": "<time-box, or empty>", "scope": "<specific alert/condition>", "reversible": true },
"tuning": ["<concrete recommendation: threshold/grouping/dedup>"],
"rationale": "<evidence-grounded reason>",
"escalation": { "needed": <bool>, "reason": "<critical/uncertain, or empty>" }
}
If incident_linked is true or critical_service is true, decision must NOT be AUTO_SUPPRESS.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect alerting
Install the agent and connect it (read) to your alerting and incident systems.
pipx install alert-noise-agent alert-noise-agent connect --alerts prometheus,pagerduty --incidents pagerduty alert-noise-agent doctor
Configure guardrails
The incident-linked and critical-service protections are enforced here, not by the model.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... NEVER_SUPPRESS_IF_INCIDENT_LINKED=true MAX_SUPPRESSION=24h # time-box; auto-expires MODE=advise # advise (recommend) | act (auto-suppress proven noise)
Mark critical services
Alerts on these are recommendation-only — never auto-suppressed.
# .alerts.yml
critical_services: ["checkout", "auth", "payments", "db-primary"]
noise_threshold: { window: 30d, min_fires: 50, max_ack_rate: 0.02, incident_correlations: 0 }
suppression: { reversible: true, max_duration: 24h }Backtest on alert history
Replay history to confirm it would never have suppressed an incident-linked alert.
alert-noise-agent backtest --range 90d --explain # reports noise found + a hard check: suppressed-incident-linked count (must be 0)
Wire in (advise first)
Run scheduled analysis and post recommendations; enable auto-suppression for proven non-critical noise once backtests are clean.
# scheduled job -> recommendations to #sre; tuning PRs to the monitoring repo # promote MODE=act after a clean backtest
Architecture
Tools required
Workflow
1. Ingest the stream
Pull the alert inventory and metadata over the analysis window.
2. Score actionability
For each alert, compute ack rate, time-to-ack, and fire volume, and pull incident correlations.
3. Correlate & dedupe
Group related/duplicate alerts so one root cause isn't ten pages, keeping the underlying alerts visible.
4. Identify noise candidates
Flag chronically non-actionable, never-incident-linked, non-critical alerts — and nothing else.
5. Apply the gate
Forbid suppressing any incident-linked or critical-service alert; those get tuning recommendations or escalation instead.
6. Tune or suppress
Open tuning PRs and apply bounded, reversible, time-boxed suppression only where the evidence and guardrails allow.
7. Audit & review
Log every action with evidence and expiry, and surface a review so on-call sees exactly what was quieted.
Examples
Chronic non-actionable alert → time-boxed auto-suppress + tuning PR
An alert fires constantly, is never acked, and has never been tied to an incident, on a non-critical service. Safe, bounded noise to quiet.
Input
Alert: 'batch-worker-cpu-high' fired 312 times in 30d, ack rate 0.6%, 0 incident correlations, service=batch-worker (non-critical).
Output
{
"alert": "batch-worker-cpu-high",
"actionability": "Very low: 312 fires/30d, 0.6% ack rate, 0 incident correlations",
"incident_linked": false,
"critical_service": false,
"decision": "AUTO_SUPPRESS",
"suppression": { "applied": true, "duration": "24h", "scope": "batch-worker-cpu-high only", "reversible": true },
"tuning": ["Raise CPU threshold from 80% to 92% for batch-worker, or add a 15m sustained-duration condition", "Group with batch-worker-memory into one 'batch-worker saturation' alert"],
"rationale": "Fires constantly, almost never acked, and has never correlated with an incident on a non-critical service — textbook noise. Suppressing for 24h while a tuning PR adjusts the threshold; suppression auto-expires and is reversible.",
"escalation": { "needed": false, "reason": "" }
}Note: Clear, quantified non-actionability (312 fires, 0.6% acks, 0 incidents) on a non-critical service. It applies a bounded 24h suppression (auto-expiring, reversible) and — crucially — also opens a tuning PR so the real fix lands, rather than just muting forever.
Noisy but incident-linked once → recommend tuning, do NOT suppress
An alert is genuinely noisy, but it correlated with a real outage once. The absolute rule blocks suppression; it recommends tuning instead.
Input
Alert: 'api-latency-p99-high' fired 140 times in 30d, ack rate 8%, but correlated with 1 real SEV2 incident last month. Service=api (customer-facing).
Output
{
"alert": "api-latency-p99-high",
"actionability": "Mixed: 140 fires/30d, 8% ack rate, but 1 real SEV2 incident correlation",
"incident_linked": true,
"critical_service": true,
"decision": "RECOMMEND_TUNING",
"suppression": { "applied": false, "duration": "", "scope": "api-latency-p99-high", "reversible": true },
"tuning": ["Add a sustained-duration condition (p99 high for >5m) to cut transient spikes", "Raise the page threshold but keep a ticket-level alert", "Group flapping instances to reduce repeat pages"],
"rationale": "It is noisy, but it correlated with a real SEV2 once and sits on a customer-facing service — so it must not be suppressed. Instead, recommend tuning to cut transient noise while preserving the signal that caught the incident.",
"escalation": { "needed": false, "reason": "" }
}Note: The defining rule in action: the alert is noisy enough that muting it would be tempting — but it caught a real SEV2 once, so the incident-linked rule forbids suppression entirely. The agent instead recommends duration/threshold tuning that reduces noise while keeping the signal that matters. Noise reduction never trumps catching incidents.
Critical-service alert, noisy → escalate, never auto-act
A flapping alert on the payments service. Even with no incident link yet, the critical-service rule means a human decides.
Input
Alert: 'payments-error-rate' flapping, 90 fires in 30d, ack rate 30%, 0 incident correlations so far. Service=payments (critical).
Output
{
"alert": "payments-error-rate",
"actionability": "Moderate: 90 fires/30d, 30% ack rate, 0 incident correlations so far",
"incident_linked": false,
"critical_service": true,
"decision": "ESCALATE",
"suppression": { "applied": false, "duration": "", "scope": "payments-error-rate", "reversible": true },
"tuning": ["Consider grouping by error type", "Possibly raise threshold — but only with payments-team sign-off given the criticality"],
"rationale": "Even though it hasn't correlated with an incident yet and is somewhat acked, this is the payments service. I won't auto-suppress or auto-tune a critical-service alert; the cost of missing a payments issue is too high. Escalating to on-call with tuning options.",
"escalation": { "needed": true, "reason": "Critical-service (payments) alert — tuning/suppression decisions require human sign-off regardless of current noise level." }
}Note: Critical-service guardrail: payments alerts are recommendation-only, so even a plausibly-noisy one is escalated rather than touched. The agent offers tuning options but explicitly defers the decision to the payments team, because the downside of muting a real payments alert dwarfs the annoyance of noise.
Implementation notes
- Make 'never suppress an incident-linked alert' an absolute, deterministic gate — a single past incident correlation permanently disqualifies an alert from suppression, no matter how noisy.
- Score by actionability (acks, time-to-ack, incident correlation), not raw volume; a frequent-but-always-acted-on alert is signal.
- Keep critical-service and SEV1-capable alerts recommendation-only and escalate decisions to humans.
- Make every suppression time-boxed, scoped, reversible, and audited — never an open-ended silence — and pair it with a tuning PR so the root cause gets fixed.
- Preserve visibility when deduping; collapsing distinct signals into one can hide a real problem.
- Backtest with 'suppressed-incident-linked alerts' as a hard zero metric before enabling any auto-suppression.
- The strong model earns its cost on the suppress-vs-tune judgment, while a cheaper model can pull and aggregate history.
Variations
Basic
Noise analyzer
Scores alerts by actionability, correlates duplicates, and recommends tuning with the supporting numbers for an SRE. No suppression.
Advanced
Guarded auto-suppression
Adds time-boxed, reversible suppression of proven non-critical noise and tuning PRs, with the absolute incident-linked and critical-service guardrails enforced.
Enterprise
Org-wide alert hygiene
Adds multi-team alert inventories, monitoring-as-code PR workflows, suppression audit and auto-expiry, on-call load analytics, and tuning from outcomes — incident-linked alerts always protected.
Download the Agent Blueprint
Export
This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
No — that's the hard guarantee. Any alert that has ever correlated with a real incident is permanently ineligible for suppression, and critical-service alerts are recommendation-only. It can only auto-suppress proven, never-incident-linked, non-critical noise.
By actionability, not volume: fire count, ack rate, time-to-ack, and incident correlation over a window. A high-volume alert that's consistently acted on is treated as signal, not noise.
No. Every suppression is time-boxed (auto-expires), scoped to the specific alert, reversible, and logged — and it's paired with a tuning recommendation/PR so the underlying noise actually gets fixed.
It never auto-suppresses or auto-tunes them. It surfaces recommendations and escalates the decision to on-call, because missing a real issue on a critical service is far costlier than the noise.
It groups related/duplicate alerts while preserving visibility into the underlying ones, so a single root cause stops paging ten times without collapsing genuinely distinct signals.
Backtest it on your alert history; the key check is that it would have suppressed zero incident-linked alerts. Start in advise mode (recommendations + tuning PRs) and enable auto-suppression for non-critical proven noise only once that holds.