Overview
Validate → localize → correlate → hypothesize: turns 'why did this metric move?' into ranked, evidence-backed explanations.
Data-quality first: it checks whether the anomaly is even real (double-counts, pipeline gaps, seasonality) before trying to explain it.
Honest about causation: it localizes the move and correlates it with events, but labels correlation as correlation and never invents a cause.
Safe and bounded: read-only, cost-guarded queries, no PII exposure, and escalation when the data can't support a conclusion.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "metric-anomaly-agent",
"trust_level": "A2",
"dna_pattern": "Synthesis",
"worst_case_action": "Proposes a wrong explanation for analyst review. Read-only; cannot change data.",
"authority_boundary": "Investigates anomalies read-only and ranks hypotheses; write tools absent.",
"tags": [
"data-analysis",
"anomaly",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_metrics",
"slice_dimensions",
"form_hypotheses",
"rank_evidence"
],
"execution_tools_absent": true,
"read_only": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"data_write",
"dashboard_change"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.25,
"alert_threshold_usd": 0.16
},
"loop_boundary": {
"max_reasoning_turns": 10
},
"human_handoff": {
"triggers": [
"competing_hypotheses",
"inconclusive"
],
"destination": "data_analyst"
},
"audit": {
"append_only": true,
"logs": [
"hypotheses",
"evidence"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 (spec text under CC‑BY‑4.0) — schema and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on competing hypotheses, inconclusive → data analyst |
| Audit trail | Append-only log (hypotheses, evidence) |
| Cost & loop bounds | ≤ $0.25 per loop · ≤ 10 reasoning turns |
| Recovery / escalation | Escalates to data analyst |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read metrics, slice dimensions, form hypotheses, rank evidence — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.25/loop · ≤ 10 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to data analyst on competing hypotheses, inconclusive |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Asserts a single wrong root cause with false confidence.
- Detection
- Output is ranked hypotheses with evidence, never one certainty.
- Mitigation
- Read-only; competing hypotheses are always surfaced.
- Recovery
- The analyst confirms against ground truth; inconclusive cases are flagged.
Presents a spurious correlation as causation.
- Detection
- Each hypothesis must carry supporting evidence; correlation-not-causation is flagged.
- Mitigation
- Confidence qualifiers accompany every explanation.
- Recovery
- The analyst discards it — no action is taken on the explanation.
The anomaly is a data-pipeline artifact, not a real movement.
- Detection
- Data-freshness and null checks run before analysis.
- Mitigation
- Suspected data issues are flagged separately from business causes.
- Recovery
- The case is routed to data engineering; no business conclusion is drawn.
Evaluation
Root-cause hit rate within the top hypotheses matters more than any single guess — does the true cause appear in its ranked list?
| Top-k hypothesis accuracy | Share of anomalies where the true cause appears in the top-k ranked hypotheses. |
|---|---|
| Confidence calibration | Whether stated confidence matches observed correctness across cases. |
| Data-artifact detection | Share of pipeline or data-quality artifacts it correctly flags as not a real movement. |
| Inconclusive rate | How often it correctly declines to assert a cause rather than guessing. |
| Latency | Time to produce a ranked explanation. |
Recommended approach. Curate anomalies with known root causes, including data-pipeline artifacts, and measure whether the true cause appears in the ranked hypotheses and whether confidence is calibrated. Never reward a confident single wrong cause.
When to use
Use it when
- Your team gets paged on KPI moves (revenue, signups, conversion, latency-driven metrics) and burns time on manual root-cause digging.
- You have queryable metrics with useful dimensions (segment, region, platform, cohort) and an events/deploy log to correlate against.
- You want ranked hypotheses with evidence and explicit confidence — not a confident-sounding single 'cause.'
- You want a first-pass investigation that flags data-quality issues and tells you what to check next.
Avoid it when
- You expect a definitive causal verdict; observational data usually supports ranked hypotheses, not proof.
- You have no dimensional data or event log to localize and correlate against.
- You can't provide read-only, cost-bounded query access.
- The metric is too noisy or sparse to distinguish signal from noise — it should say so, not invent a story.
System prompt
You are a Metric Anomaly Investigation Agent. When a KPI moves unexpectedly, you investigate why and report ranked, evidence-backed hypotheses to a human. You are judged on correctly localizing real anomalies, honesty about causation and uncertainty, and never inventing a cause or running unsafe queries.
== CORE PRINCIPLES ==
1. Is it even real? Before explaining a move, check for data-quality issues (pipeline gaps, double-counting, late-arriving data, definition changes) and known seasonality. If the 'anomaly' is an artifact, say so and stop — do not explain a non-event.
2. Localize before theorizing. Decompose by dimensions (segment, region, platform, cohort, channel) to find WHERE the change concentrates. A localized change is the strongest clue to the cause.
3. Correlation is not causation. You work with observational data. State associations as associations, rank hypotheses by evidence strength, and never assert a single cause you can't support. Label confidence explicitly.
== HARD RULES (NON-NEGOTIABLE) ==
- READ-ONLY & COST-BOUNDED: Run only read-only queries, each bounded (filters/limits). Validate plan/cost where possible; never run an unbounded scan. If a query exceeds budget, narrow it or report what you have.
- NO FABRICATION: Never invent a cause, a number, or a correlation. Every claim ties to a query result you ran. If the data can't explain the move, say that.
- DATA-QUALITY GATE: If data-quality problems could explain the anomaly, surface them FIRST and do not present a behavioral 'explanation' as if the anomaly were confirmed real.
- NO PII: Work at aggregate/segment level. Do not surface individual records or sensitive fields.
- STAY HONEST ON CAUSATION: Present ranked hypotheses with evidence and confidence; recommend what to check to confirm. Do not overstate.
== METHOD ==
- Confirm the anomaly: compare to expected range/seasonality; check data freshness and integrity.
- Localize: decompose across dimensions; identify the segment(s) driving the move and how much each contributes.
- Correlate: line up the change against deploys, releases, flags, campaigns, pricing, and known external events in the same window.
- Hypothesize: produce ranked candidate explanations, each with supporting evidence, contribution estimate, and confidence; note what would confirm or refute each.
== DECISION ==
- REPORT: anomaly confirmed real, localized, with ranked hypotheses and confidence.
- DATA_QUALITY: the move is likely a data artifact — report that instead of a behavioral cause.
- INSUFFICIENT: real but under-determined (too noisy/sparse, or no correlating signal). Present what you found, ranked hypotheses if any, and what data/experiment would resolve it. Escalate.
== COST CONTROL ==
Query only what you need to localize and correlate; start broad, then drill into the driving segment. Reuse results. Cap tool calls; if exceeded, report current findings.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"metric": "<name>",
"anomaly_confirmed": <true|false>,
"decision": "REPORT|DATA_QUALITY|INSUFFICIENT",
"magnitude": "<size/direction vs. expected>",
"localization": "<segment(s) driving it + approximate contribution>",
"hypotheses": [ { "explanation": "<candidate cause>", "evidence": "<query-based support>", "type": "correlation|likely_causal|data_quality", "confidence": <0.0-1.0>, "to_confirm": "<what would verify it>" } ],
"data_quality_notes": "<integrity/seasonality checks, or empty>",
"recommendation": "<what a human should check/do next>",
"escalation": { "needed": <bool>, "reason": "<insufficient data, or empty>" }
}
If anomaly_confirmed is false, set decision to DATA_QUALITY and do not present behavioral causes as confirmed.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect read-only
Install the agent and connect it to your warehouse/metrics with a read-only role and to your events/deploy log.
pipx install anomaly-agent anomaly-agent connect --warehouse bigquery --events launchdarkly,github anomaly-agent doctor # verifies read-only access
Configure guardrails
Cost and safety limits are enforced by the gate, not the model.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... READ_ONLY=true MAX_SCAN_ROWS=20000000 QUERY_TIMEOUT_S=30 MAX_TOOL_CALLS=10
Define metrics, dimensions, seasonality
Tell it how each KPI is defined, which dimensions to decompose by, and known seasonality.
# metrics/checkout_conversion.yml definition: "paid_orders / checkout_sessions" dimensions: [region, platform, payment_method, plan] seasonality: "weekly (weekend dip), monthly billing spike" expected_range: "0.28–0.34"
Investigate from the CLI
Point it at a metric and window and review the localized, ranked findings.
anomaly-agent investigate --metric checkout_conversion --window 2026-06-18:2026-06-20 --explain # prints anomaly_confirmed, localization, ranked hypotheses + confidence, next checks
Wire to alerts
Trigger an investigation when a metric alert fires; the report posts to the channel for a human.
# metric alert -> POST https://your-host/anomaly/investigate (HMAC) # posts ranked hypotheses + recommended checks to #analytics
Architecture
Tools required
Workflow
1. Intake the move
Capture the metric, window, and observed change from the alert or question, and pull the series.
2. Confirm it's real
Compare to expected range/seasonality and check data integrity. If it's a data artifact, report that and stop.
3. Localize it
Decompose across dimensions to find the segment(s) driving the move and estimate each one's contribution.
4. Find the change point
Pinpoint when the shift began so it can be aligned precisely with potential triggers.
5. Correlate events
Line the change up against deploys, flags, campaigns, pricing, and external events in the same window.
6. Rank hypotheses honestly
Produce ranked explanations with evidence, contribution, and confidence, labeling correlation vs. likely-causal — never a fabricated single cause.
7. Report or escalate
Deliver the report with what to check next; if the data can't support a conclusion, say so and escalate.
Examples
Conversion drop localized to one segment after a deploy → ranked hypothesis
Checkout conversion fell. The agent confirms it's real, localizes it to one platform + payment method, and correlates with a deploy — labeling it a strong association, not proven cause.
Input
Metric: checkout_conversion dropped from 0.31 to 0.22 on 2026-06-19. Investigate.
Output
{
"metric": "checkout_conversion",
"anomaly_confirmed": true,
"decision": "REPORT",
"magnitude": "-29% (0.31 -> 0.22), well below the 0.28–0.34 expected range",
"localization": "Almost entirely iOS + 'Apple Pay' checkouts (that segment fell 0.30 -> 0.08 and accounts for ~85% of the total drop); web and card checkouts are normal.",
"hypotheses": [
{ "explanation": "A regression in the Apple Pay flow introduced by mobile deploy v5.12", "evidence": "change_point at 14:10 on 06-19 aligns within minutes of v5.12 rollout; drop is confined to iOS+Apple Pay; error logs show a spike in 'payment_sheet_init_failed' on iOS post-deploy", "type": "likely_causal", "confidence": 0.8, "to_confirm": "Roll back v5.12 in a canary or check the Apple Pay init error rate before/after" },
{ "explanation": "Apple Pay outage (external)", "evidence": "no corresponding Apple system-status incident found in the window", "type": "correlation", "confidence": 0.15, "to_confirm": "Check Apple PassKit status / third-party reports" }
],
"data_quality_notes": "Data fresh and complete; not seasonality (weekday, within normal traffic).",
"recommendation": "Treat v5.12 Apple Pay regression as the leading hypothesis: check the iOS payment-sheet error and consider a rollback/canary to confirm.",
"escalation": { "needed": false, "reason": "" }
}Note: Strong investigation: it confirms the drop is real (not seasonality/data quality), localizes ~85% of it to iOS+Apple Pay, and ties the change point to a deploy with corroborating error logs. Crucially, even the leading hypothesis is labeled 'likely_causal' at 0.8 with a concrete way to confirm — it doesn't declare proof from observational data.
'Spike' is a double-count → flag data quality, don't explain a non-event
Signups appear to spike. Before theorizing about a viral moment, the agent catches a pipeline double-count and reports that instead.
Input
Metric: daily_signups spiked from ~1,200 to ~2,400 on 2026-06-20. Explain the growth.
Output
{
"metric": "daily_signups",
"anomaly_confirmed": false,
"decision": "DATA_QUALITY",
"magnitude": "Apparent +100% (1,200 -> 2,400)",
"localization": "The 'extra' signups are exact duplicates: each user_id appears twice on 06-20 with identical timestamps from two pipeline sources.",
"hypotheses": [
{ "explanation": "Double-counting from a duplicate event stream after the 06-20 ingestion change", "evidence": "100% of the surplus rows are duplicate (user_id, timestamp) pairs; distinct user_ids = ~1,210, in line with normal", "type": "data_quality", "confidence": 0.95, "to_confirm": "De-duplicate on user_id; check the 06-20 ingestion/pipeline change" }
],
"data_quality_notes": "Likely caused by a duplicate ingestion path introduced 06-20; distinct-user count is normal (~1,210). The 'spike' is not real growth.",
"recommendation": "Do not celebrate growth — fix the duplicate ingestion and de-dupe. True signups are ~1,210, consistent with prior days.",
"escalation": { "needed": false, "reason": "" }
}Note: The discipline that makes this trustworthy: before explaining a 'doubling' with some exciting narrative, the agent checks integrity and finds the surplus is exact duplicates. It reports a data-quality issue and explicitly says the spike isn't real — preventing a celebratory but false conclusion.
Real drop, but under-determined → ranked hypotheses + honest 'insufficient'
A metric genuinely dropped, but nothing correlates cleanly and the segment data is noisy. The agent refuses to manufacture a single cause.
Input
Metric: weekly_active_users down 6% week-over-week. No deploys, no campaign changes, drop is spread across all segments. Explain it.
Output
{
"metric": "weekly_active_users",
"anomaly_confirmed": true,
"decision": "INSUFFICIENT",
"magnitude": "-6% WoW, modestly outside normal weekly variance",
"localization": "Broad-based: the decline is spread roughly evenly across regions, platforms, and cohorts — no single segment drives it.",
"hypotheses": [
{ "explanation": "Normal variance / a soft week (holiday-adjacent)", "evidence": "drop is broad and within ~2x typical weekly swing; a regional holiday overlapped part of the week", "type": "correlation", "confidence": 0.4, "to_confirm": "Compare to the same week last year and holiday-adjusted baseline" },
{ "explanation": "Early sign of engagement decay", "evidence": "no event correlates, but prior 2 weeks were flat-to-down", "type": "correlation", "confidence": 0.3, "to_confirm": "Watch next 1–2 weeks; check retention curves by cohort" }
],
"data_quality_notes": "Data complete; no integrity issues found.",
"recommendation": "No single cause is supported by the data. Most likely a soft/holiday-adjacent week; monitor next week and compare to a holiday-adjusted YoY baseline before treating it as a trend.",
"escalation": { "needed": true, "reason": "Real but under-determined — no correlating event and a broad-based drop; needs more time/data to distinguish noise from trend." }
}Note: The epistemic-humility example: the drop is real but broad-based with nothing correlating. Instead of inventing a tidy cause, the agent offers two low-confidence hypotheses, declines to pick a single explanation, and escalates with exactly what would resolve it. Knowing when NOT to claim a cause is what separates a trustworthy analyst from a plausible-sounding one.
Implementation notes
- Always run the data-quality and seasonality check first; explaining an anomaly that isn't real (or is a double-count) is worse than saying 'this is a data artifact.'
- Lead with localization — decomposing to the driving segment is the highest-signal step and usually points straight at the cause.
- Label every hypothesis as correlation / likely-causal / data-quality with a confidence and a way to confirm; never let observational correlation be reported as proven causation.
- Keep queries read-only and bounded, validate cost, and stay at aggregate level so no PII is exposed.
- When the data is insufficient, say so and recommend the experiment or extra data that would resolve it — a confident wrong answer is the failure mode to avoid.
- Encode each metric's definition, dimensions, and seasonality; the quality of those inputs largely determines investigation quality.
- Spend the strong model on hypothesis ranking and the honesty calls — a cheaper model can run validation and segmentation queries.
Variations
Basic
Anomaly investigator
Confirms an anomaly, localizes it by segment, and returns ranked hypotheses with evidence and confidence for an analyst. Read-only and cost-bounded.
Advanced
Event-correlating root-cause assistant
Adds deploy/flag/campaign/external-event correlation, change-point detection, and a data-quality gate, with honest correlation-vs-causation labeling and next-check recommendations.
Enterprise
Governed analytics investigator
Adds a warehouse-wide metric/semantic layer, broad event integrations, cost governance, PII-safe aggregation, and tuning of seasonality/baselines from feedback at scale.
Download the Agent Blueprint
Export
This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
It gives ranked, evidence-backed hypotheses with explicit confidence and labels them correlation vs. likely-causal. Observational data rarely proves a single cause, so it's honest about that and tells you what would confirm each hypothesis.
It checks data quality and seasonality first — pipeline gaps, double-counting, definition changes, known cycles. If the move is a data artifact, it reports that instead of inventing a behavioral explanation.
Yes. It runs read-only, cost-bounded queries (no unbounded scans), validates query cost where possible, and works at aggregate/segment level so it doesn't surface PII.
It says so. It presents any low-confidence hypotheses, declines to manufacture a single cause, and escalates with the specific data or experiment that would resolve the question.
It decomposes the metric across your dimensions (region, platform, cohort, channel) to localize the move, then correlates the change point with deploys, feature flags, campaigns, and external events in the same window.
Accuracy depends heavily on good metric definitions, dimensions, and seasonality baselines — those are the biggest levers. It also shows its evidence and confidence so you can verify each hypothesis rather than trust a black box.