Will it tell me the definitive cause of a metric change?

It gives ranked, evidence-backed hypotheses with explicit confidence and labels them correlation vs. likely-causal. Observational data rarely proves a single cause, so it's honest about that and tells you what would confirm each hypothesis.

How does it avoid 'explaining' a fake anomaly?

It checks data quality and seasonality first — pipeline gaps, double-counting, definition changes, known cycles. If the move is a data artifact, it reports that instead of inventing a behavioral explanation.

Is it safe to run against our warehouse?

Yes. It runs read-only, cost-bounded queries (no unbounded scans), validates query cost where possible, and works at aggregate/segment level so it doesn't surface PII.

What does it do when nothing correlates?

It says so. It presents any low-confidence hypotheses, declines to manufacture a single cause, and escalates with the specific data or experiment that would resolve the question.

How does it find where the change came from?

It decomposes the metric across your dimensions (region, platform, cohort, channel) to localize the move, then correlates the change point with deploys, feature flags, campaigns, and external events in the same window.

Accuracy depends heavily on good metric definitions, dimensions, and seasonality baselines — those are the biggest levers. It also shows its evidence and confidence so you can verify each hypothesis rather than trust a black box.

Metric Anomaly Investigation Agent

Overview

Validate → localize → correlate → hypothesize: turns 'why did this metric move?' into ranked, evidence-backed explanations.

Data-quality first: it checks whether the anomaly is even real (double-counts, pipeline gaps, seasonality) before trying to explain it.

Honest about causation: it localizes the move and correlates it with events, but labels correlation as correlation and never invents a cause.

Safe and bounded: read-only, cost-guarded queries, no PII exposure, and escalation when the data can't support a conclusion.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend

DNA PatternSynthesis (Extract → Synthesize → Verify)

Worst-Case ActionProposes a wrong explanation for a metric movement that an analyst reviews before acting on. It runs read-only analysis only and cannot change data, dashboards, or take action — write tools are absent from its registry.

Authority BoundaryInvestigates a metric anomaly across read-only data, forms and ranks likely explanations with supporting evidence, and flags uncertainty. It never asserts a single cause as fact, changes data, or takes action. An analyst confirms.

Verification TestConfirm explanations are ranked hypotheses with evidence, not asserted certainties; confirm no data-write tool exists in the registry.

Production Readiness6/6 dimensions passing. Tool isolation: read-only; write tools absent. Human gates: an analyst confirms. Confidence escalation: competing hypotheses surfaced. Cost ceiling: bounded per investigation. Audit trail: hypotheses and evidence logged. Escalation path: inconclusive cases flagged.

Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json

{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "metric-anomaly-agent",
  "trust_level": "A2",
  "dna_pattern": "Synthesis",
  "worst_case_action": "Proposes a wrong explanation for analyst review. Read-only; cannot change data.",
  "authority_boundary": "Investigates anomalies read-only and ranks hypotheses; write tools absent.",
  "tags": [
    "data-analysis",
    "anomaly",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_metrics",
      "slice_dimensions",
      "form_hypotheses",
      "rank_evidence"
    ],
    "execution_tools_absent": true,
    "read_only": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "data_write",
      "dashboard_change"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.25,
    "alert_threshold_usd": 0.16
  },
  "loop_boundary": {
    "max_reasoning_turns": 10
  },
  "human_handoff": {
    "triggers": [
      "competing_hypotheses",
      "inconclusive"
    ],
    "destination": "data_analyst"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "hypotheses",
      "evidence"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 (spec text under CC‑BY‑4.0) — schema and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goal	Bounded by the authority spec above
Trust Level	A2 — Recommend
Tool access	Least privilege — execution tools absent (read-only)
Context handling	Grounded in provided inputs; cites or flags rather than guessing
Memory strategy	Task-scoped; no persistent cross-session memory
Human approval	Required on competing hypotheses, inconclusive → data analyst
Audit trail	Append-only log (hypotheses, evidence)
Cost & loop bounds	≤ $0.25 per loop · ≤ 10 reasoning turns
Recovery / escalation	Escalates to data analyst

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

Agent	Primary reasoner — Recommend authority (A2)
Tools	read metrics, slice dimensions, form hypotheses, rank evidence — execution tools absent (read-only)
Memory	Task-scoped working context; no persistent cross-session memory
Guardrails	Worst-case classified (A2); no execution tools; ≤ $0.25/loop · ≤ 10 turns
Evaluator	Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
Handoff	Escalates to data analyst on competing hypotheses, inconclusive

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Asserts a single wrong root cause with false confidence.

Detection: Output is ranked hypotheses with evidence, never one certainty.
Mitigation: Read-only; competing hypotheses are always surfaced.
Recovery: The analyst confirms against ground truth; inconclusive cases are flagged.

Presents a spurious correlation as causation.

Detection: Each hypothesis must carry supporting evidence; correlation-not-causation is flagged.
Mitigation: Confidence qualifiers accompany every explanation.
Recovery: The analyst discards it — no action is taken on the explanation.

The anomaly is a data-pipeline artifact, not a real movement.

Detection: Data-freshness and null checks run before analysis.
Mitigation: Suspected data issues are flagged separately from business causes.
Recovery: The case is routed to data engineering; no business conclusion is drawn.

Evaluation

Root-cause hit rate within the top hypotheses matters more than any single guess — does the true cause appear in its ranked list?

Top-k hypothesis accuracy	Share of anomalies where the true cause appears in the top-k ranked hypotheses.
Confidence calibration	Whether stated confidence matches observed correctness across cases.
Data-artifact detection	Share of pipeline or data-quality artifacts it correctly flags as not a real movement.
Inconclusive rate	How often it correctly declines to assert a cause rather than guessing.
Latency	Time to produce a ranked explanation.

Recommended approach. Curate anomalies with known root causes, including data-pipeline artifacts, and measure whether the true cause appears in the ranked hypotheses and whether confidence is calibrated. Never reward a confident single wrong cause.

When to use

Use it when

Your team gets paged on KPI moves (revenue, signups, conversion, latency-driven metrics) and burns time on manual root-cause digging.
You have queryable metrics with useful dimensions (segment, region, platform, cohort) and an events/deploy log to correlate against.
You want ranked hypotheses with evidence and explicit confidence — not a confident-sounding single 'cause.'
You want a first-pass investigation that flags data-quality issues and tells you what to check next.

Avoid it when

You expect a definitive causal verdict; observational data usually supports ranked hypotheses, not proof.
You have no dimensional data or event log to localize and correlate against.
You can't provide read-only, cost-bounded query access.
The metric is too noisy or sparse to distinguish signal from noise — it should say so, not invent a story.

System prompt

system-prompt.md

You are a Metric Anomaly Investigation Agent. When a KPI moves unexpectedly, you investigate why and report ranked, evidence-backed hypotheses to a human. You are judged on correctly localizing real anomalies, honesty about causation and uncertainty, and never inventing a cause or running unsafe queries.

== CORE PRINCIPLES ==
1. Is it even real? Before explaining a move, check for data-quality issues (pipeline gaps, double-counting, late-arriving data, definition changes) and known seasonality. If the 'anomaly' is an artifact, say so and stop — do not explain a non-event.
2. Localize before theorizing. Decompose by dimensions (segment, region, platform, cohort, channel) to find WHERE the change concentrates. A localized change is the strongest clue to the cause.
3. Correlation is not causation. You work with observational data. State associations as associations, rank hypotheses by evidence strength, and never assert a single cause you can't support. Label confidence explicitly.

== HARD RULES (NON-NEGOTIABLE) ==
- READ-ONLY & COST-BOUNDED: Run only read-only queries, each bounded (filters/limits). Validate plan/cost where possible; never run an unbounded scan. If a query exceeds budget, narrow it or report what you have.
- NO FABRICATION: Never invent a cause, a number, or a correlation. Every claim ties to a query result you ran. If the data can't explain the move, say that.
- DATA-QUALITY GATE: If data-quality problems could explain the anomaly, surface them FIRST and do not present a behavioral 'explanation' as if the anomaly were confirmed real.
- NO PII: Work at aggregate/segment level. Do not surface individual records or sensitive fields.
- STAY HONEST ON CAUSATION: Present ranked hypotheses with evidence and confidence; recommend what to check to confirm. Do not overstate.

== METHOD ==
- Confirm the anomaly: compare to expected range/seasonality; check data freshness and integrity.
- Localize: decompose across dimensions; identify the segment(s) driving the move and how much each contributes.
- Correlate: line up the change against deploys, releases, flags, campaigns, pricing, and known external events in the same window.
- Hypothesize: produce ranked candidate explanations, each with supporting evidence, contribution estimate, and confidence; note what would confirm or refute each.

== DECISION ==
- REPORT: anomaly confirmed real, localized, with ranked hypotheses and confidence.
- DATA_QUALITY: the move is likely a data artifact — report that instead of a behavioral cause.
- INSUFFICIENT: real but under-determined (too noisy/sparse, or no correlating signal). Present what you found, ranked hypotheses if any, and what data/experiment would resolve it. Escalate.

== COST CONTROL ==
Query only what you need to localize and correlate; start broad, then drill into the driving segment. Reuse results. Cap tool calls; if exceeded, report current findings.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "metric": "<name>",
  "anomaly_confirmed": <true|false>,
  "decision": "REPORT|DATA_QUALITY|INSUFFICIENT",
  "magnitude": "<size/direction vs. expected>",
  "localization": "<segment(s) driving it + approximate contribution>",
  "hypotheses": [ { "explanation": "<candidate cause>", "evidence": "<query-based support>", "type": "correlation|likely_causal|data_quality", "confidence": <0.0-1.0>, "to_confirm": "<what would verify it>" } ],
  "data_quality_notes": "<integrity/seasonality checks, or empty>",
  "recommendation": "<what a human should check/do next>",
  "escalation": { "needed": <bool>, "reason": "<insufficient data, or empty>" }
}
If anomaly_confirmed is false, set decision to DATA_QUALITY and do not present behavioral causes as confirmed.

Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect read-only

Install the agent and connect it to your warehouse/metrics with a read-only role and to your events/deploy log.

shell

pipx install anomaly-agent
anomaly-agent connect --warehouse bigquery --events launchdarkly,github
anomaly-agent doctor   # verifies read-only access

Configure guardrails

Cost and safety limits are enforced by the gate, not the model.

shell

cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
READ_ONLY=true
MAX_SCAN_ROWS=20000000
QUERY_TIMEOUT_S=30
MAX_TOOL_CALLS=10

Define metrics, dimensions, seasonality

Tell it how each KPI is defined, which dimensions to decompose by, and known seasonality.

shell

# metrics/checkout_conversion.yml
definition: "paid_orders / checkout_sessions"
dimensions: [region, platform, payment_method, plan]
seasonality: "weekly (weekend dip), monthly billing spike"
expected_range: "0.28–0.34"

Investigate from the CLI

Point it at a metric and window and review the localized, ranked findings.

shell

anomaly-agent investigate --metric checkout_conversion --window 2026-06-18:2026-06-20 --explain
# prints anomaly_confirmed, localization, ranked hypotheses + confidence, next checks

Wire to alerts

Trigger an investigation when a metric alert fires; the report posts to the channel for a human.

shell

# metric alert -> POST https://your-host/anomaly/investigate (HMAC)
# posts ranked hypotheses + recommended checks to #analytics

Architecture

Anomaly intakeReceives the metric, the time window, and the observed move (from an alert or a user question) and pulls the relevant series.

Validation & seasonalityCompares the move to the expected range and seasonality and checks data freshness/integrity to decide whether the anomaly is real before anything else.

Segmentation engineDecomposes the metric across dimensions to localize where the change concentrates and how much each segment contributes.

Event correlationLines the change up against deploys, feature flags, campaigns, pricing changes, and known external events in the same window.

Hypothesis rankingThe model assembles candidate explanations, each tagged correlation / likely-causal / data-quality, with evidence, a contribution estimate, and a confidence.

Safety & cost gateA deterministic layer keeps queries read-only and bounded, blocks unbounded scans, and keeps analysis at aggregate level (no PII).

Report & escalationProduces a ranked, honest report with what-to-check-next, and escalates when the data is insufficient to explain the move.

Tools required

get_metric_seriesRetrieve the metric's time series for the window, with expected range/seasonality baselines.

validate_data_qualityCheck data freshness, pipeline completeness, double-counting, and definition changes that could fake an anomaly.

decompose_segmentsBreak the metric down by dimensions (segment, region, platform, cohort, channel) to localize the change.

detect_change_pointsIdentify when the shift began, to align it precisely with potential triggers.

correlate_eventsPull deploys, feature-flag flips, campaigns, pricing changes, and external events in the window to test associations.

query_dimensionRun a bounded, read-only drill-down query into a specific segment to quantify its contribution.

rank_hypothesesScore candidate explanations by evidence strength and contribution, tagging correlation vs. likely-causal.

summarize_findingsAssemble the localized, evidence-backed, confidence-rated report with recommended next checks.

Workflow

1. Intake the move
Capture the metric, window, and observed change from the alert or question, and pull the series.
2. Confirm it's real
Compare to expected range/seasonality and check data integrity. If it's a data artifact, report that and stop.
3. Localize it
Decompose across dimensions to find the segment(s) driving the move and estimate each one's contribution.
4. Find the change point
Pinpoint when the shift began so it can be aligned precisely with potential triggers.
5. Correlate events
Line the change up against deploys, flags, campaigns, pricing, and external events in the same window.
6. Rank hypotheses honestly
Produce ranked explanations with evidence, contribution, and confidence, labeling correlation vs. likely-causal — never a fabricated single cause.
7. Report or escalate
Deliver the report with what to check next; if the data can't support a conclusion, say so and escalate.

Examples

Conversion drop localized to one segment after a deploy → ranked hypothesis

Checkout conversion fell. The agent confirms it's real, localizes it to one platform + payment method, and correlates with a deploy — labeling it a strong association, not proven cause.

Input

input

Metric: checkout_conversion dropped from 0.31 to 0.22 on 2026-06-19. Investigate.

Output

output

{
  "metric": "checkout_conversion",
  "anomaly_confirmed": true,
  "decision": "REPORT",
  "magnitude": "-29% (0.31 -> 0.22), well below the 0.28–0.34 expected range",
  "localization": "Almost entirely iOS + 'Apple Pay' checkouts (that segment fell 0.30 -> 0.08 and accounts for ~85% of the total drop); web and card checkouts are normal.",
  "hypotheses": [
    { "explanation": "A regression in the Apple Pay flow introduced by mobile deploy v5.12", "evidence": "change_point at 14:10 on 06-19 aligns within minutes of v5.12 rollout; drop is confined to iOS+Apple Pay; error logs show a spike in 'payment_sheet_init_failed' on iOS post-deploy", "type": "likely_causal", "confidence": 0.8, "to_confirm": "Roll back v5.12 in a canary or check the Apple Pay init error rate before/after" },
    { "explanation": "Apple Pay outage (external)", "evidence": "no corresponding Apple system-status incident found in the window", "type": "correlation", "confidence": 0.15, "to_confirm": "Check Apple PassKit status / third-party reports" }
  ],
  "data_quality_notes": "Data fresh and complete; not seasonality (weekday, within normal traffic).",
  "recommendation": "Treat v5.12 Apple Pay regression as the leading hypothesis: check the iOS payment-sheet error and consider a rollback/canary to confirm.",
  "escalation": { "needed": false, "reason": "" }
}

Note: Strong investigation: it confirms the drop is real (not seasonality/data quality), localizes ~85% of it to iOS+Apple Pay, and ties the change point to a deploy with corroborating error logs. Crucially, even the leading hypothesis is labeled 'likely_causal' at 0.8 with a concrete way to confirm — it doesn't declare proof from observational data.

'Spike' is a double-count → flag data quality, don't explain a non-event

Signups appear to spike. Before theorizing about a viral moment, the agent catches a pipeline double-count and reports that instead.

Input

input

Metric: daily_signups spiked from ~1,200 to ~2,400 on 2026-06-20. Explain the growth.

Output

output

{
  "metric": "daily_signups",
  "anomaly_confirmed": false,
  "decision": "DATA_QUALITY",
  "magnitude": "Apparent +100% (1,200 -> 2,400)",
  "localization": "The 'extra' signups are exact duplicates: each user_id appears twice on 06-20 with identical timestamps from two pipeline sources.",
  "hypotheses": [
    { "explanation": "Double-counting from a duplicate event stream after the 06-20 ingestion change", "evidence": "100% of the surplus rows are duplicate (user_id, timestamp) pairs; distinct user_ids = ~1,210, in line with normal", "type": "data_quality", "confidence": 0.95, "to_confirm": "De-duplicate on user_id; check the 06-20 ingestion/pipeline change" }
  ],
  "data_quality_notes": "Likely caused by a duplicate ingestion path introduced 06-20; distinct-user count is normal (~1,210). The 'spike' is not real growth.",
  "recommendation": "Do not celebrate growth — fix the duplicate ingestion and de-dupe. True signups are ~1,210, consistent with prior days.",
  "escalation": { "needed": false, "reason": "" }
}

Note: The discipline that makes this trustworthy: before explaining a 'doubling' with some exciting narrative, the agent checks integrity and finds the surplus is exact duplicates. It reports a data-quality issue and explicitly says the spike isn't real — preventing a celebratory but false conclusion.

Real drop, but under-determined → ranked hypotheses + honest 'insufficient'

A metric genuinely dropped, but nothing correlates cleanly and the segment data is noisy. The agent refuses to manufacture a single cause.

Input

input

Metric: weekly_active_users down 6% week-over-week. No deploys, no campaign changes, drop is spread across all segments. Explain it.

Output

output

{
  "metric": "weekly_active_users",
  "anomaly_confirmed": true,
  "decision": "INSUFFICIENT",
  "magnitude": "-6% WoW, modestly outside normal weekly variance",
  "localization": "Broad-based: the decline is spread roughly evenly across regions, platforms, and cohorts — no single segment drives it.",
  "hypotheses": [
    { "explanation": "Normal variance / a soft week (holiday-adjacent)", "evidence": "drop is broad and within ~2x typical weekly swing; a regional holiday overlapped part of the week", "type": "correlation", "confidence": 0.4, "to_confirm": "Compare to the same week last year and holiday-adjusted baseline" },
    { "explanation": "Early sign of engagement decay", "evidence": "no event correlates, but prior 2 weeks were flat-to-down", "type": "correlation", "confidence": 0.3, "to_confirm": "Watch next 1–2 weeks; check retention curves by cohort" }
  ],
  "data_quality_notes": "Data complete; no integrity issues found.",
  "recommendation": "No single cause is supported by the data. Most likely a soft/holiday-adjacent week; monitor next week and compare to a holiday-adjusted YoY baseline before treating it as a trend.",
  "escalation": { "needed": true, "reason": "Real but under-determined — no correlating event and a broad-based drop; needs more time/data to distinguish noise from trend." }
}

Note: The epistemic-humility example: the drop is real but broad-based with nothing correlating. Instead of inventing a tidy cause, the agent offers two low-confidence hypotheses, declines to pick a single explanation, and escalates with exactly what would resolve it. Knowing when NOT to claim a cause is what separates a trustworthy analyst from a plausible-sounding one.

Implementation notes

Always run the data-quality and seasonality check first; explaining an anomaly that isn't real (or is a double-count) is worse than saying 'this is a data artifact.'
Lead with localization — decomposing to the driving segment is the highest-signal step and usually points straight at the cause.
Label every hypothesis as correlation / likely-causal / data-quality with a confidence and a way to confirm; never let observational correlation be reported as proven causation.
Keep queries read-only and bounded, validate cost, and stay at aggregate level so no PII is exposed.
When the data is insufficient, say so and recommend the experiment or extra data that would resolve it — a confident wrong answer is the failure mode to avoid.
Encode each metric's definition, dimensions, and seasonality; the quality of those inputs largely determines investigation quality.
Spend the strong model on hypothesis ranking and the honesty calls — a cheaper model can run validation and segmentation queries.

Variations

Basic

Anomaly investigator

Confirms an anomaly, localizes it by segment, and returns ranked hypotheses with evidence and confidence for an analyst. Read-only and cost-bounded.

Advanced

Event-correlating root-cause assistant

Adds deploy/flag/campaign/external-event correlation, change-point detection, and a data-quality gate, with honest correlation-vs-causation labeling and next-check recommendations.

Enterprise

Governed analytics investigator

Adds a warehouse-wide metric/semantic layer, broad event integrations, cost governance, PII-safe aggregation, and tuning of seasonality/baselines from feedback at scale.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)

README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Metric Anomaly Investigation Agent

Overview

AgentAz™ specification

Governance matrix

Agent component mapping

Failure modes

Evaluation

When to use

System prompt

Simulate run

Setup guide

Architecture

Tools required

Workflow

Examples

Implementation notes

Variations

Frequently asked questions

Will it tell me the definitive cause of a metric change?

How does it avoid 'explaining' a fake anomaly?

Is it safe to run against our warehouse?

What does it do when nothing correlates?

How does it find where the change came from?

How accurate is it?

Related kits

NL-to-SQL Analytics Agent

AI Incident Response Agent

Alert Noise Reduction Agent

Access Request & Provisioning Agent