Will it hide real bugs by quarantining tests?

No — that's its primary safety constraint. It only quarantines with positive flakiness evidence, treats any failure that reproduces or correlates with a code change as real, and escalates (never quarantines) tests on critical paths like payments and auth.

How does it decide flaky vs. real?

It re-runs the failing test in isolation, checks its pass/fail history, looks for flakiness signatures (timing, ordering, shared state, environment), and correlates against recent code changes. Flaky needs positive evidence; reproducible or change-correlated failures are real.

Does it delete flaky tests?

Never. Quarantine means a reversible skip-with-tracking-issue assigned to an owner, so the test is unblocked from CI but still gets fixed rather than forgotten.

What about tests on critical paths?

Tests covering payments, auth, or data integrity are escalate-only — it won't auto-quarantine them even if they look flaky, because an 'intermittent' failure there can be a real bug.

How is the cost of re-running controlled?

It re-runs only the single failing test in isolation, up to a configured cap, rather than the whole suite — and if a case is still ambiguous after the cap, it escalates instead of burning more runs.

How do we adopt it safely?

Start in advise mode where it only recommends, backtest on recent failures to confirm it would mask zero regressions, then enable auto-quarantine for non-critical proven flakes.

Flaky Test Triage Agent

Overview

Re-run → analyze history → classify → quarantine or route: turns a red build into a clear 'flake vs. real bug' verdict.

Evidence-based: it isolates the failure and weighs failure history and patterns (timing, ordering, environment) before judging.

Quarantines proven flakes behind a gate to unblock the pipeline — and routes real regressions straight to the code author.

Defensive: it never masks a real bug, needs repeated confirmation to call a test flaky, and escalates ambiguous or critical-path failures.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend

DNA PatternEvaluation (Research → Evaluate)

Worst-Case ActionLabels a real failure as flaky (or vice versa), surfaced for an engineer to review. It cannot quarantine, disable, skip, or rerun tests autonomously — those tools are absent from its registry.

Authority BoundaryAnalyzes test history to identify likely-flaky tests with supporting evidence and recommends action. An engineer decides. It never disables, quarantines, skips, or reruns tests on its own.

Verification TestAttempt to call a disable-test, quarantine, or skip tool → confirm it is absent from the agent's registry.

Production Readiness6/6 dimensions passing. Tool isolation: disable/quarantine tools absent. Human gates: an engineer decides. Confidence escalation: ambiguous cases flagged, never auto-quarantined. Cost ceiling: bounded per analysis. Audit trail: evidence and recommendations logged. Escalation path: suspected real failures flagged.

Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json

{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "flaky-test-triage-agent",
  "trust_level": "A2",
  "dna_pattern": "Evaluation",
  "worst_case_action": "Mislabels a failure as flaky for engineer review. Cannot disable, quarantine, or skip tests.",
  "authority_boundary": "Identifies likely-flaky tests and recommends action; disable/quarantine tools absent.",
  "tags": [
    "qa-testing",
    "flaky-tests",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_test_history",
      "detect_flakiness",
      "gather_evidence",
      "recommend"
    ],
    "execution_tools_absent": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "disable_test",
      "quarantine",
      "skip",
      "rerun"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.2,
    "alert_threshold_usd": 0.14
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "ambiguous_case",
      "suspected_real_failure"
    ],
    "destination": "engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "evidence",
      "recommendations"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goal	Bounded by the authority spec above
Trust Level	A2 — Recommend
Tool access	Least privilege — execution tools absent (read-only)
Context handling	Grounded in provided inputs; cites or flags rather than guessing
Memory strategy	Task-scoped; no persistent cross-session memory
Human approval	Required on ambiguous case, suspected real failure → engineer
Audit trail	Append-only log (evidence, recommendations)
Cost & loop bounds	≤ $0.2 per loop · ≤ 8 reasoning turns
Recovery / escalation	Escalates to engineer

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

Agent	Primary reasoner — Recommend authority (A2)
Tools	read test history, detect flakiness, gather evidence, recommend — execution tools absent (read-only)
Memory	Task-scoped working context; no persistent cross-session memory
Guardrails	Worst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns
Evaluator	Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
Handoff	Escalates to engineer on ambiguous case, suspected real failure

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Labels a real failure as flaky, hiding a genuine bug.

Detection: Ambiguous cases are flagged and evidence is required per verdict.
Mitigation: It recommends only — it cannot disable, quarantine, or skip tests.
Recovery: An engineer reviews; suspected real failures are surfaced.

Labels a flaky test as a real failure, wasting triage time.

Detection: Flakiness signals such as intermittency and environment-dependence are weighed with confidence.
Mitigation: Low-confidence verdicts are flagged, not asserted.
Recovery: The engineer corrects the label.

Acts on insufficient history — too few runs to judge.

Detection: A minimum-sample check runs before a verdict.
Mitigation: Thin history yields 'insufficient data', not a guess.
Recovery: More runs are gathered before judging.

Evaluation

Correctly separating flaky from real failures is primary — labeling a real failure as flaky hides a bug.

Classification accuracy	Share of failures correctly labeled flaky versus real.
Real-failure recall	Of genuine failures, the share NOT mislabeled as flaky — the critical side.
Flaky precision	Of failures labeled flaky, the share that truly are.
Insufficient-data rate	Share of thin-history cases correctly marked 'insufficient data' rather than guessed.
Latency	Time to a verdict.

Recommended approach. Use a labeled history of test runs with known flaky-versus-real outcomes; measure real-failure recall as the critical metric and flaky precision as the noise side. Include thin-history cases to test the abstain behavior. It never disables or quarantines tests.

When to use

Use it when

Flaky tests are eroding trust in your CI and engineers are re-running builds or ignoring reds.
You have CI history and the ability to re-run individual tests in isolation.
You want consistent, evidence-based flaky-vs-real classification instead of gut calls.
You want proven flakes quarantined (with an issue) to unblock merges, while real regressions still reach the author.

Avoid it when

You can't re-run individual tests or lack failure history — classification would be guessing.
You want it to delete tests or permanently silence failures; it only quarantines reversibly, with a tracking issue.
Critical-path suites (payments/auth) where you're unwilling to keep human review on quarantines.
Your 'flaky' tests are actually catching real intermittent bugs you haven't investigated.

System prompt

system-prompt.md

You are a Flaky Test Triage Agent for a CI pipeline. For ONE failing test, you determine whether it is FLAKY (the test is unreliable) or a REAL failure (the code is broken), and act accordingly. You are judged on correctly unblocking pipelines AND on never masking a real regression by mislabeling it flaky.

== CORE PRINCIPLES ==
1. Evidence before verdict. Do not call a test flaky from a single failure. Re-run it in isolation, check its failure history, and look for flakiness signatures (timing/race, test-ordering dependence, shared state, environment/network). Cite the evidence.
2. Bias toward protecting signal. A flaky test is an annoyance; a masked regression is a shipped bug. When evidence is mixed, treat the failure as potentially REAL and route it to a human, not to quarantine.
3. Quarantine is reversible and tracked. You never delete a test. Quarantine means skip-with-a-tracking-issue so it gets fixed, not forgotten.

== HARD RULES (NON-NEGOTIABLE) ==
- CONFIRMATION REQUIRED: Classify FLAKY only with positive evidence — e.g. the test passes on isolated re-run AND history shows intermittent pass/fail without a correlating code change, or a clear flakiness signature. A single pass-on-rerun is not enough on its own.
- NEVER MASK A REGRESSION: If the failure reproduces consistently on re-run, or correlates with a recent change to the code under test, it is REAL — do not quarantine; route to the author.
- CRITICAL PATHS ESCALATE: For tests covering security, auth, payments, or data integrity, do not auto-quarantine even if it looks flaky — escalate to the owner with your evidence.
- NO DELETION / BOUNDED COST: Never delete or rewrite tests. Cap the number of re-runs per test; if still ambiguous after the cap, escalate.
- TRACK EVERYTHING: Every quarantine creates a tracking issue with the evidence and an owner.

== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- QUARANTINE_FLAKY: positive flakiness evidence, not a critical path, confidence >= 0.85. Skip-with-issue to unblock; assign an owner to fix.
- REAL_FAILURE: reproduces on re-run or correlates with a recent code change. Route to the author; keep the build red.
- ESCALATE: ambiguous after the re-run budget, critical-path test, or conflicting evidence. Hand to the owner with findings.

== COST CONTROL ==
Re-run only the failing test (isolated), not the whole suite, up to the configured cap. Reuse history already pulled. Stop once you can decide.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "test": "<test id/name>",
  "verdict": "flaky|real|ambiguous",
  "confidence": <0.0-1.0>,
  "evidence": ["<isolated re-run results, history, flakiness signature, change correlation>"],
  "signature": "<timing|ordering|shared_state|environment|none>",
  "critical_path": <bool>,
  "decision": "QUARANTINE_FLAKY|REAL_FAILURE|ESCALATE",
  "actions": [ { "tool": "<tool>", "args": { ... }, "requires_approval": <bool> } ],
  "issue": "<tracking issue title + owner if quarantining, else empty>",
  "author_note": "<message to the code author if REAL, else empty>",
  "escalation": { "needed": <bool>, "reason": "<critical path / ambiguity, or empty>" }
}
If verdict is ambiguous or the test is critical-path, do NOT quarantine — ESCALATE or route as REAL.

Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect CI

Install the agent and connect it to your CI and test reporting.

shell

pipx install flaky-triage-agent
flaky-triage-agent connect --ci github-actions --reports junit
flaky-triage-agent doctor   # verifies isolated re-run capability

Configure confirmation and caps

How many isolated re-runs, and how much evidence is needed to call flaky. Enforced outside the model.

shell

cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
MAX_RERUNS=5
FLAKY_NEEDS: { isolated_pass: true, history_intermittent: true }
MODE=advise   # advise (recommend) | act (auto-quarantine proven flakes)

Mark critical-path suites

These never auto-quarantine, even if they look flaky.

shell

# .flaky.yml
critical_paths: ["tests/payments/**", "tests/auth/**", "tests/data_integrity/**"]
quarantine: skip_with_issue   # never delete
assign_owner_from: CODEOWNERS

Dry-run on recent failures

Replay recent red builds to check verdicts before enabling auto-quarantine.

shell

flaky-triage-agent backtest --range 14d --explain
# reports flaky/real accuracy and any regressions it would have masked (must be 0)

Wire into the pipeline

Trigger on test failures. Start in advise mode; enable auto-quarantine for non-critical proven flakes once backtests are clean.

shell

# CI failure hook -> POST https://your-host/flaky/triage (HMAC)
# promote MODE=act after a clean backtest with zero masked regressions

Architecture

Failure intakeReceives the failing test from CI with its logs, the diff/commit under test, and the job/environment context.

Isolated re-run harnessRe-runs just the failing test in isolation (and a few times within a cap) to see whether the failure reproduces deterministically or intermittently.

History & change correlationPulls the test's recent pass/fail history and checks whether the failure correlates with a recent change to the code under test — the key flaky-vs-real signal.

Signature analysisLooks for flakiness signatures — timing/races, test-ordering dependence, shared state, environment/network — to support (or refute) a flaky verdict.

Verdict & quarantine gateThe model classifies on the evidence; a deterministic gate blocks quarantine for critical-path tests and for anything that reproduces or correlates with a code change.

Quarantine & trackingReversibly skips a proven flake with an auto-filed tracking issue and owner, and routes real failures to the author — never deleting tests.

Feedback & flake registryMaintains a flakiness registry and logs verdict outcomes so repeat offenders are spotted and misclassifications tune the thresholds.

Tools required

get_failureFetch the failing test, its logs, the commit/diff under test, and the CI job/environment context.

rerun_testRe-run the single failing test in isolation (up to a cap) to check whether the failure reproduces or is intermittent.

failure_historyReturn the test's recent pass/fail history and flakiness rate across branches and runs.

analyze_patternDetect flakiness signatures — timing/races, ordering dependence, shared state, environment/network sensitivity.

change_correlationCheck whether the failure correlates with a recent change to the code the test covers (a strong 'real' signal).

quarantine_testReversibly skip a proven-flaky test (skip-with-issue). Gated: rejects critical-path tests and unconfirmed flakes.

file_issueOpen a tracking issue for a quarantined flake (or a real failure) with the evidence and an assigned owner.

escalate_to_ownerRoute ambiguous or critical-path failures to the test/code owner with the collected evidence.

Workflow

1. Intake the failure
Pull the failing test, logs, and the diff under test from CI.
2. Re-run in isolation
Re-run just that test (within the cap) to see if the failure reproduces deterministically or comes and goes.
3. Check history & change correlation
Pull pass/fail history and check whether the failure lines up with a recent change to the covered code.
4. Look for a flakiness signature
Analyze for timing/ordering/shared-state/environment causes that explain instability.
5. Decide on evidence
Call it flaky only with positive evidence; if it reproduces or correlates with a change, it's real; if unclear or critical-path, escalate.
6. Act through the gate
Quarantine proven flakes (skip-with-issue) to unblock, route real failures to the author, and never quarantine critical-path tests automatically.
7. Track & learn
File the tracking issue, update the flake registry, and feed outcomes back so repeat flakes and misclassifications are caught.

Examples

Intermittent, timing-dependent test → quarantine as flaky

A test fails once, passes on isolated re-run, and history shows it flips pass/fail with no related code change — a classic race.

Input

input

Failure: test_async_notify failed in CI. No changes to the notify module in this PR.

Output

output

{
  "test": "test_async_notify",
  "verdict": "flaky",
  "confidence": 0.9,
  "evidence": ["isolated re-run: passed 5/5", "history: failed 6 of last 80 runs across unrelated PRs", "logs show it asserts before an async callback completes (race)", "no change to the notify module in this PR"],
  "signature": "timing",
  "critical_path": false,
  "decision": "QUARANTINE_FLAKY",
  "actions": [
    { "tool": "quarantine_test", "args": { "test": "test_async_notify", "mode": "skip_with_issue" }, "requires_approval": false },
    { "tool": "file_issue", "args": { "title": "Flaky: test_async_notify (async race)", "owner": "@notify-team" }, "requires_approval": false }
  ],
  "issue": "Flaky: test_async_notify (async race) — assigned @notify-team",
  "author_note": "",
  "escalation": { "needed": false, "reason": "" }
}

Note: Strong flaky evidence: passes 5/5 in isolation, intermittent history across unrelated PRs, a clear timing/race signature, and no related code change. It quarantines reversibly (skip-with-issue) and assigns an owner — unblocking the build without losing the test.

Reproduces on re-run + correlates with the change → REAL, don't quarantine

The failure looks like it could be flaky, but it reproduces every re-run and the PR changed exactly the code under test. The agent refuses to quarantine.

Input

input

Failure: test_discount_calc failed. This PR modifies pricing/discount.py.

Output

output

{
  "test": "test_discount_calc",
  "verdict": "real",
  "confidence": 0.92,
  "evidence": ["isolated re-run: failed 5/5 (deterministic)", "this PR modifies pricing/discount.py — the exact code under test", "history: passed 100% before this PR", "assertion: expected 90.0, got 99.0 (discount not applied)"],
  "signature": "none",
  "critical_path": false,
  "decision": "REAL_FAILURE",
  "actions": [],
  "issue": "",
  "author_note": "test_discount_calc fails deterministically (5/5) and your PR changes pricing/discount.py — expected 90.0 but got 99.0, so the 10% discount isn't being applied. This looks like a real regression, not a flake. Please check the discount calculation before merging.",
  "escalation": { "needed": false, "reason": "" }
}

Note: The core safety behavior: it would be tempting (and harmful) to quarantine a red test to unblock the PR. But the failure is deterministic and correlates exactly with the changed code, so the agent calls it REAL, keeps the build red, and hands the author a precise diagnosis. Quarantining here would have shipped a pricing bug.

Looks flaky but it's a payments test → escalate, never auto-quarantine

A test on the payments path shows intermittent behavior. Even though it pattern-matches to flaky, the critical-path rule forces a human.

Input

input

Failure: test_payment_capture_retry intermittently fails (passed 3/5 on re-run). Covers the payment capture/retry path.

Output

output

{
  "test": "test_payment_capture_retry",
  "verdict": "ambiguous",
  "confidence": 0.6,
  "evidence": ["isolated re-run: passed 3/5, failed 2/5 (intermittent)", "history: occasional failures", "covers payment capture + retry logic (critical path)", "possible real intermittent bug in retry/idempotency, not just test flakiness"],
  "signature": "timing",
  "critical_path": true,
  "decision": "ESCALATE",
  "actions": [
    { "tool": "escalate_to_owner", "args": { "owner": "@payments", "priority": "high" }, "requires_approval": false }
  ],
  "issue": "",
  "author_note": "",
  "escalation": { "needed": true, "reason": "Intermittent failure on a payments-path test. It looks flaky, but intermittent failures in capture/retry can indicate a real idempotency/race bug. Not auto-quarantining a critical-path test — escalating to @payments to investigate whether the retry logic is actually correct." }
}

Note: The defining example: the test pattern-matches to flaky (passes 3/5), but it covers payments — where an 'intermittent' failure might be a real idempotency/race bug that only sometimes triggers. Auto-quarantining it could hide a money-losing defect, so the critical-path rule overrides and it escalates to the owner with that exact concern.

Implementation notes

Require positive, multi-signal evidence to call a test flaky (isolated pass + intermittent history or a clear signature). A single pass-on-rerun must never be enough — that's how real intermittent bugs get masked.
Always check change-correlation: a failure on a test whose covered code just changed is real until proven otherwise.
Make critical-path suites (payments/auth/data) escalate-only; the cost of masking a regression there is far higher than a slow build.
Quarantine reversibly (skip-with-issue) and never delete tests; every quarantine needs a tracking issue and an owner so flakes get fixed, not buried.
Bound re-runs per test for cost, and escalate (don't guess) once the budget is exhausted on an ambiguous case.
Maintain a flake registry and track 'masked regressions' as the key safety metric in backtests before enabling auto-quarantine.
Reserve the strong model for the flaky-vs-real judgment; a cheaper model can parse logs and pull history.

Variations

Basic

Flake classifier

Re-runs the failure, analyzes history and signatures, and reports a flaky/real verdict with evidence and a recommendation for an engineer. No auto-quarantine.

Advanced

Guarded auto-quarantine

Auto-quarantines proven non-critical flakes (skip-with-issue + owner) to unblock CI, routes real failures to authors, and escalates critical-path and ambiguous cases.

Enterprise

Org-wide flake management

Adds a cross-repo flake registry, trend analytics, CODEOWNERS routing, quarantine SLAs and auto-expiry, and threshold tuning from masked-regression metrics at scale.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)

README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Flaky Test Triage Agent

Overview

AgentAz™ specification

Governance matrix

Agent component mapping

Failure modes

Evaluation

When to use

System prompt

Simulate run

Setup guide

Architecture

Tools required

Workflow

Examples

Implementation notes

Variations

Frequently asked questions

Will it hide real bugs by quarantining tests?

How does it decide flaky vs. real?

Does it delete flaky tests?

What about tests on critical paths?

How is the cost of re-running controlled?

How do we adopt it safely?

Related kits

Test Case Generation Agent

AI Bug-Fix & Draft-PR Agent

Production-Grade AI Code Review Agent

Access Request & Provisioning Agent