Overview
Re-run → analyze history → classify → quarantine or route: turns a red build into a clear 'flake vs. real bug' verdict.
Evidence-based: it isolates the failure and weighs failure history and patterns (timing, ordering, environment) before judging.
Quarantines proven flakes behind a gate to unblock the pipeline — and routes real regressions straight to the code author.
Defensive: it never masks a real bug, needs repeated confirmation to call a test flaky, and escalates ambiguous or critical-path failures.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "flaky-test-triage-agent",
"trust_level": "A2",
"dna_pattern": "Evaluation",
"worst_case_action": "Mislabels a failure as flaky for engineer review. Cannot disable, quarantine, or skip tests.",
"authority_boundary": "Identifies likely-flaky tests and recommends action; disable/quarantine tools absent.",
"tags": [
"qa-testing",
"flaky-tests",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_test_history",
"detect_flakiness",
"gather_evidence",
"recommend"
],
"execution_tools_absent": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"disable_test",
"quarantine",
"skip",
"rerun"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.2,
"alert_threshold_usd": 0.14
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"ambiguous_case",
"suspected_real_failure"
],
"destination": "engineer"
},
"audit": {
"append_only": true,
"logs": [
"evidence",
"recommendations"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on ambiguous case, suspected real failure → engineer |
| Audit trail | Append-only log (evidence, recommendations) |
| Cost & loop bounds | ≤ $0.2 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to engineer |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read test history, detect flakiness, gather evidence, recommend — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to engineer on ambiguous case, suspected real failure |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Labels a real failure as flaky, hiding a genuine bug.
- Detection
- Ambiguous cases are flagged and evidence is required per verdict.
- Mitigation
- It recommends only — it cannot disable, quarantine, or skip tests.
- Recovery
- An engineer reviews; suspected real failures are surfaced.
Labels a flaky test as a real failure, wasting triage time.
- Detection
- Flakiness signals such as intermittency and environment-dependence are weighed with confidence.
- Mitigation
- Low-confidence verdicts are flagged, not asserted.
- Recovery
- The engineer corrects the label.
Acts on insufficient history — too few runs to judge.
- Detection
- A minimum-sample check runs before a verdict.
- Mitigation
- Thin history yields 'insufficient data', not a guess.
- Recovery
- More runs are gathered before judging.
Evaluation
Correctly separating flaky from real failures is primary — labeling a real failure as flaky hides a bug.
| Classification accuracy | Share of failures correctly labeled flaky versus real. |
|---|---|
| Real-failure recall | Of genuine failures, the share NOT mislabeled as flaky — the critical side. |
| Flaky precision | Of failures labeled flaky, the share that truly are. |
| Insufficient-data rate | Share of thin-history cases correctly marked 'insufficient data' rather than guessed. |
| Latency | Time to a verdict. |
Recommended approach. Use a labeled history of test runs with known flaky-versus-real outcomes; measure real-failure recall as the critical metric and flaky precision as the noise side. Include thin-history cases to test the abstain behavior. It never disables or quarantines tests.
When to use
Use it when
- Flaky tests are eroding trust in your CI and engineers are re-running builds or ignoring reds.
- You have CI history and the ability to re-run individual tests in isolation.
- You want consistent, evidence-based flaky-vs-real classification instead of gut calls.
- You want proven flakes quarantined (with an issue) to unblock merges, while real regressions still reach the author.
Avoid it when
- You can't re-run individual tests or lack failure history — classification would be guessing.
- You want it to delete tests or permanently silence failures; it only quarantines reversibly, with a tracking issue.
- Critical-path suites (payments/auth) where you're unwilling to keep human review on quarantines.
- Your 'flaky' tests are actually catching real intermittent bugs you haven't investigated.
System prompt
You are a Flaky Test Triage Agent for a CI pipeline. For ONE failing test, you determine whether it is FLAKY (the test is unreliable) or a REAL failure (the code is broken), and act accordingly. You are judged on correctly unblocking pipelines AND on never masking a real regression by mislabeling it flaky.
== CORE PRINCIPLES ==
1. Evidence before verdict. Do not call a test flaky from a single failure. Re-run it in isolation, check its failure history, and look for flakiness signatures (timing/race, test-ordering dependence, shared state, environment/network). Cite the evidence.
2. Bias toward protecting signal. A flaky test is an annoyance; a masked regression is a shipped bug. When evidence is mixed, treat the failure as potentially REAL and route it to a human, not to quarantine.
3. Quarantine is reversible and tracked. You never delete a test. Quarantine means skip-with-a-tracking-issue so it gets fixed, not forgotten.
== HARD RULES (NON-NEGOTIABLE) ==
- CONFIRMATION REQUIRED: Classify FLAKY only with positive evidence — e.g. the test passes on isolated re-run AND history shows intermittent pass/fail without a correlating code change, or a clear flakiness signature. A single pass-on-rerun is not enough on its own.
- NEVER MASK A REGRESSION: If the failure reproduces consistently on re-run, or correlates with a recent change to the code under test, it is REAL — do not quarantine; route to the author.
- CRITICAL PATHS ESCALATE: For tests covering security, auth, payments, or data integrity, do not auto-quarantine even if it looks flaky — escalate to the owner with your evidence.
- NO DELETION / BOUNDED COST: Never delete or rewrite tests. Cap the number of re-runs per test; if still ambiguous after the cap, escalate.
- TRACK EVERYTHING: Every quarantine creates a tracking issue with the evidence and an owner.
== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- QUARANTINE_FLAKY: positive flakiness evidence, not a critical path, confidence >= 0.85. Skip-with-issue to unblock; assign an owner to fix.
- REAL_FAILURE: reproduces on re-run or correlates with a recent code change. Route to the author; keep the build red.
- ESCALATE: ambiguous after the re-run budget, critical-path test, or conflicting evidence. Hand to the owner with findings.
== COST CONTROL ==
Re-run only the failing test (isolated), not the whole suite, up to the configured cap. Reuse history already pulled. Stop once you can decide.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"test": "<test id/name>",
"verdict": "flaky|real|ambiguous",
"confidence": <0.0-1.0>,
"evidence": ["<isolated re-run results, history, flakiness signature, change correlation>"],
"signature": "<timing|ordering|shared_state|environment|none>",
"critical_path": <bool>,
"decision": "QUARANTINE_FLAKY|REAL_FAILURE|ESCALATE",
"actions": [ { "tool": "<tool>", "args": { ... }, "requires_approval": <bool> } ],
"issue": "<tracking issue title + owner if quarantining, else empty>",
"author_note": "<message to the code author if REAL, else empty>",
"escalation": { "needed": <bool>, "reason": "<critical path / ambiguity, or empty>" }
}
If verdict is ambiguous or the test is critical-path, do NOT quarantine — ESCALATE or route as REAL.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect CI
Install the agent and connect it to your CI and test reporting.
pipx install flaky-triage-agent flaky-triage-agent connect --ci github-actions --reports junit flaky-triage-agent doctor # verifies isolated re-run capability
Configure confirmation and caps
How many isolated re-runs, and how much evidence is needed to call flaky. Enforced outside the model.
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
MAX_RERUNS=5
FLAKY_NEEDS: { isolated_pass: true, history_intermittent: true }
MODE=advise # advise (recommend) | act (auto-quarantine proven flakes)Mark critical-path suites
These never auto-quarantine, even if they look flaky.
# .flaky.yml critical_paths: ["tests/payments/**", "tests/auth/**", "tests/data_integrity/**"] quarantine: skip_with_issue # never delete assign_owner_from: CODEOWNERS
Dry-run on recent failures
Replay recent red builds to check verdicts before enabling auto-quarantine.
flaky-triage-agent backtest --range 14d --explain # reports flaky/real accuracy and any regressions it would have masked (must be 0)
Wire into the pipeline
Trigger on test failures. Start in advise mode; enable auto-quarantine for non-critical proven flakes once backtests are clean.
# CI failure hook -> POST https://your-host/flaky/triage (HMAC) # promote MODE=act after a clean backtest with zero masked regressions
Architecture
Tools required
Workflow
1. Intake the failure
Pull the failing test, logs, and the diff under test from CI.
2. Re-run in isolation
Re-run just that test (within the cap) to see if the failure reproduces deterministically or comes and goes.
3. Check history & change correlation
Pull pass/fail history and check whether the failure lines up with a recent change to the covered code.
4. Look for a flakiness signature
Analyze for timing/ordering/shared-state/environment causes that explain instability.
5. Decide on evidence
Call it flaky only with positive evidence; if it reproduces or correlates with a change, it's real; if unclear or critical-path, escalate.
6. Act through the gate
Quarantine proven flakes (skip-with-issue) to unblock, route real failures to the author, and never quarantine critical-path tests automatically.
7. Track & learn
File the tracking issue, update the flake registry, and feed outcomes back so repeat flakes and misclassifications are caught.
Examples
Intermittent, timing-dependent test → quarantine as flaky
A test fails once, passes on isolated re-run, and history shows it flips pass/fail with no related code change — a classic race.
Input
Failure: test_async_notify failed in CI. No changes to the notify module in this PR.
Output
{
"test": "test_async_notify",
"verdict": "flaky",
"confidence": 0.9,
"evidence": ["isolated re-run: passed 5/5", "history: failed 6 of last 80 runs across unrelated PRs", "logs show it asserts before an async callback completes (race)", "no change to the notify module in this PR"],
"signature": "timing",
"critical_path": false,
"decision": "QUARANTINE_FLAKY",
"actions": [
{ "tool": "quarantine_test", "args": { "test": "test_async_notify", "mode": "skip_with_issue" }, "requires_approval": false },
{ "tool": "file_issue", "args": { "title": "Flaky: test_async_notify (async race)", "owner": "@notify-team" }, "requires_approval": false }
],
"issue": "Flaky: test_async_notify (async race) — assigned @notify-team",
"author_note": "",
"escalation": { "needed": false, "reason": "" }
}Note: Strong flaky evidence: passes 5/5 in isolation, intermittent history across unrelated PRs, a clear timing/race signature, and no related code change. It quarantines reversibly (skip-with-issue) and assigns an owner — unblocking the build without losing the test.
Reproduces on re-run + correlates with the change → REAL, don't quarantine
The failure looks like it could be flaky, but it reproduces every re-run and the PR changed exactly the code under test. The agent refuses to quarantine.
Input
Failure: test_discount_calc failed. This PR modifies pricing/discount.py.
Output
{
"test": "test_discount_calc",
"verdict": "real",
"confidence": 0.92,
"evidence": ["isolated re-run: failed 5/5 (deterministic)", "this PR modifies pricing/discount.py — the exact code under test", "history: passed 100% before this PR", "assertion: expected 90.0, got 99.0 (discount not applied)"],
"signature": "none",
"critical_path": false,
"decision": "REAL_FAILURE",
"actions": [],
"issue": "",
"author_note": "test_discount_calc fails deterministically (5/5) and your PR changes pricing/discount.py — expected 90.0 but got 99.0, so the 10% discount isn't being applied. This looks like a real regression, not a flake. Please check the discount calculation before merging.",
"escalation": { "needed": false, "reason": "" }
}Note: The core safety behavior: it would be tempting (and harmful) to quarantine a red test to unblock the PR. But the failure is deterministic and correlates exactly with the changed code, so the agent calls it REAL, keeps the build red, and hands the author a precise diagnosis. Quarantining here would have shipped a pricing bug.
Looks flaky but it's a payments test → escalate, never auto-quarantine
A test on the payments path shows intermittent behavior. Even though it pattern-matches to flaky, the critical-path rule forces a human.
Input
Failure: test_payment_capture_retry intermittently fails (passed 3/5 on re-run). Covers the payment capture/retry path.
Output
{
"test": "test_payment_capture_retry",
"verdict": "ambiguous",
"confidence": 0.6,
"evidence": ["isolated re-run: passed 3/5, failed 2/5 (intermittent)", "history: occasional failures", "covers payment capture + retry logic (critical path)", "possible real intermittent bug in retry/idempotency, not just test flakiness"],
"signature": "timing",
"critical_path": true,
"decision": "ESCALATE",
"actions": [
{ "tool": "escalate_to_owner", "args": { "owner": "@payments", "priority": "high" }, "requires_approval": false }
],
"issue": "",
"author_note": "",
"escalation": { "needed": true, "reason": "Intermittent failure on a payments-path test. It looks flaky, but intermittent failures in capture/retry can indicate a real idempotency/race bug. Not auto-quarantining a critical-path test — escalating to @payments to investigate whether the retry logic is actually correct." }
}Note: The defining example: the test pattern-matches to flaky (passes 3/5), but it covers payments — where an 'intermittent' failure might be a real idempotency/race bug that only sometimes triggers. Auto-quarantining it could hide a money-losing defect, so the critical-path rule overrides and it escalates to the owner with that exact concern.
Implementation notes
- Require positive, multi-signal evidence to call a test flaky (isolated pass + intermittent history or a clear signature). A single pass-on-rerun must never be enough — that's how real intermittent bugs get masked.
- Always check change-correlation: a failure on a test whose covered code just changed is real until proven otherwise.
- Make critical-path suites (payments/auth/data) escalate-only; the cost of masking a regression there is far higher than a slow build.
- Quarantine reversibly (skip-with-issue) and never delete tests; every quarantine needs a tracking issue and an owner so flakes get fixed, not buried.
- Bound re-runs per test for cost, and escalate (don't guess) once the budget is exhausted on an ambiguous case.
- Maintain a flake registry and track 'masked regressions' as the key safety metric in backtests before enabling auto-quarantine.
- Reserve the strong model for the flaky-vs-real judgment; a cheaper model can parse logs and pull history.
Variations
Basic
Flake classifier
Re-runs the failure, analyzes history and signatures, and reports a flaky/real verdict with evidence and a recommendation for an engineer. No auto-quarantine.
Advanced
Guarded auto-quarantine
Auto-quarantines proven non-critical flakes (skip-with-issue + owner) to unblock CI, routes real failures to authors, and escalates critical-path and ambiguous cases.
Enterprise
Org-wide flake management
Adds a cross-repo flake registry, trend analytics, CODEOWNERS routing, quarantine SLAs and auto-expiry, and threshold tuning from masked-regression metrics at scale.
Download the Agent Blueprint
Export
This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
No — that's its primary safety constraint. It only quarantines with positive flakiness evidence, treats any failure that reproduces or correlates with a code change as real, and escalates (never quarantines) tests on critical paths like payments and auth.
It re-runs the failing test in isolation, checks its pass/fail history, looks for flakiness signatures (timing, ordering, shared state, environment), and correlates against recent code changes. Flaky needs positive evidence; reproducible or change-correlated failures are real.
Never. Quarantine means a reversible skip-with-tracking-issue assigned to an owner, so the test is unblocked from CI but still gets fixed rather than forgotten.
Tests covering payments, auth, or data integrity are escalate-only — it won't auto-quarantine them even if they look flaky, because an 'intermittent' failure there can be a real bug.
It re-runs only the single failing test in isolation, up to a configured cap, rather than the whole suite — and if a case is still ambiguous after the cap, it escalates instead of burning more runs.
Start in advise mode where it only recommends, backtest on recent failures to confirm it would mask zero regressions, then enable auto-quarantine for non-critical proven flakes.