AgentKits

Flaky Test Triage Agent

Production Blueprint
0New

Includes Agent Blueprint + Implementation Guide

An agent that triages CI test failures and answers the question every engineer hates guessing: is this a flaky test or a real bug? It re-runs the failure in isolation, weighs the test's failure history and patterns, and only quarantines a test once it has evidence the test — not the code — is at fault. It is defensive by design: it never quarantines a test that might be catching a real regression, requires repeated confirmation before calling something flaky, keeps re-runs cost-bounded, never deletes tests, and escalates anything ambiguous or touching critical paths to the owner.

flaky-testsci-cdqatestingdeveloper-toolsautonomous-agenttest-quarantinedevexagentazagent-governancetrust-levelproduction-readiness
StackClaude, LangGraph, OpenAI
DifficultyAdvanced
Setup45 min
Version2.0.0 · 2026-06-21

Overview

Re-run → analyze history → classify → quarantine or route: turns a red build into a clear 'flake vs. real bug' verdict.

Evidence-based: it isolates the failure and weighs failure history and patterns (timing, ordering, environment) before judging.

Quarantines proven flakes behind a gate to unblock the pipeline — and routes real regressions straight to the code author.

Defensive: it never masks a real bug, needs repeated confirmation to call a test flaky, and escalates ambiguous or critical-path failures.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend
DNA PatternEvaluation (Research → Evaluate)
Worst-Case ActionLabels a real failure as flaky (or vice versa), surfaced for an engineer to review. It cannot quarantine, disable, skip, or rerun tests autonomously — those tools are absent from its registry.
Authority BoundaryAnalyzes test history to identify likely-flaky tests with supporting evidence and recommends action. An engineer decides. It never disables, quarantines, skips, or reruns tests on its own.
Verification TestAttempt to call a disable-test, quarantine, or skip tool → confirm it is absent from the agent's registry.
Production Readiness6/6 dimensions passing. Tool isolation: disable/quarantine tools absent. Human gates: an engineer decides. Confidence escalation: ambiguous cases flagged, never auto-quarantined. Cost ceiling: bounded per analysis. Audit trail: evidence and recommendations logged. Escalation path: suspected real failures flagged.
Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json
{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "flaky-test-triage-agent",
  "trust_level": "A2",
  "dna_pattern": "Evaluation",
  "worst_case_action": "Mislabels a failure as flaky for engineer review. Cannot disable, quarantine, or skip tests.",
  "authority_boundary": "Identifies likely-flaky tests and recommends action; disable/quarantine tools absent.",
  "tags": [
    "qa-testing",
    "flaky-tests",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_test_history",
      "detect_flakiness",
      "gather_evidence",
      "recommend"
    ],
    "execution_tools_absent": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "disable_test",
      "quarantine",
      "skip",
      "rerun"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.2,
    "alert_threshold_usd": 0.14
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "ambiguous_case",
      "suspected_real_failure"
    ],
    "destination": "engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "evidence",
      "recommendations"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goalBounded by the authority spec above
Trust LevelA2 — Recommend
Tool accessLeast privilege — execution tools absent (read-only)
Context handlingGrounded in provided inputs; cites or flags rather than guessing
Memory strategyTask-scoped; no persistent cross-session memory
Human approvalRequired on ambiguous case, suspected real failure → engineer
Audit trailAppend-only log (evidence, recommendations)
Cost & loop bounds≤ $0.2 per loop · ≤ 8 reasoning turns
Recovery / escalationEscalates to engineer

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

AgentPrimary reasoner — Recommend authority (A2)
Toolsread test history, detect flakiness, gather evidence, recommend — execution tools absent (read-only)
MemoryTask-scoped working context; no persistent cross-session memory
GuardrailsWorst-case classified (A2); no execution tools; ≤ $0.2/loop · ≤ 8 turns
EvaluatorConfidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
HandoffEscalates to engineer on ambiguous case, suspected real failure

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Labels a real failure as flaky, hiding a genuine bug.

Detection
Ambiguous cases are flagged and evidence is required per verdict.
Mitigation
It recommends only — it cannot disable, quarantine, or skip tests.
Recovery
An engineer reviews; suspected real failures are surfaced.

Labels a flaky test as a real failure, wasting triage time.

Detection
Flakiness signals such as intermittency and environment-dependence are weighed with confidence.
Mitigation
Low-confidence verdicts are flagged, not asserted.
Recovery
The engineer corrects the label.

Acts on insufficient history — too few runs to judge.

Detection
A minimum-sample check runs before a verdict.
Mitigation
Thin history yields 'insufficient data', not a guess.
Recovery
More runs are gathered before judging.

Evaluation

Correctly separating flaky from real failures is primary — labeling a real failure as flaky hides a bug.

Classification accuracyShare of failures correctly labeled flaky versus real.
Real-failure recallOf genuine failures, the share NOT mislabeled as flaky — the critical side.
Flaky precisionOf failures labeled flaky, the share that truly are.
Insufficient-data rateShare of thin-history cases correctly marked 'insufficient data' rather than guessed.
LatencyTime to a verdict.

Recommended approach. Use a labeled history of test runs with known flaky-versus-real outcomes; measure real-failure recall as the critical metric and flaky precision as the noise side. Include thin-history cases to test the abstain behavior. It never disables or quarantines tests.

When to use

Use it when

  • Flaky tests are eroding trust in your CI and engineers are re-running builds or ignoring reds.
  • You have CI history and the ability to re-run individual tests in isolation.
  • You want consistent, evidence-based flaky-vs-real classification instead of gut calls.
  • You want proven flakes quarantined (with an issue) to unblock merges, while real regressions still reach the author.

Avoid it when

  • You can't re-run individual tests or lack failure history — classification would be guessing.
  • You want it to delete tests or permanently silence failures; it only quarantines reversibly, with a tracking issue.
  • Critical-path suites (payments/auth) where you're unwilling to keep human review on quarantines.
  • Your 'flaky' tests are actually catching real intermittent bugs you haven't investigated.

System prompt

system-prompt.md
You are a Flaky Test Triage Agent for a CI pipeline. For ONE failing test, you determine whether it is FLAKY (the test is unreliable) or a REAL failure (the code is broken), and act accordingly. You are judged on correctly unblocking pipelines AND on never masking a real regression by mislabeling it flaky.

== CORE PRINCIPLES ==
1. Evidence before verdict. Do not call a test flaky from a single failure. Re-run it in isolation, check its failure history, and look for flakiness signatures (timing/race, test-ordering dependence, shared state, environment/network). Cite the evidence.
2. Bias toward protecting signal. A flaky test is an annoyance; a masked regression is a shipped bug. When evidence is mixed, treat the failure as potentially REAL and route it to a human, not to quarantine.
3. Quarantine is reversible and tracked. You never delete a test. Quarantine means skip-with-a-tracking-issue so it gets fixed, not forgotten.

== HARD RULES (NON-NEGOTIABLE) ==
- CONFIRMATION REQUIRED: Classify FLAKY only with positive evidence — e.g. the test passes on isolated re-run AND history shows intermittent pass/fail without a correlating code change, or a clear flakiness signature. A single pass-on-rerun is not enough on its own.
- NEVER MASK A REGRESSION: If the failure reproduces consistently on re-run, or correlates with a recent change to the code under test, it is REAL — do not quarantine; route to the author.
- CRITICAL PATHS ESCALATE: For tests covering security, auth, payments, or data integrity, do not auto-quarantine even if it looks flaky — escalate to the owner with your evidence.
- NO DELETION / BOUNDED COST: Never delete or rewrite tests. Cap the number of re-runs per test; if still ambiguous after the cap, escalate.
- TRACK EVERYTHING: Every quarantine creates a tracking issue with the evidence and an owner.

== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- QUARANTINE_FLAKY: positive flakiness evidence, not a critical path, confidence >= 0.85. Skip-with-issue to unblock; assign an owner to fix.
- REAL_FAILURE: reproduces on re-run or correlates with a recent code change. Route to the author; keep the build red.
- ESCALATE: ambiguous after the re-run budget, critical-path test, or conflicting evidence. Hand to the owner with findings.

== COST CONTROL ==
Re-run only the failing test (isolated), not the whole suite, up to the configured cap. Reuse history already pulled. Stop once you can decide.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "test": "<test id/name>",
  "verdict": "flaky|real|ambiguous",
  "confidence": <0.0-1.0>,
  "evidence": ["<isolated re-run results, history, flakiness signature, change correlation>"],
  "signature": "<timing|ordering|shared_state|environment|none>",
  "critical_path": <bool>,
  "decision": "QUARANTINE_FLAKY|REAL_FAILURE|ESCALATE",
  "actions": [ { "tool": "<tool>", "args": { ... }, "requires_approval": <bool> } ],
  "issue": "<tracking issue title + owner if quarantining, else empty>",
  "author_note": "<message to the code author if REAL, else empty>",
  "escalation": { "needed": <bool>, "reason": "<critical path / ambiguity, or empty>" }
}
If verdict is ambiguous or the test is critical-path, do NOT quarantine — ESCALATE or route as REAL.
Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect CI

Install the agent and connect it to your CI and test reporting.

shell
pipx install flaky-triage-agent
flaky-triage-agent connect --ci github-actions --reports junit
flaky-triage-agent doctor   # verifies isolated re-run capability

Configure confirmation and caps

How many isolated re-runs, and how much evidence is needed to call flaky. Enforced outside the model.

shell
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
MAX_RERUNS=5
FLAKY_NEEDS: { isolated_pass: true, history_intermittent: true }
MODE=advise   # advise (recommend) | act (auto-quarantine proven flakes)

Mark critical-path suites

These never auto-quarantine, even if they look flaky.

shell
# .flaky.yml
critical_paths: ["tests/payments/**", "tests/auth/**", "tests/data_integrity/**"]
quarantine: skip_with_issue   # never delete
assign_owner_from: CODEOWNERS

Dry-run on recent failures

Replay recent red builds to check verdicts before enabling auto-quarantine.

shell
flaky-triage-agent backtest --range 14d --explain
# reports flaky/real accuracy and any regressions it would have masked (must be 0)

Wire into the pipeline

Trigger on test failures. Start in advise mode; enable auto-quarantine for non-critical proven flakes once backtests are clean.

shell
# CI failure hook -> POST https://your-host/flaky/triage (HMAC)
# promote MODE=act after a clean backtest with zero masked regressions

Architecture

Tools required

get_failureFetch the failing test, its logs, the commit/diff under test, and the CI job/environment context.
rerun_testRe-run the single failing test in isolation (up to a cap) to check whether the failure reproduces or is intermittent.
failure_historyReturn the test's recent pass/fail history and flakiness rate across branches and runs.
analyze_patternDetect flakiness signatures — timing/races, ordering dependence, shared state, environment/network sensitivity.
change_correlationCheck whether the failure correlates with a recent change to the code the test covers (a strong 'real' signal).
quarantine_testReversibly skip a proven-flaky test (skip-with-issue). Gated: rejects critical-path tests and unconfirmed flakes.
file_issueOpen a tracking issue for a quarantined flake (or a real failure) with the evidence and an assigned owner.
escalate_to_ownerRoute ambiguous or critical-path failures to the test/code owner with the collected evidence.

Workflow

  1. 1. Intake the failure

    Pull the failing test, logs, and the diff under test from CI.

  2. 2. Re-run in isolation

    Re-run just that test (within the cap) to see if the failure reproduces deterministically or comes and goes.

  3. 3. Check history & change correlation

    Pull pass/fail history and check whether the failure lines up with a recent change to the covered code.

  4. 4. Look for a flakiness signature

    Analyze for timing/ordering/shared-state/environment causes that explain instability.

  5. 5. Decide on evidence

    Call it flaky only with positive evidence; if it reproduces or correlates with a change, it's real; if unclear or critical-path, escalate.

  6. 6. Act through the gate

    Quarantine proven flakes (skip-with-issue) to unblock, route real failures to the author, and never quarantine critical-path tests automatically.

  7. 7. Track & learn

    File the tracking issue, update the flake registry, and feed outcomes back so repeat flakes and misclassifications are caught.

Examples

Intermittent, timing-dependent test → quarantine as flaky

A test fails once, passes on isolated re-run, and history shows it flips pass/fail with no related code change — a classic race.

Input

input
Failure: test_async_notify failed in CI. No changes to the notify module in this PR.

Output

output
{
  "test": "test_async_notify",
  "verdict": "flaky",
  "confidence": 0.9,
  "evidence": ["isolated re-run: passed 5/5", "history: failed 6 of last 80 runs across unrelated PRs", "logs show it asserts before an async callback completes (race)", "no change to the notify module in this PR"],
  "signature": "timing",
  "critical_path": false,
  "decision": "QUARANTINE_FLAKY",
  "actions": [
    { "tool": "quarantine_test", "args": { "test": "test_async_notify", "mode": "skip_with_issue" }, "requires_approval": false },
    { "tool": "file_issue", "args": { "title": "Flaky: test_async_notify (async race)", "owner": "@notify-team" }, "requires_approval": false }
  ],
  "issue": "Flaky: test_async_notify (async race) — assigned @notify-team",
  "author_note": "",
  "escalation": { "needed": false, "reason": "" }
}

Note: Strong flaky evidence: passes 5/5 in isolation, intermittent history across unrelated PRs, a clear timing/race signature, and no related code change. It quarantines reversibly (skip-with-issue) and assigns an owner — unblocking the build without losing the test.

Reproduces on re-run + correlates with the change → REAL, don't quarantine

The failure looks like it could be flaky, but it reproduces every re-run and the PR changed exactly the code under test. The agent refuses to quarantine.

Input

input
Failure: test_discount_calc failed. This PR modifies pricing/discount.py.

Output

output
{
  "test": "test_discount_calc",
  "verdict": "real",
  "confidence": 0.92,
  "evidence": ["isolated re-run: failed 5/5 (deterministic)", "this PR modifies pricing/discount.py — the exact code under test", "history: passed 100% before this PR", "assertion: expected 90.0, got 99.0 (discount not applied)"],
  "signature": "none",
  "critical_path": false,
  "decision": "REAL_FAILURE",
  "actions": [],
  "issue": "",
  "author_note": "test_discount_calc fails deterministically (5/5) and your PR changes pricing/discount.py — expected 90.0 but got 99.0, so the 10% discount isn't being applied. This looks like a real regression, not a flake. Please check the discount calculation before merging.",
  "escalation": { "needed": false, "reason": "" }
}

Note: The core safety behavior: it would be tempting (and harmful) to quarantine a red test to unblock the PR. But the failure is deterministic and correlates exactly with the changed code, so the agent calls it REAL, keeps the build red, and hands the author a precise diagnosis. Quarantining here would have shipped a pricing bug.

Looks flaky but it's a payments test → escalate, never auto-quarantine

A test on the payments path shows intermittent behavior. Even though it pattern-matches to flaky, the critical-path rule forces a human.

Input

input
Failure: test_payment_capture_retry intermittently fails (passed 3/5 on re-run). Covers the payment capture/retry path.

Output

output
{
  "test": "test_payment_capture_retry",
  "verdict": "ambiguous",
  "confidence": 0.6,
  "evidence": ["isolated re-run: passed 3/5, failed 2/5 (intermittent)", "history: occasional failures", "covers payment capture + retry logic (critical path)", "possible real intermittent bug in retry/idempotency, not just test flakiness"],
  "signature": "timing",
  "critical_path": true,
  "decision": "ESCALATE",
  "actions": [
    { "tool": "escalate_to_owner", "args": { "owner": "@payments", "priority": "high" }, "requires_approval": false }
  ],
  "issue": "",
  "author_note": "",
  "escalation": { "needed": true, "reason": "Intermittent failure on a payments-path test. It looks flaky, but intermittent failures in capture/retry can indicate a real idempotency/race bug. Not auto-quarantining a critical-path test — escalating to @payments to investigate whether the retry logic is actually correct." }
}

Note: The defining example: the test pattern-matches to flaky (passes 3/5), but it covers payments — where an 'intermittent' failure might be a real idempotency/race bug that only sometimes triggers. Auto-quarantining it could hide a money-losing defect, so the critical-path rule overrides and it escalates to the owner with that exact concern.

Implementation notes

  • Require positive, multi-signal evidence to call a test flaky (isolated pass + intermittent history or a clear signature). A single pass-on-rerun must never be enough — that's how real intermittent bugs get masked.
  • Always check change-correlation: a failure on a test whose covered code just changed is real until proven otherwise.
  • Make critical-path suites (payments/auth/data) escalate-only; the cost of masking a regression there is far higher than a slow build.
  • Quarantine reversibly (skip-with-issue) and never delete tests; every quarantine needs a tracking issue and an owner so flakes get fixed, not buried.
  • Bound re-runs per test for cost, and escalate (don't guess) once the budget is exhausted on an ambiguous case.
  • Maintain a flake registry and track 'masked regressions' as the key safety metric in backtests before enabling auto-quarantine.
  • Reserve the strong model for the flaky-vs-real judgment; a cheaper model can parse logs and pull history.

Variations

Basic

Flake classifier

Re-runs the failure, analyzes history and signatures, and reports a flaky/real verdict with evidence and a recommendation for an engineer. No auto-quarantine.

Advanced

Guarded auto-quarantine

Auto-quarantines proven non-critical flakes (skip-with-issue + owner) to unblock CI, routes real failures to authors, and escalates critical-path and ambiguous cases.

Enterprise

Org-wide flake management

Adds a cross-repo flake registry, trend analytics, CODEOWNERS routing, quarantine SLAs and auto-expiry, and threshold tuning from masked-regression metrics at scale.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)
README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Frequently asked questions