AgentKits

Test Case Generation Agent

Production Blueprint
0TrendingNew

Includes Agent Blueprint + Implementation Guide

Turns a requirement, user story, or acceptance criteria into a structured set of test cases: happy paths, edge cases, and negative cases, each with clear steps and expected results. It stays grounded in the actual requirement instead of inventing behavior, and when a spec is vague it flags the gap rather than guessing. It is built defensively: it derives expected results only from the requirement, flags ambiguous or underspecified criteria, marks its assumptions, and never claims a level of coverage it cannot actually guarantee.

qatestingtest-casesquality-assurancetest-automationautonomous-agentacceptance-criteriasdlcagentazagent-governancetrust-levelproduction-readiness
StackClaude, LangGraph, OpenAI
DifficultyIntermediate
Setup40 min
Version2.0.0 · 2026-06-21

Overview

Turns requirements and user stories into structured test cases with steps and expected results.

Covers happy paths, edge cases, and negative cases grounded in the actual requirement.

Flags ambiguous or underspecified requirements instead of inventing expected behavior.

Defensive: marks assumptions and never claims coverage it cannot guarantee.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend
DNA PatternSynthesis (Extract → Synthesize → Verify)
Worst-Case ActionGenerates incomplete or incorrect test cases that an engineer reviews before use. It cannot run tests, commit code, or modify a suite — execution tools are absent from its registry.
Authority BoundaryReads requirements or code and generates test cases covering paths, edge cases, and failures, flagging gaps it can't cover. An engineer reviews and adds them. It never runs tests, commits, or modifies the codebase.
Verification TestAttempt to call a run-tests, commit, or write-to-repo tool → confirm it is absent from the agent's registry.
Production Readiness6/6 dimensions passing. Tool isolation: run/commit tools absent. Human gates: an engineer reviews. Confidence escalation: uncoverable cases flagged. Cost ceiling: bounded per target. Audit trail: generated cases logged. Escalation path: ambiguous requirements flagged.
Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json
{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "test-case-generation-agent",
  "trust_level": "A2",
  "dna_pattern": "Synthesis",
  "worst_case_action": "Generates incorrect test cases for engineer review. Cannot run, commit, or modify the suite.",
  "authority_boundary": "Generates test cases from requirements/code; run/commit/write tools absent.",
  "tags": [
    "qa-testing",
    "test-generation",
    "read-only",
    "human-review"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_requirements",
      "read_code",
      "generate_cases",
      "flag_coverage_gap"
    ],
    "execution_tools_absent": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "run_tests",
      "commit",
      "repo_write"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.25,
    "alert_threshold_usd": 0.16
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "ambiguous_requirement",
      "uncoverable_case"
    ],
    "destination": "engineer"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "generated_cases"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goalBounded by the authority spec above
Trust LevelA2 — Recommend
Tool accessLeast privilege — execution tools absent (read-only)
Context handlingGrounded in provided inputs; cites or flags rather than guessing
Memory strategyTask-scoped; no persistent cross-session memory
Human approvalRequired on ambiguous requirement, uncoverable case → engineer
Audit trailAppend-only log (generated cases)
Cost & loop bounds≤ $0.25 per loop · ≤ 8 reasoning turns
Recovery / escalationEscalates to engineer

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

AgentPrimary reasoner — Recommend authority (A2)
Toolsread requirements, read code, generate cases, flag coverage gap — execution tools absent (read-only)
MemoryTask-scoped working context; no persistent cross-session memory
GuardrailsWorst-case classified (A2); no execution tools; ≤ $0.25/loop · ≤ 8 turns
EvaluatorConfidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
HandoffEscalates to engineer on ambiguous requirement, uncoverable case

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Generates tests that pass trivially without exercising the logic, creating false confidence.

Detection
Coverage gaps are flagged and assertions are checked for substance.
Mitigation
Positioned as a draft an engineer reviews; it never runs or commits.
Recovery
The engineer strengthens or discards weak cases.

Misreads the requirement and tests the wrong behavior.

Detection
Each case is linked to a requirement and ambiguous requirements are flagged.
Mitigation
An engineer reviews before use.
Recovery
The engineer redirects the cases.

Omits an important edge case, leaving incomplete coverage.

Detection
Uncoverable or skipped paths are flagged.
Mitigation
It surfaces what it couldn't cover rather than implying completeness.
Recovery
The engineer adds the missing cases.

Evaluation

Whether generated tests meaningfully exercise the logic is primary — trivially-passing tests give false confidence.

Requirement coverageShare of requirements and branches the generated tests exercise.
Assertion substanceShare of tests with non-trivial assertions, not vacuously passing.
Mutation-catch rateShare of injected code mutations the generated tests catch.
Edge-case coverageShare of known edge cases represented.
LatencyTime to generate a suite per requirement.

Recommended approach. Run generated tests against a mutated version of the code and measure mutation-catch rate; check assertion substance and requirement coverage against a reference suite. An engineer reviews — tests are never the sole gate.

When to use

Use it when

  • You want first-draft test cases generated from requirements quickly.
  • You have requirements or acceptance criteria the cases can be grounded in.
  • You want edge and negative cases surfaced, not just the happy path.
  • You want ambiguous requirements flagged rather than silently assumed.

Avoid it when

  • You expect it to invent expected results for vague requirements — it flags instead.
  • You want guaranteed full coverage claims (it states what's covered and what isn't).
  • You have no requirement or spec for it to work from.
  • You need it to execute tests rather than author them (that's a runner).

System prompt

system-prompt.md
You are a Test Case Generation Agent. You turn a requirement, user story, or acceptance criteria into clear, structured test cases. You are judged on thorough, grounded test cases and on never inventing requirements, fabricating expected results, or overstating coverage.

== CORE PRINCIPLES ==
1. Ground every case in the requirement. Each test case's expected result must follow from the stated requirement or acceptance criteria. Do not invent behavior the spec doesn't define.
2. Cover beyond the happy path. Include edge cases (boundaries, empty/large inputs), negative cases (invalid input, errors, permissions), and where relevant, security and concurrency considerations. But only assert expected results the requirement supports.
3. Flag, don't guess. If a requirement is ambiguous, unmeasurable, or missing detail, flag it and state what's needed. Don't fabricate a specific expected value to fill the gap.

== HARD RULES (NON-NEGOTIABLE) ==
- NO INVENTED REQUIREMENTS: Never assert an expected result the requirement doesn't define. If it's unspecified, mark it as an open question, not a fact.
- NO FALSE COVERAGE CLAIMS: Never claim "full" or "100%" coverage. State what the cases cover and, honestly, what they don't (e.g. performance, security, integration not tested here).
- FLAG AMBIGUITY: Unmeasurable terms ("fast", "user-friendly") or missing detail must be flagged with the question that needs answering, not resolved by a made-up threshold.
- MARK ASSUMPTIONS: Any assumption you make to write a case is labeled as an assumption for the author to confirm.
- NEUTRAL ON QUALITY: You generate tests; you don't certify the software is correct or release-ready.

== METHOD ==
- Parse the requirement/acceptance criteria. Derive happy-path, edge, and negative cases with steps and expected results grounded in the spec. Flag ambiguities and note coverage gaps.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "requirement_summary": "<faithful gist>",
  "test_cases": [
    { "id": "TC1", "type": "happy|edge|negative|security", "title": "<short>", "steps": ["<step>"], "expected": "<grounded expected result>", "grounded_in": "<which criterion>" }
  ],
  "assumptions": ["<assumptions made, to confirm>"],
  "ambiguities": ["<unclear/unmeasurable items + the question to resolve>"],
  "coverage_note": "<what these cases cover and what they do NOT (honest)>"
}
Never invent an expected result for an unspecified behavior. Never claim full coverage.
Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect

Install the agent and connect your requirement and test-management sources.

shell
pipx install testcase-agent
testcase-agent connect --reqs jira --tests testrail
testcase-agent doctor

Configure grounding guardrails

No invented results and honest coverage are enforced here.

shell
cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
GROUND_IN_REQUIREMENT=true
FLAG_AMBIGUITY=true
NO_FULL_COVERAGE_CLAIMS=true

Set case types & format

Choose the case types and output format your team uses.

shell
# testgen.yml
types: [happy, edge, negative, security]
format: gherkin   # or plain steps/expected
include_assumptions: true

Generate from a story

Produce cases and review the flags and coverage note.

shell
testcase-agent run --story PROJ-123 --explain
# prints cases + assumptions + ambiguities + honest coverage note

Wire into your workflow

Generate draft cases on new stories for QA review.

shell
# new story -> draft test cases -> QA reviews, resolves flags, finalizes

Architecture

Tools required

get_requirementRetrieve the requirement, user story, or acceptance criteria to test.
parse_criteriaBreak the requirement into discrete testable conditions.
generate_happy_pathCreate the primary success-path test cases with grounded expected results.
generate_edge_casesCreate boundary and unusual-input cases supported by the requirement.
generate_negative_casesCreate invalid-input, error, and permission cases with expected handling.
flag_ambiguityFlag unmeasurable or missing detail with the question needed to resolve it.
expected_resultsDerive each expected result strictly from the stated criteria.
coverage_noteReport honestly what is and isn't covered by the generated cases.

Workflow

  1. 1. Take the requirement

    Receive the requirement or acceptance criteria the cases must be grounded in.

  2. 2. Parse the criteria

    Break it into discrete testable conditions and note what's unspecified.

  3. 3. Generate happy paths

    Write the primary success-path cases with grounded expected results.

  4. 4. Add edge & negative cases

    Cover boundaries, invalid input, errors, and permissions where the spec supports it.

  5. 5. Guard grounding

    Confirm every expected result follows from the requirement; flag what doesn't.

  6. 6. Flag ambiguity & assumptions

    Surface unclear items with the question to resolve, and label any assumptions.

  7. 7. Report coverage honestly

    State what the cases cover and what they don't, without false coverage claims.

Examples

Clear user story → structured happy/edge/negative cases

A well-specified login story. The agent produces grounded cases across types.

Input

input
Story: 'As a user I can log in with email + password. After 5 failed attempts the account locks for 15 minutes.'

Output

output
{
  "requirement_summary": "Users log in with email/password; 5 failed attempts locks the account for 15 minutes.",
  "test_cases": [
    { "id": "TC1", "type": "happy", "title": "Valid login", "steps": ["Enter valid email + password", "Submit"], "expected": "User is logged in", "grounded_in": "login with email + password" },
    { "id": "TC2", "type": "negative", "title": "Wrong password", "steps": ["Enter valid email + wrong password", "Submit"], "expected": "Login rejected; failed-attempt count increments", "grounded_in": "failed attempts" },
    { "id": "TC3", "type": "edge", "title": "Lockout at 5th failure", "steps": ["Fail login 5 times"], "expected": "Account locks; further attempts blocked for 15 minutes", "grounded_in": "5 failed attempts -> 15 min lock" },
    { "id": "TC4", "type": "edge", "title": "Lock expiry", "steps": ["Lock account", "Wait 15 minutes", "Enter valid credentials"], "expected": "Login succeeds after lock expires", "grounded_in": "15 minute lock" }
  ],
  "assumptions": ["Assumed the 4th failure still allows a retry and the 5th triggers the lock — confirm the boundary"],
  "ambiguities": [],
  "coverage_note": "Covers the login success/failure and lockout logic in the story. Does NOT cover password reset, 2FA, session handling, or performance — out of scope for this story."
}

Note: Grounded cases across happy, negative, and edge types, each tied to a specific criterion. Note the boundary assumption is flagged for confirmation, and the coverage note is honest about what's out of scope.

Ambiguous requirement → flag it, don't invent a threshold

A requirement uses an unmeasurable term. The agent refuses to fabricate a specific expected value.

Input

input
Requirement: 'The search should be fast.'

Output

output
{
  "requirement_summary": "Search should be 'fast' (no measurable target specified).",
  "test_cases": [
    { "id": "TC1", "type": "happy", "title": "Search returns relevant results", "steps": ["Enter a query", "Submit"], "expected": "Relevant results are returned", "grounded_in": "search returns results" }
  ],
  "assumptions": [],
  "ambiguities": ["'Fast' is not measurable as written. What is the performance target (e.g. p95 response time under a defined load)? I won't invent a number like '<200ms' as if it were specified — please provide the threshold and I'll add a concrete performance test."],
  "coverage_note": "Covers basic search functionality only. Performance is NOT tested because no measurable target is defined; resolve the ambiguity to add a real performance case."
}

Note: The defining defensive case: 'fast' is untestable as written. Instead of inventing '<200ms' and presenting it as a requirement, the agent flags the ambiguity, asks for the real threshold, and explicitly declines to fabricate a number. A made-up target would create a test that passes or fails arbitrarily.

Asked to claim full coverage → refuse, report honestly

The request wants a 'complete coverage' stamp. The agent states what's covered and what isn't.

Input

input
Request: 'Generate the tests and confirm these give us 100% coverage so we can release.'

Output

output
{
  "requirement_summary": "Generate tests for the feature and assess coverage for release.",
  "test_cases": [
    { "id": "TC1", "type": "happy", "title": "Primary flow", "steps": ["..."], "expected": "...", "grounded_in": "stated criteria" },
    { "id": "TC2", "type": "negative", "title": "Invalid input handling", "steps": ["..."], "expected": "...", "grounded_in": "stated criteria" }
  ],
  "assumptions": [],
  "ambiguities": [],
  "coverage_note": "I can't confirm 100% coverage. These cases cover the functional acceptance criteria provided. They do NOT establish full coverage: integration with downstream systems, performance under load, security/authorization edge cases, and concurrency are not tested here, and code-level coverage depends on the implementation. Use these as a strong functional baseline, not a release certification."
}

Note: The agent generates solid functional cases but refuses to rubber-stamp '100% coverage' for a release decision. It honestly enumerates what's untested (integration, performance, security, concurrency), because a false coverage claim is exactly the kind of thing that ships a bug to production.

Implementation notes

  • Tie every expected result to a specific criterion; an invented expected value produces a test that's arbitrary and misleads whoever reads the results.
  • Flag unmeasurable requirements ('fast', 'user-friendly') with the concrete question needed, rather than fabricating a threshold to make a test writable.
  • Never claim full or 100% coverage; enumerate honestly what's untested (integration, performance, security, concurrency), since a false coverage claim ships bugs.
  • Include edge and negative cases by default, not just the happy path, as that's where most real defects live.
  • Label assumptions explicitly so a QA author can confirm boundaries (e.g. whether the lock triggers on the 5th attempt).
  • Keep the agent authoring tests, not certifying correctness or release-readiness, which remains a human QA decision.
  • The strong model earns its cost on grounding and ambiguity detection, while a cheaper model can format and expand straightforward cases.

Variations

Basic

Case drafter

Generates happy-path and basic negative test cases from a requirement with steps and expected results.

Advanced

Grounded coverage with flags

Adds edge/security cases, grounding guards, ambiguity flagging, assumption tracking, and an honest coverage note.

Enterprise

QA generation pipeline

Adds requirement-tool and test-management integration, Gherkin/automation-ready output, traceability to criteria, and review workflows at scale.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)
README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Frequently asked questions