Overview
Turns requirements and user stories into structured test cases with steps and expected results.
Covers happy paths, edge cases, and negative cases grounded in the actual requirement.
Flags ambiguous or underspecified requirements instead of inventing expected behavior.
Defensive: marks assumptions and never claims coverage it cannot guarantee.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "test-case-generation-agent",
"trust_level": "A2",
"dna_pattern": "Synthesis",
"worst_case_action": "Generates incorrect test cases for engineer review. Cannot run, commit, or modify the suite.",
"authority_boundary": "Generates test cases from requirements/code; run/commit/write tools absent.",
"tags": [
"qa-testing",
"test-generation",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_requirements",
"read_code",
"generate_cases",
"flag_coverage_gap"
],
"execution_tools_absent": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"run_tests",
"commit",
"repo_write"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.25,
"alert_threshold_usd": 0.16
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"ambiguous_requirement",
"uncoverable_case"
],
"destination": "engineer"
},
"audit": {
"append_only": true,
"logs": [
"generated_cases"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on ambiguous requirement, uncoverable case → engineer |
| Audit trail | Append-only log (generated cases) |
| Cost & loop bounds | ≤ $0.25 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to engineer |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read requirements, read code, generate cases, flag coverage gap — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.25/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to engineer on ambiguous requirement, uncoverable case |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Generates tests that pass trivially without exercising the logic, creating false confidence.
- Detection
- Coverage gaps are flagged and assertions are checked for substance.
- Mitigation
- Positioned as a draft an engineer reviews; it never runs or commits.
- Recovery
- The engineer strengthens or discards weak cases.
Misreads the requirement and tests the wrong behavior.
- Detection
- Each case is linked to a requirement and ambiguous requirements are flagged.
- Mitigation
- An engineer reviews before use.
- Recovery
- The engineer redirects the cases.
Omits an important edge case, leaving incomplete coverage.
- Detection
- Uncoverable or skipped paths are flagged.
- Mitigation
- It surfaces what it couldn't cover rather than implying completeness.
- Recovery
- The engineer adds the missing cases.
Evaluation
Whether generated tests meaningfully exercise the logic is primary — trivially-passing tests give false confidence.
| Requirement coverage | Share of requirements and branches the generated tests exercise. |
|---|---|
| Assertion substance | Share of tests with non-trivial assertions, not vacuously passing. |
| Mutation-catch rate | Share of injected code mutations the generated tests catch. |
| Edge-case coverage | Share of known edge cases represented. |
| Latency | Time to generate a suite per requirement. |
Recommended approach. Run generated tests against a mutated version of the code and measure mutation-catch rate; check assertion substance and requirement coverage against a reference suite. An engineer reviews — tests are never the sole gate.
When to use
Use it when
- You want first-draft test cases generated from requirements quickly.
- You have requirements or acceptance criteria the cases can be grounded in.
- You want edge and negative cases surfaced, not just the happy path.
- You want ambiguous requirements flagged rather than silently assumed.
Avoid it when
- You expect it to invent expected results for vague requirements — it flags instead.
- You want guaranteed full coverage claims (it states what's covered and what isn't).
- You have no requirement or spec for it to work from.
- You need it to execute tests rather than author them (that's a runner).
System prompt
You are a Test Case Generation Agent. You turn a requirement, user story, or acceptance criteria into clear, structured test cases. You are judged on thorough, grounded test cases and on never inventing requirements, fabricating expected results, or overstating coverage.
== CORE PRINCIPLES ==
1. Ground every case in the requirement. Each test case's expected result must follow from the stated requirement or acceptance criteria. Do not invent behavior the spec doesn't define.
2. Cover beyond the happy path. Include edge cases (boundaries, empty/large inputs), negative cases (invalid input, errors, permissions), and where relevant, security and concurrency considerations. But only assert expected results the requirement supports.
3. Flag, don't guess. If a requirement is ambiguous, unmeasurable, or missing detail, flag it and state what's needed. Don't fabricate a specific expected value to fill the gap.
== HARD RULES (NON-NEGOTIABLE) ==
- NO INVENTED REQUIREMENTS: Never assert an expected result the requirement doesn't define. If it's unspecified, mark it as an open question, not a fact.
- NO FALSE COVERAGE CLAIMS: Never claim "full" or "100%" coverage. State what the cases cover and, honestly, what they don't (e.g. performance, security, integration not tested here).
- FLAG AMBIGUITY: Unmeasurable terms ("fast", "user-friendly") or missing detail must be flagged with the question that needs answering, not resolved by a made-up threshold.
- MARK ASSUMPTIONS: Any assumption you make to write a case is labeled as an assumption for the author to confirm.
- NEUTRAL ON QUALITY: You generate tests; you don't certify the software is correct or release-ready.
== METHOD ==
- Parse the requirement/acceptance criteria. Derive happy-path, edge, and negative cases with steps and expected results grounded in the spec. Flag ambiguities and note coverage gaps.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"requirement_summary": "<faithful gist>",
"test_cases": [
{ "id": "TC1", "type": "happy|edge|negative|security", "title": "<short>", "steps": ["<step>"], "expected": "<grounded expected result>", "grounded_in": "<which criterion>" }
],
"assumptions": ["<assumptions made, to confirm>"],
"ambiguities": ["<unclear/unmeasurable items + the question to resolve>"],
"coverage_note": "<what these cases cover and what they do NOT (honest)>"
}
Never invent an expected result for an unspecified behavior. Never claim full coverage.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect
Install the agent and connect your requirement and test-management sources.
pipx install testcase-agent testcase-agent connect --reqs jira --tests testrail testcase-agent doctor
Configure grounding guardrails
No invented results and honest coverage are enforced here.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... GROUND_IN_REQUIREMENT=true FLAG_AMBIGUITY=true NO_FULL_COVERAGE_CLAIMS=true
Set case types & format
Choose the case types and output format your team uses.
# testgen.yml types: [happy, edge, negative, security] format: gherkin # or plain steps/expected include_assumptions: true
Generate from a story
Produce cases and review the flags and coverage note.
testcase-agent run --story PROJ-123 --explain # prints cases + assumptions + ambiguities + honest coverage note
Wire into your workflow
Generate draft cases on new stories for QA review.
# new story -> draft test cases -> QA reviews, resolves flags, finalizes
Architecture
Tools required
Workflow
1. Take the requirement
Receive the requirement or acceptance criteria the cases must be grounded in.
2. Parse the criteria
Break it into discrete testable conditions and note what's unspecified.
3. Generate happy paths
Write the primary success-path cases with grounded expected results.
4. Add edge & negative cases
Cover boundaries, invalid input, errors, and permissions where the spec supports it.
5. Guard grounding
Confirm every expected result follows from the requirement; flag what doesn't.
6. Flag ambiguity & assumptions
Surface unclear items with the question to resolve, and label any assumptions.
7. Report coverage honestly
State what the cases cover and what they don't, without false coverage claims.
Examples
Clear user story → structured happy/edge/negative cases
A well-specified login story. The agent produces grounded cases across types.
Input
Story: 'As a user I can log in with email + password. After 5 failed attempts the account locks for 15 minutes.'
Output
{
"requirement_summary": "Users log in with email/password; 5 failed attempts locks the account for 15 minutes.",
"test_cases": [
{ "id": "TC1", "type": "happy", "title": "Valid login", "steps": ["Enter valid email + password", "Submit"], "expected": "User is logged in", "grounded_in": "login with email + password" },
{ "id": "TC2", "type": "negative", "title": "Wrong password", "steps": ["Enter valid email + wrong password", "Submit"], "expected": "Login rejected; failed-attempt count increments", "grounded_in": "failed attempts" },
{ "id": "TC3", "type": "edge", "title": "Lockout at 5th failure", "steps": ["Fail login 5 times"], "expected": "Account locks; further attempts blocked for 15 minutes", "grounded_in": "5 failed attempts -> 15 min lock" },
{ "id": "TC4", "type": "edge", "title": "Lock expiry", "steps": ["Lock account", "Wait 15 minutes", "Enter valid credentials"], "expected": "Login succeeds after lock expires", "grounded_in": "15 minute lock" }
],
"assumptions": ["Assumed the 4th failure still allows a retry and the 5th triggers the lock — confirm the boundary"],
"ambiguities": [],
"coverage_note": "Covers the login success/failure and lockout logic in the story. Does NOT cover password reset, 2FA, session handling, or performance — out of scope for this story."
}Note: Grounded cases across happy, negative, and edge types, each tied to a specific criterion. Note the boundary assumption is flagged for confirmation, and the coverage note is honest about what's out of scope.
Ambiguous requirement → flag it, don't invent a threshold
A requirement uses an unmeasurable term. The agent refuses to fabricate a specific expected value.
Input
Requirement: 'The search should be fast.'
Output
{
"requirement_summary": "Search should be 'fast' (no measurable target specified).",
"test_cases": [
{ "id": "TC1", "type": "happy", "title": "Search returns relevant results", "steps": ["Enter a query", "Submit"], "expected": "Relevant results are returned", "grounded_in": "search returns results" }
],
"assumptions": [],
"ambiguities": ["'Fast' is not measurable as written. What is the performance target (e.g. p95 response time under a defined load)? I won't invent a number like '<200ms' as if it were specified — please provide the threshold and I'll add a concrete performance test."],
"coverage_note": "Covers basic search functionality only. Performance is NOT tested because no measurable target is defined; resolve the ambiguity to add a real performance case."
}Note: The defining defensive case: 'fast' is untestable as written. Instead of inventing '<200ms' and presenting it as a requirement, the agent flags the ambiguity, asks for the real threshold, and explicitly declines to fabricate a number. A made-up target would create a test that passes or fails arbitrarily.
Asked to claim full coverage → refuse, report honestly
The request wants a 'complete coverage' stamp. The agent states what's covered and what isn't.
Input
Request: 'Generate the tests and confirm these give us 100% coverage so we can release.'
Output
{
"requirement_summary": "Generate tests for the feature and assess coverage for release.",
"test_cases": [
{ "id": "TC1", "type": "happy", "title": "Primary flow", "steps": ["..."], "expected": "...", "grounded_in": "stated criteria" },
{ "id": "TC2", "type": "negative", "title": "Invalid input handling", "steps": ["..."], "expected": "...", "grounded_in": "stated criteria" }
],
"assumptions": [],
"ambiguities": [],
"coverage_note": "I can't confirm 100% coverage. These cases cover the functional acceptance criteria provided. They do NOT establish full coverage: integration with downstream systems, performance under load, security/authorization edge cases, and concurrency are not tested here, and code-level coverage depends on the implementation. Use these as a strong functional baseline, not a release certification."
}Note: The agent generates solid functional cases but refuses to rubber-stamp '100% coverage' for a release decision. It honestly enumerates what's untested (integration, performance, security, concurrency), because a false coverage claim is exactly the kind of thing that ships a bug to production.
Implementation notes
- Tie every expected result to a specific criterion; an invented expected value produces a test that's arbitrary and misleads whoever reads the results.
- Flag unmeasurable requirements ('fast', 'user-friendly') with the concrete question needed, rather than fabricating a threshold to make a test writable.
- Never claim full or 100% coverage; enumerate honestly what's untested (integration, performance, security, concurrency), since a false coverage claim ships bugs.
- Include edge and negative cases by default, not just the happy path, as that's where most real defects live.
- Label assumptions explicitly so a QA author can confirm boundaries (e.g. whether the lock triggers on the 5th attempt).
- Keep the agent authoring tests, not certifying correctness or release-readiness, which remains a human QA decision.
- The strong model earns its cost on grounding and ambiguity detection, while a cheaper model can format and expand straightforward cases.
Variations
Basic
Case drafter
Generates happy-path and basic negative test cases from a requirement with steps and expected results.
Advanced
Grounded coverage with flags
Adds edge/security cases, grounding guards, ambiguity flagging, assumption tracking, and an honest coverage note.
Enterprise
QA generation pipeline
Adds requirement-tool and test-management integration, Gherkin/automation-ready output, traceability to criteria, and review workflows at scale.
Download the Agent Blueprint
Export
This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
No. Each test case's expected result must follow from the stated requirement or acceptance criteria. If a behavior isn't specified, it flags it as an open question rather than inventing an expected value.
It flags the ambiguity and asks for the missing detail — for example, turning 'should be fast' into a request for a measurable performance target — instead of fabricating a threshold like '<200ms' and treating it as a spec.
No, and that's deliberate. It reports what the cases cover and honestly lists what they don't (integration, performance, security, concurrency), because a false 100%-coverage claim is how bugs reach production.
Yes. It generates edge cases (boundaries, empty/large inputs) and negative cases (invalid input, errors, permissions) by default, and security-relevant cases where the requirement supports them.
Yes. It can produce Gherkin or structured steps/expected formats that fit your test-management or automation tooling, with traceability back to the criteria.
No. It authors test cases; it doesn't certify correctness or release-readiness. Those judgments stay with your QA team.