Will it guess a field it can't read?

No — that's its core constraint. If a field is blank, missing, or unreadable, it returns null with a flag (missing/illegible) and routes it for review. A confidently wrong value is worse than a flagged blank, so it never fills gaps with guesses.

Does it validate against my schema?

Yes. The output is checked against your schema's types, required fields, and enums. Fields that don't fit are reported rather than silently coerced, so you don't get malformed data downstream.

Will it 'fix' values it thinks are wrong?

No. It stays faithful to what the document says and flags suspicious values for review rather than correcting them, since an automatic 'fix' can hide a real error.

What about sensitive fields?

It never infers sensitive values like ID numbers or dates of birth that aren't explicitly on the document, because a wrong sensitive value carries real risk. Missing ones are flagged for a human to supply.

Can it handle scanned and handwritten forms?

Yes, with OCR — but it assesses legibility and flags anything it can't read confidently, especially handwriting and poor scans, instead of guessing.

Form-to-JSON Extraction Agent

Q: How do I know which values to trust?

Every field carries a confidence score, and low-confidence, ambiguous, or poor-OCR fields are flagged for human review. You see exactly which values are solid and which need a look.

Overview

Turns filled-in forms and documents into schema-valid JSON mapped to your fields.

Scores confidence per field so low-certainty values are visible, not hidden.

Flags illegible, ambiguous, or missing fields for human review instead of guessing.

Defensive: extracts only what's present, never fabricates a field, and validates the schema.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A2 — Recommend

DNA PatternExtraction (Extract → Verify)

Worst-Case ActionReturns an incorrect or missing field, marked absent rather than guessed, for human review. It never fabricates values and cannot write extracted data anywhere — execution tools are absent.

Authority BoundaryReads a form, extracts only fields that are actually present, validates against a schema, and marks missing fields as absent. It never guesses or fabricates a value and never writes downstream. A human consumes the validated output.

Verification TestProvide a form with a blank field → confirm the agent marks it absent and does not invent a value; confirm no write tool exists.

Production Readiness6/6 dimensions passing. Tool isolation: write tools absent. Human gates: output consumed by a human/system after review. Confidence escalation: ambiguous fields flagged. Cost ceiling: bounded per form. Audit trail: extracted fields logged. Escalation path: low-confidence forms routed to review.

Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json

{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "form-to-json-agent",
  "trust_level": "A2",
  "dna_pattern": "Extraction",
  "worst_case_action": "Returns a wrong/absent field for review; never fabricates. Cannot write data.",
  "authority_boundary": "Extracts only present fields and validates; never guesses; write tools absent.",
  "tags": [
    "document-processing",
    "forms",
    "extraction",
    "read-only"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_form",
      "extract_present_fields",
      "validate_schema",
      "mark_absent"
    ],
    "execution_tools_absent": true
  },
  "output_boundary": {
    "format": "structured_json",
    "never_emits": [
      "data_write"
    ],
    "never_fabricates": true
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.18,
    "alert_threshold_usd": 0.12
  },
  "loop_boundary": {
    "max_reasoning_turns": 6
  },
  "human_handoff": {
    "triggers": [
      "ambiguous_field",
      "low_confidence"
    ],
    "destination": "review_queue"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "fields"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goal	Bounded by the authority spec above
Trust Level	A2 — Recommend
Tool access	Least privilege — execution tools absent (read-only)
Context handling	Grounded in provided inputs; cites or flags rather than guessing
Memory strategy	Task-scoped; no persistent cross-session memory
Human approval	Required on ambiguous field, low confidence → review queue
Audit trail	Append-only log (fields)
Cost & loop bounds	≤ $0.18 per loop · ≤ 6 reasoning turns
Recovery / escalation	Escalates to review queue

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

Agent	Primary reasoner — Recommend authority (A2)
Tools	read form, extract present fields, validate schema, mark absent — execution tools absent (read-only)
Memory	Task-scoped working context; no persistent cross-session memory
Guardrails	Worst-case classified (A2); no execution tools; ≤ $0.18/loop · ≤ 6 turns
Evaluator	Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
Handoff	Escalates to review queue on ambiguous field, low confidence

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Fabricates a value for a field that is actually blank.

Detection: Blank fields are detected and the agent marks them absent.
Mitigation: It never guesses — missing fields are marked absent, not invented.
Recovery: The human fills the gap from the source.

Misreads a field through an OCR error or wrong field mapping.

Detection: Low-confidence fields are flagged and schema validation runs.
Mitigation: Extraction only — output feeds a human or system after review.
Recovery: The reviewer corrects the field.

Maps a value to the wrong schema key.

Detection: Schema validation and type checks catch mismatches.
Mitigation: Ambiguous mappings are flagged.
Recovery: The mapping is corrected before downstream use.

Evaluation

Field-level accuracy and zero fabrication are primary — an invented value for a blank field is the silent failure.

Field accuracy	Share of extracted fields matching ground truth, reported per field type.
Fabrication rate	Frequency of invented values for blank fields — should be near zero.
Schema validity	Share of outputs that pass schema and type validation.
Mapping accuracy	Share of values mapped to the correct schema key.
Latency	Time to extract per form.

Recommended approach. Use a labeled set of forms, including ones with blank fields, with known values; report per-field accuracy and treat any invented value for a blank as a critical fabrication. Validate all output against the schema.

When to use

Use it when

You process forms or documents into structured data at volume.
You want per-field confidence and a clean schema-valid output.
You want illegible or missing fields flagged for review, not guessed.
You need faithful extraction you can trust downstream.

Avoid it when

You want it to fill in missing fields by inference — it won't guess values.
You have no schema for it to map to and validate against.
You can't review flagged low-confidence fields.
Your documents are free-form with no extractable fields.

System prompt

system-prompt.md

You are a Form-to-JSON Extraction Agent. You read a filled-in form/document and output schema-valid JSON of the fields, with per-field confidence. You are judged on accurate, faithful extraction and on NEVER guessing, inventing, or inferring a field value that isn't actually on the document.

== CORE PRINCIPLES ==
1. Extract only what's present. Output a value only if it's actually on the document. If a field is blank, missing, or unreadable, mark it as such — never fill it with a guess or an inferred value.
2. Confidence per field. Attach a confidence to every extracted value. Low-confidence, ambiguous, or poor-OCR fields are flagged for human review.
3. Faithful to the document. Capture what the form says. Don't normalize away meaning, don't "correct" values, and don't merge or split fields in ways that change them.

== HARD RULES (NON-NEGOTIABLE) ==
- NO FABRICATED FIELDS: Never invent, infer, or guess a value. Not on the document = null + "missing"/"illegible" flag. This is absolute — a confidently wrong value is worse than a flagged blank.
- NO SENSITIVE INFERENCE: Never derive a sensitive value (ID number, DOB, etc.) that isn't explicitly present.
- CONFIDENCE + FLAGS: Every field carries a confidence; low-confidence/ambiguous/illegible -> review_required.
- SCHEMA-VALID: Output must validate against the provided schema (types, required, enums). Report fields that don't fit rather than coercing silently.
- FAITHFUL: Preserve values as written; flag rather than "fix" suspicious values.

== METHOD ==
- Read the document. For each schema field, locate the value, extract it as written, score confidence, and flag missing/illegible/ambiguous. Validate against the schema. Return JSON + a review list.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "document_type": "<as identified>",
  "fields": { "<field>": { "value": "<as written|null>", "confidence": "high|medium|low", "flag": "ok|low_confidence|illegible|missing|ambiguous" } },
  "schema_valid": <bool>,
  "review_required": ["<fields needing human review and why>"],
  "note": "Extracted only values present on the document. No field was guessed or inferred."
}
Never guess a value. Flag missing/illegible fields. Validate the schema.

Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and configure

Install the agent and point it at your schema.

shell

pipx install form-extract-agent
form-extract-agent init --schema ./schema.json
form-extract-agent doctor

Configure no-guess guardrails

No fabricated fields and confidence flagging are enforced here.

shell

cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
NO_GUESSED_VALUES=true
FLAG_BELOW_CONFIDENCE=medium
SCHEMA_VALIDATE=true

Define the schema

Declare fields, types, and which are required.

shell

# schema.json
{ "name": "string", "dob": "date", "id_number": "string",
  "required": ["name"], "enums": { "state": ["active","inactive"] } }

Test on samples

Extract sample forms and review confidence and flags.

shell

form-extract-agent run --doc ./form.pdf --explain
# prints JSON + per-field confidence + review_required (no guessed values)

Wire into your pipeline

Extract incoming forms; flagged fields go to review.

shell

# document in -> extract -> schema-valid JSON; low-confidence/missing -> review queue

Architecture

Document intakeReceives the form/document and the target schema to extract into.

Field locatorLocates each schema field's value on the document, including OCR for scanned input.

Faithful extractorExtracts each value as written without correcting, normalizing away, or merging fields.

Confidence scorerAttaches a confidence to every field based on clarity and OCR quality.

No-guess guardBlocks any invented or inferred value; absent or unreadable fields become null with a flag.

Schema validatorValidates the output against the schema's types, required fields, and enums.

Review queueRoutes low-confidence, illegible, ambiguous, and missing fields to a human.

Tools required

get_documentRetrieve the form/document and the target schema.

detect_fieldsLocate each schema field's value on the document.

extract_to_schemaExtract values as written and map them to schema fields.

confidence_scoreAttach a confidence to each extracted value.

validate_schemaValidate the output against types, required fields, and enums.

flag_low_confidenceFlag low-confidence and ambiguous fields for review.

flag_missingMark blank or absent fields as missing rather than guessing.

ocr_quality_checkAssess OCR/legibility and flag illegible fields.

Workflow

1. Take the document
Receive the form/document and the target schema.
2. Locate fields
Find each schema field's value on the document, with OCR for scans.
3. Extract faithfully
Capture each value as written, without correcting or normalizing it away.
4. Score confidence
Attach a confidence to every field based on clarity and OCR quality.
5. Guard against guessing
Mark absent or unreadable fields as null with a flag — never invent a value.
6. Validate the schema
Check types, required fields, and enums; report fields that don't fit.
7. Queue for review
Route low-confidence, illegible, and missing fields to a human.

Examples

Clean form → all fields extracted, schema-valid

A clear, fully filled form extracts cleanly with high confidence.

Input

input

A typed application form with Name: Jane Smith, DOB: 1990-04-12, State: active — all legible.

Output

output

{
  "document_type": "application_form",
  "fields": {
    "name": { "value": "Jane Smith", "confidence": "high", "flag": "ok" },
    "dob": { "value": "1990-04-12", "confidence": "high", "flag": "ok" },
    "state": { "value": "active", "confidence": "high", "flag": "ok" }
  },
  "schema_valid": true,
  "review_required": [],
  "note": "Extracted only values present on the document. No field was guessed or inferred."
}

Note: Every field is present and legible, so extraction is high-confidence and schema-valid with nothing to review. The routine case the agent should fully automate.

Illegible field → don't guess, flag for review

One field is smudged/unreadable. The agent extracts the rest and flags the unreadable field rather than guessing.

Input

input

A scanned form where Name and State are clear, but the DOB is handwritten and partially smudged.

Output

output

{
  "document_type": "application_form",
  "fields": {
    "name": { "value": "Robert Lee", "confidence": "high", "flag": "ok" },
    "dob": { "value": null, "confidence": "low", "flag": "illegible" },
    "state": { "value": "active", "confidence": "high", "flag": "ok" }
  },
  "schema_valid": true,
  "review_required": ["dob — handwritten and partially smudged; could not read with confidence. Left null for human review rather than guessing a date."],
  "note": "Extracted only values present on the document. No field was guessed or inferred."
}

Note: The DOB is unreadable, so the agent returns null with an 'illegible' flag instead of guessing a plausible-looking date. A guessed date of birth on a record could be worse than a blank — flagging it for a human is the safe, correct behavior.

Asked to infer a missing value → refuse to fabricate

A required field is blank and the request is to 'just fill it in.' The agent refuses to invent a value.

Input

input

Request: 'The ID number field is blank — just put in a reasonable one so the record validates.'

Output

output

{
  "document_type": "application_form",
  "fields": {
    "name": { "value": "Maria Gomez", "confidence": "high", "flag": "ok" },
    "id_number": { "value": null, "confidence": "low", "flag": "missing" }
  },
  "schema_valid": false,
  "review_required": ["id_number — required but blank on the document. I will NOT fabricate an ID number to make the record validate; an invented identifier could be assigned to the wrong person or collide with a real one. The field is left null and flagged; obtain the real value to complete the record."],
  "note": "Extracted only values present on the document. No field was guessed or inferred."
}

Note: The defining defensive case: asked to invent a missing identifier so a record validates. The agent refuses — a fabricated ID number could collide with a real one or be tied to the wrong person. It reports schema_valid:false honestly and flags the field, rather than papering over a real data gap with a fake value. A flagged blank beats a confident lie.

Implementation notes

Never guess or infer a field value; absent or unreadable fields must be null with a flag, because a confidently wrong value propagates silently into downstream systems.
Score confidence per field and route low-confidence, illegible, and ambiguous fields to human review rather than accepting them blindly.
Validate against the schema and report fields that don't fit instead of silently coercing types or dropping data.
Stay faithful to the document: preserve values as written and flag suspicious ones rather than 'correcting' them, which can hide real errors.
Never infer sensitive values (identifiers, dates of birth) that aren't explicitly present — the harm from a wrong sensitive value is high.
Handle checkboxes, multi-value, and repeated fields explicitly so structure isn't lost or misread.
Spend the strong model on handwriting, poor scans, and ambiguity — a cheaper model can handle clean typed forms.

Variations

Basic

Field extractor

Extracts document fields into schema-valid JSON with per-field confidence. On demand.

Advanced

Guarded extraction

Adds no-guess guards, illegible/missing flagging, schema validation, and a human review queue for uncertain fields.

Enterprise

Document processing pipeline

Adds batch ingestion, OCR tuning, multi-template schemas, human-in-the-loop review, and audit trails for high-volume IDP.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)

README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Form-to-JSON Extraction Agent

Overview

AgentAz™ specification

Governance matrix

Agent component mapping

Failure modes

Evaluation

When to use

System prompt

Simulate run

Setup guide

Architecture

Tools required

Workflow

Examples

Implementation notes

Variations

Frequently asked questions

Will it guess a field it can't read?

How do I know which values to trust?

Does it validate against my schema?

Will it 'fix' values it thinks are wrong?

What about sensitive fields?

Can it handle scanned and handwritten forms?

Related kits

Invoice Data Extraction Agent

Access Request & Provisioning Agent

Account Research Agent

Action Item Tracking Agent