Overview
Line-by-line audit against your actual policy: limits, categories, receipt rules, and per-diems — each flag cites the rule it breaks.
Catches what manual review misses: duplicate submissions, out-of-policy items, and suspicious patterns across reports.
Decides within limits: clean reports auto-approve; specific items are held for review; rejections and fraud signals go to a human.
Defensive: no auto-approval over the cap or with missing receipts, and no fraud accusation without cited evidence.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "expense-report-audit-agent",
"trust_level": "A2",
"dna_pattern": "Evaluation",
"worst_case_action": "Flags an expense incorrectly for human review. Cannot approve, reject, or reimburse.",
"authority_boundary": "Audits expenses against policy and flags issues; no approval or payment tools present.",
"tags": [
"finance",
"expense-audit",
"compliance",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_expense",
"check_policy",
"detect_anomaly",
"flag_violation"
],
"execution_tools_absent": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"expense_approve",
"expense_reject",
"payment"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.25,
"alert_threshold_usd": 0.16
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"policy_violation",
"anomaly",
"low_confidence"
],
"destination": "finance_review"
},
"audit": {
"append_only": true,
"logs": [
"flags",
"policy_refs"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on policy violation, anomaly, low confidence → finance review |
| Audit trail | Append-only log (flags, policy refs) |
| Cost & loop bounds | ≤ $0.25 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to finance review |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read expense, check policy, detect anomaly, flag violation — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.25/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to finance review on policy violation, anomaly, low confidence |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Misses a genuine policy violation (false negative).
- Detection
- Every report is screened against the full policy set, not sampled.
- Mitigation
- Positioned as full-coverage screening with a human deciding exceptions.
- Recovery
- The missed rule is added post-audit and the report can be re-screened.
Flags a compliant expense as a violation (false positive).
- Detection
- Each finding carries confidence and cites the policy clause.
- Mitigation
- Findings are recommendations a human approves; it never auto-rejects.
- Recovery
- The approver clears it and the rule is tuned.
A receipt is fabricated or altered.
- Detection
- The agent flags anomalies but never asserts authenticity.
- Mitigation
- A human verifies authenticity.
- Recovery
- Suspicious items are escalated to finance.
Evaluation
Violation recall is what matters — missing a genuine policy breach is the failure — against a tolerable false-positive rate.
| Violation recall | Of genuine policy violations, the share it catches. |
|---|---|
| Precision | Of items flagged, the share that are real violations — noise resistance. |
| Policy coverage | Share of policy rules actually exercised by the screen. |
| Citation accuracy | Whether each flag cites the correct policy clause. |
| Latency | Time to audit a report. |
Recommended approach. Build a set of expense reports annotated against the full policy, with seeded violations and compliant edge cases; measure recall and precision and verify each flag cites the right clause. Include altered-receipt cases to confirm it flags rather than asserts authenticity.
When to use
Use it when
- Finance/AP reviews a high volume of expense reports and most of the work is policy-checking and receipt-matching.
- You have a written expense policy the agent can audit against and access to receipts/report data.
- You want consistent, documented audits with an approval trail for compliance.
- You want to auto-clear clean reports and surface only the genuine exceptions and fraud signals to humans.
Avoid it when
- You have no written, structured policy for the agent to audit against.
- You expect it to make final fraud or termination determinations — those are human decisions.
- You can't give it receipt/report access to actually verify line items.
- You are unwilling to keep approval gates on large amounts and rejections.
System prompt
You are an Expense Audit Agent in a finance operation. You audit ONE expense report against the company's written policy and decide: approve, hold specific items, reject, or escalate. You are judged on catching real policy violations and fraud, fairness and accuracy, and never approving spend you shouldn't or accusing someone without evidence.
== CORE PRINCIPLES ==
1. Policy-grounded. Every flag must cite the specific policy rule it violates (limit, category, receipt requirement, per-diem). Do not invent rules or violations; if the policy is silent, it is not a violation.
2. Evidence over suspicion. Base duplicate/fraud flags on concrete evidence (matching receipt, overlapping dates, identical amounts). Never label an employee 'fraud' without cited evidence; flag patterns for human review instead.
3. Audit each line. Approve the compliant items and flag only the specific non-compliant ones — don't reject a whole report over one bad line.
== HARD RULES (NON-NEGOTIABLE) ==
- APPROVAL LIMITS: Auto-approve ONLY when every line is within policy, required receipts are present, and the total is at or below the configured auto-approval cap. Anything above the cap, or with a policy exception, requires human approval.
- RECEIPTS REQUIRED: Do not approve an item that policy requires a receipt for if the receipt is missing or unreadable — hold it.
- NO UNFOUNDED ACCUSATIONS: Suspected duplicates/fraud are flagged with the evidence and routed to a human; never assert intent or wrongdoing.
- PII/DATA: Treat employee and financial data as sensitive; keep it in scope; redact where not needed.
- FAIRNESS: Apply the same policy consistently to every report.
== METHOD ==
- Load the report and the applicable policy. For each line: check category, amount vs. limit, receipt presence/validity, and per-diem/date rules.
- Run duplicate detection (same amount+date+merchant, or the same receipt across reports) and basic anomaly checks (e.g. mileage + flight for the same leg, weekend/personal patterns).
- Decide per line: ok / flag (with rule cited) / hold (missing doc). Then decide the report outcome.
== DECISION POLICY (calibrated confidence 0.0-1.0) ==
- APPROVE: all lines compliant, receipts present, total <= cap, confidence >= 0.85.
- HOLD: specific items missing receipts or needing a minor fix — approve the rest, hold those.
- REJECT_WITH_REASONS: clear policy violations; cite each. (Recommendation for a human to confirm.)
- ESCALATE: total over cap, suspected duplicate/fraud, policy exception, or conflicting evidence.
== COST CONTROL ==
Check only what each line needs; reuse the policy already loaded. Cap tool calls; if exceeded, approve the clearly-clean lines and escalate the rest.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"report_id": "<id>",
"decision": "APPROVE|HOLD|REJECT_WITH_REASONS|ESCALATE",
"confidence": <0.0-1.0>,
"total_usd": <number>,
"line_findings": [ { "item": "<line>", "status": "ok|flag|hold", "rule": "<policy rule cited, or empty>", "note": "<short>" } ],
"fraud_signals": ["<evidence-based pattern, or empty>"],
"approved_amount_usd": <number>,
"actions": [ { "tool": "<tool>", "args": { ... }, "requires_approval": <bool> } ],
"employee_note": "<neutral, factual; no accusation>",
"escalation": { "needed": <bool>, "reason": "<cap/fraud/exception, or empty>" }
}
If evidence is mixed, prefer HOLD or ESCALATE over REJECT, and never accuse without cited evidence.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect your expense system
Install the agent and connect it to your expense/AP platform.
pipx install expense-audit-agent expense-audit-agent connect --system concur expense-audit-agent doctor
Configure limits and mode
The auto-approval cap and receipt rules are enforced deterministically, not by the model.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... AUTO_APPROVE_CAP_USD=250 REQUIRE_RECEIPT_OVER_USD=25 MODE=assist # assist (recommend) | act (auto within cap)
Load your expense policy
Provide the structured policy the agent audits against. This is the only basis for flags.
# policy.yml
limits: { meals: 60, hotel_per_night: 300, mileage_per_mile: 0.67 }
receipt_required_over: 25
disallowed: ["alcohol_over_limit", "personal", "first_class_without_approval"]
per_diem: { domestic: 75 }Backtest on past reports
Replay audited reports to compare the agent's findings to actual outcomes before going live.
expense-audit-agent backtest --range 90d --explain # reports approve/flag accuracy and any missed violations
Wire into the approval flow
Route submitted reports to the agent. Start in assist mode, enable auto-approval within the cap once backtests are clean.
# submission webhook -> POST https://your-host/expense/audit (HMAC) # promote MODE=act for within-cap clean reports
Architecture
Tools required
Workflow
1. Intake the report
Load the report, receipts, and submitter context; load the applicable policy.
2. Audit each line
Check category, amount vs. limit, receipt presence/validity, and per-diem/date rules for every line, citing the rule on any flag.
3. Verify receipts
Confirm required receipts are present and readable and match the line; hold items that lack required proof.
4. Detect duplicates & anomalies
Run duplicate detection and pattern checks across this and prior reports, gathering evidence rather than asserting intent.
5. Decide per line and report
Approve compliant lines, hold those missing docs, flag violations with the rule, and decide the report outcome within the cap.
6. Act through the gate
Auto-approve within limits; route over-cap totals, rejections, and fraud signals to a human with the evidence.
7. Record the trail
Log each decision with the cited rule and outcome for compliance, and feed overrides back to improve the checks.
Examples
Clean report within policy → auto-approve
A standard travel report: every line within limits, receipts attached, total under the cap.
Input
Report EXP-3310, total $214: hotel $180 (receipt), taxi $19 (receipt), lunch $15 (receipt). Policy: hotel<=300/night, meals<=60, receipts over $25.
Output
{
"report_id": "EXP-3310",
"decision": "APPROVE",
"confidence": 0.93,
"total_usd": 214,
"line_findings": [
{ "item": "Hotel $180", "status": "ok", "rule": "hotel_per_night<=300", "note": "within limit, receipt present" },
{ "item": "Taxi $19", "status": "ok", "rule": "", "note": "under receipt threshold, allowed" },
{ "item": "Lunch $15", "status": "ok", "rule": "meals<=60", "note": "within limit" }
],
"fraud_signals": [],
"approved_amount_usd": 214,
"actions": [ { "tool": "approve_expense", "args": { "report_id": "EXP-3310", "amount_usd": 214 }, "requires_approval": false } ],
"employee_note": "Your expense report EXP-3310 ($214) has been approved.",
"escalation": { "needed": false, "reason": "" }
}Note: Every line is checked against the cited policy rule, receipts are present, total is under the $250 cap — so it auto-approves and documents the rule for each line. This is the volume the kit clears instantly so humans only see exceptions.
Out-of-policy item + missing receipt → hold the line, approve the rest
Mostly clean, but one meal is over the limit and a $40 item has no receipt. The agent holds those two lines and approves the rest.
Input
Report EXP-3402, total $268: hotel $190 (receipt), dinner $92 (receipt), supplies $40 (NO receipt). Policy: meals<=60, receipts required over $25.
Output
{
"report_id": "EXP-3402",
"decision": "HOLD",
"confidence": 0.9,
"total_usd": 268,
"line_findings": [
{ "item": "Hotel $190", "status": "ok", "rule": "hotel_per_night<=300", "note": "within limit" },
{ "item": "Dinner $92", "status": "flag", "rule": "meals<=60", "note": "$32 over the per-meal limit; needs approval or adjustment" },
{ "item": "Supplies $40", "status": "hold", "rule": "receipt_required_over=25", "note": "no receipt attached for a $40 item" }
],
"fraud_signals": [],
"approved_amount_usd": 190,
"actions": [ { "tool": "approve_expense", "args": { "report_id": "EXP-3402", "lines": ["hotel"], "amount_usd": 190 }, "requires_approval": false } ],
"employee_note": "Approved the hotel ($190). Two items need attention: the dinner ($92) exceeds the $60 meal limit, and the $40 supplies item needs a receipt. Please add the receipt and confirm the meal overage with your manager.",
"escalation": { "needed": false, "reason": "" }
}Note: Instead of rejecting the whole report over two lines, the agent approves the compliant hotel and holds exactly the two problem items — each with the cited rule. The employee note is specific and actionable, not punitive.
Duplicate receipt across reports → escalate with evidence, no accusation
A receipt submitted on this report also appears on a prior one. The agent flags the duplicate with evidence and escalates — without accusing the employee.
Input
Report EXP-3450: flight $610 with receipt #A-7782. duplicate_check finds receipt #A-7782 already reimbursed on EXP-3119 last month.
Output
{
"report_id": "EXP-3450",
"decision": "ESCALATE",
"confidence": 0.8,
"total_usd": 610,
"line_findings": [
{ "item": "Flight $610", "status": "flag", "rule": "duplicate_submission", "note": "receipt #A-7782 matches one already reimbursed on EXP-3119 (2026-05)" }
],
"fraud_signals": ["Identical receipt #A-7782 ($610) appears on EXP-3119 (already reimbursed) and EXP-3450 — possible duplicate submission"],
"approved_amount_usd": 0,
"actions": [ { "tool": "escalate_to_finance", "args": { "queue": "duplicate_review", "evidence": ["EXP-3119", "EXP-3450", "receipt#A-7782"] }, "requires_approval": false } ],
"employee_note": "We're reviewing report EXP-3450; the flight receipt appears to match one already reimbursed. Finance will follow up — this may simply be an accidental re-submission.",
"escalation": { "needed": true, "reason": "Possible duplicate reimbursement — same receipt on two reports." }
}Note: The defining defensive case: the agent has concrete evidence (same receipt number on two reports) but treats it as a possible duplicate to review, not proven fraud. It escalates with the evidence, holds the $610, and the employee note explicitly allows for an honest mistake. Evidence and fairness, never accusation.
Implementation notes
- Enforce the auto-approval cap and receipt requirements in a deterministic gate; the model audits, the gate controls what can be approved without a human.
- Cite the specific policy rule on every flag. A finding without a rule is an opinion, not an audit — and citations make the trail defensible.
- Treat duplicates and anomalies as evidence to review, never as proven fraud; route them to a human and keep employee-facing language neutral.
- Audit per line and approve the compliant parts — rejecting whole reports over a single bad line creates friction and rework.
- Backtest against historically audited reports and track missed-violation and false-flag rates before enabling auto-approval.
- Keep employee and financial data in scope with PII discipline, and apply the policy identically to everyone for fairness and audit.
- Reserve the strong model for anomaly judgment and the report decision; a cheaper model can match receipts and categorize lines.
Variations
Basic
Audit & flag assistant
Audits each line against policy, verifies receipts, and returns flagged items with the cited rule and a recommendation for a reviewer. No auto-approval.
Advanced
Guarded auto-approval
Auto-approves clean reports within the cap, holds specific non-compliant lines, runs duplicate/anomaly detection, and escalates fraud signals and over-cap totals.
Enterprise
Governed spend audit
Adds multi-policy support, ERP/AP integration, full audit trails and SLAs, fraud-pattern analytics across employees, and check tuning from reviewer outcomes.
Download the Agent Blueprint
Export
This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
Only when every line is within policy, required receipts are present, and the total is within your configured cap. Anything over the cap, missing a required receipt, or showing policy exceptions is held or escalated to a human.
It audits each line against your structured written policy and cites the specific rule on every flag. If the policy is silent on something, it isn't treated as a violation — no invented rules.
No. It surfaces evidence-based patterns (like a duplicate receipt) and routes them to a human for review with the evidence attached, keeping employee-facing language neutral. It never asserts intent or wrongdoing.
It approves the compliant lines and holds only the specific problem items, with the cited rule and what's needed to fix them — rather than rejecting the whole report.
It checks for the same receipt, or the same amount/date/merchant, across the current and prior reports, and flags genuine matches as possible duplicates for human review.
Start in assist mode where it only recommends, backtest against historically audited reports, then enable auto-approval for clean within-cap reports once the results hold up.