Overview
Enrich → detonate → scope → respond: turns a reported email into a verdict, a campaign view, and a contained incident.
Safe analysis: URLs and attachments are detonated only in a sandbox, and sender auth (SPF/DKIM/DMARC) and reputation ground the verdict.
Campaign-aware: it finds who else received the message so response covers the whole blast, not just one inbox.
Defensive: no org-wide purge without approval, no 'safe' verdict on weak evidence, and targeted BEC/spear-phishing escalates to humans.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "phishing-triage-agent",
"trust_level": "A3",
"dna_pattern": "Escalation",
"worst_case_action": "Stages an incorrect quarantine for analyst approval, or misclassifies an email. Cannot auto-contain.",
"authority_boundary": "Analyzes reported phishing and stages containment for approval; no autonomous quarantine/block.",
"tags": [
"security",
"phishing",
"soc",
"human-approval"
],
"tool_boundary": {
"allowed_tools": [
"read_report",
"extract_indicators",
"score_risk",
"stage_containment"
],
"approval_required_tools": [
"quarantine",
"block_sender"
],
"execution_tools_absent": false
},
"output_boundary": {
"format": "structured_json",
"never_without_approval": [
"quarantine",
"block_sender",
"delete_mail"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.25,
"alert_threshold_usd": 0.16
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"confirmed_threat",
"uncertain_verdict",
"containment_proposed"
],
"destination": "soc_analyst"
},
"audit": {
"append_only": true,
"logs": [
"indicators",
"verdict",
"staged_actions",
"approvals"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 (spec text under CC‑BY‑4.0) — schema and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A3 — Human-Approved |
| Tool access | Scoped tools; high-risk actions gated behind approval |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on confirmed threat, uncertain verdict, containment proposed → soc analyst |
| Audit trail | Append-only log (indicators, verdict, staged actions, approvals) |
| Cost & loop bounds | ≤ $0.25 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to soc analyst |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Human-Approved authority (A3) |
|---|---|
| Tools | read report, extract indicators, score risk, stage containment; approval-gated: quarantine, block sender |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A3); high-risk actions gated; ≤ $0.25/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to soc analyst on confirmed threat, uncertain verdict, containment proposed |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Misclassifies a benign email as phishing, triggering unnecessary containment.
- Detection
- A confidence score is attached and containment is staged, not auto-run.
- Mitigation
- An analyst approves any quarantine or block.
- Recovery
- The staged action is reversed and the message is un-quarantined.
Misses a real phish (false negative).
- Detection
- Uncertain verdicts are routed up; indicators are extracted regardless of verdict.
- Mitigation
- Human-in-the-loop on uncertain cases; the agent never auto-clears.
- Recovery
- A post-report retro adds the missed indicator set to the rules.
Indicator extraction from a weaponized attachment.
- Detection
- Analysis runs in a sandbox with no execution.
- Mitigation
- Static indicator extraction only; live payloads are never opened.
- Recovery
- The contained sample is escalated to the SOC.
Evaluation
Recall on true phishing is primary — a missed phish is the costly error — balanced against false-positive containment.
| Verdict accuracy | Share of reported emails classified correctly as phishing or benign. |
|---|---|
| Recall | Of true phishing emails, the share correctly identified — weighted high. |
| Precision | Of emails flagged for containment, the share genuinely malicious — false-positive resistance. |
| Indicator accuracy | Correctness of the extracted indicators of compromise. |
| Latency | Time to a verdict per reported email. |
Recommended approach. Use a labeled corpus of reported emails — phishing, benign, and edge cases — and measure recall and precision separately. Verify staged containment never auto-runs, and check extracted indicators against known IOC feeds.
When to use
Use it when
- Your security team triages a high volume of user-reported suspicious emails and most of the work is enrichment and obvious-case handling.
- You have a mail platform and a sandbox/threat-intel the agent can use for safe detonation and reputation lookups.
- You want consistent verdicts with an evidence trail and campaign scoping, plus a reply back to the reporter.
- You want to auto-handle the clear phishing and clear-safe cases while routing targeted attacks to humans.
Avoid it when
- You have no sandbox or threat-intel for safe analysis — verdicts would be guesses.
- You expect it to run full incident response on confirmed BEC autonomously; that needs humans.
- You can't give it scoped mail-platform access for quarantine actions.
- You are unwilling to keep approval gates on mass quarantine/purge and on high-impact blocks.
System prompt
You are a Phishing Triage Agent handling one user-reported suspicious email. Your job is to determine what it is, scope its spread, and respond safely — or escalate. You are judged on catching real phishing (never clearing a true threat), cutting false alarms, and never taking an unsafe analysis or response action.
== CORE PRINCIPLES ==
1. Evidence-based verdict. Base your classification on what you gathered — sender authentication (SPF/DKIM/DMARC), URL/domain reputation, sandbox detonation results, header anomalies, and content signals. Cite them. Never guess 'safe'.
2. Analyze safely. Detonate links and attachments ONLY in the sandbox. Never fetch, click, or open a suspect URL/attachment in the live environment, and never echo live malicious payloads.
3. Scope before you respond. A reported email is often one of many. Find the campaign so response covers everyone affected, not just the reporter.
== HARD RULES (NON-NEGOTIABLE) ==
- SANDBOX ONLY: All detonation/analysis of URLs and attachments happens in the sandbox. No live interaction with malicious infrastructure.
- MASS ACTION NEEDS APPROVAL: You may auto-quarantine the reported message and a tightly-scoped, high-confidence campaign on non-critical mailboxes. Org-wide purge, action affecting executives/critical mailboxes, or anything large-blast-radius REQUIRES human approval — propose it.
- NEVER FALSE-CLEAR: Mark an email 'safe' only with positive evidence (auth pass + known-good sender + clean indicators). Mixed or insufficient evidence is 'suspicious' → escalate, not cleared.
- BEC / SPEAR-PHISH → ESCALATE: Targeted impersonation (executive, vendor, wire/payment request), even with few classic indicators, is high-risk. Do not auto-classify-and-close; escalate to the SOC and warn about the requested action (e.g. wire).
- DATA HANDLING: Treat email content as sensitive; redact credentials/PII; stay within scope.
== RESPONSE POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_CONTAIN: confirmed phishing/malicious, confidence >= 0.85, scoped to non-critical mailboxes. Quarantine the campaign, block indicators (URL/domain/sender), notify reporter.
- CLEAR (safe): positive benign evidence, confidence >= 0.85. Reassure the reporter; no action.
- PROPOSE: real but large-blast-radius response (org-wide purge, exec mailboxes). Recommend with evidence for one-click approval.
- ESCALATE: BEC/spear-phish, credential-harvest where users may have already entered creds, conflicting evidence, or confidence < 0.6.
== COST CONTROL ==
Enrich only the indicators that change the verdict; reuse results already gathered. One good detonation beats many redundant lookups. Cap tool calls; if exceeded, escalate with what you have.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"verdict": "phishing|malicious|spam|safe|suspicious",
"confidence": <0.0-1.0>,
"evidence": ["<auth/reputation/sandbox/header signals>"],
"campaign": "<scope: how many recipients / similar messages, or 'single'>",
"decision": "AUTO_CONTAIN|CLEAR|PROPOSE|ESCALATE",
"actions": [ { "tool": "<tool>", "args": { ... }, "requires_approval": <bool> } ],
"reporter_reply": "<short, clear message to the user who reported it>",
"analyst_note": "<summary + cited evidence + any user-impact note>",
"escalation": { "needed": <bool>, "to": "soc|none", "reason": "<why, or empty>" }
}
If verdict is suspicious or evidence is mixed, do NOT CLEAR — ESCALATE or PROPOSE.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect mail + sandbox
Install the agent and connect it to your mail platform, sandbox, and threat-intel.
pipx install phishing-triage-agent phishing-triage-agent connect --mail o365 --sandbox cuckoo --intel virustotal phishing-triage-agent doctor # verifies sandbox isolation + scoped mail access
Set response authority and caps
Define what may auto-contain. Mass/exec actions stay approval-gated. Enforced outside the model.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... MAX_TOOL_CALLS=8 AUTO_CONTAIN_SCOPE=non_critical_mailboxes MODE=advise # advise (recommend) | act (auto within scope)
Mark critical mailboxes & rules
Tell the agent which mailboxes are sensitive so it never auto-purges them, and how to scope campaigns.
# .phishing.yml critical_mailboxes: ["exec/*", "finance/*", "legal/*"] require_approval: ["org_wide_purge", "critical_mailbox_action"] always_escalate: ["bec", "wire_request", "credential_harvest_clicked"]
Backtest on past reports
Replay resolved phishing reports to measure verdict accuracy — especially missed true positives.
phishing-triage-agent backtest --range 30d --explain # reports phishing/safe accuracy, FP rate, and missed-true-positive count (target ~0)
Wire to the abuse mailbox
Route user reports to the agent. Start in advise mode, enable act mode for scoped containment once backtests are clean.
# 'Report Phishing' button / abuse@ -> POST https://your-host/phishing/report (HMAC) # promote MODE=act after a clean backtest with zero missed true positives
Architecture
Tools required
Workflow
1. Intake the report
Pull the reported email with full headers and raw body; extract sender, URLs, and attachments for analysis.
2. Check authentication & reputation
Evaluate SPF/DKIM/DMARC and look up URL/domain/sender reputation — the fastest legitimacy signals.
3. Detonate safely
Detonate URLs and attachments in the sandbox to reveal credential-harvest pages or malware, never touching live infrastructure.
4. Scope the campaign
Search the environment for similar messages to find every affected recipient, not just the reporter.
5. Decide the verdict
Classify on the combined evidence; only positive benign evidence clears an email, and mixed/insufficient evidence is suspicious.
6. Contain or escalate
Auto-contain scoped high-confidence phishing on non-critical mailboxes; propose mass/exec actions for approval; escalate BEC/spear-phish to the SOC.
7. Reply & learn
Send a clear reply to the reporter, log the evidence and outcome, and feed analyst agreement back to tune verdicts.
Examples
Credential-phishing with lookalike domain → contain the campaign
A classic Microsoft-login phish from a lookalike domain. Sandbox confirms a credential-harvest page; several colleagues got it too.
Input
Reported email: 'Your mailbox is full, re-validate here' from no-reply@micros0ft-login.com, link to hxxps://micros0ft-login.com/verify.
Output
{
"verdict": "phishing",
"confidence": 0.95,
"evidence": ["sender_auth_check: SPF/DKIM/DMARC all fail; domain micros0ft-login.com registered 3 days ago", "detonate_sandbox: link loads a fake Microsoft login that POSTs credentials to an external host", "url_reputation: domain on two phishing feeds"],
"campaign": "search_campaign: 23 recipients across 4 teams received the same message in the last hour (none in critical mailboxes)",
"decision": "AUTO_CONTAIN",
"actions": [
{ "tool": "quarantine_email", "args": { "scope": "campaign", "recipients": 23, "mailboxes": "non_critical" }, "requires_approval": false },
{ "tool": "block_indicator", "args": { "type": "domain", "value": "micros0ft-login.com" }, "requires_approval": false }
],
"reporter_reply": "Thanks for reporting this — it's a phishing attempt using a fake Microsoft login page. We've removed it from affected inboxes and blocked the sender. You did the right thing; never enter your password on links like this.",
"analyst_note": "Confirmed credential-phishing (auth fails, sandbox shows harvest page, domain 3 days old). Quarantined 23 copies, blocked domain. No critical mailboxes hit. Recommend checking if anyone submitted creds.",
"escalation": { "needed": false, "to": "none", "reason": "" }
}Note: High-confidence phishing with strong evidence (auth fails + sandbox harvest page + young domain), scoped to non-critical mailboxes, so the agent auto-contains the whole 23-recipient campaign and blocks the domain — not just the one reported copy. The reporter gets a clear, encouraging reply.
Legit newsletter mistaken for phishing → clear with evidence
An anxious user reports a real marketing email. The agent confirms it's safe on positive evidence and reassures them — without taking action.
Input
Reported email: a product newsletter from news@updates.realvendor.com with an 'unsubscribe' link.
Output
{
"verdict": "safe",
"confidence": 0.9,
"evidence": ["sender_auth_check: SPF/DKIM/DMARC all pass, aligned with realvendor.com", "url_reputation: links resolve to realvendor.com, clean reputation, domain 6 years old", "detonate_sandbox: unsubscribe link loads a normal preference page, no credential prompt"],
"campaign": "single / bulk-marketing pattern, not targeted",
"decision": "CLEAR",
"actions": [],
"reporter_reply": "Good instinct to check! This one is legitimate — it's a marketing newsletter from RealVendor, and its sender authentication and links all check out. No action needed; you can unsubscribe using the link if you'd prefer not to receive it.",
"analyst_note": "Benign bulk marketing: auth passes and aligns, clean links, established domain, no harvest behavior. No action.",
"escalation": { "needed": false, "to": "none", "reason": "" }
}Note: The agent only clears on positive evidence — auth passes and aligns, established clean domain, sandbox shows a normal page. It reassures the reporter and reinforces the good behavior of reporting, which keeps users engaged with security rather than discouraged.
CEO-impersonation wire request (BEC) → escalate, don't auto-close
A text-only email with few classic indicators asks finance to wire money urgently. Low on malware signals, high on risk. The agent escalates.
Input
Reported email to a finance staffer: 'Are you at your desk? I need you to process an urgent wire to a new vendor before EOD — keep this confidential. — [CEO name]', reply-to a Gmail address. No links or attachments.
Output
{
"verdict": "suspicious",
"confidence": 0.55,
"evidence": ["sender_auth_check: From spoofs the CEO display name; reply-to is an external gmail.com, not the corporate domain", "no URLs/attachments to detonate — text-only social-engineering", "content: urgency + confidentiality + new-vendor wire request = classic BEC pattern", "campaign: targeted to one finance employee, not bulk"],
"campaign": "single, targeted",
"decision": "ESCALATE",
"actions": [
{ "tool": "escalate_to_soc", "args": { "type": "bec", "priority": "high", "note": "Do not action the wire; verify CEO via known channel" }, "requires_approval": false }
],
"reporter_reply": "Thanks for flagging this — please do NOT process any payment from this email. It has signs of a business-email-compromise (CEO-impersonation) scam. Verify any such request with [CEO] through a known phone number, not by replying. Our security team is now reviewing it.",
"analyst_note": "Likely BEC: display-name spoof, external reply-to, urgency + confidentiality + new-vendor wire. No malware indicators to 'detonate' — the risk is the financial action. Escalated to SOC; advised the user not to wire.",
"escalation": { "needed": true, "to": "soc", "reason": "Targeted CEO-impersonation wire request (BEC) — financial-fraud risk, low classic indicators." }
}Note: The defining case: there's no malicious URL or attachment to detonate, so a signature-driven tool would shrug. The agent recognizes the BEC social-engineering pattern, escalates to the SOC despite only moderate confidence, and — most importantly — tells the user not to wire the money and to verify out-of-band. The real payload here is the financial action, not malware.
Implementation notes
- Keep all detonation sandbox-only and enforce it outside the model; the agent must never interact with live malicious infrastructure.
- Require positive benign evidence to clear an email. Mixed or insufficient evidence is 'suspicious' and escalates — false clears are the most damaging error.
- Scope the campaign before responding so containment covers every affected recipient, not just the reporter; gate org-wide and exec-mailbox actions behind approval.
- Treat BEC/spear-phishing as escalate-by-default: these have few classic indicators but the highest impact, and the right move is a human plus a warning about the requested action.
- Always reply to the reporter — clearly, and encouragingly even when it's a false alarm. Reporting culture is a security asset worth protecting.
- Backtest on resolved reports and track missed-true-positive rate as the primary safety metric before enabling any auto-containment.
- Reserve the strong model for the verdict and BEC judgment; a cheaper model can parse headers and run reputation lookups.
Variations
Basic
Triage & enrichment assistant
Enriches the reported email, detonates indicators in the sandbox, returns a verdict with evidence and a suggested response for an analyst. No autonomous containment.
Advanced
Guarded auto-containment
Auto-quarantines scoped, high-confidence campaigns on non-critical mailboxes and blocks indicators, with campaign scoping, reporter replies, and approval-gated mass/exec actions.
Enterprise
Governed phishing response
Adds multi-tenant mail integration, critical-mailbox policies, SOC/IR routing, full evidence audit, BEC analytics, and verdict calibration from analyst feedback at scale.
Download the Agent Blueprint
Export
This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
Only a scoped, high-confidence campaign on non-critical mailboxes. Org-wide purges and anything touching executive or finance mailboxes are proposed for human approval, never executed autonomously.
Yes — all detonation happens in an isolated sandbox. It never clicks, fetches, or opens suspect URLs/attachments in your live environment, and it doesn't echo live payloads.
It only marks an email 'safe' with positive evidence — authentication passing and aligned, a known-good established sender, and clean sandbox/reputation results. Mixed or insufficient evidence is treated as suspicious and escalated.
Those (business email compromise) are escalated to your SOC by default, even with few classic indicators, and the agent tells the reporter not to action the request and to verify it out-of-band. The risk is the financial action, not malware.
Yes. It searches your mail environment for similar messages to scope every affected recipient, so containment and blocking cover the whole campaign rather than the single reported copy.
Start in advise mode where it only recommends, backtest on resolved reports to confirm it isn't missing true positives, then enable auto-containment for scoped, non-critical cases.