Will it delete emails from people's inboxes automatically?

Only a scoped, high-confidence campaign on non-critical mailboxes. Org-wide purges and anything touching executive or finance mailboxes are proposed for human approval, never executed autonomously.

Is it safe to let it analyze malicious links and attachments?

Yes — all detonation happens in an isolated sandbox. It never clicks, fetches, or opens suspect URLs/attachments in your live environment, and it doesn't echo live payloads.

How does it avoid clearing a real phishing email?

It only marks an email 'safe' with positive evidence — authentication passing and aligned, a known-good established sender, and clean sandbox/reputation results. Mixed or insufficient evidence is treated as suspicious and escalated.

What about CEO-impersonation / wire-fraud emails with no links?

Those (business email compromise) are escalated to your SOC by default, even with few classic indicators, and the agent tells the reporter not to action the request and to verify it out-of-band. The risk is the financial action, not malware.

Does it handle a phishing campaign, not just one email?

Yes. It searches your mail environment for similar messages to scope every affected recipient, so containment and blocking cover the whole campaign rather than the single reported copy.

How do we adopt it safely?

Start in advise mode where it only recommends, backtest on resolved reports to confirm it isn't missing true positives, then enable auto-containment for scoped, non-critical cases.

Phishing Triage & Response Agent

Overview

Enrich → detonate → scope → respond: turns a reported email into a verdict, a campaign view, and a contained incident.

Safe analysis: URLs and attachments are detonated only in a sandbox, and sender auth (SPF/DKIM/DMARC) and reputation ground the verdict.

Campaign-aware: it finds who else received the message so response covers the whole blast, not just one inbox.

Defensive: no org-wide purge without approval, no 'safe' verdict on weak evidence, and targeted BEC/spear-phishing escalates to humans.

AgentAz™ specification

A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.

Trust Level ?A3 — Human-Approved

DNA PatternEscalation (Research → Evaluate → Plan → Escalate)

Worst-Case ActionStages an incorrect containment action (such as a quarantine) that an analyst approves before it runs, or misclassifies a reported email. It cannot quarantine, block, or delete on its own — those actions require human approval.

Authority BoundaryAnalyzes a reported email, extracts indicators, scores risk, and stages a recommended containment action for analyst approval. It never quarantines, blocks senders, or deletes mail autonomously. An analyst approves any action.

Verification TestStage a containment action → confirm it requires explicit analyst approval and is not auto-executed; confirm destructive actions are gated.

Production Readiness6/6 dimensions passing. Tool isolation: containment actions gated behind approval. Human gates: an analyst approves. Confidence escalation: uncertain verdicts routed up. Cost ceiling: bounded per report. Audit trail: indicators and decisions logged. Escalation path: confirmed threats routed to the SOC.

Last Reviewed2026-06-24

Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:

agentaz.json

{
  "$schema": "./agentaz.schema.json",
  "version": "2.0.0",
  "last_reviewed": "2026-06-24",
  "agent_id": "phishing-triage-agent",
  "trust_level": "A3",
  "dna_pattern": "Escalation",
  "worst_case_action": "Stages an incorrect quarantine for analyst approval, or misclassifies an email. Cannot auto-contain.",
  "authority_boundary": "Analyzes reported phishing and stages containment for approval; no autonomous quarantine/block.",
  "tags": [
    "security",
    "phishing",
    "soc",
    "human-approval"
  ],
  "tool_boundary": {
    "allowed_tools": [
      "read_report",
      "extract_indicators",
      "score_risk",
      "stage_containment"
    ],
    "approval_required_tools": [
      "quarantine",
      "block_sender"
    ],
    "execution_tools_absent": false
  },
  "output_boundary": {
    "format": "structured_json",
    "never_without_approval": [
      "quarantine",
      "block_sender",
      "delete_mail"
    ]
  },
  "cost_boundary": {
    "max_usd_per_trace_loop": 0.25,
    "alert_threshold_usd": 0.16
  },
  "loop_boundary": {
    "max_reasoning_turns": 8
  },
  "human_handoff": {
    "triggers": [
      "confirmed_threat",
      "uncertain_verdict",
      "containment_proposed"
    ],
    "destination": "soc_analyst"
  },
  "audit": {
    "append_only": true,
    "logs": [
      "indicators",
      "verdict",
      "staged_actions",
      "approvals"
    ]
  }
}

New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.

This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 (spec text under CC‑BY‑4.0) — schema and source on GitHub.

Governance matrix

A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.

Agent goal	Bounded by the authority spec above
Trust Level	A3 — Human-Approved
Tool access	Scoped tools; high-risk actions gated behind approval
Context handling	Grounded in provided inputs; cites or flags rather than guessing
Memory strategy	Task-scoped; no persistent cross-session memory
Human approval	Required on confirmed threat, uncertain verdict, containment proposed → soc analyst
Audit trail	Append-only log (indicators, verdict, staged actions, approvals)
Cost & loop bounds	≤ $0.25 per loop · ≤ 8 reasoning turns
Recovery / escalation	Escalates to soc analyst

Agent component mapping

A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.

Agent	Primary reasoner — Human-Approved authority (A3)
Tools	read report, extract indicators, score risk, stage containment; approval-gated: quarantine, block sender
Memory	Task-scoped working context; no persistent cross-session memory
Guardrails	Worst-case classified (A3); high-risk actions gated; ≤ $0.25/loop · ≤ 8 turns
Evaluator	Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned
Handoff	Escalates to soc analyst on confirmed threat, uncertain verdict, containment proposed

Failure modes

Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.

Misclassifies a benign email as phishing, triggering unnecessary containment.

Detection: A confidence score is attached and containment is staged, not auto-run.
Mitigation: An analyst approves any quarantine or block.
Recovery: The staged action is reversed and the message is un-quarantined.

Misses a real phish (false negative).

Detection: Uncertain verdicts are routed up; indicators are extracted regardless of verdict.
Mitigation: Human-in-the-loop on uncertain cases; the agent never auto-clears.
Recovery: A post-report retro adds the missed indicator set to the rules.

Indicator extraction from a weaponized attachment.

Detection: Analysis runs in a sandbox with no execution.
Mitigation: Static indicator extraction only; live payloads are never opened.
Recovery: The contained sample is escalated to the SOC.

Evaluation

Recall on true phishing is primary — a missed phish is the costly error — balanced against false-positive containment.

Verdict accuracy	Share of reported emails classified correctly as phishing or benign.
Recall	Of true phishing emails, the share correctly identified — weighted high.
Precision	Of emails flagged for containment, the share genuinely malicious — false-positive resistance.
Indicator accuracy	Correctness of the extracted indicators of compromise.
Latency	Time to a verdict per reported email.

Recommended approach. Use a labeled corpus of reported emails — phishing, benign, and edge cases — and measure recall and precision separately. Verify staged containment never auto-runs, and check extracted indicators against known IOC feeds.

When to use

Use it when

Your security team triages a high volume of user-reported suspicious emails and most of the work is enrichment and obvious-case handling.
You have a mail platform and a sandbox/threat-intel the agent can use for safe detonation and reputation lookups.
You want consistent verdicts with an evidence trail and campaign scoping, plus a reply back to the reporter.
You want to auto-handle the clear phishing and clear-safe cases while routing targeted attacks to humans.

Avoid it when

You have no sandbox or threat-intel for safe analysis — verdicts would be guesses.
You expect it to run full incident response on confirmed BEC autonomously; that needs humans.
You can't give it scoped mail-platform access for quarantine actions.
You are unwilling to keep approval gates on mass quarantine/purge and on high-impact blocks.

System prompt

system-prompt.md

You are a Phishing Triage Agent handling one user-reported suspicious email. Your job is to determine what it is, scope its spread, and respond safely — or escalate. You are judged on catching real phishing (never clearing a true threat), cutting false alarms, and never taking an unsafe analysis or response action.

== CORE PRINCIPLES ==
1. Evidence-based verdict. Base your classification on what you gathered — sender authentication (SPF/DKIM/DMARC), URL/domain reputation, sandbox detonation results, header anomalies, and content signals. Cite them. Never guess 'safe'.
2. Analyze safely. Detonate links and attachments ONLY in the sandbox. Never fetch, click, or open a suspect URL/attachment in the live environment, and never echo live malicious payloads.
3. Scope before you respond. A reported email is often one of many. Find the campaign so response covers everyone affected, not just the reporter.

== HARD RULES (NON-NEGOTIABLE) ==
- SANDBOX ONLY: All detonation/analysis of URLs and attachments happens in the sandbox. No live interaction with malicious infrastructure.
- MASS ACTION NEEDS APPROVAL: You may auto-quarantine the reported message and a tightly-scoped, high-confidence campaign on non-critical mailboxes. Org-wide purge, action affecting executives/critical mailboxes, or anything large-blast-radius REQUIRES human approval — propose it.
- NEVER FALSE-CLEAR: Mark an email 'safe' only with positive evidence (auth pass + known-good sender + clean indicators). Mixed or insufficient evidence is 'suspicious' → escalate, not cleared.
- BEC / SPEAR-PHISH → ESCALATE: Targeted impersonation (executive, vendor, wire/payment request), even with few classic indicators, is high-risk. Do not auto-classify-and-close; escalate to the SOC and warn about the requested action (e.g. wire).
- DATA HANDLING: Treat email content as sensitive; redact credentials/PII; stay within scope.

== RESPONSE POLICY (calibrated confidence 0.0-1.0) ==
- AUTO_CONTAIN: confirmed phishing/malicious, confidence >= 0.85, scoped to non-critical mailboxes. Quarantine the campaign, block indicators (URL/domain/sender), notify reporter.
- CLEAR (safe): positive benign evidence, confidence >= 0.85. Reassure the reporter; no action.
- PROPOSE: real but large-blast-radius response (org-wide purge, exec mailboxes). Recommend with evidence for one-click approval.
- ESCALATE: BEC/spear-phish, credential-harvest where users may have already entered creds, conflicting evidence, or confidence < 0.6.

== COST CONTROL ==
Enrich only the indicators that change the verdict; reuse results already gathered. One good detonation beats many redundant lookups. Cap tool calls; if exceeded, escalate with what you have.

== OUTPUT FORMAT (return ONE JSON object) ==
{
  "verdict": "phishing|malicious|spam|safe|suspicious",
  "confidence": <0.0-1.0>,
  "evidence": ["<auth/reputation/sandbox/header signals>"],
  "campaign": "<scope: how many recipients / similar messages, or 'single'>",
  "decision": "AUTO_CONTAIN|CLEAR|PROPOSE|ESCALATE",
  "actions": [ { "tool": "<tool>", "args": { ... }, "requires_approval": <bool> } ],
  "reporter_reply": "<short, clear message to the user who reported it>",
  "analyst_note": "<summary + cited evidence + any user-impact note>",
  "escalation": { "needed": <bool>, "to": "soc|none", "reason": "<why, or empty>" }
}
If verdict is suspicious or evidence is mixed, do NOT CLEAR — ESCALATE or PROPOSE.

Was this useful?

Simulate run

Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.

Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.

Setup guide

Install and connect mail + sandbox

Install the agent and connect it to your mail platform, sandbox, and threat-intel.

shell

pipx install phishing-triage-agent
phishing-triage-agent connect --mail o365 --sandbox cuckoo --intel virustotal
phishing-triage-agent doctor   # verifies sandbox isolation + scoped mail access

Set response authority and caps

Define what may auto-contain. Mass/exec actions stay approval-gated. Enforced outside the model.

shell

cp .env.example .env
ANTHROPIC_API_KEY=sk-ant-...
MAX_TOOL_CALLS=8
AUTO_CONTAIN_SCOPE=non_critical_mailboxes
MODE=advise   # advise (recommend) | act (auto within scope)

Mark critical mailboxes & rules

Tell the agent which mailboxes are sensitive so it never auto-purges them, and how to scope campaigns.

shell

# .phishing.yml
critical_mailboxes: ["exec/*", "finance/*", "legal/*"]
require_approval: ["org_wide_purge", "critical_mailbox_action"]
always_escalate: ["bec", "wire_request", "credential_harvest_clicked"]

Backtest on past reports

Replay resolved phishing reports to measure verdict accuracy — especially missed true positives.

shell

phishing-triage-agent backtest --range 30d --explain
# reports phishing/safe accuracy, FP rate, and missed-true-positive count (target ~0)

Wire to the abuse mailbox

Route user reports to the agent. Start in advise mode, enable act mode for scoped containment once backtests are clean.

shell

# 'Report Phishing' button / abuse@ -> POST https://your-host/phishing/report (HMAC)
# promote MODE=act after a clean backtest with zero missed true positives

Architecture

Report intakeIngests the user-reported email (with full headers and raw body) from the mail platform or abuse mailbox and normalizes sender, subject, URLs, and attachments.

Authentication & reputationChecks SPF/DKIM/DMARC alignment and looks up domain/URL/sender reputation against threat intel — the first, cheapest signals of legitimacy.

Sandbox detonationDetonates URLs and attachments in an isolated sandbox to observe redirects, credential-harvest pages, and malware behavior, never touching live infrastructure.

Campaign scopingSearches the mail environment for similar messages to determine how many recipients are affected, turning one report into the full blast radius.

Verdict & response gateThe model classifies on the combined evidence; a deterministic gate enforces sandbox-only analysis and routes mass/exec-impacting actions to human approval.

Containment & commsQuarantines the scoped campaign, blocks indicators, and replies to the reporter; high-impact actions are staged for approval.

Escalation & learningRoutes BEC/spear-phish to the SOC with evidence, and logs analyst agreement to tune verdicts and reduce false clears/alarms.

Tools required

get_reported_emailRetrieve the reported email with full headers, raw body, URLs, and attachments from the mail platform/abuse mailbox.

sender_auth_checkEvaluate SPF, DKIM, and DMARC alignment and sender/domain age and reputation.

url_reputationLook up URLs/domains against threat-intel for reputation, known-phishing lists, and first-seen age.

detonate_sandboxOpen URLs and attachments in an isolated sandbox to observe redirects, credential-harvest pages, and malware behavior. Never runs live.

search_campaignSearch the mail environment for similar/related messages to scope how many recipients received the same attack.

quarantine_emailQuarantine/remove messages. Auto-allowed for scoped, high-confidence campaigns on non-critical mailboxes; mass/exec actions are approval-gated.

block_indicatorBlock a malicious URL, domain, or sender at the mail gateway/proxy to stop further delivery and clicks.

escalate_to_socRoute to the SOC with the evidence package for BEC/spear-phish, credential-harvest exposure, or uncertain high-risk cases.

Workflow

1. Intake the report
Pull the reported email with full headers and raw body; extract sender, URLs, and attachments for analysis.
2. Check authentication & reputation
Evaluate SPF/DKIM/DMARC and look up URL/domain/sender reputation — the fastest legitimacy signals.
3. Detonate safely
Detonate URLs and attachments in the sandbox to reveal credential-harvest pages or malware, never touching live infrastructure.
4. Scope the campaign
Search the environment for similar messages to find every affected recipient, not just the reporter.
5. Decide the verdict
Classify on the combined evidence; only positive benign evidence clears an email, and mixed/insufficient evidence is suspicious.
6. Contain or escalate
Auto-contain scoped high-confidence phishing on non-critical mailboxes; propose mass/exec actions for approval; escalate BEC/spear-phish to the SOC.
7. Reply & learn
Send a clear reply to the reporter, log the evidence and outcome, and feed analyst agreement back to tune verdicts.

Examples

Credential-phishing with lookalike domain → contain the campaign

A classic Microsoft-login phish from a lookalike domain. Sandbox confirms a credential-harvest page; several colleagues got it too.

Input

input

Reported email: 'Your mailbox is full, re-validate here' from no-reply@micros0ft-login.com, link to hxxps://micros0ft-login.com/verify.

Output

output

{
  "verdict": "phishing",
  "confidence": 0.95,
  "evidence": ["sender_auth_check: SPF/DKIM/DMARC all fail; domain micros0ft-login.com registered 3 days ago", "detonate_sandbox: link loads a fake Microsoft login that POSTs credentials to an external host", "url_reputation: domain on two phishing feeds"],
  "campaign": "search_campaign: 23 recipients across 4 teams received the same message in the last hour (none in critical mailboxes)",
  "decision": "AUTO_CONTAIN",
  "actions": [
    { "tool": "quarantine_email", "args": { "scope": "campaign", "recipients": 23, "mailboxes": "non_critical" }, "requires_approval": false },
    { "tool": "block_indicator", "args": { "type": "domain", "value": "micros0ft-login.com" }, "requires_approval": false }
  ],
  "reporter_reply": "Thanks for reporting this — it's a phishing attempt using a fake Microsoft login page. We've removed it from affected inboxes and blocked the sender. You did the right thing; never enter your password on links like this.",
  "analyst_note": "Confirmed credential-phishing (auth fails, sandbox shows harvest page, domain 3 days old). Quarantined 23 copies, blocked domain. No critical mailboxes hit. Recommend checking if anyone submitted creds.",
  "escalation": { "needed": false, "to": "none", "reason": "" }
}

Note: High-confidence phishing with strong evidence (auth fails + sandbox harvest page + young domain), scoped to non-critical mailboxes, so the agent auto-contains the whole 23-recipient campaign and blocks the domain — not just the one reported copy. The reporter gets a clear, encouraging reply.

Legit newsletter mistaken for phishing → clear with evidence

An anxious user reports a real marketing email. The agent confirms it's safe on positive evidence and reassures them — without taking action.

Input

input

Reported email: a product newsletter from news@updates.realvendor.com with an 'unsubscribe' link.

Output

output

{
  "verdict": "safe",
  "confidence": 0.9,
  "evidence": ["sender_auth_check: SPF/DKIM/DMARC all pass, aligned with realvendor.com", "url_reputation: links resolve to realvendor.com, clean reputation, domain 6 years old", "detonate_sandbox: unsubscribe link loads a normal preference page, no credential prompt"],
  "campaign": "single / bulk-marketing pattern, not targeted",
  "decision": "CLEAR",
  "actions": [],
  "reporter_reply": "Good instinct to check! This one is legitimate — it's a marketing newsletter from RealVendor, and its sender authentication and links all check out. No action needed; you can unsubscribe using the link if you'd prefer not to receive it.",
  "analyst_note": "Benign bulk marketing: auth passes and aligns, clean links, established domain, no harvest behavior. No action.",
  "escalation": { "needed": false, "to": "none", "reason": "" }
}

Note: The agent only clears on positive evidence — auth passes and aligns, established clean domain, sandbox shows a normal page. It reassures the reporter and reinforces the good behavior of reporting, which keeps users engaged with security rather than discouraged.

CEO-impersonation wire request (BEC) → escalate, don't auto-close

A text-only email with few classic indicators asks finance to wire money urgently. Low on malware signals, high on risk. The agent escalates.

Input

input

Reported email to a finance staffer: 'Are you at your desk? I need you to process an urgent wire to a new vendor before EOD — keep this confidential. — [CEO name]', reply-to a Gmail address. No links or attachments.

Output

output

{
  "verdict": "suspicious",
  "confidence": 0.55,
  "evidence": ["sender_auth_check: From spoofs the CEO display name; reply-to is an external gmail.com, not the corporate domain", "no URLs/attachments to detonate — text-only social-engineering", "content: urgency + confidentiality + new-vendor wire request = classic BEC pattern", "campaign: targeted to one finance employee, not bulk"],
  "campaign": "single, targeted",
  "decision": "ESCALATE",
  "actions": [
    { "tool": "escalate_to_soc", "args": { "type": "bec", "priority": "high", "note": "Do not action the wire; verify CEO via known channel" }, "requires_approval": false }
  ],
  "reporter_reply": "Thanks for flagging this — please do NOT process any payment from this email. It has signs of a business-email-compromise (CEO-impersonation) scam. Verify any such request with [CEO] through a known phone number, not by replying. Our security team is now reviewing it.",
  "analyst_note": "Likely BEC: display-name spoof, external reply-to, urgency + confidentiality + new-vendor wire. No malware indicators to 'detonate' — the risk is the financial action. Escalated to SOC; advised the user not to wire.",
  "escalation": { "needed": true, "to": "soc", "reason": "Targeted CEO-impersonation wire request (BEC) — financial-fraud risk, low classic indicators." }
}

Note: The defining case: there's no malicious URL or attachment to detonate, so a signature-driven tool would shrug. The agent recognizes the BEC social-engineering pattern, escalates to the SOC despite only moderate confidence, and — most importantly — tells the user not to wire the money and to verify out-of-band. The real payload here is the financial action, not malware.

Implementation notes

Keep all detonation sandbox-only and enforce it outside the model; the agent must never interact with live malicious infrastructure.
Require positive benign evidence to clear an email. Mixed or insufficient evidence is 'suspicious' and escalates — false clears are the most damaging error.
Scope the campaign before responding so containment covers every affected recipient, not just the reporter; gate org-wide and exec-mailbox actions behind approval.
Treat BEC/spear-phishing as escalate-by-default: these have few classic indicators but the highest impact, and the right move is a human plus a warning about the requested action.
Always reply to the reporter — clearly, and encouragingly even when it's a false alarm. Reporting culture is a security asset worth protecting.
Backtest on resolved reports and track missed-true-positive rate as the primary safety metric before enabling any auto-containment.
Reserve the strong model for the verdict and BEC judgment; a cheaper model can parse headers and run reputation lookups.

Variations

Basic

Triage & enrichment assistant

Enriches the reported email, detonates indicators in the sandbox, returns a verdict with evidence and a suggested response for an analyst. No autonomous containment.

Advanced

Guarded auto-containment

Auto-quarantines scoped, high-confidence campaigns on non-critical mailboxes and blocks indicators, with campaign scoping, reporter replies, and approval-gated mass/exec actions.

Enterprise

Governed phishing response

Adds multi-tenant mail integration, critical-mailbox policies, SOC/IR routing, full evidence audit, BEC analytics, and verdict calibration from analyst feedback at scale.

Download the Agent Blueprint

The complete blueprint, zipped — including a runnable run.py you can execute with one API key (Anthropic or OpenAI).

Download Blueprint (.zip)

README.mdsystem-prompt.mdsetup-guide.mdtools.jsonworkflow.mdexamples.md.env.examplekit.jsonrun.pyLICENSENOTICEstarters/

Export

Generate a starter for your stack — all client-side, nothing leaves your browser.

ZIP

Starters use mock tools — swap in your integrations to deploy.

View the source on GitHub

This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).

Phishing Triage & Response Agent

Overview

AgentAz™ specification

Governance matrix

Agent component mapping

Failure modes

Evaluation

When to use

System prompt

Simulate run

Setup guide

Architecture

Tools required

Workflow

Examples

Implementation notes

Variations

Frequently asked questions

Will it delete emails from people's inboxes automatically?

Is it safe to let it analyze malicious links and attachments?

How does it avoid clearing a real phishing email?

What about CEO-impersonation / wire-fraud emails with no links?

Does it handle a phishing campaign, not just one email?

How do we adopt it safely?

Related kits

AI SOC Alert Triage Agent

Access Request & Provisioning Agent

AI Incident Response Agent

Production-Grade AI Code Review Agent