AgentKits

How to Design Safe AI Agents: A Practical Engineering Guide

Safe agent design is mostly subtraction: remove the tools it doesn't need, remove its ability to act irreversibly alone, remove its license to answer when it doesn't know. Here's the full checklist.

Most advice on building AI agents starts with capability — what the agent can do. Designing a safe agent starts from the opposite end: what it must never do, and what happens when it gets something wrong. That inversion is the whole discipline. An agent that reasons brilliantly but has no boundary on its actions is not an advanced system; it's an incident waiting for a trigger. Here is how we approach safe agent design across every blueprint we publish, written as a practical checklist rather than a manifesto.

Start from the worst-case action, not the happy path

Before you write a prompt, answer one question: if this agent does the most damaging thing it is physically wired to do, how bad is that? Not the likely outcome — the worst one. An agent that can only read and summarize has a worst case of "wrong summary." An agent that can issue a refund has a worst case of "money moved incorrectly." Those are different systems and they deserve different controls, even if the underlying model is identical.

We make this explicit by classifying every agent on a trust scale from A1 (read-only research) to A5 (full autonomy), keyed entirely to its worst-case action. The classification drives everything downstream: how many human gates it needs, how tightly its tools are scoped, how much logging is mandatory. If you do nothing else from this article, do this — it reorders every other decision correctly.

Give the agent the smallest possible toolbox

The single most effective safety control is also the most boring: don't give the agent a tool it doesn't strictly need. An agent cannot misuse a capability that isn't in its registry. If a vendor-risk agent only needs to read documents and produce an assessment, it should have no write access to anything — no ticket creation, no email, no status change. The Vendor Risk Assessor blueprint is built exactly this way: it evaluates and recommends, and the tools that would let it act on that recommendation are simply absent. "Least privilege" is an old security idea, and it transfers to agents with almost no modification.

This matters more than prompt instructions, because a prompt is a request and a missing tool is a wall. Telling an agent "do not send emails" is a suggestion the model can fail to honor under a clever input. Removing the email tool means it cannot send one no matter what it's told.

Put a human on every irreversible action

Reversibility is the line that decides where a human belongs. Reading data, drafting a reply, scoring a risk — reversible, and safe to let an agent do unsupervised. Issuing a payment, granting access, deleting a record, sending a message that commits the company — irreversible, and those should never happen without a person approving the specific action. The Access Request Handler shows the pattern cleanly: it reads the request, checks it against policy, and prepares a grant decision, but a human approves before any access actually changes. The agent does the tedious 90 percent; the human owns the 10 percent that's dangerous.

Resist the temptation to auto-approve "low-risk" cases to save time. The cost of a human glance is small. The cost of an agent that learned it can sometimes skip the human is not.

Ground every claim, and let the agent say "I don't know"

A safe agent answers from sources, not from confidence. When an agent makes a factual claim — a policy says X, a contract allows Y — it should be able to point to where that came from, and it should route to a human when the answer isn't in its material rather than inventing one. The Policy QA Agent is designed around this rule: it answers strictly from the provided policy documents with a citation, and anything not covered goes to HR instead of getting a plausible-sounding guess. "I don't know, here's who does" is a feature, not a failure. An agent that always has an answer is an agent that will confidently be wrong.

Design the escalation path before you need it

Every agent will eventually hit something it shouldn't handle alone — an ambiguous case, a low-confidence call, a high-stakes decision, a sign of an emergency. Safe design means the route to a human is built in from the start, with explicit triggers, not bolted on after the first bad outcome. Decide in advance: what conditions force a handoff, and to whom? A Control Monitor agent, for instance, flags a failing compliance control and routes it to the owner — it never marks a control compliant on its own to make a number look good. The escalation isn't an exception path; it's a core feature of the design.

Log everything, in a form a human can audit

If you can't reconstruct why an agent did what it did, you can't trust it in anything regulated, and increasingly you can't deploy it legally. Every consequential decision should leave an append-only trail: what the agent saw, what it decided, what it recommended, who approved. This is not just good engineering — under the EU AI Act, logging and traceability for high-risk systems become obligations in 2026, and "the model decided" is not an acceptable record. Build the audit trail in from day one; retrofitting it is painful and usually incomplete.

Bound the loops and the spend

Two failure modes get overlooked because they're not about wrong answers: an agent that loops without converging, and an agent that quietly burns budget. Cap the number of reasoning turns. Set a cost ceiling per task with an alert threshold below it. Give the agent an escape hatch so that when it can't make progress, it stops and asks rather than spinning. None of this is glamorous, and all of it prevents the 3 a.m. incident where a stuck agent ran ten thousand tool calls against a production system.

Evaluate safety, not just accuracy

Teams measure whether an agent gets the task right. Fewer measure whether it stays inside its boundaries. Both matter. Track the obvious quality metrics — accuracy, precision, recall, latency — but also track the safety ones: how often it escalates, how often it tries an action it shouldn't, how often a human overrides it and why. A rising override rate is an early warning. A falling escalation rate might mean the agent got better, or it might mean it started guessing instead of asking. Watch them together.

Write the boundary down where someone can review it

All of the above only counts if it's documented in a form a reviewer can actually check before the agent ships. That's the gap most teams miss: the safety logic lives in someone's head or scattered across a prompt, and no one can sign off on it. We close that gap with a design-time specification — every blueprint carries a machine-readable record of its trust level, authority boundary, escalation triggers, and audit settings. You can read the full vocabulary on the AgentAz™ specification page, or run an agent through the AI agent risk assessment to see how it classifies. The specification doesn't enforce anything by itself — a runtime does that — but it gives a security reviewer something concrete to approve, which is the step that turns "we think it's safe" into "we can show why."

The short version

Safe agent design is mostly subtraction. Remove tools the agent doesn't need. Remove its ability to take irreversible actions alone. Remove its license to answer when it doesn't know. What's left is an agent that proposes, escalates, and records — and a human who owns the decisions that matter. That's not a limitation on what agents can do. It's the precondition for letting them do anything that counts.

Frequently asked questions

Keep reading

← All posts