AgentKits

How to Build Production-Ready AI Agents in 2026

The gap between an agent that demos well and one you'd trust with real money is mostly about deciding, up front, what it must never do.

There's a wide gap between an agent that demos well and an agent you'd put in front of real customers or real money. The demo runs once, on a clean input, with you watching. Production runs ten thousand times a day on inputs nobody anticipated, while you're asleep. Almost everything that goes wrong with agents lives in that gap, and most of it has nothing to do with the model you picked.

We've now built and reviewed a lot of these, and the agents that survive contact with production share a surprisingly boring set of traits. None of them are clever. They're mostly about deciding, up front, what the agent is not allowed to do.

Start by writing down what it must never do

Before the prompt, before the tools, before anything: list the actions that would cause real harm if the agent took them at the wrong moment. Refunding more than someone paid. Deleting a production record. Sending an email that commits the company to something. Approving access. Diagnosing a patient. Each of those is a line the agent should never cross on its own, no matter how confident it is.

This sounds obvious and almost nobody does it. The default instinct is to make the agent capable and then bolt on safety later. That's backwards. The safe boundary is the design. Once you've written the list, the rest of the build is mostly arranging things so those lines are structurally impossible to cross, not just discouraged in a prompt.

Put the dangerous actions behind code, not behind the model

A language model is probabilistic. That's wonderful for drafting a reply and unacceptable for deciding whether to issue a $4,000 refund. The single most important architectural decision is to keep irreversible or high-risk actions out of the model's hands and behind deterministic code with a hard gate.

In practice that means an environment flag the model can't override — something like EXECUTE_ACTIONS=false by default — and a separate, dumb piece of code that checks policy before anything happens. The model can propose a refund. The code decides whether it's within the window and under the cap. If it's over, a human gets it. The model never touches the money directly. This is the difference between "the agent refunded a customer" and "the agent recommended a refund and the system, applying our rules, approved it." Only the second one is auditable.

The same pattern covers most domains. The agent drafts the email; a human sends it. The agent routes the ticket; it doesn't resolve the billing dispute. The agent flags the suspicious transaction; it doesn't freeze the account. Propose, don't execute. Route, don't act. Draft, don't send. If you internalize one idea from this whole piece, make it that one.

Make the model output structured data, not prose

An agent that returns a paragraph is an agent you can't reason about programmatically. Have it return a single JSON object with the decision, the confidence, the evidence it used, and an explicit escalation field. Now your code can branch on it, log it, and gate on it. Now "I'm not sure" is a value your system can act on instead of a sentence buried in text.

Force a schema and validate against it. When the model returns something that doesn't fit, that's a signal — treat it as a failure to handle, not a value to coerce. A surprising amount of production reliability comes from simply refusing to accept malformed output.

Confidence and escalation are features, not afterthoughts

The agents that hold up in production are the ones that know when to quit. Every decision should carry a confidence, and low confidence should route to a human rather than forcing a guess. A workflow router that sends an ambiguous request to a triage queue is doing its job. One that confidently misroutes it to the wrong team because it had to pick something is the one generating angry tickets.

Build the escalation path first, not last. Decide where uncertain cases go, who sees them, and what context travels with them. An agent without a good "I don't know, here's a human" branch will eventually be confidently wrong about something that matters, and you'll wish you'd built the off-ramp.

Ground everything, and never let it invent

The fastest way to lose trust is a fabricated fact stated with confidence. A research agent that invents a citation. A coverage assistant that promises a claim will be paid. A catalog enricher that adds "FDA approved" to a product that isn't. These aren't edge cases; they're the natural failure mode of a system designed to always produce an answer.

The fix is to ground answers in real, provided sources and to make "I don't have that" a first-class output. If a field isn't on the document, it's null and flagged — not guessed. If a question isn't covered by the policy, the answer is "this isn't addressed, here's who can help" — not a plausible-sounding invention. A flagged blank beats a confident lie every single time, and your users can tell the difference faster than you'd think.

Control cost on purpose

Agents loop. Loops cost money. Without a budget, a single bad input can spiral into a hundred tool calls and a bill you'll notice. Set a per-task ceiling on steps and tokens, and decide what the agent does when it hits the ceiling — usually, stop and escalate.

The cheapest token is the one you don't spend. Use a small, fast model for the routine 80% of cases and reserve the expensive model for the genuinely hard ones. We'll get into the economics of this in a separate piece, but the principle is simple: model tiering, caching, and hard step limits will do more for your bill than any amount of prompt golf.

Test against the cases that scare you

The happy path will work. It always does. What you need to test is the third example — the one where the input is adversarial, ambiguous, or designed to trick the agent into crossing one of those lines you wrote down at the start. Someone applying chargeback pressure to a refund agent. A steering question to a leasing assistant. A request to mark a control compliant "just to pass the audit."

For every agent we consider production-ready, the hardest test is the one where the correct behavior is to refuse. If your test suite is all happy paths, you don't have a test suite — you have a demo with extra steps. Write the adversarial cases down, run them on every change, and treat a regression on a refusal as seriously as a crash.

The unglamorous checklist

Pulling it together, here's what separates the agents that ship from the ones that get quietly turned off after the first incident:

ConcernDemo agentProduction agent
High-risk actionsModel decidesBehind deterministic code + human gate
OutputProseValidated structured JSON
UncertaintyGuessesConfidence score → escalate
FactsSometimes inventedGrounded; "I don't have that" is allowed
CostUnbounded loopsStep/token budget + model tiering
TestingHappy pathAdversarial + refusal cases
LoggingMaybe a print statementEvery decision, with its basis

The honest truth

None of this is exciting. There's no clever reasoning trick or magic framework in the list. Building a production agent is mostly the discipline of deciding what it can't do, putting the risky parts behind real code, and being honest about uncertainty. The model is the easy part — it's been the easy part for a while now.

If you want a head start, the defensive patterns above are exactly how the blueprints in our registry are built: every one keeps high-risk actions gated, grounds its answers, and ships with the hardest case being a refusal. But you don't need our kits to do this. You need the list of lines your agent must never cross, and the discipline to make those lines structural. Everything else follows from that.

Frequently asked questions

Keep reading

← All posts