The Agent Engineering Stack: How Modern AI Agents Actually Work
The model is one layer in the stack. Most of what separates a demo from a production agent lives in the others — planning, tools, memory, governance, grounding, handoff, and evaluation. Here's the whole stack.
People talk about AI agents as if the model is the agent. It isn't. The model is one layer in a stack, and most of what separates a demo from a production system lives in the other layers — the parts that decide what the agent plans, what it's allowed to touch, what it remembers, and what happens when it's unsure. If you want to understand how modern agents actually work, it helps to take the stack apart layer by layer and see what each one is responsible for.
The reasoning layer: the model
At the core is the language model doing the reasoning — interpreting the goal, deciding what to do next, generating the output. This is the part that improved dramatically and the part everyone fixates on. But on its own, a model is a brain in a jar: it can think, but it can't reliably act, remember across sessions, or stay inside a boundary. Treating the model as the whole agent is the mistake that produces impressive demos and unshippable products.
The planning layer: turning a goal into steps
A real task rarely maps to a single model call. Something has to break a goal into an ordered sequence of subtasks with dependencies. That's the planning layer, and keeping it separate from execution is a deliberate safety choice: a planner that proposes a plan for review is far easier to trust than one that improvises actions as it goes. The Goal Decomposer blueprint does exactly this — it produces a plan and stops, leaving execution to a human or a downstream system. Plan first, act second, and never let the two blur together.
The routing layer: getting work to the right handler
Once there's a plan, each piece has to reach the thing that can do it — a specific tool, a specialized agent, a human queue. The routing layer classifies and directs without doing the work itself. The Workflow Router is built around this single responsibility: read a request, decide where it belongs, send it there. Keeping routing as its own layer means you can reason about and test the dispatch logic independently of the work being dispatched, which is how you keep a growing system understandable.
The tool layer: the agent's hands
Tools are how an agent affects the world — query a database, call an API, write a file. This is also where almost all the risk concentrates, because tools are the only way an agent can do something irreversible. The design rule is least privilege: the agent gets the smallest set of tools the task requires and nothing else. A capability that isn't in the registry can't be misused, regardless of what the model is convinced to attempt. Most safety engineering happens right here, at the boundary between reasoning and action.
The memory and context layer
Agents need to carry state — what happened earlier in the task, relevant facts, prior decisions. The memory layer manages what the agent knows at any moment and, just as importantly, what it shouldn't. Bad context handling is a quiet source of failures: stale data, leaked information from another session, a context window stuffed with noise that crowds out the signal. Deciding what goes into the agent's working memory, and what stays out, is an engineering discipline in its own right, not an afterthought.
The governance layer: guardrails and boundaries
Sitting across the whole stack is the governance layer — the rules about what the agent may do, when it must pause for a human, what it must log. This is the layer that turns a capable agent into a deployable one. It's where worst-case classification, human-approval gates, escalation triggers, and audit trails live. A Control Monitor agent is a good illustration of governance as a first-class concern: it watches controls and flags failures but is structurally unable to mark something compliant on its own, because that authority was never granted. Governance isn't a wrapper you add at the end; it's woven through every layer above.
The grounding layer: answering from sources
For any agent that makes factual claims, there has to be a layer that ties answers to evidence. Grounding is what separates "the policy says you have 30 days" from a confident hallucination. The Policy QA Agent demonstrates the discipline: it answers only from provided documents, cites the source, and routes anything uncovered to a human rather than filling the gap with a guess. The ability to say "that's not in my material" is an architectural feature, and it belongs in the stack on purpose.
The handoff layer: where humans re-enter
No agent should run entirely alone on consequential work, which means the stack needs a defined way for control to return to a person. The handoff layer specifies the triggers — low confidence, high stakes, ambiguity, anything irreversible — and the destination. It's the difference between an agent that gets stuck and fails silently and one that knows when to tap a human on the shoulder. Designing these triggers up front, rather than discovering them after an incident, is a mark of a mature system.
The evaluation layer: knowing if it works
Finally, you need a way to measure whether the whole stack is behaving. Evaluation tracks the quality metrics — accuracy, precision, recall, latency — and the safety metrics — escalation rate, override frequency, attempted out-of-bounds actions. An Action Item Tracker hints at why this layer matters even for "simple" agents: extraction quality drifts, ownership gets ambiguous, and without measurement you won't notice until something important is missed. You can't improve or trust what you don't measure.
Putting the stack together
A production agent is these layers working in concert: a model that reasons, a planner that decomposes, a router that dispatches, tools scoped to the minimum, memory that's curated, governance woven throughout, grounding that ties claims to sources, handoffs that bring humans back in, and evaluation that tells you the truth. The model gets the attention, but the engineering is in the rest. That's also why a better model doesn't make the stack obsolete — it makes the reasoning layer stronger while leaving every other layer exactly as necessary as before.
If you want to see how the governance layer is specified in practice, every blueprint we publish documents its boundaries on the AgentAz™ specification page, and the full set is in the blueprint library. Read a few and you'll start seeing the stack everywhere — and noticing immediately when a system is missing a layer it needed.
Frequently asked questions
A production agent stack has a reasoning layer (the model), a planning layer that decomposes goals, a routing layer that dispatches work, a tool layer scoped to least privilege, a memory and context layer, a governance layer of guardrails and boundaries, a grounding layer that ties claims to sources, a handoff layer for human escalation, and an evaluation layer that measures behavior.
No. The model is the reasoning layer only. On its own it can think but cannot reliably act, remember across sessions, or stay inside a boundary. The other layers — planning, tools, memory, governance, grounding, handoff, evaluation — are what turn a model into a deployable agent.
Because a better model only strengthens the reasoning layer. Planning, tool scoping, memory curation, governance, grounding, handoffs, and evaluation are all just as necessary regardless of model quality. The hard engineering lives outside the model, which is why agent architecture remains the differentiator.