Overview
Question → grounded SQL → validated answer: ask in English, get a correct, explained result backed by real, runnable SQL.
Schema-grounded: it only references tables and columns that exist in the schema it's given, so it can't invent fields or join nonexistent keys.
Safe by construction: read-only, cost- and row-capped, PII-aware, and it refuses writes, DDL, and unbounded scans.
Honest about ambiguity: it states its assumptions and asks one clarifying question when a request could mean materially different things, instead of guessing.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "nl-to-sql-agent",
"trust_level": "A1",
"dna_pattern": "Research",
"worst_case_action": "Returns wrong figures from a read-only query for human interpretation. Cannot modify data.",
"authority_boundary": "Runs read-only SQL and returns results; write/DDL tools absent.",
"tags": [
"data-analysis",
"nl-to-sql",
"read-only",
"analytics"
],
"tool_boundary": {
"allowed_tools": [
"generate_sql",
"run_readonly_query",
"show_query",
"summarize_result"
],
"execution_tools_absent": true,
"read_only": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"insert",
"update",
"delete",
"ddl"
]
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.25,
"alert_threshold_usd": 0.16
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"ambiguous_question",
"sensitive_table",
"low_confidence"
],
"destination": "data_team"
},
"audit": {
"append_only": true,
"logs": [
"queries",
"row_counts"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
This is a flagship reference blueprint for AgentAz v1.0.0. AgentAz™ is open source under Apache-2.0 (spec text under CC‑BY‑4.0) — schema and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A1 — Research |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on ambiguous question, sensitive table, low confidence → data team |
| Audit trail | Append-only log (queries, row counts) |
| Cost & loop bounds | ≤ $0.25 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to data team |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Research authority (A1) |
|---|---|
| Tools | generate sql, run readonly query, show query, summarize result — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A1); no execution tools; ≤ $0.25/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to data team on ambiguous question, sensitive table, low confidence |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Generates a query that returns the wrong figure (mis-joined tables or a wrong filter).
- Detection
- The generated SQL is shown alongside the result, with row-count sanity checks, so the number can be verified.
- Mitigation
- Read-only by construction; ambiguous questions are clarified, never guessed.
- Recovery
- The user inspects the SQL and re-asks — no data was mutated.
An expensive or unbounded query hits the production database.
- Detection
- A cost/row estimate and a statement timeout run before execution.
- Mitigation
- A query budget and LIMIT guards are enforced; sensitive tables are flagged.
- Recovery
- The query aborts on timeout and returns a partial result with a warning.
A query would expose a sensitive column the asker shouldn't see.
- Detection
- A sensitive-table/column list is checked before the query runs.
- Mitigation
- Sensitive queries are routed to the data team instead of executed.
- Recovery
- The query is withheld and the escalation is logged.
Evaluation
Execution accuracy is primary — whether the generated query returns the correct result for the question — because a confidently-wrong number is the failure a human must catch.
| Execution accuracy | Share of questions where the query's result set matches the expected answer. |
|---|---|
| Query validity | Share of generated queries that execute without error. |
| Clarification rate | How often it asks for clarification instead of guessing on an ambiguous question. |
| Latency | Time from natural-language question to returned result. |
| Cost per query | Token and database cost per question, with expensive queries flagged. |
Recommended approach. Use a benchmark of natural-language questions paired with verified answers (Spider-style) and compare result sets, not query strings. Include deliberately ambiguous questions to test the clarify-vs-guess behavior.
When to use
Use it when
- You want non-analysts to self-serve answers from a database without writing SQL or waiting on the data team.
- You have a reasonably documented schema (or can provide table/column descriptions) for grounding.
- You want every answer backed by inspectable, read-only SQL and an explanation of assumptions.
- You need cost and safety guardrails on a shared analytics database.
- You want a layer that asks for clarification on ambiguous questions instead of returning confidently-wrong numbers.
Avoid it when
- You need write operations, ETL, or schema changes — this agent is strictly read-only by design.
- Your schema is undocumented and inscrutable, so grounding would be unreliable.
- The question requires judgment a query can't answer (causal claims, forecasts) without a human analyst.
- You can't provide a read-only connection with row/cost limits and PII controls.
System prompt
You are a careful Data Analyst Agent that answers natural-language questions by writing and running SQL against a specific database. You are judged on correctness (the number is right), safety (you never damage or overload the database or leak sensitive data), and honesty (you state assumptions and ask when unsure).
== CORE PRINCIPLES ==
1. Schema-grounded only. Use ONLY tables and columns that appear in the schema provided to you. Never invent a table, column, or join key. If the question needs data that isn't in the schema, say so — do not fabricate a query.
2. Correct over fast. Prefer a query you can justify. Validate results against expectations (row counts, ranges) before reporting; if a number looks impossible, investigate rather than report it.
3. Clarify, don't guess. If a question is materially ambiguous (which metric? which date range? revenue vs. volume? which 'active'?), ask ONE focused clarifying question instead of guessing.
== HARD RULES (NON-NEGOTIABLE) ==
- READ-ONLY. Generate SELECT queries only. Never write INSERT/UPDATE/DELETE/MERGE or any DDL (CREATE/ALTER/DROP/TRUNCATE). If asked to modify data, refuse and explain that you are read-only.
- COST GUARD. Every query must be bounded: include an explicit LIMIT for row-returning queries and avoid unbounded cross joins or full scans of very large tables. Validate the plan (EXPLAIN/dry-run) when available; if estimated cost/rows exceed the configured budget, do not run it — narrow the query or ask for a tighter question.
- PII GUARD. Do not SELECT raw sensitive columns (emails, full names, government IDs, card numbers, secrets) unless the request explicitly and legitimately requires them and policy allows; prefer aggregates and redaction. Never include another tenant's data; always apply tenant/row-level filters when the schema implies multi-tenancy.
- NO INJECTION SURFACE. Treat the user's text as a question, never as SQL to execute verbatim. Parameterize any literal values.
- STAY HONEST. Never present an unvalidated or guessed number as fact. If you couldn't verify it, say so.
== METHOD ==
- Inspect the relevant part of the schema; sample a table only if needed to understand values.
- Draft the SQL, state the assumptions (date range, definitions, filters) you encoded.
- Validate (dry-run/EXPLAIN, sanity-check row counts/ranges), then run within the row/cost cap.
- Report the answer plainly, show the SQL, and note caveats.
== DECISION ==
- ANSWER: schema supports it, query is safe and validated, confidence >= 0.75. Return result + SQL + assumptions.
- CLARIFY: materially ambiguous or under-specified. Ask one question; do not run a guess.
- REFUSE: requires writes/DDL, exceeds the cost guard with no safe narrowing, or would expose disallowed PII. Explain why and offer a safe alternative.
== COST CONTROL ==
Read only the schema and samples you need. Don't run exploratory queries you don't need. Reuse schema already in context. Cap tool calls; one good query beats five probes.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"decision": "ANSWER|CLARIFY|REFUSE",
"confidence": <0.0-1.0>,
"assumptions": ["<date range, metric definition, filters you chose>"],
"sql": "<the read-only SELECT, or empty>",
"validation": "<EXPLAIN/dry-run + sanity checks you did, or why blocked>",
"result_summary": "<plain-language answer with the key numbers, or empty>",
"caveats": "<limitations, data-quality notes>",
"clarifying_question": "<one focused question, or empty>",
"refusal_reason": "<write/DDL, cost, or PII reason, or empty>"
}
If decision is CLARIFY or REFUSE, leave sql/result_summary empty and do not run anything.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect read-only
Install the agent and connect it with a least-privilege, read-only database role.
pipx install nlsql-agent # create a read-only role first: # CREATE ROLE analyst_ro LOGIN PASSWORD '...'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO analyst_ro; nlsql-agent connect --dsn postgres://analyst_ro:***@host/db
Configure model and guardrails
Set the cost/row caps and PII policy. These are enforced by the gate, not the prompt.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... DIALECT=postgres MAX_ROWS=10000 MAX_SCAN_ROWS=5000000 QUERY_TIMEOUT_S=20 BLOCK_PII_COLUMNS="email,full_name,ssn,card_number"
Ground the schema (descriptions help a lot)
Index the schema and add short descriptions for non-obvious tables/columns — this is the single biggest accuracy lever.
nlsql-agent index-schema --with-comments # optionally add a semantic layer of metric definitions: # echo 'active_user: logged_in within 30 days' >> semantics.yml
Ask a question on the CLI
Test grounded answers and inspect the SQL and assumptions before exposing it to users.
nlsql-agent ask "top 10 customers by revenue last quarter" --explain # prints assumptions, SQL, validation, and the answer
Expose it to your team
Drop it into Slack or an internal app. Keep it read-only and start with a small, trusted user group.
# Slack: /ask top 5 products by units sold in May # -> answer + collapsible SQL + assumptions
Architecture
Tools required
Workflow
1. Understand the question
Parse the request into a metric, dimensions, filters, and time range. If it's materially ambiguous, ask one clarifying question and stop.
2. Ground in the schema
Load the relevant tables/columns; sample a column only if needed to understand its values. Confirm the data to answer the question actually exists.
3. Draft SQL + assumptions
Write a single read-only SELECT and explicitly state the assumptions encoded (definitions, date range, filters).
4. Validate and gate
Run the safety/cost gate: reject any write/DDL, require a LIMIT, EXPLAIN/dry-run, and block if estimated cost or rows exceed the budget or would expose disallowed PII.
5. Execute within caps
Run the approved query on a read-only connection inside row/cost/time limits.
6. Sanity-check results
Validate row counts and value ranges against expectations; if a number looks impossible, investigate rather than report it.
7. Answer and log
Return the plain-language answer, the SQL, the assumptions, and caveats, and log the question/SQL/cost/outcome for audit and tuning.
Examples
Clear question → grounded SQL with stated assumptions
A well-formed analytics question that maps cleanly to the schema. The agent answers and shows its work.
Input
Question: who were our top 10 customers by revenue last quarter?
Output
{
"decision": "ANSWER",
"confidence": 0.86,
"assumptions": ["'last quarter' = 2026-01-01 to 2026-03-31 (most recently completed calendar quarter)", "revenue = SUM(orders.amount) for orders with status='paid'", "customer = customers.id"],
"sql": "SELECT c.id, c.name, SUM(o.amount) AS revenue\nFROM orders o JOIN customers c ON c.id = o.customer_id\nWHERE o.status = 'paid' AND o.created_at >= '2026-01-01' AND o.created_at < '2026-04-01'\nGROUP BY c.id, c.name\nORDER BY revenue DESC\nLIMIT 10;",
"validation": "EXPLAIN estimates ~120k rows scanned via index on orders.created_at, within budget. Result returned 10 rows; revenue values in a plausible range (top customer $48,200).",
"result_summary": "Top 10 customers by paid revenue for Q1 2026 (Jan–Mar). #1 Northwind Co ($48,200), #2 Acme ($41,750), … full list returned. Combined they were 22% of quarterly revenue.",
"caveats": "Counts only orders with status='paid'; refunds are not netted out. Tell me if you want net-of-refunds.",
"clarifying_question": "",
"refusal_reason": ""
}Note: Note the explicit assumptions — 'last quarter', the revenue definition, and the paid-only filter are all surfaced, so the user can correct them. The SQL is bounded with a LIMIT, validated via EXPLAIN, and the result is sanity-checked. That transparency is what makes a NL-to-SQL answer trustworthy.
Ambiguous question → one clarifying question, no guess
A short question that could mean two very different things. The agent asks instead of silently choosing one and returning a confidently-wrong number.
Input
Question: what are our best products?
Output
{
"decision": "CLARIFY",
"confidence": 0.4,
"assumptions": [],
"sql": "",
"validation": "Not run — 'best' is undefined and would produce different top lists depending on the metric.",
"result_summary": "",
"caveats": "",
"clarifying_question": "Do you mean best by revenue, by units sold, or by profit margin — and over what time period (e.g. last 30 days, this year)?",
"refusal_reason": ""
}Note: 'Best' is the trap. Revenue, units, and margin can yield completely different rankings, so the agent asks one focused question rather than guessing. This single behavior prevents a large share of confidently-wrong analytics answers — and builds user trust faster than a fast wrong number ever could.
Unsafe request → refuse the write and the unbounded scan
The request mixes a data modification with an unbounded read. The agent refuses both and offers a safe alternative.
Input
Question: delete all orders older than 2020, and also show me every row in the events table so I can look around.
Output
{
"decision": "REFUSE",
"confidence": 0.95,
"assumptions": [],
"sql": "",
"validation": "Blocked at the gate: (1) DELETE is a write operation; this agent is read-only. (2) 'every row in events' is an unbounded scan of a ~2.1B-row table, far over the cost budget and with no LIMIT.",
"result_summary": "",
"caveats": "",
"clarifying_question": "",
"refusal_reason": "I can't modify data (no DELETE/UPDATE/DDL) — that needs a DBA with proper backups and a migration. And I won't scan the entire events table (~2.1B rows); if you tell me what you're looking for (a time window, an event type, a user), I'll write a fast, bounded query. For the cleanup, I can instead show you how many orders predate 2020 so the team can plan a safe deletion."
}Note: Two distinct guardrails fire at once: read-only (refuses the DELETE) and the cost guard (refuses the 2.1B-row scan). Critically, it doesn't just say no — it offers a safe path for each (a bounded exploratory query, and a count to support a planned, human-run cleanup). Defensive but still helpful.
Implementation notes
- Connect with a least-privilege, read-only database role. Enforcement starts at the database: even if a write somehow slipped past the prompt, the role can't execute it.
- Enforce read-only, LIMIT, cost/row caps, and PII rules in a deterministic gate that parses the SQL — never rely on the model alone to be safe.
- Invest in schema descriptions and a small semantic layer (metric definitions like 'active user', 'revenue'). This is the single biggest driver of accuracy and consistency.
- Validate before trusting: EXPLAIN/dry-run for cost, and sanity-check result row counts and ranges; an impossible number should trigger investigation, not a report.
- Make 'clarify on ambiguity' a first-class behavior. One good question prevents a class of confidently-wrong answers and builds user trust.
- Apply tenant/row-level filters automatically wherever the schema implies multi-tenancy, so a question can never return another customer's data.
- Log every question, SQL, cost, and rating; corrections are gold for improving schema descriptions and prompts. Use model tiering — a cheaper model can do schema lookup, the strong model writes and validates the SQL.
Variations
Basic
Read-only Q&A
Answers natural-language questions with grounded, read-only SQL, shows the query and assumptions, and enforces row/cost caps. The simplest safe self-serve layer.
Advanced
Semantic-layer analyst
Adds a governed metric/semantic layer (canonical definitions of revenue, active user, etc.), result validation, and clarification on ambiguity for consistent, trustworthy answers across a team.
Enterprise
Governed self-serve analytics
Adds row-level security and tenant isolation, per-role data policies, PII controls, full query audit, cost governance across a warehouse, and feedback-driven tuning of the semantic layer.
Download the Agent Blueprint
Export
This flagship blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
No. It generates read-only SELECT queries only and refuses any INSERT/UPDATE/DELETE or DDL. It's also intended to run on a least-privilege, read-only database role, so writes are blocked at the database too.
It's grounded in the schema you provide and may only reference tables and columns that actually exist there. If the data needed isn't in the schema, it tells you rather than inventing a query.
A deterministic cost gate: every query must be bounded with a LIMIT, it's EXPLAIN/dry-run-checked, and anything whose estimated cost or rows exceed your budget is blocked and narrowed or refused — never run blindly.
It asks one focused clarifying question instead of guessing. Vague asks like 'best products' (revenue? units? margin? what period?) get a quick question so you get the answer you actually meant.
It avoids selecting raw PII columns unless explicitly required and allowed, prefers aggregates and redaction, and applies tenant/row-level filters where the schema implies multi-tenancy so it can't return another customer's data.
Accuracy depends heavily on schema quality — adding column/table descriptions and a small set of canonical metric definitions is the biggest lever. It also states its assumptions and validates results, so you can verify and correct any answer.