Overview
Clusters raw feedback (tickets, reviews, surveys) into clear themes with representative quotes.
Reports honest frequency and sample size, so a theme's weight is visible — not just its existence.
Separates real recurring signal from vocal-minority or small-sample noise, and flags low-confidence themes.
Defensive: every theme is grounded in cited source items; it never fabricates sentiment, frequency, or causation.
AgentAz™ specification
A lightweight, design-time governance spec for security review. It documents what this agent is authorized to do — and why — and pairs with whatever policy engine you already run. It does not enforce anything at runtime.
Machine-readable contract (agentaz.json), validated against the open AgentAz™ JSON Schema — bundled for offline use and published at a permanent URL:
{
"$schema": "./agentaz.schema.json",
"version": "2.0.0",
"last_reviewed": "2026-06-24",
"agent_id": "feedback-synthesis-agent",
"trust_level": "A2",
"dna_pattern": "Synthesis",
"worst_case_action": "Produces an inaccurate theme for PM review. Cannot create tickets or change a roadmap.",
"authority_boundary": "Clusters feedback into cited themes; ticket/roadmap tools absent; no fabricated feedback.",
"tags": [
"product-management",
"feedback",
"synthesis",
"read-only",
"human-review"
],
"tool_boundary": {
"allowed_tools": [
"read_feedback",
"cluster",
"extract_quotes",
"count_signal"
],
"execution_tools_absent": true
},
"output_boundary": {
"format": "structured_json",
"never_emits": [
"create_ticket",
"roadmap_write"
],
"never_fabricates": true
},
"cost_boundary": {
"max_usd_per_trace_loop": 0.25,
"alert_threshold_usd": 0.16
},
"loop_boundary": {
"max_reasoning_turns": 8
},
"human_handoff": {
"triggers": [
"thin_theme",
"conflicting_signal"
],
"destination": "product_manager"
},
"audit": {
"append_only": true,
"logs": [
"themes",
"source_counts"
]
}
}New to this? Read the AgentAz specification guide — Trust Levels, DNA patterns, and how it complements your runtime.
AgentAz™ is open source under Apache-2.0 — schema (frozen v1.0.0) and source on GitHub.
Governance matrix
A scannable summary of this blueprint's governance coverage, derived from its AgentAz™ specification. It documents the boundaries that already ship — not new functionality.
| Agent goal | Bounded by the authority spec above |
|---|---|
| Trust Level | A2 — Recommend |
| Tool access | Least privilege — execution tools absent (read-only) |
| Context handling | Grounded in provided inputs; cites or flags rather than guessing |
| Memory strategy | Task-scoped; no persistent cross-session memory |
| Human approval | Required on thin theme, conflicting signal → product manager |
| Audit trail | Append-only log (themes, source counts) |
| Cost & loop bounds | ≤ $0.25 per loop · ≤ 8 reasoning turns |
| Recovery / escalation | Escalates to product manager |
Agent component mapping
A framework-neutral view of how this blueprint maps to standard agent-architecture components (the vocabulary common to ADK-style frameworks). It describes structure for clarity — not an official integration or certified compatibility.
| Agent | Primary reasoner — Recommend authority (A2) |
|---|---|
| Tools | read feedback, cluster, extract quotes, count signal — execution tools absent (read-only) |
| Memory | Task-scoped working context; no persistent cross-session memory |
| Guardrails | Worst-case classified (A2); no execution tools; ≤ $0.25/loop · ≤ 8 turns |
| Evaluator | Confidence and authority-boundary checks; low-confidence or out-of-bounds results are flagged, not actioned |
| Handoff | Escalates to product manager on thin theme, conflicting signal |
Failure modes
Specific ways this blueprint can fail, and how it is designed to detect, contain, and recover from each — the boundaries that make it safe to run, stated plainly.
Produces a theme not actually supported by the feedback.
- Detection
- Themes cite representative quotes and unsupported themes are flagged.
- Mitigation
- It clusters real feedback, never fabricates, and has no ticket or roadmap tools.
- Recovery
- The PM discards the weak theme.
Miscounts signal, over- or under-stating a theme's prevalence.
- Detection
- Source counts accompany each theme.
- Mitigation
- Counts are shown, not asserted impressions.
- Recovery
- The PM recounts.
Drops a small but important signal.
- Detection
- Thin or conflicting themes are flagged, not dropped.
- Mitigation
- It surfaces uncertainty.
- Recovery
- The PM reviews the raw feedback.
Evaluation
Theme validity supported by real feedback is primary — an unsupported theme or miscounted prevalence is the failure.
| Theme validity | Share of themes supported by representative source quotes. |
|---|---|
| Prevalence accuracy | Agreement of stated theme counts with the actual feedback. |
| Fabrication rate | Frequency of themes not present in the feedback — should be near zero. |
| Signal recall | Of small-but-important signals, the share retained. |
| Latency | Time to synthesize a feedback set. |
Recommended approach. Use feedback sets with human-labeled themes and counts; measure theme validity — every theme must cite real quotes — and prevalence accuracy. It has no ticket or roadmap tools.
When to use
Use it when
- You have more user feedback than anyone can read and need it distilled into themes.
- You want representative quotes and honest frequencies to support prioritization, not vibes.
- You want to distinguish a widespread issue from a loud handful before you act on it.
- You're preparing a voice-of-customer summary, a roadmap input, or a release retro.
Avoid it when
- You want it to declare priorities or make roadmap decisions — it synthesizes evidence, humans decide.
- Your feedback is too sparse to support themes (it should flag that, not invent themes).
- You need verified causal claims about why metrics moved — that needs analysis beyond feedback text.
- You can't let it handle user text with appropriate PII care.
System prompt
You are a User Feedback Synthesis Agent for a product team. You turn a corpus of raw user feedback into themes with evidence and honest frequency. You are judged on faithful, useful synthesis — and on never fabricating a theme, a number, or a sentiment the data doesn't support.
== CORE PRINCIPLES ==
1. Grounded in cited items. Every theme must be backed by specific source items (with representative quotes). If you can't cite it, you can't claim it.
2. Honest frequency and sample. Report how many items support each theme and out of how many total. Never inflate prevalence. A theme mentioned by 3 of 400 is a 3/400 theme, not a "top issue".
3. Signal vs. noise. Distinguish a widespread recurring theme from a vocal minority or a small sample. Flag low-confidence/small-N themes explicitly rather than elevating them.
== HARD RULES (NON-NEGOTIABLE) ==
- NO FABRICATION: Never invent feedback, quotes, sentiment, or counts. Quotes must be real excerpts from the corpus. If unsure, omit.
- NO FALSE CAUSATION: Do not claim feedback explains a metric change or that one theme "causes" another unless the feedback explicitly says so. Report what users said, not an inferred causal story.
- REPRESENT FAIRLY: Don't over-weight a few emphatic items. Surface dissent and counter-themes; note when sentiment is mixed.
- FLAG WEAK EVIDENCE: Mark themes built on a small sample or ambiguous wording as low-confidence.
- PII: Treat user identifiers as sensitive; quote content, not personal data; redact where present.
== METHOD ==
- Read and dedupe the corpus. Cluster items into coherent themes by the underlying issue/request, not surface words.
- For each theme: count supporting items, pull 1-3 representative real quotes, tag sentiment, and assess confidence from frequency and clarity.
- Surface counter-signal and note sample sizes. Rank by evidenced frequency, not by how loud individual items are.
== OUTPUT FORMAT (return ONE JSON object) ==
{
"corpus": { "total_items": <n>, "deduped": <n>, "sources": ["tickets","reviews","survey"] },
"themes": [
{
"theme": "<concise theme>",
"count": <supporting items>,
"frequency": "<count>/<total> (<pct>%)",
"sentiment": "positive|negative|mixed",
"confidence": "high|medium|low",
"quotes": ["<real excerpt>", "..."],
"note": "<caveat, e.g. small sample / mixed / vocal minority, or empty>"
}
],
"counter_signal": ["<dissent or contradicting feedback, if any>"],
"not_captured": "<themes too sparse to assert, or empty>",
"caveats": ["<sampling/representativeness limits>"]
}
Rank themes by evidenced frequency. Mark small-sample or ambiguous themes low-confidence. Never assert a theme you cannot cite.Simulate run
Try the agent with a sample task. This is a frontend-only preview that shows how the kit would plan and execute — no API calls, nothing leaves your browser.
Frontend preview only — no data leaves your browser. Tip: press ⌘/Ctrl + Enter to run.
Setup guide
Install and connect feedback sources
Install the agent and connect it to your feedback sources.
pipx install feedback-synth-agent feedback-synth-agent connect --sources zendesk,appstore,typeform feedback-synth-agent doctor
Configure honesty guardrails
Thresholds for confidence and the no-fabrication posture.
cp .env.example .env ANTHROPIC_API_KEY=sk-ant-... LOW_CONFIDENCE_BELOW_N=5 REQUIRE_REAL_QUOTES=true REDACT_PII=true
Set scope and segments
Define the period and any segmentation for the synthesis.
# scope.yml period: last_30d segment_by: [plan_tier] min_theme_size: 3 # smaller clusters are reported as low-confidence, not dropped silently
Run a synthesis
Generate themes and review the cited frequencies and quotes.
feedback-synth-agent run --period 30d --explain # prints themes with count/frequency/quotes/confidence + counter-signal
Wire into your workflow
Schedule recurring synthesis and post to your PM channel or docs.
# weekly job -> voice-of-customer summary to #product (read-only synthesis)
Architecture
Tools required
Workflow
1. Ingest the corpus
Pull feedback across sources and record total volume and the source mix.
2. Dedupe & normalize
Collapse duplicates so counts reflect distinct feedback.
3. Cluster into themes
Group items by the underlying issue or request, not surface wording.
4. Attach evidence
For each theme, count supporting items, compute honest frequency, and pull real representative quotes.
5. Assess confidence & signal
Rate confidence by sample size and clarity; separate widespread signal from vocal-minority noise; surface counter-themes.
6. Rank honestly
Order themes by evidenced frequency, mark low-confidence ones, and note what was too sparse to assert.
7. Deliver & redact
Output the cited synthesis with caveats, keeping user PII out of the result.
Examples
Clear recurring theme → reported with frequency and quotes
A strong, widespread theme appears across many items. The agent reports it with an honest count and representative quotes.
Input
Corpus: 400 items (tickets+reviews). 86 mention slow load times on the dashboard.
Output
{
"corpus": { "total_items": 400, "deduped": 372, "sources": ["tickets","reviews"] },
"themes": [
{
"theme": "Dashboard load times are slow, especially with large datasets",
"count": 86,
"frequency": "86/372 (23%)",
"sentiment": "negative",
"confidence": "high",
"quotes": ["takes 15+ seconds to load my main dashboard every morning", "unusable once I have more than a few thousand rows"],
"note": ""
}
],
"counter_signal": ["A few power users on the new infra reported fast loads — possible rollout difference"],
"not_captured": "",
"caveats": ["Tickets over-represent users who hit problems; absolute prevalence may differ from the full base"]
}Note: A high-confidence theme: 86 of 372 deduped items (23%), with real representative quotes and a clear frequency. Note it still surfaces counter-signal (power users with fast loads) and a sampling caveat rather than overstating prevalence.
Small-sample signal → reported but flagged low-confidence
A potentially interesting issue appears only a few times. The agent reports it honestly without elevating it to a top issue.
Input
Corpus: 400 items. 3 mention a specific export-to-CSV encoding bug.
Output
{
"corpus": { "total_items": 400, "deduped": 372, "sources": ["tickets","reviews"] },
"themes": [
{
"theme": "CSV export has a character-encoding issue for non-Latin names",
"count": 3,
"frequency": "3/372 (0.8%)",
"sentiment": "negative",
"confidence": "low",
"quotes": ["exported CSV shows garbled characters for my customers' names"],
"note": "Small sample (3 items) — could be a real narrow bug or noise. Worth a quick check, not a top priority on this evidence alone."
}
],
"counter_signal": [],
"not_captured": "Several one-off requests appeared once each and are not asserted as themes",
"caveats": ["3 items is too few to estimate prevalence; treat as a lead to investigate, not a validated theme"]
}Note: The honesty case: only 3 mentions. The agent still surfaces it (it could be a real narrow bug) but explicitly marks it low-confidence with its tiny frequency, and refuses to dress it up as a major theme. This is what stops a vocal-minority issue from hijacking a roadmap.
Tempting but unsupported causal claim → refuse to fabricate
The team hopes the feedback explains a churn spike. The feedback doesn't actually say that, so the agent declines to assert causation.
Input
Corpus: 400 items. Some negative comments about a recent redesign. Prompt context: 'churn went up last month — does the feedback explain it?'
Output
{
"corpus": { "total_items": 400, "deduped": 372, "sources": ["tickets","reviews","survey"] },
"themes": [
{
"theme": "Mixed reactions to the navigation redesign",
"count": 41,
"frequency": "41/372 (11%)",
"sentiment": "mixed",
"confidence": "medium",
"quotes": ["the new nav took me a while to get used to", "actually prefer the new layout once I found things"],
"note": "Sentiment is genuinely mixed — roughly half negative, half adapting/positive"
}
],
"counter_signal": ["Several users explicitly said they liked the redesign"],
"not_captured": "",
"caveats": ["This feedback does NOT establish that the redesign caused the churn increase. The corpus shows redesign reactions are mixed; no item links a cancellation to the redesign. A causal claim would require churn-cohort analysis, not feedback text alone."]
}Note: The defining defensive case: there's a strong pull to hand the team the tidy story 'the redesign caused the churn.' The agent reports the redesign theme honestly as mixed, surfaces the positive counter-signal, and explicitly states the feedback can't establish causation — pointing to the analysis that actually could. It refuses to manufacture a satisfying but unsupported narrative.
Implementation notes
- Require real, cited quotes for every theme and block fabricated ones; a synthesis you can't trace to source items is worse than no synthesis.
- Always report frequency as count-over-total; bare theme lists hide whether something is widespread or a handful of loud voices.
- Flag small-sample and ambiguous themes as low-confidence instead of dropping or inflating them — leads are fine, just labeled.
- Forbid causal claims unless the feedback states them; route metric-causation questions to actual analysis, not vibes from text.
- Surface counter-signal and mixed sentiment so the team sees dissent, not just the dominant narrative.
- Keep PII out: quote content, redact identifiers, and treat the corpus as sensitive user data.
- Spend the strong model on theme framing, confidence, and counter-signal — a cheaper model can dedupe and cluster.
Variations
Basic
Theme summarizer
Clusters a feedback corpus into themes with representative quotes and counts for a quick read. Single source, on demand.
Advanced
Evidence-graded synthesis
Adds honest frequency, confidence grading, signal-vs-noise separation, counter-signal, and causation guardrails across multiple sources.
Enterprise
Continuous voice-of-customer
Adds scheduled multi-source synthesis, segmentation, trend tracking over time, PII governance, and integration into roadmap and research workflows.
Download the Agent Blueprint
Export
This blueprint and the AgentAz™ specification live in the central AgentKits registry — open source under Apache-2.0 (code & schema) and CC‑BY‑4.0 (text).
Frequently asked questions
No — that's its core constraint. Every theme must be grounded in cited source items, and quotes must be real excerpts from the corpus. If it can't cite something, it doesn't claim it.
It reports honest frequency (count over total) for every theme and flags small-sample or ambiguous themes as low-confidence, so a handful of emphatic items can't masquerade as a top issue.
Only what users actually said. It won't claim feedback caused a metric change unless the feedback states it — those causal questions need cohort analysis, and the agent will say so rather than invent a story.
No. It produces a faithful, evidence-graded synthesis for prioritization; the product team makes the roadmap decisions.
It treats the corpus as sensitive, quotes content rather than personal data, and redacts identifiers so the synthesis doesn't leak PII.
It surfaces them as low-confidence leads with their small frequency rather than dropping them silently or inflating them, so genuinely narrow issues stay visible without distorting priorities.