A visual guide to LLMs, agents, and generative UI — how they actually work, and why it matters for what you're building.
Begin readingYou do not need to read this straight through. Pick the track that matches the decision you are trying to make, then use the glossary and sources when a claim needs verification.
Read Chapters 1-7, 16-19, 22-34. Focus on model choice, evals, launch thresholds, cost, and when not to use AI.
Read Chapters 3, 8, 14-16, 20, 31-33. Focus on trust, failure states, AI-generated UI, and human control.
Read Chapters 1-13, 16-21, 23, 26, 28-29. Focus on schemas, tools, RAG, observability, evals, and cost.
Read Chapters 7, 18-19, 22, 24-30, 34. Focus on economics, defensibility, distribution, data, and product risk.
Every technical chapter should cash out into a product decision. If a concept does not change what you build, measure, price, or disclose to users, treat it as background knowledge and keep moving.
You think in words. LLMs think in tokens. Understanding this difference is the foundation of everything else.
When you type "Hello, how are you?" into ChatGPT or Gemini, the model doesn't see five words. It sees something like this:
A token is a chunk of text — sometimes a whole word, sometimes part of a word, sometimes just a character. The model has a fixed vocabulary (think of it as a dictionary) of roughly 30,000–100,000 tokens, and every piece of text gets broken into pieces from that dictionary.
Imagine you have a box of 50,000 unique LEGO bricks, each with a different shape. To represent any object, you combine bricks from your box. Common objects (like "the" or "hello") get their own single brick. Rare words get broken into multiple bricks. The word "tokenization" might become three bricks: token + iz + ation.
Every API call to an LLM is priced per token. Every model has a maximum number of tokens it can handle at once (its "context window"). When you're designing generative UI — a protocol where an LLM generates UI component trees — the size of that output in tokens directly determines:
Different models use different tokenizers, but the patterns are similar:
| Text | Approximate Tokens | Why |
|---|---|---|
Hello | 1 | Common word, gets its own token |
authentication | 2–3 | Long word, split into parts |
{"type": "card"} | 7–9 | JSON has lots of punctuation, each costs a token |
| A full paragraph (100 words) | ~130 | Rule of thumb: 1 token ≈ 0.75 words in English |
| A complex generative UI layout (20 components) | ~1,500–3,000 | Nested JSON structures are token-expensive |
When you design the generative UI schema, every field name you choose costs tokens. A field called backgroundColor costs more tokens than bg. But bg is ambiguous and the model might misinterpret it. This is a real product tradeoff: schema readability vs token efficiency. This is a product and UX decision as much as an engineering one.
Tokens are the fundamental unit of LLM computation. Everything — cost, speed, capability limits — flows from token counts. When someone says "this model has a 128K context window," they mean 128,000 tokens, which is roughly a 200-page book.
OpenAI's pricing is entirely token-based, and the exact prices move as model families change.S1 This means a company like Notion AI, which processes millions of documents daily, must obsess over token efficiency — every unnecessary word in their system prompt costs real money at scale.
Cursor (the AI code editor) ran into token limits early. Their codebase context feature had to be carefully designed to select only the most relevant files to include in the context — because stuffing an entire repo into the prompt would blow past token limits and cost a fortune. They built a retrieval system that picks the 5-10 most relevant files, not all 500.
Stripe optimized their fraud detection prompts to use ~40% fewer tokens by switching from verbose natural language descriptions to compressed, structured formats — cutting their API costs proportionally while maintaining accuracy.
The most important idea in modern AI is embarrassingly simple: predict the next word. That's it. Everything else — conversations, code, UI generation — is a consequence of this one trick done at extraordinary scale.
An LLM doesn't "understand" language the way you do. It's a prediction machine. Given a sequence of tokens, it calculates the probability of every possible next token, then picks one.
Here's the key insight: the model generates text one token at a time. After it picks "mat," the input becomes "The cat sat on the mat" and it predicts the next token again. This process repeats — token by token — until the model generates a special "stop" token or hits the maximum output length.
You know how your phone keyboard suggests the next word? An LLM is the same idea, but instead of being trained on your text messages, it's been trained on a significant fraction of all text ever written by humans. And instead of choosing from 3 suggestions, it's choosing from 100,000 possibilities, weighted by probability. The "magic" is just autocomplete at absurd scale.
How does the model learn these probabilities? Through training on enormous amounts of text. The process is conceptually simple:
The model doesn't memorize text. It learns patterns — statistical relationships between tokens. After seeing millions of sentences about cats sitting on things, it learns that "mat" is the most likely word after "the cat sat on the." After seeing millions of JSON objects, it learns the patterns of valid JSON. After seeing millions of code snippets, it learns programming syntax.
The process above — showing the model billions of examples and adjusting its parameters — is called training. It happens once (or periodically) and costs millions of dollars in compute. The result is a trained model — a massive file of numerical weights.
When you send a message to ChatGPT or Claude, the trained model runs the next-token prediction loop to generate a response. This is called inference. It happens billions of times per day and costs dollars (or fractions of a cent) per request.
Every time you hear "inference" in an AI conversation, mentally substitute "using the model to generate output." Inference latency = how fast you get a response. Inference cost = how much each API call costs. Inference provider = the company running the model's servers. This is the word you'll hear most often in AI product and engineering conversations.
If an LLM has seen enough examples of JSON structures that describe UI components, it can predict what a valid UI component JSON should look like. Feed it a prompt like "generate a card component with a title and two buttons" and it produces token after token of valid JSON — not because it "understands" UI, but because it's seen enough patterns to predict what comes next in that kind of document.
An LLM doesn't know what a button looks like or what a card does. It knows what a button looks like in JSON — the statistical pattern of how buttons are described in text. This is both the power (it can generate any structured format it's seen) and the limitation (it can produce something that looks right in JSON but would be terrible UI).
GitHub Copilot is literally next-token prediction applied to code. When you type function calculateTax(, Copilot predicts the most likely next tokens based on patterns from millions of public repositories. It doesn't "understand" tax law — it's seen enough tax calculation functions to predict the pattern. This is why it's great at boilerplate but stumbles on novel business logic.
Google Search autocomplete works on a similar principle — given "how to make", it predicts "pancakes" or "money" based on frequency patterns. LLMs are this concept taken to an extreme scale.
Midjourney and DALL-E use a variation of this for images: instead of predicting the next token, they predict what pixels should look like given a text description. Different modality, same core idea — pattern prediction at scale.
A context window is the total amount of text a model can "see" at once — both your input and its output combined. It's the single most important constraint in building AI products.
Think of the context window as a desk. Everything the model needs to work with — your instructions, the conversation history, any documents you've provided, AND the response it's generating — all has to fit on this desk. If it doesn't fit, it falls off the edge and the model can't see it.
| Model | Context Window | Roughly Equivalent To |
|---|---|---|
| OpenAI flagship models | Varies by model | Check the live model docs before shipping |
| Claude Sonnet / Opus family | 200K+ tokens, model-dependent | Anthropic documents 200K standard context and newer long-context options |
| Gemini Pro / Flash family | Up to 1M+ tokens, model-dependent | Google publishes current limits in AI Studio / API docs |
| Gemini Nano (on-device) | ~4–32K tokens | ~5–50 pages of text |
Model names, context windows, and prices change quickly. Treat this table as orientation, then verify against live provider docs before using it in a spec.S1S2S3
See that last row? On-device models (Gemini Nano, Apple's local models, Phi-4-mini) have dramatically smaller context windows. A prompt that works with a cloud model might completely fail on-device. Your architecture needs to handle this: shorter schemas, simpler prompts, or a fallback strategy when the on-device model can't handle the request.
Context window size shapes every product decision in an AI system:
Can hold complex schemas, long conversation history, and rich tool definitions. Generates detailed, multi-component UIs. But: slower, more expensive, requires network.
Fast, private, works offline. But: can only handle simple prompts and small output. Needs compressed schemas. Limited to simpler UI generation.
The context window is a shared budget. Every token you spend on instructions is a token you can't use for output. This is why prompt engineering is an optimization problem: say enough for the model to understand the task, but no more. In generative UI systems, a bloated component schema eats into the space available for the actual UI generation.
Cursor (AI code editor) lives and dies by context management. A developer's codebase might be millions of lines, but the model can only see a fraction at once. Cursor built an entire retrieval system — indexing your repo, ranking file relevance, and intelligently packing the most useful code into the context window. This "what to include" problem is their core product challenge.
NotebookLM by Google uses Gemini's 1M token window to ingest entire research papers, books, and document collections at once. Before large context windows, this required complex chunking and retrieval (RAG). Now you can just dump 50 PDFs in and ask questions. The product exists because the context window got big enough.
ChatGPT's memory feature is a workaround for context limits. Between conversations, the context window resets. So OpenAI stores a condensed summary of what it learned about you — effectively compressing your history into a few hundred tokens that fit alongside each new conversation.
When the model predicts the next token, it doesn't always pick the most likely one. Temperature controls how adventurous it gets.
Remember from Chapter 2 that the model produces a probability distribution over all possible next tokens. Temperature is a number that modifies these probabilities before the model makes its pick.
Temperature isn't the only knob. Two others matter for your work:
Top-P (nucleus sampling): Instead of considering all 100,000 possible tokens, only consider the smallest set whose combined probability exceeds P. If P=0.9, the model only picks from the top tokens that together account for 90% of the probability. This prevents the model from ever picking wildly unlikely tokens.
Top-K: Even simpler — only consider the K most likely tokens. If K=50, the model picks from the top 50 most probable tokens. The other 99,950 are eliminated entirely.
For generative UI and any system where the model's output must conform to a specific schema, you want: temperature near 0, top-P around 0.9, and top-K around 40. This keeps the model focused on producing valid, predictable output while still allowing some flexibility in how it composes the UI.
GitHub Copilot uses low temperature (~0.1–0.2) for code completions. You want predictable, syntactically correct code — not creative surprises. When Copilot suggests a function body, it should be the most likely correct implementation, not a novel experiment.
ChatGPT's creative writing mode uses higher temperature (~0.7–1.0). When you ask it to write a story, you want variety — the same prompt should produce different stories each time. Low temperature would produce the same story every time, which feels robotic.
Jasper AI (marketing copy tool) lets users adjust a "creativity slider" — which maps directly to temperature. "More creative" = higher temperature for brainstorming taglines. "More precise" = lower temperature for factual product descriptions. They turned a technical parameter into a UX feature.
Standard models respond instantly. Reasoning models pause, think step-by-step, and then answer. They cost 5–10x more, take seconds to start, and beat everything else on the hard stuff. The product question is when that trade is worth it.
Remember from Chapter 2 that LLMs predict one token at a time. A reasoning model does something different: before producing its visible answer, it generates internal "thinking tokens" — a private chain of reasoning that the user may or may not see.
Standard model: a student who blurts out the answer immediately. Reasoning model: a student who pulls out scratch paper, works through the problem step by step, then gives you the final answer. The scratch paper (thinking tokens) takes time and costs money, but for hard problems the answer is dramatically better.
Reasoning models create a fundamentally different interaction pattern:
The decision framework is simple: if you wouldn't need scratch paper for this problem, don't use a reasoning model. "What's the weather?" doesn't need reasoning. "Analyze this contract for liability risks across three jurisdictions" does. The product decision is whether to route automatically (like model routing in Chapter 7) or let the user choose.
Cursor uses reasoning models selectively: standard models for autocomplete and quick edits, reasoning models for complex multi-file refactors. The user doesn't choose — the system routes based on task complexity.
ChatGPT shows a collapsible "Thought for X seconds" indicator. Users can expand it to see the chain of thought or collapse it and just read the answer. This progressive disclosure pattern has become the standard.
Claude's Extended Thinking offers four effort levels (low, medium, high, max). Higher effort = more thinking tokens = longer wait = better answers on hard problems. The API exposes this as a parameter, letting product teams tune the tradeoff per feature.
LLMs naturally produce free-flowing text. But generative UI needs valid JSON. Structured output is how we force a creative, probabilistic system to produce machine-readable data.
Without structured output, if you ask an LLM to "generate a card component," you might get:
Sure! Here's a card component for you:
The card has a title "Weather Today" and shows the
current temperature of 72°F with a sunny icon...
That's nice prose, but your UI renderer can't do anything with it. What you need is:
{
"type": "Card",
"children": [
{ "type": "Text", "value": "Weather Today", "style": "headline" },
{ "type": "Row", "children": [
{ "type": "Icon", "name": "sunny" },
{ "type": "Text", "value": "72°F", "style": "display" }
]}
]
}
Structured output constrains the model toward valid JSON that matches your schema. Providers differ in how strict this is, and even schema-valid output can contain wrong values, so still validate before acting.S9 There are three main approaches, and understanding the differences is critical:
Here's what a real function calling setup looks like. This is the exact pattern that generative UI would use:
// You send this to the API alongside your prompt:
{
"tools": [{
"type": "function",
"function": {
"name": "render_ui",
"description": "Generate a UI component tree for the user's request",
"parameters": {
"type": "object",
"properties": {
"root": {
"type": "object",
"properties": {
"type": { "enum": ["Card", "Column", "Row", "List"] },
"children": {
"type": "array",
"items": { "$ref": "#/$defs/Component" }
}
}
}
},
"required": ["root"]
}
}
}]
}
// The model's response is constrained to this schema:
{
"tool_calls": [{
"function": {
"name": "render_ui",
"arguments": {
"root": {
"type": "Card",
"children": [
{ "type": "Text", "value": "Weather", "style": "headline" },
{ "type": "Text", "value": "72°F Sunny", "style": "body" }
]
}
}
}
}]
}
A generative UI protocol is, at its core, a function calling schema for generating UI component trees. The schema defines what components exist (Card, Row, Column, Text, Button, Image...), what properties each has, and how they nest. The renderer — React on web, SwiftUI on iOS, Jetpack Compose on Android — maps these to native components. The model's job is to fill in the values. The tighter your schema, the more reliable the output. The looser, the more creative — but more likely to break.
This is where design instinct becomes a superpower. Schema design is UX design for machines:
{"type": "Card", "variant": "elevated"|"filled"|"outlined"}
✅ Always valid
✅ Predictable rendering
❌ Limited expressiveness
❌ Model can't improvise
{"type": "string", "style": "object"}
✅ Creative flexibility
✅ Can handle novel requests
❌ Might generate invalid UIs
❌ Harder to render reliably
Shopify's Sidekick uses function calling to let merchants manage their store via natural language. "Give me a 20% discount on winter jackets" triggers a structured tool call with exact parameters: { action: "create_discount", collection: "winter-jackets", percentage: 20 }. Free-text output would be useless — Shopify's backend needs machine-readable instructions.
Zapier's AI Actions connects ChatGPT to 6,000+ apps using structured output. When you say "add this to my Notion database," the model generates a structured API call that Zapier can execute. The schema for each integration is pre-defined — the model fills in the values.
Vercel's v0 generates React code from natural language descriptions. Under the hood, it uses structured output to produce a specific code format with metadata (component name, imports, props). The output isn't "creative writing that happens to be code" — it's schema-constrained generation optimized for parseability and rendering.
Not all models are created equal. Choosing which model to use for which task is one of the most impactful product decisions you'll make.
Every major AI provider offers a family of models at different capability/cost/speed tradeoffs. Think of it like cars: you don't drive an 18-wheeler to get groceries, and you don't use a Smart car to haul lumber.
Sophisticated AI products don't use a single model — they route requests to different models based on complexity. This is called model routing or cascading.
Model selection isn't a one-time decision — it's a runtime decision made for every request. A production AI product needs a routing layer: simple tasks go to a small model (Haiku, Flash, on-device), standard tasks go to a mid-tier model, and complex tasks go to a frontier model. Designing this routing logic is a core product and architecture decision.
Perplexity routes queries across multiple models. Simple factual lookups go to a fast, cheap model. Deep research queries go to a frontier model. They built a classifier that evaluates query complexity in <50ms and routes accordingly — cutting their average cost per query by ~60% while maintaining quality where it matters.
Notion AI uses different models for different features: a lightweight model for autocomplete suggestions (speed matters most), a mid-tier model for summarization (balance of speed and quality), and a frontier model for complex writing tasks (quality matters most). Each feature has its own model selection, not one model for everything.
Samsung Galaxy AI on the S24/S25 series does exactly the on-device/cloud routing described here. Simple tasks (text summarization, live translate) run on-device via a smaller model. Complex tasks (generative edit in photos, chat assist) go to cloud. The user doesn't know or care which model is running — they just see the result.
Modern models read text, see images, hear audio, and watch video. The input box is no longer a box. The design problem is figuring out which modalities your product actually needs and which are demo candy.
Every chapter so far has been implicitly text-centric. But as of 2026, every frontier model is natively multimodal — processing text, images, audio, and sometimes video within a single inference call.
Multimodal AI enables new input patterns that were impossible with text-only models:
Multimodal doesn't just add input types — it changes the fundamental interaction model. Text-only AI is "describe your problem." Multimodal AI is "show me your problem." This is a massive reduction in friction for users who struggle to articulate complex visual or spatial information in words.
Google Lens evolved from a standalone visual search tool into Gemini's eyes. Circle to Search on Pixel/Samsung lets you highlight anything on screen and ask questions about it — multimodal inference running on what you see.
Be My Eyes (accessibility app) uses GPT-4o's vision to describe the world to blind users in real-time. A user points their phone camera and the model narrates what it sees. This was impossible before multimodal.
NotebookLM ingests entire PDFs, slides, and images as visual tokens. You can ask "what's the chart on page 7 showing?" and it answers based on the actual visual layout, not just extracted text.
An LLM by itself can only generate text. Function calling is how we give it hands — the ability to actually do things in the real world: check calendars, send messages, query databases, and generate UIs.
Imagine you hire a brilliant consultant who knows everything about everything — but they're locked in a room with no phone, no computer, and no internet. They can give you amazing advice, but they can't actually do anything. That's an LLM without function calling.
Function calling gives the consultant a phone. You tell them: "Here are the apps on this phone and what each one does." When they need to check something or take an action, they tell you which app to use and what to type in. You execute it, show them the result, and they continue their work.
The function calling lifecycle has exactly four steps. Every agentic system — including generative UI — follows this pattern:
The model never actually executes functions. It generates a request to call a function. Your application code runs the function and feeds the result back. This is important for security (the model can't directly access your APIs without your code mediating) and for control (you can validate, log, rate-limit, or reject tool calls before executing them).
The concept is identical across providers, but the API syntax differs slightly:
// Defining tools
tools: [{
type: "function",
function: {
name: "get_weather",
description: "Get current weather for a city",
parameters: {
type: "object",
properties: {
city: { type: "string", description: "City name" }
},
required: ["city"]
}
}
}]
// Model response when it wants to call a tool:
{
"choices": [{
"message": {
"tool_calls": [{
"id": "call_abc123",
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"San Jose\"}"
}
}]
}
}]
}
// Defining tools
tools: [{
name: "get_weather",
description: "Get current weather for a city",
input_schema: {
type: "object",
properties: {
city: { type: "string", description: "City name" }
},
required: ["city"]
}
}]
// Model response when it wants to call a tool:
{
"content": [{
"type": "tool_use",
"id": "toolu_abc123",
"name": "get_weather",
"input": { "city": "San Jose" }
}]
}
// Defining tools
tools: [{
function_declarations: [{
name: "get_weather",
description: "Get current weather for a city",
parameters: {
type: "object",
properties: {
city: { type: "string", description: "City name" }
},
required: ["city"]
}
}]
}]
// Model response when it wants to call a tool:
{
"candidates": [{
"content": {
"parts": [{
"functionCall": {
"name": "get_weather",
"args": { "city": "San Jose" }
}
}]
}
}]
}
Notice the pattern: the schema definition is nearly identical (JSON Schema), but each provider wraps it differently. The model's response always contains: which function to call, and what arguments to pass. Your code handles the rest.
In generative UI systems, the "tools" aren't weather APIs — they're app capabilities. A fitness app might expose tools like log_workout, get_weekly_stats, set_goal. The agent calls these tools, gets the data back, and then generates a UI component tree to display the results. Generative UI is function calling where the final output is a rendered interface instead of text.
The model chooses which tool to call based entirely on the description field and parameter descriptions. Bad descriptions lead to wrong tool selection. This is a product/design decision:
"description": "Weather function"
Model doesn't know when to use it, might confuse it with a climate function or a forecast function.
"description": "Get the current temperature and conditions for a specific city. Returns temp in °F, condition (sunny/cloudy/rainy), and humidity percentage."
Model knows exactly what it gets back and when to use it.
ChatGPT Plugins (now GPT Actions) was one of the first mass-market implementations of function calling. When you ask ChatGPT to "find flights to Tokyo," it calls the Kayak plugin's search_flights function with structured parameters. Thousands of businesses built plugins — each one is just a function calling schema that lets GPT interact with their service.
Siri and Alexa were doing a primitive version of function calling before LLMs. "Set a timer for 5 minutes" maps to an intent (set_timer) with a slot (duration: 5min). The difference with LLM-based function calling is flexibility: you don't need to pre-define every possible phrasing. The model figures out the intent and extracts the parameters from any natural language input.
Anthropic's Claude introduced "computer use" tool calls — the model can call functions like click(x, y), type(text), and screenshot() to operate a desktop computer. Same function calling pattern, radically different tools. This is where agents start interacting with the physical world.
LLMs only know what was in their training data. RAG (Retrieval Augmented Generation) connects them to external knowledge at query time: your documents, your database, your company wiki.
Imagine you're taking an exam. A standard LLM takes it closed-book, answering from memory. RAG takes it open-book: before answering, it searches a library, pulls out relevant pages, reads them, and answers using both memory and the retrieved material. Most production AI assistants are open-book exams. The interesting work is in how you build and search the library.
RAG is usually drawn as four boxes. That hides where it actually breaks. Real systems have eight stages, split between work you do once at build time and work you do on every query.
The boringest stage in the pipeline is the one that decides whether RAG works. A chunk too big returns noisy passages with the answer buried inside. A chunk too small loses the surrounding context the model needs to interpret it. There's no universal right answer — different content shapes want different strategies.
Split every N tokens (typically 200–800), with overlap. Fast to build, predictable. Cuts mid-sentence, mid-table, mid-thought.
Best for: uniform prose like blog posts, marketing copy, news.
Split on meaningful boundaries: paragraphs, headings, sections. Preserves the author's structure. Slower to build, harder to tune.
Best for: docs with strong structure — manuals, contracts, policy documents.
Index small chunks for retrieval but return larger parents at generation time. Best of both: precise hits, full context.
Best for: long documents where the answer needs surrounding context — research papers, legal filings.
Chunk by function, class, or table — never split a unit of code. Often paired with AST parsing.
Best for: codebases, API docs, structured data.
Vector search alone is the 2023 baseline. The 2026 production stack adds three things:
Hybrid search combines dense vectors (good for meaning) with sparse keyword search like BM25 (good for exact matches: error codes, product SKUs, legal citations). Vectors miss "ERR_4032"; BM25 nails it. Run both, merge the results. This single change usually beats any amount of tuning to vectors alone.
Reranking takes the top 20–100 results from the cheap retriever and re-scores them with a slower, more accurate cross-encoder model (Cohere Rerank, Voyage Rerank, or a fine-tuned encoder). Cross-encoders look at the query and document together, so they catch nuance that bi-encoder vectors miss. Typical lift: 10–30% on retrieval quality for the cost of one extra model call.
Query rewriting handles the gap between how users phrase questions and how documents phrase answers. HyDE (Hypothetical Document Embeddings) is the well-known move: ask the LLM to draft a hypothetical answer first, then embed and search using that. The drafted answer often shares more vocabulary with real documents than the original question did. For multi-turn chats, query rewriting also rolls earlier turns into a self-contained search query so retrieval doesn't lose context.
Knowledge graphs deserve a mention. When relationships matter more than passages — "who reports to whom," "what's connected to this incident" — a graph beats vectors. Most teams won't need this; the ones that do, know.
By 2026 these are three real options, not one. Picking the wrong one costs months. The shape of the choice:
| RAG | Fine-tuning | Long-context (stuff it all in) | |
|---|---|---|---|
| Solves | Knowledge the model lacks | Behavior, tone, format the prompt can't get right | Single large document at a time |
| Freshness | As fresh as your indexer | Frozen at training time | Whatever you paste in |
| Cost shape | Per-query retrieval cost; cheap to update | Up-front training cost; cheap inference | High per-query token cost |
| Fails when | Retrieval misses the right chunk | Use case shifts; data drifts | Corpus is too big or context is too noisy |
| Reach for it when | Corpus changes often or is large | Output style or domain language won't budge with prompting | Corpus is small, stable, and fits in 200K–1M tokens |
The 2026 default order: prompt engineering first, then long-context if the corpus fits, then RAG when it doesn't, then fine-tuning only if behavior is still off. Many teams skip straight to fine-tuning because it sounds sophisticated. Most regret it.
RAG creates UX problems pure chatbots don't have, and the design work is what separates trusted products from suspicious ones:
RAG quality is bottlenecked by retrieval, not generation. A frontier model with the wrong context produces a fluent wrong answer. A weaker model with the right context produces a useful right answer. When RAG feels broken, the fix is almost always upstream of the LLM: better chunks, hybrid search, a reranker.
Perplexity made citations the entire interface. Every claim is numbered and linked back to a source. Users trust it for research because they can verify, not because the model is special.
NotebookLM scopes RAG to documents you upload, never general training data. "Based on your sources" appears on every response. That scope clarity is the trust signal.
Cursor runs RAG over your codebase: it embeds your repo, retrieves relevant files for each request, and adds them to the context. The "intelligence" people praise is mostly retrieval quality, not the underlying model.
Glean and Elastic built enterprise search on RAG. The hard problem isn't retrieval — it's enforcing per-document access permissions so employees only see what they're allowed to see.
Cohere Rerank and Voyage Rerank dominate the reranker market. They're a single API call you bolt onto an existing vector search and they routinely deliver double-digit recall improvements. Most enterprise RAG stacks use one or the other.
A single function call is useful. But real agents call multiple functions in sequence, make decisions based on results, and adapt when things go wrong. This is the agentic loop.
The difference between a chatbot and an agent is simple: a chatbot responds once. An agent keeps going until the task is done.
Let's trace through a realistic generative UI agent scenario: the user says "Schedule dinner with Alex this Friday at a good restaurant near home."
check_calendar({ person: "Alex", date: "2026-04-10" }){ free: true, available: ["6pm-9pm"] }
search_restaurants({ near: "home", cuisine: "any", rating: ">4.0" })[{ name: "Osteria", rating: 4.5 }, { name: "Sushi Gen", rating: 4.3 }, ...]
check_reservation({ restaurant: "Osteria", date: "2026-04-10", time: "7pm", party: 2 }){ available: false, next_available: "8pm" }
Four tool calls, each building on the last. The model maintained context across all of them, made decisions based on intermediate results, and adapted when the first time slot wasn't available. That's an agent.
Each iteration of the loop is a separate API call. The entire conversation history — including all previous tool calls and results — gets sent back to the model each time. This is why context windows matter so much: a complex 10-step agent task might consume thousands of tokens just in history before the model even starts thinking about the next step.
Claude Code (Anthropic's coding agent) is a textbook agentic loop. You say "refactor this module to use dependency injection." It thinks ("I need to read the file first"), acts (reads the file), observes (sees the current structure), thinks again ("I see 3 classes that need interfaces"), acts (edits file 1), observes (checks for errors), and loops until all files are updated and tests pass. A single user request can trigger 20+ iterations of the loop.
Devin (the AI software engineer by Cognition) chains together even longer loops: reading GitHub issues → planning an implementation → writing code → running tests → debugging failures → committing. Each step feeds into the next. When tests fail, it doesn't just stop — it reads the error, reasons about the cause, and tries a fix. Some tasks run 50+ loop iterations.
Google's Deep Research (in Gemini) uses extended agentic loops for research. It searches the web, reads articles, identifies gaps in its knowledge, searches again with refined queries, synthesizes findings, and produces a report. One research question can trigger dozens of search-read-think cycles over several minutes.
Function calling lets a model use tools. But who decides which tools exist and how to connect to them? That's the problem MCP solves.
MCP (Model Context Protocol) is an open standard created by Anthropic that standardizes how AI models discover and use tools across any application. Think of it as USB for AI.S4
Before USB: Every device had its own cable. Your printer had a parallel port cable. Your mouse had a PS/2 connector. Your camera had a proprietary cable. If you wanted to connect a new device, you needed to find the right cable and install a custom driver.
After USB: One port, one standard. Plug anything in and it works. The computer asks "what are you?" and the device says "I'm a keyboard" or "I'm a camera" and they negotiate automatically.
MCP is USB for AI. Instead of every app building custom integrations with every AI model, MCP provides one standard protocol. An AI agent asks "what can you do?" and the app says "here are my functions." The agent can immediately use them.
An MCP server provides three things to the AI model:
create_event, search_files)Here's what a simple MCP server for a fitness app looks like:
// MCP Server: Fitness Tracker
{
"name": "fitness-tracker",
"version": "1.0",
"tools": [
{
"name": "log_workout",
"description": "Record a completed workout session",
"inputSchema": {
"type": "object",
"properties": {
"exercise": { "type": "string", "description": "e.g. 'bench press'" },
"sets": { "type": "number" },
"reps": { "type": "number" },
"weight_lbs": { "type": "number" }
},
"required": ["exercise", "sets", "reps"]
}
},
{
"name": "get_weekly_summary",
"description": "Get workout stats for the current week",
"inputSchema": {
"type": "object",
"properties": {
"week_offset": {
"type": "number",
"description": "0 = this week, -1 = last week"
}
}
}
},
{
"name": "set_goal",
"description": "Set a fitness goal for a specific exercise",
"inputSchema": {
"type": "object",
"properties": {
"exercise": { "type": "string" },
"target_weight": { "type": "number" },
"target_date": { "type": "string", "format": "date" }
},
"required": ["exercise", "target_weight"]
}
}
]
}
MCP defines what an app can DO. A generative UI protocol defines what the result LOOKS LIKE.
An app exposes its capabilities via MCP ("I can log workouts, show summaries, set goals"). When an agent calls those tools, generative UI renders the results as native components ("here's a card showing your weekly summary with a progress bar toward your goal").
Together, these standards mean: every app becomes agent-accessible with native, beautiful UI — without the app developer building a custom AI integration. That's the platform pattern emerging across the industry.
Anthropic launched MCP in late 2024 and adoption has been rapid. As of early 2026, there are MCP servers for Slack, GitHub, Google Drive, Notion, Linear, Jira, Figma, Postgres databases, and hundreds more. Claude Desktop, Cursor, Windsurf, and other AI tools can connect to any MCP server — one protocol, instant integration.
Block (Square) and Apollo were early enterprise adopters, building internal MCP servers so their AI tools could interact with proprietary systems. Instead of building custom ChatGPT plugins AND custom Claude integrations AND custom Gemini integrations, they build one MCP server and it works everywhere.
Figma's MCP server lets AI agents read design files, inspect components, and even generate code from designs — all through standard MCP tool calls. This is the "USB for AI" vision in action: Figma implements MCP once, and every AI tool that speaks MCP can now interact with Figma designs.
Google Stitch shipped an MCP server in 2026, letting external AI agents interact with Stitch design projects programmatically. This shows how quickly MCP is becoming the default integration layer — even AI design tools are adopting it.
There's more than one way to build an agent. The orchestration pattern you choose shapes everything: reliability, speed, cost, and user experience.
This is the pattern from Chapter 8 — the model alternates between thinking and acting. It's the most common and most flexible pattern.
When multiple independent tools need to be called, a smart agent calls them all at once instead of sequentially:
A lightweight model classifies the request and routes it to specialized handlers:
In production, most agents use a combination of these patterns. The router picks the right model, that model uses ReAct for complex tasks with parallel tool calls where possible. Designing this orchestration logic — deciding which pattern for which scenario — is a core product decision that shapes the user experience.
Uber's customer support AI uses a router pattern: a fast classifier determines if the query is about a ride issue, a payment issue, or an Eats issue, then routes to a specialized agent for each domain. Each specialized agent has its own tool set and system prompt optimized for that domain. This is cheaper and more accurate than one monolithic agent handling everything.
LangChain and LlamaIndex popularized orchestration frameworks that make these patterns composable. LangChain's "agent executor" implements the ReAct loop. Their "sequential chain" implements linear pipelines. Their "router chain" implements the routing pattern. These frameworks exist because orchestration is hard enough to warrant dedicated tooling.
OpenAI's Assistants API handles orchestration server-side — you define tools and the API manages the think-act-observe loop for you, calling your functions and feeding results back automatically. This is a bet that most developers don't want to build their own orchestration layer — they just want to define tools and let the platform handle the rest.
The primer has covered how agents work mechanically. This chapter covers what happens when you ship them to real users — the UX patterns, trust frameworks, and protocols that make agents usable.
57% of organizations now have agents in production. But "production" doesn't mean "autonomous." The biggest lesson from 2025-2026: users want agents that are powerful but controllable. The UX challenge is designing the right level of autonomy for each context.
A new category: agents that literally see and control screens. Claude Computer Use operates a full macOS desktop. OpenAI's Operator controls a remote browser. Google's Project Mariner works inside Chrome. These agents take screenshots, click buttons, type text, and navigate apps just like a human would.
The UX challenge is unique: the user watches their screen being controlled by an AI. This requires real-time observation (screen sharing), permission gates before sensitive actions, and a kill switch to stop the agent immediately.
A multi-step agent fails in ways a single API call doesn't. It picked the wrong tool. It called the right tool with bad arguments. It looped. It silently degraded after a model upgrade. The only way to debug any of this is a trace — a structured log of every model call, every tool call, every input, every output, in order. By 2026 this is standard infrastructure: each step gets a span, the trace tree shows the full reasoning path, and you can replay a failing run in isolation. LangSmith, Braintrust, and Langfuse are common platforms; OpenTelemetry's GenAI semantic conventions define emerging shared fields for model calls, tool calls, token usage, latency, and errors.S7 The headline rule: if you can't replay a bad run with the exact same inputs, you can't fix it. Build trace capture before you build the second tool.
Intercom's Fin is one of the most successful customer service agents in production. It resolves 50%+ of support tickets autonomously but escalates to human agents for complex cases, a textbook confidence-based escalation pattern.
Replit Agent builds entire applications from natural language. It shows its plan (intent preview), executes steps one at a time (audit trail), and asks for approval before deploying (autonomy gate). Users can see every file it creates and modify any step.
A2A (Agent-to-Agent) is the emerging open protocol for agents to delegate work to other agents — one agent can hand off subtasks to specialized peers without bespoke integration code. Alongside MCP (agent-to-tools) and AG-UI (agent-to-frontend), these three protocols form the infrastructure layer for multi-agent systems.
Agents fail. APIs time out, models hallucinate, tool calls return unexpected data. How you handle failure defines the user experience.
In a traditional app, errors are predictable: network error, invalid input, server down. In an agentic system, you get entirely new failure modes:
| Failure Type | Example | How to Handle |
|---|---|---|
| Wrong tool selection | Agent calls send_email when user wanted send_message |
Confirmation step before executing irreversible actions |
| Invalid arguments | Agent passes "date": "next Friday" instead of "2026-04-10" |
Validate arguments against schema before executing; ask model to retry with correct format |
| Tool execution failure | Restaurant API is down | Return structured error to model; let it try alternatives or inform user |
| Hallucinated tool | Agent tries to call book_flight but no such tool exists |
Validate tool name before execution; return "tool not found" to model |
| Infinite loop | Agent keeps retrying a failed action | Set max iteration count (e.g., 5 loops max); break and inform user |
| Schema violation in output | generative UI output has invalid component nesting | Validate against schema; show fallback UI; log for monitoring |
A well-designed agentic UI should degrade gracefully through these stages:
Your generative UI schema should include first-class error and loading states. A component that says "state": "loading" renders a skeleton screen. "state": "partial" renders available data with placeholders. "state": "error" renders a retry card. These aren't afterthoughts — they're the most important states to design because they're what users see when things go wrong (which is often).
ChatGPT's browsing feature frequently hits websites that block it. Instead of crashing, it tells the user "I wasn't able to access that site" and offers to try alternative sources. This conversational fallback pattern — admit the failure, explain why, offer alternatives — is the baseline every agentic product should hit.
Tesla Autopilot is the hardware analogy for graceful degradation. Full self-driving → lane keeping → adaptive cruise → manual control. Each level is a fallback when the one above it can't handle the situation. It never just stops working — it degrades to a less capable but still functional mode and alerts the driver.
Alexa's confidence thresholds show a different approach: when the model's confidence in its interpretation is below a threshold, it asks for confirmation instead of acting. "Did you mean turn off the bedroom lights?" This is cheaper and safer than executing a wrong action and having to undo it. For generative UI, a confirmation card before irreversible actions follows the same principle.
Notion AI handles hallucination risk by always including an "AI-generated" badge on its outputs and providing the source material alongside the summary. This UI-level pattern — flagging uncertainty visually — is something generative UI should consider as a first-class component state.
Guardrails are the protective systems that prevent AI from generating harmful content, leaking private data, or acting beyond its intended scope. In 2026, they're also a regulatory requirement.
An LLM without guardrails will attempt anything you ask. Guardrails constrain it — blocking harmful content, filtering personal data, preventing jailbreaks, and keeping the AI focused on its intended task. Think of them as the brakes on a very powerful car.
When a guardrail triggers, the user sees... something. What they see is a design decision that directly affects trust:
Four principles for trustworthy AI design: Transparency (users know they're interacting with AI), Proportionality (restrictions match the risk level), Reversibility (actions can be undone), and Contestability (users can challenge AI decisions). These aren't just good design — they're increasingly legal requirements.
A model scoring 90% on a benchmark might still frustrate real users. Evaluation is how you close the gap between measured performance and experienced quality.
Quality is the #1 barrier to production AI — cited by 32% of teams as their top challenge. "Running evals" is the AI equivalent of usability testing: you systematically check whether the system works for real scenarios, track scores over time, and use the data to decide what to ship.
This is the hardest and most important step. Writing a grading rubric is the same skill UX researchers use when creating annotation guides for usability studies — and it has the same failure mode: a vague rubric produces noisy, irreproducible scores no matter how good the grader is.
// Example rubric for a customer support bot
{
"accuracy": {
"5": "Correct answer with all relevant details",
"3": "Partially correct, some wrong info",
"1": "Completely wrong or hallucinated"
},
"helpfulness": {
"5": "Fully resolved the user's issue",
"3": "Some useful info but didn't resolve",
"1": "Useless or made things worse"
}
}
Using a strong model to grade a weaker model's output is now standard. It scales. It's cheap relative to humans. And it's full of biases that quietly invalidate your scores if you're not careful.
Two practical defaults: require chain-of-thought from the judge ("explain your reasoning before giving a score") — it forces the judge to actually engage with the rubric instead of pattern-matching. And calibrate against humans periodically: have humans grade 50–100 examples, compare to the judge, and treat low-agreement criteria as untrustworthy until you fix the rubric.
Every model upgrade — Sonnet 4.5 to 4.6 to 4.7, GPT-4o to GPT-5, Gemini 2.5 to whatever's next — is a behavior change in production. Sometimes it's a big upgrade. Sometimes a regression on the queries that matter most to you. The only way to know is to keep a frozen golden set and re-run it on every new model.
The minimum viable version: ~200 examples that represent your real query distribution (sampled from production logs, scrubbed), with expected outputs or rubric scores. On every model upgrade or prompt change, re-run the set, diff the per-example scores against the previous run, and flag regressions before they ship. Most eval platforms (Braintrust, LangSmith, Langfuse) make this a one-click operation.
Two non-obvious things:
Offline evals tell you the model is correct. Online evals tell you the product works. They measure different things and you need both.
Drop the "types of evaluation" framing — it's the wrong axis. The right axis is: what dimension of quality are you measuring, and which method gives you the cheapest reliable signal on it?
| Dimension | What it answers | Cheapest reliable method |
|---|---|---|
| Correctness | Did it produce the right answer? | Automated checks against labeled examples |
| Helpfulness | Did it actually solve the user's problem? | LLM-judge with a rubric, audited against humans |
| Safety | Did it avoid harmful, off-policy, or sensitive output? | Automated guardrails + adversarial test set |
| Latency & cost | Is it fast and affordable enough? | Production telemetry (TTFT, p50/p95, $/task) |
| Real-world impact | Are users better off because of it? | Online A/B tests on outcome metrics |
Benchmark scores measure the model. Custom evals measure your product. A model that scores 95% on MMLU can produce a terrible support bot if the 5% failures land on the queries your users care about most. Build evals on the queries you actually see in production.
Braintrust, LangSmith, and Langfuse dominate the eval-platform market. They handle test-set runs, grading, regression tracking, and online tracing in one place. Most production AI teams pick one and never look back.
OpenAI, Anthropic, and Google all run frozen internal eval suites against every model release. The public benchmarks (MMLU, SWE-bench, HumanEval) are a small fraction of what they actually measure. The real evals are private and use-case specific — exactly the kind you should be building.
METR's 2025 study found experienced developers using AI coding tools were 19% slower, despite believing they were 20% faster.S6 The perception gap is exactly why offline accuracy and online outcomes both have to be measured. Either one alone lies.
Mobile platforms are increasingly running models both in the cloud AND on the device itself. Understanding this dual architecture is essential for anyone designing AI-powered experiences.
Every AI request faces a three-way tradeoff between latency, quality, and privacy. Whether you're building for mobile, web, or desktop — there is no option that wins on all three.
| On-Device (Nano, Phi, Apple) | Cloud — Fast (Flash, Haiku) | Cloud — Frontier (Pro, Sonnet) | |
|---|---|---|---|
| Latency | ~50–200ms | ~200–500ms | ~1–3s |
| Context Window | ~4–32K tokens | ~1M tokens | ~1M tokens |
| Cost | Free (runs on device) | Very low | Moderate |
| Privacy | Data never leaves phone | Data sent to server | Data sent to server |
| Offline | Yes | No | No |
| UI Generation | Simple components only | Standard layouts | Complex, multi-component |
| Best For | Quick actions, autocomplete, simple classification | Most generative UI tasks | Complex reasoning, multi-step agents |
Your generative UI protocol needs to work across this entire spectrum. That means: compact schemas that fit in Nano's small context window, graceful degradation when the on-device model can't handle a complex layout, and a clear escalation path from on-device → cloud when needed. This is a core architectural decision that shapes the entire protocol design.
Apple Intelligence implements a tiered approach almost identical to what generative UI needs. Simple tasks (notification summaries, smart reply suggestions, text proofreading) run entirely on-device via Apple's ~3B parameter model. Complex tasks (image generation with Image Playground, deep writing assistance) route to Apple's "Private Cloud Compute" servers. The decision happens automatically — the user never chooses.
On-device call screening (available on Pixel and Galaxy devices) is a pure on-device success story. A local model transcribes the caller's speech in real-time — no network needed. It works because the task (speech→text for a short utterance) fits comfortably within an on-device model's capability. This is the kind of scoped, well-defined task that on-device excels at.
Samsung Galaxy AI's Live Translate runs on-device for real-time phone call translation. The latency requirement (sub-200ms) makes cloud infeasible. But for complex features like Chat Assist (rewriting messages in different tones), they route to cloud because tone-shifting requires more sophisticated reasoning than the on-device model can handle.
Spotify's DJ feature uses cloud models to generate the DJ's commentary (creative, personalized text) but on-device models for the voice synthesis (latency-critical). Splitting one feature across on-device and cloud models — each doing what it's best at — is a pattern you'll use in generative UI.
Every AI product lives inside a triangle: quality, speed, and cost. You can optimize two at the expense of the third. Every design decision shifts the balance.
| Metric | Threshold | Why It Matters |
|---|---|---|
| Time to first token (TTFT) | < 200ms ideal | Users perceive streaming responses as 40-60% faster than waiting |
| Output token cost | 3-8x input cost | Every word the AI writes costs more than every word it reads |
| Streaming | Non-negotiable | Show tokens as they generate — never make users stare at a spinner |
| Prompt caching | 50-90% savings | Reusing system prompts across calls is the easiest cost win |
Prompt caching is the rare optimization that's both massive and free. By 2026 every major provider supports it (Anthropic, OpenAI, Google), and any production app with a non-trivial system prompt that isn't using it is leaving 50–90% of input cost on the table. It deserves more than a bullet point.
What it actually does. Every API call has a prefix that almost never changes — the system prompt, tool definitions, few-shot examples, sometimes a long retrieved document. The provider runs that prefix through the model once and stores the resulting internal state on their side. When your next call arrives within the cache window, the provider skips re-processing the prefix and starts from the cached state. You're billed at roughly 10% of the normal input rate for the cached portion.
What's cacheable. Anything that's identical across calls and lives at the start of the prompt: system instructions, tool/function definitions, few-shot examples, large retrieved documents shared across users. Anything that varies per user — chat history, the user's question — has to come after the cacheable prefix.
The 5-minute TTL gotcha. Most providers expire idle caches after ~5 minutes. A product with steady traffic gets near-100% cache hits and pockets the savings. A product with bursty or low traffic mostly pays for misses. If your traffic is uneven, batch user requests through a small pool of "warm" sessions, or consider Anthropic's longer-TTL extended cache (1 hour) for high-leverage prompts.
Worked example. A support bot with a 2,000-token system prompt, processing 1,000 messages/hour at ~80% cache hit rate on Sonnet 4.6: caching saves roughly $50/day, or $1,500/month — about 90% of what you'd otherwise spend on input tokens. That's an extra ~$18K/year in margin, gained by adding one parameter to your API call.
Prompt caching vs KV caching. They sound similar and people conflate them. KV caching is what happens inside a single generation — the model caches its own intermediate state so it doesn't recompute earlier tokens as it generates each new one. It's automatic and you don't think about it. Prompt caching is what happens across calls — the provider caches your prompt prefix between requests, billed to you. It's opt-in and you absolutely think about it.
In Chapter 2 we defined inference as the process of using a trained model to generate output. Every response your product generates is an inference call. The AI industry has developed several techniques to make these calls faster and cheaper — and understanding them helps you make product architecture decisions.
An important distinction: the company that trained a model isn't always the company that serves it. Meta trains Llama. But you can run Llama inference through AWS Bedrock, Together AI, Fireworks, Groq, or your own servers. This decoupling matters because:
Inference is where all the money flows in production AI. Training is a one-time cost borne by the model lab. Inference is an ongoing cost borne by every product using the model. When someone says "AI is expensive," they mean inference is expensive. When someone says "AI is getting cheaper," they mean inference prices are dropping (they fell roughly 80% between early 2025 and early 2026). Every optimization in this section — caching, speculation, quantization — exists to make inference cheaper and faster.
Every design decision is a cost decision. Longer system prompts mean more input tokens. Verbose AI responses mean more output tokens, which cost 3–8x more. Model routing (Chapter 7) is the biggest lever after caching: send simple tasks to cheap models, complex tasks to expensive ones. A well-designed routing system can cut costs 30–50% with no perceptible quality loss.
By 2026 open models — Llama, Mistral, DeepSeek, Qwen — are competitive with closed frontier models on most non-frontier tasks. That makes "should we self-host?" a real question instead of an obvious no. The answer is still usually no, but the exceptions matter.
| Self-hosting beats APIs when… | APIs beat self-hosting when… |
|---|---|
| You have very high steady volume (millions of requests/day) and unit economics dominate everything else. | You have variable or low volume — GPU utilization tanks, ops cost dominates. |
| Data residency or compliance forbids sending data to third parties (healthcare, defense, regulated finance). | You can use a regional or VPC-deployed API endpoint instead — most providers offer this now. |
| You're fine-tuning heavily and need full control over weights and training. | Provider fine-tuning APIs (LoRA endpoints) cover the use case. |
| You need a model the labs don't sell — a specific size, an older checkpoint, an embedding model with custom tokenizer. | You can pick from a frontier model + a cheap routing model and that's enough. |
| Latency is so tight that even a colocated API endpoint isn't fast enough (Groq, Cerebras territory). | 200–500ms TTFT from a hosted API is acceptable. |
The hidden cost of self-hosting isn't the GPUs. It's the ops team. Running production inference well — autoscaling, monitoring, model upgrades, security patching, handling traffic spikes — is a full-time SRE function. Most teams that try it eventually move back to a hosted inference provider (AWS Bedrock, Together AI, Fireworks, Groq) which gives them the open-model portability without the ops burden. The genuinely-self-hosted population is small and specialized.
Streaming is the single biggest UX win. ChatGPT, Claude, and Gemini all stream tokens as they generate. Users read faster than models write, so streaming feels interactive rather than like waiting. Products that show a loading spinner until the full response is ready feel dramatically slower, even when total latency is identical.
Prompt caching is now standard across major providers. OpenAI documents up to 90% input-token cost reductions and latency wins when repeated prompt prefixes hit the cache.S8 The teams that wire this in early usually get a cost win without changing the product.
Groq and Cerebras serve open models on custom silicon at speeds frontier APIs can't match — hundreds of tokens per second on Llama-class models. For latency-bound use cases (live voice, autocomplete), they're a different category of product, not just a cheaper one.
Cursor's model routing sends most autocomplete to a small fast model, escalates harder requests to a frontier model, and uses Anthropic's longer-TTL caching on the codebase context. Three optimizations stacked: routing, caching, model size. None of them are visible to the user.
Everything in this book converges here. Generative UI takes an LLM's structured output and transforms it into rendered interface components — React on web, SwiftUI on iOS, Jetpack Compose on Android, or any other renderer.
Let's trace the full journey — from a user's voice to pixels on screen — using everything we've learned:
At its core, generative UI defines a tree of components. Each component has a type, properties, and optional children. The model generates this tree, and a platform renderer turns it into native UI:
// generative UI response for "Show my workout stats"
{
"root": {
"type": "Column",
"children": [
{
"type": "Text",
"value": "This Week's Workouts",
"style": "headlineMedium"
},
{
"type": "Card",
"variant": "elevated",
"children": [
{
"type": "Row",
"mainAxisAlignment": "spaceBetween",
"children": [
{ "type": "Text", "value": "Sessions", "style": "labelLarge" },
{ "type": "Text", "value": "4 of 5", "style": "bodyLarge" }
]
},
{
"type": "LinearProgressIndicator",
"progress": 0.8,
"color": "primary"
}
]
},
{
"type": "Card",
"variant": "outlined",
"children": [
{ "type": "Text", "value": "Top Exercise", "style": "labelLarge" },
{ "type": "Text", "value": "Bench Press — 185 lbs × 5", "style": "titleMedium" },
{
"type": "Button",
"label": "View Details",
"action": { "type": "navigate", "target": "workout_detail" }
}
]
}
]
}
}
Notice: the component names (Column, Card, Row, Text, Button, LinearProgressIndicator) map to standard UI primitives available in any framework — React, SwiftUI, Compose, Flutter. The style tokens (headlineMedium, labelLarge) map to a design system. The renderer walks this tree and emits native components for whatever platform you target.
The schema layer — between the model's output and the rendered interface — is where UX decisions live. Engineers understand the rendering. ML engineers understand the model. The schema in the middle is where UX meets AI constraints. Which components to include, how to handle responsive layouts, what error states to support, how to balance expressiveness with reliability — these are judgment calls that require understanding UX, AI constraints, AND the component system. This is the new design surface.
Vercel's v0 is the closest public analog to generative UI — it takes natural language prompts and generates React/Next.js UI components. Under the hood, it produces a structured component tree (JSX) from an LLM, just like generative UI produces a JSON tree for Compose. v0 proved the concept works commercially: developers pay for AI-generated UI. It generates React components from natural language — the same structured-output-to-rendered-component pipeline that generative UI uses across all platforms.
Google Stitch generates UI from natural language prompts as HTML/CSS, proving that designers want to describe interfaces conversationally and get rendered output. Tools like Antigravity then convert HTML to React. The pattern is clear: AI generates a description, a protocol standardizes it, and a renderer turns it into platform-native UI.
Apple's App Intents framework is Apple's version of this pipeline for SwiftUI. When Siri handles a request, App Intents defines the structured data, and SwiftUI renders the result as a native widget or Live Activity. Apple has a complete pipeline from voice → structured intent → native UI. Every platform is converging on this pattern: structured AI output → platform-native rendering.
Microsoft's Copilot in Microsoft 365 generates "Adaptive Cards" — a JSON-based UI format that renders natively in Teams, Outlook, and other Microsoft apps. This is structurally identical to generative UI: a JSON schema defines the component tree, a renderer turns it into native UI. Adaptive Cards has been in production since 2017 and handles billions of renders. It proves the pattern works at scale.
Out of the box, an LLM is a generalist. Skills, instructions, and project configs are how you turn it into a specialist that knows your tools, your conventions, and your workflow.
Every major AI provider has built a system for customizing how their models behave. The names differ but the core idea is the same: give the model persistent context about how you want it to work, not just what you want it to do right now.
Think of it as a spectrum from simple to powerful:
The simplest form of customization. A system prompt is a set of instructions sent to the model before the user's message. Every API call can include one. It tells the model who it is and how to behave.
// System prompt example
{
"system": "You are a senior UX researcher. When analyzing user feedback,
always identify the underlying need behind the stated request.
Format findings as: Observation → Insight → Recommendation.",
"messages": [
{ "role": "user", "content": "Users keep asking for a dark mode toggle." }
]
}
System prompts are powerful but have a key limitation: they're ephemeral. Every new conversation starts from scratch unless you manually include the system prompt again. They also eat into your context window (Chapter 3) since they're sent with every message.
The next step up: save your system prompts as reusable personas that persist across conversations. Each provider has a different name for this:
| Provider | Feature | What It Does |
|---|---|---|
| OpenAI | Custom GPTs | Package a system prompt + tools + knowledge files into a shareable persona. "GPT that reviews design specs against WCAG guidelines." |
| Gems | Custom Gemini personas with persistent instructions. "A Gem that writes PRDs in our team's format." | |
| Anthropic | Projects + System Prompts | Project-scoped instructions and knowledge files that apply to all conversations within a project. |
Custom personas are really just saved system prompts with a UI wrapper. The model doesn't fundamentally change. But the UX impact is significant: instead of copy-pasting instructions every time, you have a persistent specialist you can return to. For teams, this means you can create shared personas that encode team conventions.
This is where things get interesting for people who work in code-adjacent roles. Project configs give the AI persistent knowledge about a specific project, its conventions, and its structure.
The pattern was popularized by AI coding tools and is now spreading to broader AI workflows:
| Tool | Config File | What It Contains |
|---|---|---|
| Claude Code | CLAUDE.md | Project conventions, architecture decisions, coding standards, team preferences. Lives in the repo root. Claude reads it automatically at the start of every session. |
| Cursor | .cursorrules | Similar to CLAUDE.md. Rules about code style, preferred libraries, patterns to follow or avoid. Cursor loads it as context for every AI interaction in that project. |
| GitHub Copilot | .github/copilot-instructions.md | Repository-level instructions for Copilot. Defines conventions specific to the codebase. |
| Windsurf | .windsurfrules | Project rules for the Windsurf editor's AI assistant. Same pattern, different file name. |
A project config is like the onboarding document you'd give a new team member on their first day. "Here's how we name things. Here's our folder structure. Here are the libraries we use and why. Here's what we've tried before that didn't work." Except instead of a human reading it once and gradually forgetting, the AI reads it at the start of every single session.
In Claude Code's workflow, there's an important distinction between two types of files:
Persistent project context. Describes the codebase, conventions, architecture, and preferences. Doesn't change between tasks. Think of it as the project's constitution.
Example: "This is a Next.js 15 app using Tailwind. We use server components by default. All API routes go in /app/api. Never use class components."
Task-specific planning document. Created for a specific feature or work session. Breaks down the task into steps, tracks progress, and captures decisions made along the way. Temporary and task-scoped.
Example: "Task: Add dark mode. Step 1: Create theme context ✅. Step 2: Update Tailwind config ✅. Step 3: Add toggle component. Step 4: Persist preference."
The two work together: CLAUDE.md tells the agent how to work in this project. plan.md tells it what to work on right now. One is stable, the other is ephemeral.
Skills go beyond instructions. A skill is a packaged capability that the AI can execute, not just follow. Skills combine instructions, tool definitions, and sometimes code into a reusable module.
The key difference between a skill and a system prompt: a system prompt says "you are a UX researcher." A skill says "when the user asks you to analyze feedback, here's the exact process to follow, here are the tools to use, here are examples of good output, and here's how to format the result." Skills are procedural, not just descriptive.
Each AI provider has a different philosophy and architecture for customization. Understanding these differences matters because they shape what you can build and how portable your workflows are.
Claude's approach is file-based: CLAUDE.md and skill files live in your repo. If you switch tools or providers, those files still work as documentation. A Custom GPT lives on OpenAI's platform. If you leave, you lose it. Google's Gems are tied to your Google account.
Claude's skill system is modular. You can have a "create-docx" skill, a "design-doc" skill, and a "frontend" skill, and they compose together in the same session. The model reads whichever skill files are relevant. OpenAI's Custom GPTs are monolithic: one GPT, one system prompt, one set of tools. You can't easily mix GPTs together.
This is where MCP (Chapter 9) becomes the differentiator. Claude uses MCP as the universal protocol for connecting to external tools. OpenAI uses GPT Actions (custom API integrations defined per-GPT). Google uses Extensions (pre-built connectors to Workspace apps). MCP is open and any tool can implement it. GPT Actions and Extensions are vendor-specific.
Cursor + .cursorrules has become the most widely adopted project config pattern among developers. Teams commit their .cursorrules file to the repo, encoding conventions like "use TypeScript strict mode," "prefer server components," "use this testing pattern." New team members get AI assistance that already knows the team's standards from day one.
Custom GPTs for design teams: Several design orgs have built internal GPTs for their specific workflows. "Upload a screenshot and this GPT audits it against our design system." "Paste user feedback and this GPT categorizes it by our taxonomy." These are essentially skills packaged as sharable apps.
Claude Projects for research teams: Research teams upload papers, transcripts, and frameworks into Claude Projects. The project-level knowledge means every conversation starts with deep context about the research domain, without re-explaining the background each time.
The lines between these approaches are blurring fast. MCP is being adopted beyond Claude (Cursor, Windsurf, and others now support it). OpenAI is moving toward more composable tools. Google is opening up Gemini's extension system.
The convergence point: a world where you define your team's AI configuration once (conventions, tools, knowledge, workflows) and it works across any AI tool your team uses. We're not there yet, but the trajectory is clear. The teams investing in structured AI customization now will have a significant advantage as these systems mature.
Customization (above) is what you tell the AI. Memory is what the AI learns about you over time. Every major provider now has a memory system: ChatGPT stores facts across conversations, Claude encrypts and exports memories, Gemini imports histories from competitors.
It helps to name the layers, because they have different shelf lives and different governance needs:
The UX patterns for memory are standardizing: transparency (see what the AI remembers), editing (correct or delete memories), scoping (global vs project-specific), staleness management (refresh outdated info), and portability (export and import between providers).
A wrong memory is worse than no memory. The AI will confidently act on outdated information ("you said you wanted a vegetarian option" — no, that was last year). The mitigation is governance: time-stamp every memory, summarize aggressively to compress old facts, let users edit or delete, and bias toward forgetting over hoarding. Treat memory like a database that needs a retention policy, not an attic.
Skills and project configs are how AI goes from "generic tool" to "team member who knows our workflow." The investment isn't in the technology but in the articulation: writing down how your team works, what your conventions are, and what good output looks like. That documentation becomes the AI's training manual. Teams that can articulate their process clearly will get dramatically more value from AI than teams that can't.
Not every AI feature should be a chatbot. There are distinct product patterns, each with different UX, architecture, and user expectations. Choosing wrong costs months.
Every AI product maps to one of six archetypes. The archetype determines the interaction model, the trust requirements, the latency budget, and the failure modes you need to design for.
The most common product debate in 2026: should this feature be a copilot (AI assists, human decides) or an agent (AI acts, human supervises)? The answer depends on three factors:
The best way to understand archetypes: take one user need and see how each pattern handles it differently.
Products don't stay in one archetype. They evolve along a predictable path — usually from lower autonomy to higher:
The archetype isn't fixed — products evolve along the spectrum. GitHub Copilot started as autocomplete (generation), became a chat sidebar (copilot), and is becoming an agent (Copilot Workspace). Notion started with AI writing (generation), added Q&A (search), and is moving toward AI workflows (agent). The archetype you launch with isn't the archetype you'll have in two years.
Linear uses classification invisibly: AI auto-labels issues by priority and team. Users don't interact with the AI directly — it just makes the product smarter. This is the highest-ROI, lowest-risk archetype.
Figma AI is copilot-patterned: it suggests layouts, generates variants, and fills text — but the designer is always in control. The canvas is the workspace; AI is the assistant.
Cursor spans three archetypes simultaneously: autocomplete (generation), chat panel (copilot), and Composer (agent). Each mode has different trust levels, latency budgets, and UI patterns.
Prompt engineering was about writing good instructions. Context engineering is about designing the entire information environment the model sees — and it's now the most important product decision in AI.
The term "context engineering" was popularized by Andrej Karpathy in 2025 and has since become the standard framing. The insight: what matters isn't just the prompt — it's everything in the context window. System instructions, retrieved documents, conversation history, tool outputs, and examples all shape the model's behavior.
In traditional software, the product spec becomes code. In AI products, the product spec is the system prompt. Want the bot to be concise? That's a prompt instruction. Want it to always cite sources? Prompt instruction. Want it to refuse certain topics? Prompt instruction. The system prompt is the single most leveraged artifact in an AI product — and iterating on it is how you ship improvements without changing any code.
Summarize this feedback.
Result: Generic summary, no structure, misses key themes, inconsistent length across runs.
You are a UX researcher analyzing user feedback. For each piece of feedback, identify: (1) the stated request, (2) the underlying need, (3) severity (1-5). Respond in JSON.
Result: Consistent, structured, actionable. Same format every time.
The shift from "prompt engineering" to "context engineering" reflects a maturation: it's not about clever wording tricks anymore. It's about designing the entire information environment. What documents get retrieved? How much conversation history is retained? Which tools are exposed? How are examples selected? These are product architecture decisions that happen to be expressed as text in a context window.
Anthropic's Claude system prompt is thousands of tokens long and is treated as a living product document. Changes go through eval suites before deployment. It defines Claude's personality, capabilities, limitations, and behavior — it IS the product.
Cursor dynamically constructs context for each request: relevant code files (retrieved via embeddings), the user's recent edits, linter errors, and the project's .cursorrules file. No two requests see the same context. The "intelligence" of Cursor is largely in how well it selects what to include.
Traditional SaaS costs almost nothing per additional user. AI products spend real money on every API call. That single fact rewrites pricing, margins, and which features are worth shipping at all.
The fundamental economic difference: serving one more user on Figma costs Figma almost nothing. Serving one more query on ChatGPT costs OpenAI real money — model inference, compute, and API fees. This marginal cost per request is what makes AI product economics different from everything that came before.
| Model | How It Works | Example | Tradeoff |
|---|---|---|---|
| Per-seat subscription | Fixed price per user/month | ChatGPT Plus ($20/mo), Cursor Pro ($20/mo) | Simple, predictable. But heavy users cost you money, light users subsidize them. |
| Usage-based | Pay per token / API call | OpenAI API, Anthropic API, Google Vertex | Fair pricing, scales with value. But unpredictable bills scare customers. |
| Hybrid | Base subscription + usage overages | Claude Pro (base + message limits) | Best of both: predictable base, usage upside. Most common in 2026. |
| Free tier + premium | Basic AI free, advanced features paid | Notion AI, Grammarly, Perplexity | Great for adoption. Risk: free tier costs real money to serve. |
| Embedded / platform | AI baked into a product you already pay for | Apple Intelligence, Galaxy AI, Google Workspace AI | No separate pricing. AI is a feature, not a product. Funded by the parent product. |
In AI products, every design decision is a cost decision. A longer system prompt = more input tokens per call. A "show your reasoning" feature = 5-10x more output tokens. A RAG pipeline = embedding costs + retrieval costs + longer context. A multi-step agent = multiple API calls per user action. Product teams that don't model these costs before building frequently discover their feature is economically unviable at scale.
When every product can access the same foundation models, what makes yours defensible? The models are commoditizing. The value is moving to the layers above and below.
Here's the uncomfortable truth: if your product's value proposition is "we use GPT-4o to do X," your competitor can ship the same thing in a week. The model is an API call. The moat is everything else.
The most durable moat in AI is the data flywheel: user interactions → better training/eval data → improved product → more users → more interactions. Products that capture and learn from usage data compound their advantage over time. Products that just wrap an API don't. This is why "AI-native" companies (built around the flywheel) are structurally advantaged over "AI-added" companies (bolted AI onto an existing product).
AI developer products have unique DX challenges: non-deterministic outputs, complex debugging, and the need to "try before you buy." The playground isn't a nice-to-have — it's the product.
When a developer evaluates an AI API or tool, they go through a predictable journey: try it in a playground → read the docs → build a prototype → hit edge cases → decide to commit or abandon. The DX at each stage determines conversion.
| Stage | What They Need | DX Pattern |
|---|---|---|
| Explore | Can this do what I need? | Interactive playground with real models. Zero setup. Shareable results. |
| Prototype | Can I build with this? | SDKs in major languages. Quickstart that works in < 5 min. Copy-paste examples. |
| Build | How do I handle the hard parts? | Streaming docs, error handling guides, prompt engineering tutorials, eval tooling. |
| Scale | Can I rely on this? | Rate limits, uptime SLAs, cost calculators, usage dashboards, model versioning. |
| Debug | Why did it break? | Observability: request logs, token counts, latency traces, response diffs. |
How quickly does a developer go from zero to a working demo? This determines adoption more than any feature list. The target: under 5 minutes for a "hello world" equivalent.
Anthropic's Workbench lets developers test prompts, compare models side-by-side, and share results — all in-browser, before writing any code. The "try it" path has zero friction.
Vercel's AI SDK became the standard for building AI features in web apps because it abstracted away streaming, provider switching, and tool use into a clean TypeScript API. Good SDK design = adoption.
Stripe's API docs (pre-AI) set the DX standard that AI companies now emulate: interactive code examples, copy-paste SDKs, real API keys in the docs. The best AI developer products apply these same principles to a much harder problem space.
The AI landscape has dozens of layers and hundreds of companies. Understanding where your product sits — and who the adjacent players are — is essential for strategic positioning.
| Layer | Build When | Buy When | Key Tradeoff |
|---|---|---|---|
| Model | You need fine-tuned behavior | Almost always buy/rent | Training costs $1M+ |
| Orchestration | Complex multi-agent flows | Standard agentic patterns | LangChain is fast but opinionated |
| Vector DB | Unique scaling or privacy needs | Standard RAG | Pinecone/Weaviate vs pgvector |
| Eval | Highly domain-specific metrics | Standard accuracy/quality | Custom evals + Braintrust hybrid |
| Guardrails | Regulated industry (health, finance) | Standard content safety | Compliance needs drive build |
You don't grow your own wheat (compute), breed your own cows (train models), or manufacture your own pans (build infra) — you buy those. But you DO create your own recipes (prompts), design your own menu (product), and build the dining room (UX). The moat is what the customer sees and tastes, not what happens in the supply chain.
The strategic question for any AI product: which layer are you in, and who are you depending on? If you're an application, you depend on model providers. If you're a model provider, you depend on compute. Every layer has leverage over the ones above it and dependency on the ones below.
The "barbell" pattern: most value accrues at the top (apps that own the user relationship and data) and the bottom (compute providers with physical infrastructure). The middle layers — model APIs, orchestration frameworks, vector databases — face the most commoditization pressure. The products that thrive in the middle are those that become essential workflow infrastructure (LangChain) or own a critical data layer (Pinecone).
The demo always works. Production is where AI products fail. Understanding this gap is the difference between a successful launch and an embarrassing one.
Every AI product team has experienced this: you build a prototype, demo it to leadership, everyone is amazed. Then you ship it to real users and it immediately breaks in ways you never anticipated. This isn't a bug — it's a fundamental property of AI systems.
| Category | Demo Doesn't Test | Production Requires |
|---|---|---|
| Input diversity | 5-10 curated examples | Handling any input, including adversarial ones |
| Error handling | "It works" path only | Timeouts, rate limits, model errors, bad inputs |
| Latency | Acceptable for a demo | P95 latency under 3s for every request |
| Cost | Free during prototype | $X per query × millions of queries = real money |
| Monitoring | You watch it yourself | Automated alerts, dashboards, anomaly detection |
| Model updates | Pinned to one version | Model provider updates break your prompts |
Your product runs on a model you don't control. When the provider updates that model, your prompts can break without any change on your end. This has happened to nearly every AI product team.
The #1 cause of production AI failures: the long tail of user inputs. Your eval suite covers the 90% case. The remaining 10% of queries — ambiguous, multi-language, misspelled, out-of-scope, adversarial — is where the product breaks. Building for the long tail means investing as much in error handling, fallbacks, and edge-case coverage as you do in the happy path.
Google's AI Overviews launch in 2025 became a cautionary tale. The demo was polished. Real users immediately surfaced absurd answers — the AI suggested adding glue to pizza and eating rocks. Google had to add guardrails, limit triggers, and rethink the entire rollout within days. The gap between "works on curated queries" and "works on everything people actually search" was enormous.
Notion AI took a different approach: they shipped with aggressive guardrails (refusing many edge cases) and gradually expanded capabilities based on production data. Start conservative, expand with evidence. Slower launch, fewer crises.
The most powerful AI products aren't the ones with the best model on day one. They're the ones that learn from every user interaction and compound that learning into a better product.
A data flywheel is a self-reinforcing loop: the product generates data from user interactions, that data improves the product, the improved product attracts more users, and more users generate more data. This is the core growth loop for AI products.
Collecting user data for improvement creates a tension with user expectations of privacy. Different companies handle this very differently:
| Provider | Trains on Your Data? | User Control |
|---|---|---|
| OpenAI | Yes, by default (consumer). No (API + Team). | Opt-out available in settings |
| Anthropic | No. Never trains on conversations. | Memory is encrypted, exportable |
| Yes for free tier. No for Workspace paid. | Can import/export memory |
The trust implication: products that DON'T train on user data can advertise that as a feature. Products that DO train on data get a better flywheel but face privacy scrutiny. This is a genuine strategic tradeoff, not a clear right answer.
The flywheel isn't automatic. You have to design for it. That means: building feedback mechanisms into the UX (easy thumbs up/down, correction flows), creating data pipelines that turn feedback into eval datasets, and establishing processes to regularly retrain or re-prompt based on what you learn. Companies that capture data but never close the loop don't have a flywheel — they have a data warehouse.
Spotify's Discover Weekly is the canonical data flywheel. Every listen, skip, save, and playlist add feeds back into the recommendation model. After 10+ years, the compound advantage is enormous — a new competitor can't replicate a decade of behavioral data.
Tesla Autopilot processes billions of miles of driving data from its fleet. Every car contributes to the training data. More cars → more data → better driving → more customers → more cars. The fleet IS the moat.
ChatGPT's RLHF loop: Human feedback on responses trains the reward model, which improves the base model, which produces better responses, which generate more subscriptions, which fund more human raters. OpenAI turned user feedback into a direct product improvement cycle.
The best AI PMs are the ones who kill AI features that shouldn't exist. Knowing when a lookup table, a rule, or a simple search is the right answer is rarer and more valuable than knowing how to build with LLMs.
Every chapter in this primer has implicitly said "here's how AI does X." This chapter asks the opposite question: when is AI the wrong tool?
For any proposed AI feature, ask: "what would this look like without AI?" Often the non-AI version is faster, cheaper, more reliable, and good enough:
| AI Feature | Non-AI Alternative | AI Justified? |
|---|---|---|
| AI-powered search | Good keyword search with filters | Only if semantic understanding genuinely matters |
| AI-generated summaries | Human-written abstracts or excerpts | Only at scale where humans can't keep up |
| AI categorization | Rule-based classifier or dropdown | Only if categories are fuzzy and input varies widely |
| AI writing assistant | Templates and snippets library | Only if the output truly needs to be novel each time |
| AI-powered recommendations | Curated lists, popularity sorting | Only with enough user data to personalize |
The question isn't "can AI do this?" — it almost always can. The question is "does AI do this better than the alternatives, at a cost we can sustain, with a failure rate we can tolerate?" If the answer to any of those is no, the right product decision is to not use AI. This takes more courage than shipping an AI feature, and it's what distinguishes senior product thinking from hype-driven building.
Linear uses AI for issue classification but uses deterministic rules for workflow automation (status changes, assignments, notifications). They could use AI for everything — they chose not to, because rules are faster, cheaper, and 100% predictable for structured workflows.
Stripe Radar combines ML fraud detection with hard rules. Some fraud patterns are simple enough for rules ("block transactions over $10K from new accounts in high-risk countries"). ML handles the fuzzy cases. The hybrid is more reliable than either alone.
Traditional user research asks "what do you need?" AI product research is harder because users can't articulate needs for a technology they don't fully understand. The methods have to change.
Nobody asked for "next-token prediction." They asked for "help me write faster." The translation from human need to AI capability is a skill most product teams haven't developed yet.
The most important research question for any AI feature: "What's the delta?" Not "is the AI good?" but "is the AI better than what exists today?" If the current experience is a blank text field and the AI fills in a draft, the delta is huge. If the current experience is a well-designed template library and the AI generates slightly different text, the delta is small. Ship features with big deltas. Kill features with small ones.
AI product research isn't about asking users "do you want AI?" (they'll say yes). It's about measuring whether AI actually improves their outcome vs the non-AI alternative. The METR study found AI coding tools made developers 19% slower despite them believing they were 20% faster.S6 Without rigorous measurement, you're flying on perception, not reality.
Notion tested AI features as prompt prototypes before committing engineering resources. They wired a simple GPT call into their editor, tested with 20 users in a single afternoon, and learned that users wanted AI for "fill in this table" more than "write me a paragraph." This redirected six months of roadmap in one day.
Figma ran A/B tests on AI-generated layout suggestions where the control group got random layouts and the test group got AI layouts. The delta was measurable: AI layouts were chosen 3x more often. That data justified the investment.
The goal isn't maximum trust. It's calibrated trust — users trust the AI exactly as much as it deserves to be trusted. Over-trust and under-trust are both product failures.
This is the most nuanced design challenge in AI. Every product sits somewhere on a trust spectrum, and getting the calibration wrong has real consequences.
| Design Pattern | What It Does | When to Use |
|---|---|---|
| Confidence indicators | Show how certain the AI is ("High confidence" / "Not sure") | When accuracy varies by query type |
| Source attribution | Show where the answer came from (citations, links) | Any factual or knowledge-based task |
| Verification prompts | "Does this look right?" before executing | Irreversible actions or high-stakes outputs |
| Uncertainty language | "I think..." vs "The answer is..." in AI responses | When hallucination risk is moderate |
| AI-generated badges | Clear labels that content was made by AI | Always (and legally required in EU by Aug 2026) |
| Edit affordances | Make AI output easily editable, not a final answer | Any generation or drafting task |
| Comparison views | Show AI suggestion alongside the original | Editing, rewriting, refactoring tasks |
| Fallback visibility | Show what happens if AI is wrong (undo, revert) | Any action the AI takes on behalf of user |
Trust is not a feature you add — it's a property that emerges from dozens of design decisions. The font size of an "AI generated" label. Whether the AI says "The answer is" vs "I believe the answer is." Whether the edit button is prominent or hidden. Whether errors are admitted openly or buried. Each micro-decision shifts calibration. The best AI products get this right not through one big trust feature, but through consistent, deliberate calibration across every interaction.
Traditional PMs ship features that work or don't. AI PMs ship features that work 92% of the time and need judgment calls about the other 8%. This chapter is about building that judgment.
Every AI product decision lives in a fog of uncertainty that traditional product decisions don't have. The model might hallucinate. The model provider might change the model. The cost might be unsustainable. A competitor might ship the same thing next week. Here's how to navigate that fog.
The hardest part of being an AI PM isn't the technology — it's translating probabilistic reality into language that leadership, legal, marketing, and sales can act on.
"The model has a 93% accuracy rate with a P95 latency of 2.3 seconds on our eval suite, though performance degrades on out-of-distribution inputs."
"It gets the right answer 19 out of 20 times. When it's wrong, users see a 'not sure' indicator and can retry. We're monitoring the 5% and improving weekly."
The decisions that define your career as an AI PM aren't the easy ones. They're the ones where there's no clear right answer:
These have no textbook answer. They require product judgment built from experience, values, and deep understanding of your users and business. The primer gives you the technical foundation. The gray areas are where you earn your title.
The defining characteristic of a strong AI PM isn't technical knowledge (you have that now) or business acumen (you're building that). It's comfort with ambiguity. The ability to make a decision with 70% confidence, communicate the uncertainty honestly, build in reversibility, and iterate based on data. Traditional PMs ship and move on. AI PMs ship, monitor, learn, and adjust — continuously. The product is never "done" because the model, the data, and the users are always changing.
Anthropic's responsible scaling policy is a decision framework for uncertainty at the company level: they define capability thresholds and pre-commit to safety measures at each level. This "decide the framework in advance, not in the moment" approach works for product teams too — define your quality thresholds, failure protocols, and escalation paths before you need them.
Notion AI's launch strategy was a masterclass in uncertainty management: ship with aggressive guardrails (the AI refuses many edge cases), measure what users actually try, expand capabilities based on real data. They chose "too conservative at launch, loosen over time" over "too permissive at launch, tighten after incidents." One approach builds trust. The other destroys it.
This chapter connects all 33 previous chapters into a single workflow. Here's how a team actually goes from "we should add AI" to a shipped, monitored, improving feature.
AI features blur traditional roles. Here's how responsibilities are shifting in 2026:
The lines between PM, designer, and engineer are blurring on AI teams. Microsoft reorganized in 2025 around a unified "Applied AI" function that merges traditional PM and engineering under a single "builder" role. Google's DeepMind product teams have designers writing prompts and PMs reviewing eval results. Startups like Vercel ship AI features where a single person handles prompt design, eval, and UX — because with API-based AI, you don't need separate specialists for each step.
The implication: the most valuable people on AI teams are T-shaped — deep in one discipline, but able to contribute across the prompt-eval-UX loop. A designer who understands evals. A PM who can write and iterate prompts. An engineer who thinks about trust patterns. This primer exists to build that cross-disciplinary fluency.
Microsoft's March 2026 Copilot reorg merged its consumer and commercial AI teams under a single leader and freed its AI CEO (Mustafa Suleyman) to focus on models. The signal: AI products are no longer a side project staffed by a few people — they're central enough to warrant dedicated, unified organizations with engineering, product, and design working as one unit.
Figma's AI team puts designers directly into the eval process and has designers writing system prompts — a designer wrote the first system prompt for Figma Make. Their Head of AI Product emphasizes keeping teams small by having everyone touch code, enabled by AI tooling that makes this feasible.
Anthropic's Claude Code team describes their approach openly: "designers ship code, engineers make product decisions, product managers build prototypes and evals." They have PMs, but the PM's job has shifted — instead of writing specs and handing them off, PMs build working prototypes with Claude Code and use evals to validate ideas. The team replaced documentation-first thinking with prototype-first thinking.
The workflow above gets the feature shipped. The checklist below keeps it from becoming a beautiful demo that quietly leaks data, loses trust, or collapses when the model changes.
Treat retrieved docs, webpages, tool results, and emails as untrusted input. Add allowlisted tools, permission gates, output validation, and tests for data exfiltration attempts.
Classify actions as draft, reversible, externally visible, or irreversible. Require confirmation for sends, purchases, access changes, deletions, and sensitive-data transmission.
Define what user data is logged, retained, used for evals, used for training, redacted, or excluded. Make consent and deletion paths part of the product surface.
Pin model versions where possible, run scheduled regression evals, and record model IDs in traces so quality changes can be explained after a provider update.
Require semantic labels, focus order, contrast, motion controls, and readable error states in the UI schema. Generated interface does not get a pass on accessibility.
Decide where generated text, code, and images can be used, what sources need attribution, and which workflows need legal review before launch.
Problem, user task, why AI is needed, non-AI fallback, data access, model route, permissions, success metric, failure cost, launch threshold.
3-5 criteria, examples of pass/fail, grader type, golden set owner, minimum score, regression threshold, review cadence.
Red-team cases, prompt-injection tests, telemetry, rollback path, human escalation, cost alert, legal/privacy review, support playbook.
Corpus, permissions, freshness, chunking strategy, retriever, reranker, citation style, empty-result behavior, source-quality metric.
Most durable AI products converge on the same skeleton: app UI -> AI gateway -> model router -> prompt/context builder -> model call -> tool executor/RAG -> validators/guardrails -> trace/eval logger -> user feedback loop. If one of those boxes is missing, know why.
Every term from this primer, in one place. Reference this whenever a concept from an earlier chapter comes up.
| Term | Plain English | Chapter |
|---|---|---|
| Token | A chunk of text (~4 characters) that models process. The "atom" of AI. | 1 |
| Next-token prediction | How LLMs work: predict the most likely next word, repeat. | 2 |
| Context window | The model's working memory. Everything it can "see" at once. | 3 |
| Temperature | Randomness dial. Low = predictable. High = creative. | 4 |
| Reasoning model | Model that "thinks" step-by-step before answering. Slower, costlier, better on hard problems. | 5 |
| Structured output | Forcing the model to respond in a specific format (JSON, XML). | 6 |
| Model routing | Sending simple tasks to cheap models, hard tasks to expensive ones. | 7 |
| Multimodal | AI that processes text, images, audio, and video. | 8 |
| Function calling | Model outputs a structured request to use a tool (API, database, etc). | 9 |
| RAG | Retrieval Augmented Generation. Searching a knowledge base before answering. | 10 |
| Embedding | Converting text into numbers that capture meaning. Powers semantic search. | 10 |
| Vector database | Database optimized for storing and searching embeddings. | 10 |
| Hybrid search | Combining dense vector search (meaning) with sparse keyword search like BM25 (exact match) and merging results. | 10 |
| Reranking | Re-scoring the top retrieval results with a slower, more accurate cross-encoder model. Biggest single quality lever in production RAG. | 10 |
| HyDE | Hypothetical Document Embeddings. Have the LLM draft an answer first, then embed and search using that — closes the vocabulary gap between questions and documents. | 10 |
| Agentic loop | Think → Act → Observe → Repeat. How AI agents reason through multi-step tasks. | 11 |
| MCP | Model Context Protocol. Universal standard for connecting AI to tools. | 12 |
| A2A | Agent-to-Agent protocol. Standard for agents communicating with each other. | 14 |
| Guardrails | Safety systems that filter inputs/outputs (content, PII, jailbreaks). | 16 |
| Evals | Systematically testing AI against a set of examples to measure quality. | 17 |
| LLM-as-judge | Using a strong model to grade a weaker model's outputs. | 17 |
| Inference | Using a trained model to generate output. Every API call is an inference call. Distinct from training (which creates the model). | 2, 19 |
| TTFT | Time to First Token. How fast the model starts responding. <200ms feels instant. | 19 |
| Streaming | Showing tokens as they generate rather than waiting for the full response. | 19 |
| Prompt caching | Reusing processed prompt prefixes across calls. Cached tokens billed at ~10% of normal rate. The biggest production cost lever. | 19 |
| KV caching | Internal caching of attention state during a single generation so the model doesn't recompute earlier tokens. Automatic, not the same as prompt caching. | 19 |
| Speculative decoding | A small fast model drafts tokens, the large model verifies in batch. 2–3x speed at the same quality. | 19 |
| Quantization | Reducing the precision of model weights (e.g. 16-bit to 4-bit) to make models smaller and faster, with a small quality cost. | 19 |
| Distillation | Training a smaller "student" model to mimic a larger "teacher" model. How frontier capabilities flow down to cheap, fast models. | 19 |
| OpenTelemetry GenAI | Emerging standard for emitting LLM and agent traces (model calls, tool calls, latencies) into your existing observability stack. | 14 |
| Generative UI | AI generating actual interface components (cards, forms) not just text. | 20 |
| System prompt | Hidden instructions that define the AI's behavior, identity, and rules. | 21, 23 |
| Context engineering | Designing everything in the context window: prompt, docs, tools, history. | 23 |
| Data flywheel | Users → data → better product → more users. The core AI growth loop. | 29 |
| Trust calibration | Designing so users trust AI exactly as much as it deserves. | 32 |
| Hallucination | When the model confidently states something false. | 17, 32 |
| Fine-tuning | Retraining a model on custom data to change its behavior. | 10, 21 |
| RLHF | Reinforcement Learning from Human Feedback. How models learn to be helpful. | 29 |
Curated resources to continue learning. Organized by section. The sources below are also the live references for volatile claims like pricing, context windows, regulatory timing, and model capabilities.
AI facts age quickly. Any number tied to model price, context window, latency, benchmark score, legal deadline, or provider feature should be rechecked before publishing externally.
Live API pricing for current model families and token rates.
Context window docs and extended-thinking token behavior.
Google Gemini pricing and model limits.
Anthropic MCP docs for protocol concepts and product support.
European Commission overview for transparency and high-risk obligations.
Original METR writeup on early-2025 AI and experienced developer productivity.
Semantic conventions for GenAI spans, metrics, events, and agent traces.
OpenAI prompt caching guide for cacheable prefixes, latency, and cost behavior.
Structured Outputs announcement and caveats for schema-constrained generation.
The best way to learn: explain it to someone else.
If you can't, you don't know it yet.