AI Primer · 36 Chapters

From tokens— to product strategy.

A visual guide to LLMs, agents, and generative UI — how they actually work, and why it matters for what you're building.

Begin reading
@adhithya
Foundations
Agents
Production
Product & Strategy
Judgment
Reference
Start

How to Use This Primer

You do not need to read this straight through. Pick the track that matches the decision you are trying to make, then use the glossary and sources when a claim needs verification.

PM Track

Read Chapters 1-7, 16-19, 22-34. Focus on model choice, evals, launch thresholds, cost, and when not to use AI.

Design Track

Read Chapters 3, 8, 14-16, 20, 31-33. Focus on trust, failure states, AI-generated UI, and human control.

Engineer Track

Read Chapters 1-13, 16-21, 23, 26, 28-29. Focus on schemas, tools, RAG, observability, evals, and cost.

Founder / Strategy Track

Read Chapters 7, 18-19, 22, 24-30, 34. Focus on economics, defensibility, distribution, data, and product risk.

Reading Rule

Every technical chapter should cash out into a product decision. If a concept does not change what you build, measure, price, or disclose to users, treat it as background knowledge and keep moving.

01

Tokens — The Atoms of Language

You think in words. LLMs think in tokens. Understanding this difference is the foundation of everything else.

When you type "Hello, how are you?" into ChatGPT or Gemini, the model doesn't see five words. It sees something like this:

Your input: "Hello, how are you?" Hello , how are you ? 15496 11 1268 527 345 30 ← 6 tokens
Each word (and punctuation mark) becomes a numbered token. The model only sees numbers.

A token is a chunk of text — sometimes a whole word, sometimes part of a word, sometimes just a character. The model has a fixed vocabulary (think of it as a dictionary) of roughly 30,000–100,000 tokens, and every piece of text gets broken into pieces from that dictionary.

Try It: Token Counter
Paste or type text to see estimated token count.
Analogy: LEGO Bricks

Imagine you have a box of 50,000 unique LEGO bricks, each with a different shape. To represent any object, you combine bricks from your box. Common objects (like "the" or "hello") get their own single brick. Rare words get broken into multiple bricks. The word "tokenization" might become three bricks: token + iz + ation.

Why tokens matter for product teams

Every API call to an LLM is priced per token. Every model has a maximum number of tokens it can handle at once (its "context window"). When you're designing generative UI — a protocol where an LLM generates UI component trees — the size of that output in tokens directly determines:

Tokenization in practice

Different models use different tokenizers, but the patterns are similar:

TextApproximate TokensWhy
Hello1Common word, gets its own token
authentication2–3Long word, split into parts
{"type": "card"}7–9JSON has lots of punctuation, each costs a token
A full paragraph (100 words)~130Rule of thumb: 1 token ≈ 0.75 words in English
A complex generative UI layout (20 components)~1,500–3,000Nested JSON structures are token-expensive
Why This Matters for AI-Generated UI

When you design the generative UI schema, every field name you choose costs tokens. A field called backgroundColor costs more tokens than bg. But bg is ambiguous and the model might misinterpret it. This is a real product tradeoff: schema readability vs token efficiency. This is a product and UX decision as much as an engineering one.

Key takeaway

Tokens are the fundamental unit of LLM computation. Everything — cost, speed, capability limits — flows from token counts. When someone says "this model has a 128K context window," they mean 128,000 tokens, which is roughly a 200-page book.

In the Wild: How Token Economics Shape Real Products

OpenAI's pricing is entirely token-based, and the exact prices move as model families change.S1 This means a company like Notion AI, which processes millions of documents daily, must obsess over token efficiency — every unnecessary word in their system prompt costs real money at scale.

Cursor (the AI code editor) ran into token limits early. Their codebase context feature had to be carefully designed to select only the most relevant files to include in the context — because stuffing an entire repo into the prompt would blow past token limits and cost a fortune. They built a retrieval system that picks the 5-10 most relevant files, not all 500.

Stripe optimized their fraud detection prompts to use ~40% fewer tokens by switching from verbose natural language descriptions to compressed, structured formats — cutting their API costs proportionally while maintaining accuracy.

02

Next-Token Prediction — How LLMs Think

The most important idea in modern AI is embarrassingly simple: predict the next word. That's it. Everything else — conversations, code, UI generation — is a consequence of this one trick done at extraordinary scale.

An LLM doesn't "understand" language the way you do. It's a prediction machine. Given a sequence of tokens, it calculates the probability of every possible next token, then picks one.

INPUT TOKENS The cat sat on the Neural Network billions of parameters trained on the internet PROBABILITY FOR EACH POSSIBLE NEXT TOKEN: mat 68% floor 15% table 7% roof 3% ... + 99,993 other tokens with tiny probabilities
The model assigns a probability to every possible next token. "mat" wins at 68%. This happens for every single token in the output.

Here's the key insight: the model generates text one token at a time. After it picks "mat," the input becomes "The cat sat on the mat" and it predicts the next token again. This process repeats — token by token — until the model generates a special "stop" token or hits the maximum output length.

Analogy: Autocomplete on Steroids

You know how your phone keyboard suggests the next word? An LLM is the same idea, but instead of being trained on your text messages, it's been trained on a significant fraction of all text ever written by humans. And instead of choosing from 3 suggestions, it's choosing from 100,000 possibilities, weighted by probability. The "magic" is just autocomplete at absurd scale.

The training process (simplified)

How does the model learn these probabilities? Through training on enormous amounts of text. The process is conceptually simple:

Step 1: Take text "The cat sat on the mat" Step 2: Hide last word "The cat sat on the ___" Answer: "mat" Step 3: Check & adjust Model guessed "floor"? Adjust weights for "mat" Repeat billions of times with trillions of words
Training = showing the model billions of examples and adjusting its parameters whenever it guesses wrong. After enough repetitions, it gets very good at guessing.

The model doesn't memorize text. It learns patterns — statistical relationships between tokens. After seeing millions of sentences about cats sitting on things, it learns that "mat" is the most likely word after "the cat sat on the." After seeing millions of JSON objects, it learns the patterns of valid JSON. After seeing millions of code snippets, it learns programming syntax.

Training vs Inference: learning vs using

The process above — showing the model billions of examples and adjusting its parameters — is called training. It happens once (or periodically) and costs millions of dollars in compute. The result is a trained model — a massive file of numerical weights.

When you send a message to ChatGPT or Claude, the trained model runs the next-token prediction loop to generate a response. This is called inference. It happens billions of times per day and costs dollars (or fractions of a cent) per request.

Training vs Inference Training (learning) Happens: once, over weeks/months Cost: $10M – $100M+ Who does it: OpenAI, Anthropic, Google Output: a trained model (weights file) You don't do this. The lab does. Inference (using) Happens: every API call, billions/day Cost: $0.001 – $0.10 per request Who does it: you (via API) or the user Output: a generated response This is what you pay for. Every call.
Training creates the model. Inference uses the model. When people say "AI costs are high," they almost always mean inference costs — the per-request price of generating responses at scale.
Key Idea

Every time you hear "inference" in an AI conversation, mentally substitute "using the model to generate output." Inference latency = how fast you get a response. Inference cost = how much each API call costs. Inference provider = the company running the model's servers. This is the word you'll hear most often in AI product and engineering conversations.

This is why LLMs can generate UI

If an LLM has seen enough examples of JSON structures that describe UI components, it can predict what a valid UI component JSON should look like. Feed it a prompt like "generate a card component with a title and two buttons" and it produces token after token of valid JSON — not because it "understands" UI, but because it's seen enough patterns to predict what comes next in that kind of document.

Key Idea

An LLM doesn't know what a button looks like or what a card does. It knows what a button looks like in JSON — the statistical pattern of how buttons are described in text. This is both the power (it can generate any structured format it's seen) and the limitation (it can produce something that looks right in JSON but would be terrible UI).

In the Wild: Next-Token Prediction Powers Everything

GitHub Copilot is literally next-token prediction applied to code. When you type function calculateTax(, Copilot predicts the most likely next tokens based on patterns from millions of public repositories. It doesn't "understand" tax law — it's seen enough tax calculation functions to predict the pattern. This is why it's great at boilerplate but stumbles on novel business logic.

Google Search autocomplete works on a similar principle — given "how to make", it predicts "pancakes" or "money" based on frequency patterns. LLMs are this concept taken to an extreme scale.

Midjourney and DALL-E use a variation of this for images: instead of predicting the next token, they predict what pixels should look like given a text description. Different modality, same core idea — pattern prediction at scale.

03

Context Windows — The Model's Working Memory

A context window is the total amount of text a model can "see" at once — both your input and its output combined. It's the single most important constraint in building AI products.

Think of the context window as a desk. Everything the model needs to work with — your instructions, the conversation history, any documents you've provided, AND the response it's generating — all has to fit on this desk. If it doesn't fit, it falls off the edge and the model can't see it.

CONTEXT WINDOW (128K tokens ≈ 200 pages) System Prompt "You are a UI generator..." User Messages Your prompts, conversation history Tool Definitions Function schemas, component definitions Model Output Response being generated now ~500 tokens ~2,000 tokens ~3,000 tokens ~2,000 tokens ~120K unused Input + Output must fit within the window. Overflow → truncation or failure.
The context window is shared between everything: your instructions, conversation history, tool definitions, AND the model's response.
Try It: Context Window Budget
Adjust to see how a 128K context window fills up:
System RAG History Reply (~4K) Free

Context window sizes across models

ModelContext WindowRoughly Equivalent To
OpenAI flagship modelsVaries by modelCheck the live model docs before shipping
Claude Sonnet / Opus family200K+ tokens, model-dependentAnthropic documents 200K standard context and newer long-context options
Gemini Pro / Flash familyUp to 1M+ tokens, model-dependentGoogle publishes current limits in AI Studio / API docs
Gemini Nano (on-device)~4–32K tokens~5–50 pages of text

Model names, context windows, and prices change quickly. Treat this table as orientation, then verify against live provider docs before using it in a spec.S1S2S3

The Nano constraint

See that last row? On-device models (Gemini Nano, Apple's local models, Phi-4-mini) have dramatically smaller context windows. A prompt that works with a cloud model might completely fail on-device. Your architecture needs to handle this: shorter schemas, simpler prompts, or a fallback strategy when the on-device model can't handle the request.

Try It: Temperature Slider
Drag to see how temperature affects "Suggest a coffee shop name":
0.0 Deterministic0.71.0 Creative

Why this matters for product decisions

Context window size shapes every product decision in an AI system:

Big context window (cloud)

Can hold complex schemas, long conversation history, and rich tool definitions. Generates detailed, multi-component UIs. But: slower, more expensive, requires network.

Small context window (on-device)

Fast, private, works offline. But: can only handle simple prompts and small output. Needs compressed schemas. Limited to simpler UI generation.

Key Idea

The context window is a shared budget. Every token you spend on instructions is a token you can't use for output. This is why prompt engineering is an optimization problem: say enough for the model to understand the task, but no more. In generative UI systems, a bloated component schema eats into the space available for the actual UI generation.

In the Wild: Context Window as Product Differentiator

Cursor (AI code editor) lives and dies by context management. A developer's codebase might be millions of lines, but the model can only see a fraction at once. Cursor built an entire retrieval system — indexing your repo, ranking file relevance, and intelligently packing the most useful code into the context window. This "what to include" problem is their core product challenge.

NotebookLM by Google uses Gemini's 1M token window to ingest entire research papers, books, and document collections at once. Before large context windows, this required complex chunking and retrieval (RAG). Now you can just dump 50 PDFs in and ask questions. The product exists because the context window got big enough.

ChatGPT's memory feature is a workaround for context limits. Between conversations, the context window resets. So OpenAI stores a condensed summary of what it learned about you — effectively compressing your history into a few hundred tokens that fit alongside each new conversation.

04

Temperature and Sampling — Controlling Creativity

When the model predicts the next token, it doesn't always pick the most likely one. Temperature controls how adventurous it gets.

Remember from Chapter 2 that the model produces a probability distribution over all possible next tokens. Temperature is a number that modifies these probabilities before the model makes its pick.

Temperature = 0 Deterministic mat floor table roof Always picks "mat" Temperature = 0.7 Balanced mat floor table roof Usually "mat", sometimes others Temperature = 1.5 Creative / chaotic mat floor table moon Anything goes — even "moon" The Rule for generative UI / UI Generation Use low temperature (0–0.3) for UI generation. You want reliable, schema-compliant JSON — not creative surprises. Creativity in UI should come from the prompt, not the randomness.
Low temperature = predictable, reliable output. High temperature = diverse, surprising output. For UI generation, you almost always want low.

Other sampling parameters

Temperature isn't the only knob. Two others matter for your work:

Top-P (nucleus sampling): Instead of considering all 100,000 possible tokens, only consider the smallest set whose combined probability exceeds P. If P=0.9, the model only picks from the top tokens that together account for 90% of the probability. This prevents the model from ever picking wildly unlikely tokens.

Top-K: Even simpler — only consider the K most likely tokens. If K=50, the model picks from the top 50 most probable tokens. The other 99,950 are eliminated entirely.

Key Idea

For generative UI and any system where the model's output must conform to a specific schema, you want: temperature near 0, top-P around 0.9, and top-K around 40. This keeps the model focused on producing valid, predictable output while still allowing some flexibility in how it composes the UI.

In the Wild: Temperature Settings Across Products

GitHub Copilot uses low temperature (~0.1–0.2) for code completions. You want predictable, syntactically correct code — not creative surprises. When Copilot suggests a function body, it should be the most likely correct implementation, not a novel experiment.

ChatGPT's creative writing mode uses higher temperature (~0.7–1.0). When you ask it to write a story, you want variety — the same prompt should produce different stories each time. Low temperature would produce the same story every time, which feels robotic.

Jasper AI (marketing copy tool) lets users adjust a "creativity slider" — which maps directly to temperature. "More creative" = higher temperature for brainstorming taglines. "More precise" = lower temperature for factual product descriptions. They turned a technical parameter into a UX feature.

05

Reasoning Models — When AI Needs to Think

Standard models respond instantly. Reasoning models pause, think step-by-step, and then answer. They cost 5–10x more, take seconds to start, and beat everything else on the hard stuff. The product question is when that trade is worth it.

Remember from Chapter 2 that LLMs predict one token at a time. A reasoning model does something different: before producing its visible answer, it generates internal "thinking tokens" — a private chain of reasoning that the user may or may not see.

Standard Model vs Reasoning Model Standard Model Prompt → Answer (1-3 seconds) Reasoning Model Prompt → [Think...] → Answer (5-60s) thinking tokens (hidden) Standard: Fast, cheap, most tasks GPT-4o, Claude Sonnet, Gemini Flash Reasoning: Slower, costly, hard tasks o3, Claude Thinking, Deep Think
Reasoning models spend extra compute "thinking" before answering. The thinking tokens are consumed from the context window but often hidden from the user.
Analogy: Scratch Paper

Standard model: a student who blurts out the answer immediately. Reasoning model: a student who pulls out scratch paper, works through the problem step by step, then gives you the final answer. The scratch paper (thinking tokens) takes time and costs money, but for hard problems the answer is dramatically better.

The UX challenge

Reasoning models create a fundamentally different interaction pattern:

Key Idea

The decision framework is simple: if you wouldn't need scratch paper for this problem, don't use a reasoning model. "What's the weather?" doesn't need reasoning. "Analyze this contract for liability risks across three jurisdictions" does. The product decision is whether to route automatically (like model routing in Chapter 7) or let the user choose.

In the Wild

Cursor uses reasoning models selectively: standard models for autocomplete and quick edits, reasoning models for complex multi-file refactors. The user doesn't choose — the system routes based on task complexity.

ChatGPT shows a collapsible "Thought for X seconds" indicator. Users can expand it to see the chain of thought or collapse it and just read the answer. This progressive disclosure pattern has become the standard.

Claude's Extended Thinking offers four effort levels (low, medium, high, max). Higher effort = more thinking tokens = longer wait = better answers on hard problems. The API exposes this as a parameter, letting product teams tune the tradeoff per feature.

06

Structured Output — Making Models Predictable

LLMs naturally produce free-flowing text. But generative UI needs valid JSON. Structured output is how we force a creative, probabilistic system to produce machine-readable data.

Without structured output, if you ask an LLM to "generate a card component," you might get:

Sure! Here's a card component for you:

The card has a title "Weather Today" and shows the
current temperature of 72°F with a sunny icon...

That's nice prose, but your UI renderer can't do anything with it. What you need is:

{
  "type": "Card",
  "children": [
    { "type": "Text", "value": "Weather Today", "style": "headline" },
    { "type": "Row", "children": [
      { "type": "Icon", "name": "sunny" },
      { "type": "Text", "value": "72°F", "style": "display" }
    ]}
  ]
}

Structured output constrains the model toward valid JSON that matches your schema. Providers differ in how strict this is, and even schema-valid output can contain wrong values, so still validate before acting.S9 There are three main approaches, and understanding the differences is critical:

Prompt-Based "Respond only in valid JSON matching this schema..." Reliability: ~70–85% JSON Mode API forces output to be valid JSON. No schema enforced. Reliability: ~95% Function Calling Define a function with JSON Schema. Model MUST return matching output. Reliability: ~99%+ How Structured Output Actually Works (Function Calling) You define a schema {"type":"Card",...} Model generates tokens but constrained to schema Output guaranteed to match your schema HOW CONSTRAINED DECODING WORKS: At each step, the API masks out tokens that would produce invalid JSON. If the schema requires "type" next, only that token can be generated. Like autocomplete that prevents you from typing invalid characters.
Function calling with schema enforcement is the gold standard for generative UI: it makes the output parseable, then your app still validates semantics and permissions.

Function calling in practice

Here's what a real function calling setup looks like. This is the exact pattern that generative UI would use:

// You send this to the API alongside your prompt:
{
  "tools": [{
    "type": "function",
    "function": {
      "name": "render_ui",
      "description": "Generate a UI component tree for the user's request",
      "parameters": {
        "type": "object",
        "properties": {
          "root": {
            "type": "object",
            "properties": {
              "type": { "enum": ["Card", "Column", "Row", "List"] },
              "children": {
                "type": "array",
                "items": { "$ref": "#/$defs/Component" }
              }
            }
          }
        },
        "required": ["root"]
      }
    }
  }]
}

// The model's response is constrained to this schema:
{
  "tool_calls": [{
    "function": {
      "name": "render_ui",
      "arguments": {
        "root": {
          "type": "Card",
          "children": [
            { "type": "Text", "value": "Weather", "style": "headline" },
            { "type": "Text", "value": "72°F Sunny", "style": "body" }
          ]
        }
      }
    }
  }]
}
Why This Matters for AI-Generated UI

A generative UI protocol is, at its core, a function calling schema for generating UI component trees. The schema defines what components exist (Card, Row, Column, Text, Button, Image...), what properties each has, and how they nest. The renderer — React on web, SwiftUI on iOS, Jetpack Compose on Android — maps these to native components. The model's job is to fill in the values. The tighter your schema, the more reliable the output. The looser, the more creative — but more likely to break.

The schema design tradeoff

This is where design instinct becomes a superpower. Schema design is UX design for machines:

Tight Schema

{"type": "Card", "variant": "elevated"|"filled"|"outlined"}

✅ Always valid
✅ Predictable rendering
❌ Limited expressiveness
❌ Model can't improvise

Loose Schema

{"type": "string", "style": "object"}

✅ Creative flexibility
✅ Can handle novel requests
❌ Might generate invalid UIs
❌ Harder to render reliably

In the Wild: Structured Output Powers Production Systems

Shopify's Sidekick uses function calling to let merchants manage their store via natural language. "Give me a 20% discount on winter jackets" triggers a structured tool call with exact parameters: { action: "create_discount", collection: "winter-jackets", percentage: 20 }. Free-text output would be useless — Shopify's backend needs machine-readable instructions.

Zapier's AI Actions connects ChatGPT to 6,000+ apps using structured output. When you say "add this to my Notion database," the model generates a structured API call that Zapier can execute. The schema for each integration is pre-defined — the model fills in the values.

Vercel's v0 generates React code from natural language descriptions. Under the hood, it uses structured output to produce a specific code format with metadata (component name, imports, props). The output isn't "creative writing that happens to be code" — it's schema-constrained generation optimized for parseability and rendering.

07

Model Variants — Choosing the Right Brain

Not all models are created equal. Choosing which model to use for which task is one of the most impactful product decisions you'll make.

Every major AI provider offers a family of models at different capability/cost/speed tradeoffs. Think of it like cars: you don't drive an 18-wheeler to get groceries, and you don't use a Smart car to haul lumber.

SPEED / COST EFFICIENCY → CAPABILITY → Frontier GPT-4o, Claude Opus Gemini 2.5 Pro Workhorse Claude Sonnet Gemini 2.5 Flash Fast/Cheap Claude Haiku GPT-4o Mini On-Device Gemini Nano Complex reasoning Multi-step agent tasks Most production workloads UI generation sweet spot Simple classify/extract Privacy, offline
Model selection is a spectrum. The right model depends on your task, latency requirements, cost budget, and privacy constraints.
Try It: Model Comparison
Pick two models to compare:

Model routing: using multiple models

Sophisticated AI products don't use a single model — they route requests to different models based on complexity. This is called model routing or cascading.

User Request "Show my schedule" Router classify Simple → Nano "What time is it?" Medium → Flash "Show my schedule" Complex → Pro "Plan my week around..." ~50ms, free ~300ms, $0.001 ~2s, $0.02
A router classifies each request's complexity and sends it to the cheapest/fastest model that can handle it. This is how you scale agentic UI without blowing your cost budget.
Key Idea

Model selection isn't a one-time decision — it's a runtime decision made for every request. A production AI product needs a routing layer: simple tasks go to a small model (Haiku, Flash, on-device), standard tasks go to a mid-tier model, and complex tasks go to a frontier model. Designing this routing logic is a core product and architecture decision.

In the Wild: Model Routing in Production

Perplexity routes queries across multiple models. Simple factual lookups go to a fast, cheap model. Deep research queries go to a frontier model. They built a classifier that evaluates query complexity in <50ms and routes accordingly — cutting their average cost per query by ~60% while maintaining quality where it matters.

Notion AI uses different models for different features: a lightweight model for autocomplete suggestions (speed matters most), a mid-tier model for summarization (balance of speed and quality), and a frontier model for complex writing tasks (quality matters most). Each feature has its own model selection, not one model for everything.

Samsung Galaxy AI on the S24/S25 series does exactly the on-device/cloud routing described here. Simple tasks (text summarization, live translate) run on-device via a smaller model. Complex tasks (generative edit in photos, chat assist) go to cloud. The user doesn't know or care which model is running — they just see the result.

08

Multimodal AI — Beyond Text

Modern models read text, see images, hear audio, and watch video. The input box is no longer a box. The design problem is figuring out which modalities your product actually needs and which are demo candy.

Every chapter so far has been implicitly text-centric. But as of 2026, every frontier model is natively multimodal — processing text, images, audio, and sometimes video within a single inference call.

How Different Modalities Become Tokens Text Words → subword tokens ~1.3 tokens per word Image Pixels → visual patches ~765+ tokens/image Audio Waveform → spectrograms ~1,000 tokens per minute Video Frames → visual + audio ~10K+ tokens per minute Same context window All modalities share token budget A single image costs as many tokens as 600 words of text
Every modality gets converted to tokens that share the same context window. Images and video are expensive — a single photo uses as many tokens as a full page of text.

What this means for product teams

Multimodal AI enables new input patterns that were impossible with text-only models:

Key Idea

Multimodal doesn't just add input types — it changes the fundamental interaction model. Text-only AI is "describe your problem." Multimodal AI is "show me your problem." This is a massive reduction in friction for users who struggle to articulate complex visual or spatial information in words.

In the Wild

Google Lens evolved from a standalone visual search tool into Gemini's eyes. Circle to Search on Pixel/Samsung lets you highlight anything on screen and ask questions about it — multimodal inference running on what you see.

Be My Eyes (accessibility app) uses GPT-4o's vision to describe the world to blind users in real-time. A user points their phone camera and the model narrates what it sees. This was impossible before multimodal.

NotebookLM ingests entire PDFs, slides, and images as visual tokens. You can ask "what's the chart on page 7 showing?" and it answers based on the actual visual layout, not just extracted text.

Agentic Architecture
09

Function Calling — Teaching Models to Use Tools

An LLM by itself can only generate text. Function calling is how we give it hands — the ability to actually do things in the real world: check calendars, send messages, query databases, and generate UIs.

Imagine you hire a brilliant consultant who knows everything about everything — but they're locked in a room with no phone, no computer, and no internet. They can give you amazing advice, but they can't actually do anything. That's an LLM without function calling.

Function calling gives the consultant a phone. You tell them: "Here are the apps on this phone and what each one does." When they need to check something or take an action, they tell you which app to use and what to type in. You execute it, show them the result, and they continue their work.

The mechanics

The function calling lifecycle has exactly four steps. Every agentic system — including generative UI — follows this pattern:

1 YOU define the available tools tools: [ { name: "get_weather", params: {city} }, { name: "send_message", params: {to, body} } ] 2 MODEL decides which tool to call (and with what arguments) User says: "What's the weather in San Jose?" Model: { tool: "get_weather", args: { city: "San Jose" } } 3 YOUR CODE executes the function and returns the result You call the weather API → get back { temp: 72, condition: "sunny" } You send this result back to the model. 4 MODEL generates final response using the data Model: "It's 72°F and sunny in San Jose right now." Or with generative UI: generates a weather Card component with the data filled in.
The four-step function calling lifecycle. The model never calls the function itself — it tells YOUR code what to call, and your code executes it. The model is the brain; your code is the hands.
Critical Insight

The model never actually executes functions. It generates a request to call a function. Your application code runs the function and feeds the result back. This is important for security (the model can't directly access your APIs without your code mediating) and for control (you can validate, log, rate-limit, or reject tool calls before executing them).

How different providers implement it

The concept is identical across providers, but the API syntax differs slightly:

OpenAI / GPT

// Defining tools
tools: [{
  type: "function",
  function: {
    name: "get_weather",
    description: "Get current weather for a city",
    parameters: {
      type: "object",
      properties: {
        city: { type: "string", description: "City name" }
      },
      required: ["city"]
    }
  }
}]

// Model response when it wants to call a tool:
{
  "choices": [{
    "message": {
      "tool_calls": [{
        "id": "call_abc123",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"San Jose\"}"
        }
      }]
    }
  }]
}

Anthropic / Claude

// Defining tools
tools: [{
  name: "get_weather",
  description: "Get current weather for a city",
  input_schema: {
    type: "object",
    properties: {
      city: { type: "string", description: "City name" }
    },
    required: ["city"]
  }
}]

// Model response when it wants to call a tool:
{
  "content": [{
    "type": "tool_use",
    "id": "toolu_abc123",
    "name": "get_weather",
    "input": { "city": "San Jose" }
  }]
}

Google / Gemini

// Defining tools
tools: [{
  function_declarations: [{
    name: "get_weather",
    description: "Get current weather for a city",
    parameters: {
      type: "object",
      properties: {
        city: { type: "string", description: "City name" }
      },
      required: ["city"]
    }
  }]
}]

// Model response when it wants to call a tool:
{
  "candidates": [{
    "content": {
      "parts": [{
        "functionCall": {
          "name": "get_weather",
          "args": { "city": "San Jose" }
        }
      }]
    }
  }]
}

Notice the pattern: the schema definition is nearly identical (JSON Schema), but each provider wraps it differently. The model's response always contains: which function to call, and what arguments to pass. Your code handles the rest.

Why This Matters for AI-Generated UI

In generative UI systems, the "tools" aren't weather APIs — they're app capabilities. A fitness app might expose tools like log_workout, get_weekly_stats, set_goal. The agent calls these tools, gets the data back, and then generates a UI component tree to display the results. Generative UI is function calling where the final output is a rendered interface instead of text.

Why tool descriptions matter enormously

The model chooses which tool to call based entirely on the description field and parameter descriptions. Bad descriptions lead to wrong tool selection. This is a product/design decision:

Bad Description

"description": "Weather function"

Model doesn't know when to use it, might confuse it with a climate function or a forecast function.

Good Description

"description": "Get the current temperature and conditions for a specific city. Returns temp in °F, condition (sunny/cloudy/rainy), and humidity percentage."

Model knows exactly what it gets back and when to use it.

In the Wild: Function Calling at Scale

ChatGPT Plugins (now GPT Actions) was one of the first mass-market implementations of function calling. When you ask ChatGPT to "find flights to Tokyo," it calls the Kayak plugin's search_flights function with structured parameters. Thousands of businesses built plugins — each one is just a function calling schema that lets GPT interact with their service.

Siri and Alexa were doing a primitive version of function calling before LLMs. "Set a timer for 5 minutes" maps to an intent (set_timer) with a slot (duration: 5min). The difference with LLM-based function calling is flexibility: you don't need to pre-define every possible phrasing. The model figures out the intent and extracts the parameters from any natural language input.

Anthropic's Claude introduced "computer use" tool calls — the model can call functions like click(x, y), type(text), and screenshot() to operate a desktop computer. Same function calling pattern, radically different tools. This is where agents start interacting with the physical world.

10

RAG — Giving AI Access to Knowledge

LLMs only know what was in their training data. RAG (Retrieval Augmented Generation) connects them to external knowledge at query time: your documents, your database, your company wiki.

Imagine you're taking an exam. A standard LLM takes it closed-book, answering from memory. RAG takes it open-book: before answering, it searches a library, pulls out relevant pages, reads them, and answers using both memory and the retrieved material. Most production AI assistants are open-book exams. The interesting work is in how you build and search the library.

The pipeline, end to end

RAG is usually drawn as four boxes. That hides where it actually breaks. Real systems have eight stages, split between work you do once at build time and work you do on every query.

The RAG Pipeline (and where each stage breaks) BUILD TIME — runs when documents change 1. Ingest Pull docs from sources. PDFs, wikis, tickets, DBs. 2. Chunk Split into passages. Too big = noisy hits. Too small = no context. 3. Embed Text → vector. Wrong embedding model = bad recall forever. 4. Index Store vectors in a DB. Stale index = stale answers. QUERY TIME — runs on every user message 5. Retrieve Embed query, fetch top-K. Misses here the model can't recover. 6. Rerank Re-score top-K with a cross-encoder. The cheap step that fixes recall. 7. Augment Stuff passages into the prompt. Order and labels change the answer. 8. Generate Model answers with citations. Hallucinations if context is thin. Embeddings, in one line Text gets converted to a vector — a coordinate in "meaning space." "Refund policy" and "return items" land near each other; vector search finds neighbors by meaning, not keywords. That's the whole trick.
Build-time work happens when your corpus changes. Query-time work happens on every user message. Most teams skip steps 2 and 6, then wonder why their RAG underperforms.

Chunking is where most teams lose

The boringest stage in the pipeline is the one that decides whether RAG works. A chunk too big returns noisy passages with the answer buried inside. A chunk too small loses the surrounding context the model needs to interpret it. There's no universal right answer — different content shapes want different strategies.

Fixed-size

Split every N tokens (typically 200–800), with overlap. Fast to build, predictable. Cuts mid-sentence, mid-table, mid-thought.

Best for: uniform prose like blog posts, marketing copy, news.

Semantic / structural

Split on meaningful boundaries: paragraphs, headings, sections. Preserves the author's structure. Slower to build, harder to tune.

Best for: docs with strong structure — manuals, contracts, policy documents.

Hierarchical

Index small chunks for retrieval but return larger parents at generation time. Best of both: precise hits, full context.

Best for: long documents where the answer needs surrounding context — research papers, legal filings.

Code- or schema-aware

Chunk by function, class, or table — never split a unit of code. Often paired with AST parsing.

Best for: codebases, API docs, structured data.

Beyond plain vector search

Vector search alone is the 2023 baseline. The 2026 production stack adds three things:

Hybrid search combines dense vectors (good for meaning) with sparse keyword search like BM25 (good for exact matches: error codes, product SKUs, legal citations). Vectors miss "ERR_4032"; BM25 nails it. Run both, merge the results. This single change usually beats any amount of tuning to vectors alone.

Reranking takes the top 20–100 results from the cheap retriever and re-scores them with a slower, more accurate cross-encoder model (Cohere Rerank, Voyage Rerank, or a fine-tuned encoder). Cross-encoders look at the query and document together, so they catch nuance that bi-encoder vectors miss. Typical lift: 10–30% on retrieval quality for the cost of one extra model call.

Query rewriting handles the gap between how users phrase questions and how documents phrase answers. HyDE (Hypothetical Document Embeddings) is the well-known move: ask the LLM to draft a hypothetical answer first, then embed and search using that. The drafted answer often shares more vocabulary with real documents than the original question did. For multi-turn chats, query rewriting also rolls earlier turns into a self-contained search query so retrieval doesn't lose context.

Knowledge graphs deserve a mention. When relationships matter more than passages — "who reports to whom," "what's connected to this incident" — a graph beats vectors. Most teams won't need this; the ones that do, know.

RAG vs fine-tuning vs long-context

By 2026 these are three real options, not one. Picking the wrong one costs months. The shape of the choice:

RAGFine-tuningLong-context (stuff it all in)
SolvesKnowledge the model lacksBehavior, tone, format the prompt can't get rightSingle large document at a time
FreshnessAs fresh as your indexerFrozen at training timeWhatever you paste in
Cost shapePer-query retrieval cost; cheap to updateUp-front training cost; cheap inferenceHigh per-query token cost
Fails whenRetrieval misses the right chunkUse case shifts; data driftsCorpus is too big or context is too noisy
Reach for it whenCorpus changes often or is largeOutput style or domain language won't budge with promptingCorpus is small, stable, and fits in 200K–1M tokens

The 2026 default order: prompt engineering first, then long-context if the corpus fits, then RAG when it doesn't, then fine-tuning only if behavior is still off. Many teams skip straight to fine-tuning because it sounds sophisticated. Most regret it.

Try It: Which RAG pattern do you need?
Answer four questions about your use case. The recommendation tells you where to start — not where to end.

Why RAG matters for product design

RAG creates UX problems pure chatbots don't have, and the design work is what separates trusted products from suspicious ones:

Key Idea

RAG quality is bottlenecked by retrieval, not generation. A frontier model with the wrong context produces a fluent wrong answer. A weaker model with the right context produces a useful right answer. When RAG feels broken, the fix is almost always upstream of the LLM: better chunks, hybrid search, a reranker.

In the Wild

Perplexity made citations the entire interface. Every claim is numbered and linked back to a source. Users trust it for research because they can verify, not because the model is special.

NotebookLM scopes RAG to documents you upload, never general training data. "Based on your sources" appears on every response. That scope clarity is the trust signal.

Cursor runs RAG over your codebase: it embeds your repo, retrieves relevant files for each request, and adds them to the context. The "intelligence" people praise is mostly retrieval quality, not the underlying model.

Glean and Elastic built enterprise search on RAG. The hard problem isn't retrieval — it's enforcing per-document access permissions so employees only see what they're allowed to see.

Cohere Rerank and Voyage Rerank dominate the reranker market. They're a single API call you bolt onto an existing vector search and they routinely deliver double-digit recall improvements. Most enterprise RAG stacks use one or the other.

11

The Agentic Loop — How Agents Reason and Act

A single function call is useful. But real agents call multiple functions in sequence, make decisions based on results, and adapt when things go wrong. This is the agentic loop.

The difference between a chatbot and an agent is simple: a chatbot responds once. An agent keeps going until the task is done.

The Agentic Loop User: "Dinner with Alex" 🧠 THINK "Check calendar, find restaurant, book" 🔧 ACT check_calendar({ person: "Alex" }) 👁 OBSERVE { free, times: ["6pm","7pm"] } Loop until done ✅ RESPOND Generate final UI / text response to user
The ReAct pattern: Think → Act → Observe → repeat until done. Each loop is a separate API call where the model sees all previous steps.

A real multi-step example

Let's trace through a realistic generative UI agent scenario: the user says "Schedule dinner with Alex this Friday at a good restaurant near home."

1
THINK: "I need Alex's availability first." ACT: check_calendar({ person: "Alex", date: "2026-04-10" })
OBSERVE: { free: true, available: ["6pm-9pm"] }
2
THINK: "She's free. Now find restaurants near home." ACT: search_restaurants({ near: "home", cuisine: "any", rating: ">4.0" })
OBSERVE: [{ name: "Osteria", rating: 4.5 }, { name: "Sushi Gen", rating: 4.3 }, ...]
3
THINK: "Found options. Check if Osteria has availability." ACT: check_reservation({ restaurant: "Osteria", date: "2026-04-10", time: "7pm", party: 2 })
OBSERVE: { available: false, next_available: "8pm" }
4
THINK: "7pm is full. 8pm works within Alex's window. Present options to user." ACT: Generate generative UI response — a card showing Osteria at 8pm with "Book" and "See other options" buttons.

Four tool calls, each building on the last. The model maintained context across all of them, made decisions based on intermediate results, and adapted when the first time slot wasn't available. That's an agent.

Key Idea

Each iteration of the loop is a separate API call. The entire conversation history — including all previous tool calls and results — gets sent back to the model each time. This is why context windows matter so much: a complex 10-step agent task might consume thousands of tokens just in history before the model even starts thinking about the next step.

In the Wild: Agentic Loops in Production

Claude Code (Anthropic's coding agent) is a textbook agentic loop. You say "refactor this module to use dependency injection." It thinks ("I need to read the file first"), acts (reads the file), observes (sees the current structure), thinks again ("I see 3 classes that need interfaces"), acts (edits file 1), observes (checks for errors), and loops until all files are updated and tests pass. A single user request can trigger 20+ iterations of the loop.

Devin (the AI software engineer by Cognition) chains together even longer loops: reading GitHub issues → planning an implementation → writing code → running tests → debugging failures → committing. Each step feeds into the next. When tests fail, it doesn't just stop — it reads the error, reasons about the cause, and tries a fix. Some tasks run 50+ loop iterations.

Google's Deep Research (in Gemini) uses extended agentic loops for research. It searches the web, reads articles, identifies gaps in its knowledge, searches again with refined queries, synthesizes findings, and produces a report. One research question can trigger dozens of search-read-think cycles over several minutes.

12

MCP — The Universal Plug for AI

Function calling lets a model use tools. But who decides which tools exist and how to connect to them? That's the problem MCP solves.

MCP (Model Context Protocol) is an open standard created by Anthropic that standardizes how AI models discover and use tools across any application. Think of it as USB for AI.S4

Analogy: Before and After USB

Before USB: Every device had its own cable. Your printer had a parallel port cable. Your mouse had a PS/2 connector. Your camera had a proprietary cable. If you wanted to connect a new device, you needed to find the right cable and install a custom driver.

After USB: One port, one standard. Plug anything in and it works. The computer asks "what are you?" and the device says "I'm a keyboard" or "I'm a camera" and they negotiate automatically.

MCP is USB for AI. Instead of every app building custom integrations with every AI model, MCP provides one standard protocol. An AI agent asks "what can you do?" and the app says "here are my functions." The agent can immediately use them.

Without MCP vs With MCP Without MCP: Custom Integrations AI Model Calendar API Email API Maps API Fitness API Each app = custom code, schema, auth With MCP: One Standard Protocol AI Model MCP Protocol Layer Calendar Email Maps Fitness One standard per app = any model works MCP Architecture: Three Roles MCP Host The AI app (Claude, Cursor) MCP Client Manages connections MCP Server Each app (Calendar, Email...)
MCP standardizes the connection between AI models and apps. One protocol, infinite tools.

What an MCP server exposes

An MCP server provides three things to the AI model:

  1. Tools — Functions the model can call (like create_event, search_files)
  2. Resources — Data the model can read (like file contents, database records)
  3. Prompts — Reusable prompt templates (like "summarize this document")

Here's what a simple MCP server for a fitness app looks like:

// MCP Server: Fitness Tracker
{
  "name": "fitness-tracker",
  "version": "1.0",
  "tools": [
    {
      "name": "log_workout",
      "description": "Record a completed workout session",
      "inputSchema": {
        "type": "object",
        "properties": {
          "exercise": { "type": "string", "description": "e.g. 'bench press'" },
          "sets": { "type": "number" },
          "reps": { "type": "number" },
          "weight_lbs": { "type": "number" }
        },
        "required": ["exercise", "sets", "reps"]
      }
    },
    {
      "name": "get_weekly_summary",
      "description": "Get workout stats for the current week",
      "inputSchema": {
        "type": "object",
        "properties": {
          "week_offset": {
            "type": "number",
            "description": "0 = this week, -1 = last week"
          }
        }
      }
    },
    {
      "name": "set_goal",
      "description": "Set a fitness goal for a specific exercise",
      "inputSchema": {
        "type": "object",
        "properties": {
          "exercise": { "type": "string" },
          "target_weight": { "type": "number" },
          "target_date": { "type": "string", "format": "date" }
        },
        "required": ["exercise", "target_weight"]
      }
    }
  ]
}
MCP + AI-Generated UI: The Platform Pattern

MCP defines what an app can DO. A generative UI protocol defines what the result LOOKS LIKE.

An app exposes its capabilities via MCP ("I can log workouts, show summaries, set goals"). When an agent calls those tools, generative UI renders the results as native components ("here's a card showing your weekly summary with a progress bar toward your goal").

Together, these standards mean: every app becomes agent-accessible with native, beautiful UI — without the app developer building a custom AI integration. That's the platform pattern emerging across the industry.

In the Wild: MCP Adoption Is Accelerating

Anthropic launched MCP in late 2024 and adoption has been rapid. As of early 2026, there are MCP servers for Slack, GitHub, Google Drive, Notion, Linear, Jira, Figma, Postgres databases, and hundreds more. Claude Desktop, Cursor, Windsurf, and other AI tools can connect to any MCP server — one protocol, instant integration.

Block (Square) and Apollo were early enterprise adopters, building internal MCP servers so their AI tools could interact with proprietary systems. Instead of building custom ChatGPT plugins AND custom Claude integrations AND custom Gemini integrations, they build one MCP server and it works everywhere.

Figma's MCP server lets AI agents read design files, inspect components, and even generate code from designs — all through standard MCP tool calls. This is the "USB for AI" vision in action: Figma implements MCP once, and every AI tool that speaks MCP can now interact with Figma designs.

Google Stitch shipped an MCP server in 2026, letting external AI agents interact with Stitch design projects programmatically. This shows how quickly MCP is becoming the default integration layer — even AI design tools are adopting it.

13

Orchestration Patterns — ReAct, Chains, and Routing

There's more than one way to build an agent. The orchestration pattern you choose shapes everything: reliability, speed, cost, and user experience.

Pattern 1: ReAct (Reasoning + Acting)

This is the pattern from Chapter 8 — the model alternates between thinking and acting. It's the most common and most flexible pattern.

Think "I need..." Act call_tool() Observe result = ... Think "Next..." Act call_tool() ...
ReAct: flexible, adaptive, handles complex multi-step tasks. Downside: each step is an API call (slow, expensive).

Pattern 2: Parallel Tool Calls

When multiple independent tools need to be called, a smart agent calls them all at once instead of sequentially:

Think: need weather AND calendar get_weather("SJ") get_calendar("Fri") } Both results at once → Generate combined UI PARALLEL
Parallel execution: when tools are independent, call them simultaneously. Cuts latency in half (or more).

Pattern 3: Router Pattern

A lightweight model classifies the request and routes it to specialized handlers:

User Request Router Fast classifier Simple query → Nano Standard task → Flash Complex reasoning → Pro Gen UI Response
The router pattern optimizes cost and latency by sending each request to the cheapest model that can handle it.
Key Idea

In production, most agents use a combination of these patterns. The router picks the right model, that model uses ReAct for complex tasks with parallel tool calls where possible. Designing this orchestration logic — deciding which pattern for which scenario — is a core product decision that shapes the user experience.

In the Wild: Orchestration Patterns in Production

Uber's customer support AI uses a router pattern: a fast classifier determines if the query is about a ride issue, a payment issue, or an Eats issue, then routes to a specialized agent for each domain. Each specialized agent has its own tool set and system prompt optimized for that domain. This is cheaper and more accurate than one monolithic agent handling everything.

LangChain and LlamaIndex popularized orchestration frameworks that make these patterns composable. LangChain's "agent executor" implements the ReAct loop. Their "sequential chain" implements linear pipelines. Their "router chain" implements the routing pattern. These frameworks exist because orchestration is hard enough to warrant dedicated tooling.

OpenAI's Assistants API handles orchestration server-side — you define tools and the API manages the think-act-observe loop for you, calling your functions and feeding results back automatically. This is a bet that most developers don't want to build their own orchestration layer — they just want to define tools and let the platform handle the rest.

14

Agents in Production — Trust, Control, and the Real World

The primer has covered how agents work mechanically. This chapter covers what happens when you ship them to real users — the UX patterns, trust frameworks, and protocols that make agents usable.

57% of organizations now have agents in production. But "production" doesn't mean "autonomous." The biggest lesson from 2025-2026: users want agents that are powerful but controllable. The UX challenge is designing the right level of autonomy for each context.

The Autonomy Spectrum Low autonomy Full autonomy Draft Agent suggests, human decides Approve Agent acts with human approval Monitor Agent acts, human watches + can stop Autonomous Agent acts freely, reports afterward Most production agents in 2026 sit in the "Approve" or "Monitor" zone — not fully autonomous. The right level depends on: risk of the action, reversibility, user trust, and domain sensitivity.
The autonomy dial: product teams choose where each agent action sits on this spectrum. Higher risk = more human control.

Six core agent UX patterns

  1. Intent Preview: Agent shows its plan before acting. "I'll check your calendar, find restaurants nearby, and book one. OK?"
  2. Autonomy Dial: Users can adjust how much control the agent has. Gmail's Smart Compose (draft) vs Autopilot (send).
  3. Action Audit: Every agent action is logged and visible. Users can see what happened, when, and why.
  4. Confidence Signals: Agent shows how certain it is. High confidence = proceed. Low confidence = ask for human input.
  5. Escalation Pathways: When the agent is stuck or unsure, it hands off to a human smoothly — not as a failure, but as designed behavior.
  6. Scope Cards: Panels showing what the agent can and cannot access. "This agent can read your calendar and email but cannot make purchases."

Computer Use agents

A new category: agents that literally see and control screens. Claude Computer Use operates a full macOS desktop. OpenAI's Operator controls a remote browser. Google's Project Mariner works inside Chrome. These agents take screenshots, click buttons, type text, and navigate apps just like a human would.

The UX challenge is unique: the user watches their screen being controlled by an AI. This requires real-time observation (screen sharing), permission gates before sensitive actions, and a kill switch to stop the agent immediately.

Observability — debugging an agent that thinks for itself

A multi-step agent fails in ways a single API call doesn't. It picked the wrong tool. It called the right tool with bad arguments. It looped. It silently degraded after a model upgrade. The only way to debug any of this is a trace — a structured log of every model call, every tool call, every input, every output, in order. By 2026 this is standard infrastructure: each step gets a span, the trace tree shows the full reasoning path, and you can replay a failing run in isolation. LangSmith, Braintrust, and Langfuse are common platforms; OpenTelemetry's GenAI semantic conventions define emerging shared fields for model calls, tool calls, token usage, latency, and errors.S7 The headline rule: if you can't replay a bad run with the exact same inputs, you can't fix it. Build trace capture before you build the second tool.

In the Wild

Intercom's Fin is one of the most successful customer service agents in production. It resolves 50%+ of support tickets autonomously but escalates to human agents for complex cases, a textbook confidence-based escalation pattern.

Replit Agent builds entire applications from natural language. It shows its plan (intent preview), executes steps one at a time (audit trail), and asks for approval before deploying (autonomy gate). Users can see every file it creates and modify any step.

A2A (Agent-to-Agent) is the emerging open protocol for agents to delegate work to other agents — one agent can hand off subtasks to specialized peers without bespoke integration code. Alongside MCP (agent-to-tools) and AG-UI (agent-to-frontend), these three protocols form the infrastructure layer for multi-agent systems.

Production
15

Error Handling — When Agents Fail Gracefully

Agents fail. APIs time out, models hallucinate, tool calls return unexpected data. How you handle failure defines the user experience.

In a traditional app, errors are predictable: network error, invalid input, server down. In an agentic system, you get entirely new failure modes:

Failure TypeExampleHow to Handle
Wrong tool selection Agent calls send_email when user wanted send_message Confirmation step before executing irreversible actions
Invalid arguments Agent passes "date": "next Friday" instead of "2026-04-10" Validate arguments against schema before executing; ask model to retry with correct format
Tool execution failure Restaurant API is down Return structured error to model; let it try alternatives or inform user
Hallucinated tool Agent tries to call book_flight but no such tool exists Validate tool name before execution; return "tool not found" to model
Infinite loop Agent keeps retrying a failed action Set max iteration count (e.g., 5 loops max); break and inform user
Schema violation in output generative UI output has invalid component nesting Validate against schema; show fallback UI; log for monitoring

The graceful degradation ladder

A well-designed agentic UI should degrade gracefully through these stages:

✅ Full success Full generative UI rendered ⚡ Partial success Show available data + loading states 🔄 Retry with fallback Try alternative tool or simpler model 💬 Text fallback Fall back to plain text response
Design for failure at every level. The user should always see something useful — never a blank screen or cryptic error.
The Same Request, Three States Restaurant Agent ✓ Booked Osteria Fri 8PM · 2 guests Add to Calendar Modify Reservation Full Success Complete UI rendered Restaurant Agent ⏳ Checking... Osteria Fri · Finding times... Partial + Skeleton Show what we have Restaurant Agent ✕ Couldn't book Reservation service is unavailable Try Again Call Restaurant Error + Fallback Always offer a next step User asked: "Book Osteria for Friday at 7pm"
The same user request rendered in three states. A well-designed generative UI handles all three gracefully — the user always sees something useful and always has a next step.
Error States in AI-Generated UI

Your generative UI schema should include first-class error and loading states. A component that says "state": "loading" renders a skeleton screen. "state": "partial" renders available data with placeholders. "state": "error" renders a retry card. These aren't afterthoughts — they're the most important states to design because they're what users see when things go wrong (which is often).

In the Wild: How Products Handle Agent Failures

ChatGPT's browsing feature frequently hits websites that block it. Instead of crashing, it tells the user "I wasn't able to access that site" and offers to try alternative sources. This conversational fallback pattern — admit the failure, explain why, offer alternatives — is the baseline every agentic product should hit.

Tesla Autopilot is the hardware analogy for graceful degradation. Full self-driving → lane keeping → adaptive cruise → manual control. Each level is a fallback when the one above it can't handle the situation. It never just stops working — it degrades to a less capable but still functional mode and alerts the driver.

Alexa's confidence thresholds show a different approach: when the model's confidence in its interpretation is below a threshold, it asks for confirmation instead of acting. "Did you mean turn off the bedroom lights?" This is cheaper and safer than executing a wrong action and having to undo it. For generative UI, a confirmation card before irreversible actions follows the same principle.

Notion AI handles hallucination risk by always including an "AI-generated" badge on its outputs and providing the source material alongside the summary. This UI-level pattern — flagging uncertainty visually — is something generative UI should consider as a first-class component state.

16

AI Safety & Guardrails — Keeping AI in Bounds

Guardrails are the protective systems that prevent AI from generating harmful content, leaking private data, or acting beyond its intended scope. In 2026, they're also a regulatory requirement.

An LLM without guardrails will attempt anything you ask. Guardrails constrain it — blocking harmful content, filtering personal data, preventing jailbreaks, and keeping the AI focused on its intended task. Think of them as the brakes on a very powerful car.

Layered Guardrail Architecture User Input INPUT GUARDS Jailbreak detection PII filtering Topic control LLM OUTPUT GUARDS Hallucination check Content safety Action validation Safe Output EU AI Act (Aug 2026): Transparency requirements are UX mandates Users informed of AI · Content labeled · Deepfakes disclosed
Guardrails wrap around the LLM on both input and output sides. The EU AI Act adds transparency requirements that are explicitly UX obligations.S5

The UX of guardrails

When a guardrail triggers, the user sees... something. What they see is a design decision that directly affects trust:

Key Idea

Four principles for trustworthy AI design: Transparency (users know they're interacting with AI), Proportionality (restrictions match the risk level), Reversibility (actions can be undone), and Contestability (users can challenge AI decisions). These aren't just good design — they're increasingly legal requirements.

17

Evaluation — Measuring If AI Actually Works

A model scoring 90% on a benchmark might still frustrate real users. Evaluation is how you close the gap between measured performance and experienced quality.

Quality is the #1 barrier to production AI — cited by 32% of teams as their top challenge. "Running evals" is the AI equivalent of usability testing: you systematically check whether the system works for real scenarios, track scores over time, and use the data to decide what to ship.

The eval process, step by step

The Eval Loop 1. Define What does "good" look like? 2. Build Tests 50-500 examples with expected output 3. Run Full system: prompt + RAG + tools 4. Grade Auto / LLM / human review 5. Track Over time Change prompt / model / pipeline → run again → compare Example: Improving a support bot Monday: 200 test cases. Accuracy: 91%. Helpfulness: 4.2/5. Tuesday: Changed prompt. Accuracy: 91%. Helpfulness: 3.8/5. ⚠️ Wednesday: Adjusted. Accuracy: 92%. Helpfulness: 4.1/5. ✓ Ship it. Every change triggers a new eval run. Scores tell you if it helped or hurt.
The eval loop: define → build test set → run → grade → track. Every change triggers a new run.

Step 1: Define what "good" means

This is the hardest and most important step. Writing a grading rubric is the same skill UX researchers use when creating annotation guides for usability studies — and it has the same failure mode: a vague rubric produces noisy, irreproducible scores no matter how good the grader is.

// Example rubric for a customer support bot
{
  "accuracy": {
    "5": "Correct answer with all relevant details",
    "3": "Partially correct, some wrong info",
    "1": "Completely wrong or hallucinated"
  },
  "helpfulness": {
    "5": "Fully resolved the user's issue",
    "3": "Some useful info but didn't resolve",
    "1": "Useless or made things worse"
  }
}
Try It: Starter rubric for your use case
Pick what you're building. Get a JSON rubric you can paste into your eval harness as a starting point.

The three grading approaches

Three Ways to Grade AI Output Automated Did the JSON parse? Does it contain the answer? Fast · Cheap · Objective Best for: factual tasks LLM-as-Judge Strong model grades weaker model's output Scalable · ~85% human agree Best for: tone, helpfulness Human Review Real people grade the outputs Gold standard · Slow Best for: edge cases, trust In practice: Automated (100%) → LLM-judge on failures (20%) → Human on sample (5%) A grading funnel — most issues caught automatically, nuance handled by AI, humans validate.
Most teams use all three in a funnel: automated catches obvious issues, LLM-judge handles nuance, humans validate the hardest cases.

LLM-as-judge, the careful version

Using a strong model to grade a weaker model's output is now standard. It scales. It's cheap relative to humans. And it's full of biases that quietly invalidate your scores if you're not careful.

Four ways an LLM judge gives you the wrong answer Position bias Prefers whichever option comes first (or second). Fix: randomize order, run twice. Length bias Longer answer = "more thorough" in judge's eyes. Fix: penalize length in rubric. Self-preference A model judging its own family scores it higher. Fix: judge with a different family. Vague rubric "Rate helpfulness 1-5" drifts run to run. Fix: anchor each score to examples. Two patterns that beat naive judging Pairwise > pointwise. "Which is better, A or B?" is more reliable than "rate this 1-5." Models are bad at absolute scoring, decent at comparison. Build leaderboards from pairwise wins.
LLM judges aren't oracles. Audit them against a small human-graded set before you trust the dashboard. If judge-vs-human agreement is below ~80%, your numbers are noise.

Two practical defaults: require chain-of-thought from the judge ("explain your reasoning before giving a score") — it forces the judge to actually engage with the rubric instead of pattern-matching. And calibrate against humans periodically: have humans grade 50–100 examples, compare to the judge, and treat low-agreement criteria as untrustworthy until you fix the rubric.

Regression testing across model versions

Every model upgrade — Sonnet 4.5 to 4.6 to 4.7, GPT-4o to GPT-5, Gemini 2.5 to whatever's next — is a behavior change in production. Sometimes it's a big upgrade. Sometimes a regression on the queries that matter most to you. The only way to know is to keep a frozen golden set and re-run it on every new model.

The minimum viable version: ~200 examples that represent your real query distribution (sampled from production logs, scrubbed), with expected outputs or rubric scores. On every model upgrade or prompt change, re-run the set, diff the per-example scores against the previous run, and flag regressions before they ship. Most eval platforms (Braintrust, LangSmith, Langfuse) make this a one-click operation.

Two non-obvious things:

Online evals — what users actually do

Offline evals tell you the model is correct. Online evals tell you the product works. They measure different things and you need both.

The five things every eval program tracks

Drop the "types of evaluation" framing — it's the wrong axis. The right axis is: what dimension of quality are you measuring, and which method gives you the cheapest reliable signal on it?

DimensionWhat it answersCheapest reliable method
CorrectnessDid it produce the right answer?Automated checks against labeled examples
HelpfulnessDid it actually solve the user's problem?LLM-judge with a rubric, audited against humans
SafetyDid it avoid harmful, off-policy, or sensitive output?Automated guardrails + adversarial test set
Latency & costIs it fast and affordable enough?Production telemetry (TTFT, p50/p95, $/task)
Real-world impactAre users better off because of it?Online A/B tests on outcome metrics
Key Idea

Benchmark scores measure the model. Custom evals measure your product. A model that scores 95% on MMLU can produce a terrible support bot if the 5% failures land on the queries your users care about most. Build evals on the queries you actually see in production.

In the Wild

Braintrust, LangSmith, and Langfuse dominate the eval-platform market. They handle test-set runs, grading, regression tracking, and online tracing in one place. Most production AI teams pick one and never look back.

OpenAI, Anthropic, and Google all run frozen internal eval suites against every model release. The public benchmarks (MMLU, SWE-bench, HumanEval) are a small fraction of what they actually measure. The real evals are private and use-case specific — exactly the kind you should be building.

METR's 2025 study found experienced developers using AI coding tools were 19% slower, despite believing they were 20% faster.S6 The perception gap is exactly why offline accuracy and online outcomes both have to be measured. Either one alone lies.

18

On-Device vs Cloud — The Tradeoff Triangle

Mobile platforms are increasingly running models both in the cloud AND on the device itself. Understanding this dual architecture is essential for anyone designing AI-powered experiences.

Every AI request faces a three-way tradeoff between latency, quality, and privacy. Whether you're building for mobile, web, or desktop — there is no option that wins on all three.

Speed (Latency) Quality (Capability) Privacy (On-device) Gemini Nano On-device Gemini Pro Cloud Hybrid Best of both Pick any two. You can't have all three at maximum.
The tradeoff triangle. On-device = fast + private but limited capability. Cloud = powerful but slower and data leaves the device. Hybrid strategies try to balance all three.

Practical comparison

On-Device (Nano, Phi, Apple)Cloud — Fast (Flash, Haiku)Cloud — Frontier (Pro, Sonnet)
Latency~50–200ms~200–500ms~1–3s
Context Window~4–32K tokens~1M tokens~1M tokens
CostFree (runs on device)Very lowModerate
PrivacyData never leaves phoneData sent to serverData sent to server
OfflineYesNoNo
UI GenerationSimple components onlyStandard layoutsComplex, multi-component
Best ForQuick actions, autocomplete, simple classificationMost generative UI tasksComplex reasoning, multi-step agents
Key Idea for generative UI Design

Your generative UI protocol needs to work across this entire spectrum. That means: compact schemas that fit in Nano's small context window, graceful degradation when the on-device model can't handle a complex layout, and a clear escalation path from on-device → cloud when needed. This is a core architectural decision that shapes the entire protocol design.

In the Wild: On-Device vs Cloud in Shipping Products

Apple Intelligence implements a tiered approach almost identical to what generative UI needs. Simple tasks (notification summaries, smart reply suggestions, text proofreading) run entirely on-device via Apple's ~3B parameter model. Complex tasks (image generation with Image Playground, deep writing assistance) route to Apple's "Private Cloud Compute" servers. The decision happens automatically — the user never chooses.

On-device call screening (available on Pixel and Galaxy devices) is a pure on-device success story. A local model transcribes the caller's speech in real-time — no network needed. It works because the task (speech→text for a short utterance) fits comfortably within an on-device model's capability. This is the kind of scoped, well-defined task that on-device excels at.

Samsung Galaxy AI's Live Translate runs on-device for real-time phone call translation. The latency requirement (sub-200ms) makes cloud infeasible. But for complex features like Chat Assist (rewriting messages in different tones), they route to cloud because tone-shifting requires more sophisticated reasoning than the on-device model can handle.

Spotify's DJ feature uses cloud models to generate the DJ's commentary (creative, personalized text) but on-device models for the voice synthesis (latency-critical). Splitting one feature across on-device and cloud models — each doing what it's best at — is a pattern you'll use in generative UI.

19

Cost, Latency & the Performance Triangle

Every AI product lives inside a triangle: quality, speed, and cost. You can optimize two at the expense of the third. Every design decision shifts the balance.

Quality Accuracy Speed Latency Cost $/request Pick any two.
The iron triangle: fast + cheap = lower quality. High quality + fast = expensive. High quality + cheap = slow.

The numbers designers should know

MetricThresholdWhy It Matters
Time to first token (TTFT)< 200ms idealUsers perceive streaming responses as 40-60% faster than waiting
Output token cost3-8x input costEvery word the AI writes costs more than every word it reads
StreamingNon-negotiableShow tokens as they generate — never make users stare at a spinner
Prompt caching50-90% savingsReusing system prompts across calls is the easiest cost win
Try It: AI Cost Calculator (2026)
Per-query, per-conversation, and per-MAU cost. Toggle prompt caching to see the single biggest cost lever in production AI.
Model:

Prompt caching — the single biggest cost lever in 2026

Prompt caching is the rare optimization that's both massive and free. By 2026 every major provider supports it (Anthropic, OpenAI, Google), and any production app with a non-trivial system prompt that isn't using it is leaving 50–90% of input cost on the table. It deserves more than a bullet point.

What it actually does. Every API call has a prefix that almost never changes — the system prompt, tool definitions, few-shot examples, sometimes a long retrieved document. The provider runs that prefix through the model once and stores the resulting internal state on their side. When your next call arrives within the cache window, the provider skips re-processing the prefix and starts from the cached state. You're billed at roughly 10% of the normal input rate for the cached portion.

Prompt caching: cache miss vs cache hit Cache MISS (first call, or after TTL expires) Process prefix (2,000 tokens) — full price User msg Generate output Cost: 2,000 × full input rate + output. Provider stores prefix state for ~5 minutes. Cache HIT (subsequent calls within TTL) Reuse cached prefix — billed at ~10% rate User msg Generate output Cost: 2,000 × ~10% input rate + output. Effectively the prefix is free for active sessions. The TTL gotcha: caches expire after ~5 minutes idle. Bursty traffic gets misses; steady traffic gets hits.
Cache misses are full-price. Cache hits are nearly free for the cached portion. Throughput patterns determine which one you mostly get.

What's cacheable. Anything that's identical across calls and lives at the start of the prompt: system instructions, tool/function definitions, few-shot examples, large retrieved documents shared across users. Anything that varies per user — chat history, the user's question — has to come after the cacheable prefix.

The 5-minute TTL gotcha. Most providers expire idle caches after ~5 minutes. A product with steady traffic gets near-100% cache hits and pockets the savings. A product with bursty or low traffic mostly pays for misses. If your traffic is uneven, batch user requests through a small pool of "warm" sessions, or consider Anthropic's longer-TTL extended cache (1 hour) for high-leverage prompts.

Worked example. A support bot with a 2,000-token system prompt, processing 1,000 messages/hour at ~80% cache hit rate on Sonnet 4.6: caching saves roughly $50/day, or $1,500/month — about 90% of what you'd otherwise spend on input tokens. That's an extra ~$18K/year in margin, gained by adding one parameter to your API call.

Key Idea

Prompt caching vs KV caching. They sound similar and people conflate them. KV caching is what happens inside a single generation — the model caches its own intermediate state so it doesn't recompute earlier tokens as it generates each new one. It's automatic and you don't think about it. Prompt caching is what happens across calls — the provider caches your prompt prefix between requests, billed to you. It's opt-in and you absolutely think about it.

How inference gets faster and cheaper

In Chapter 2 we defined inference as the process of using a trained model to generate output. Every response your product generates is an inference call. The AI industry has developed several techniques to make these calls faster and cheaper — and understanding them helps you make product architecture decisions.

Inference Optimization Techniques Prompt Caching Reuse processed system prompts across calls. If 2K tokens repeat every call, cache them. Impact: 50-90% cost savings on input tokens Speculative Decoding Small fast model drafts tokens. Large model verifies in batch. Keeps quality, gains speed. Impact: 2-3x speed, zero quality loss Quantization Reduce precision of model weights (32-bit → 4-bit). Smaller, faster model. Impact: 2-4x smaller, slight quality trade KV Caching Cache intermediate computations from earlier tokens so the model doesn't reprocess them. Impact: faster generation of long responses You don't implement these — the model provider does. But they affect the pricing and speed you get.
Four inference optimizations that shape the AI products you build. Prompt caching is the only one you directly control. The others happen at the provider level but affect your cost and latency.

The inference provider landscape

An important distinction: the company that trained a model isn't always the company that serves it. Meta trains Llama. But you can run Llama inference through AWS Bedrock, Together AI, Fireworks, Groq, or your own servers. This decoupling matters because:

Key Idea

Inference is where all the money flows in production AI. Training is a one-time cost borne by the model lab. Inference is an ongoing cost borne by every product using the model. When someone says "AI is expensive," they mean inference is expensive. When someone says "AI is getting cheaper," they mean inference prices are dropping (they fell roughly 80% between early 2025 and early 2026). Every optimization in this section — caching, speculation, quantization — exists to make inference cheaper and faster.

Key Idea

Every design decision is a cost decision. Longer system prompts mean more input tokens. Verbose AI responses mean more output tokens, which cost 3–8x more. Model routing (Chapter 7) is the biggest lever after caching: send simple tasks to cheap models, complex tasks to expensive ones. A well-designed routing system can cut costs 30–50% with no perceptible quality loss.

When self-hosting open models beats APIs (and when it doesn't)

By 2026 open models — Llama, Mistral, DeepSeek, Qwen — are competitive with closed frontier models on most non-frontier tasks. That makes "should we self-host?" a real question instead of an obvious no. The answer is still usually no, but the exceptions matter.

Self-hosting beats APIs when…APIs beat self-hosting when…
You have very high steady volume (millions of requests/day) and unit economics dominate everything else. You have variable or low volume — GPU utilization tanks, ops cost dominates.
Data residency or compliance forbids sending data to third parties (healthcare, defense, regulated finance). You can use a regional or VPC-deployed API endpoint instead — most providers offer this now.
You're fine-tuning heavily and need full control over weights and training. Provider fine-tuning APIs (LoRA endpoints) cover the use case.
You need a model the labs don't sell — a specific size, an older checkpoint, an embedding model with custom tokenizer. You can pick from a frontier model + a cheap routing model and that's enough.
Latency is so tight that even a colocated API endpoint isn't fast enough (Groq, Cerebras territory). 200–500ms TTFT from a hosted API is acceptable.

The hidden cost of self-hosting isn't the GPUs. It's the ops team. Running production inference well — autoscaling, monitoring, model upgrades, security patching, handling traffic spikes — is a full-time SRE function. Most teams that try it eventually move back to a hosted inference provider (AWS Bedrock, Together AI, Fireworks, Groq) which gives them the open-model portability without the ops burden. The genuinely-self-hosted population is small and specialized.

In the Wild

Streaming is the single biggest UX win. ChatGPT, Claude, and Gemini all stream tokens as they generate. Users read faster than models write, so streaming feels interactive rather than like waiting. Products that show a loading spinner until the full response is ready feel dramatically slower, even when total latency is identical.

Prompt caching is now standard across major providers. OpenAI documents up to 90% input-token cost reductions and latency wins when repeated prompt prefixes hit the cache.S8 The teams that wire this in early usually get a cost win without changing the product.

Groq and Cerebras serve open models on custom silicon at speeds frontier APIs can't match — hundreds of tokens per second on Llama-class models. For latency-bound use cases (live voice, autocomplete), they're a different category of product, not just a cheaper one.

Cursor's model routing sends most autocomplete to a small fast model, escalates harder requests to a frontier model, and uses Anthropic's longer-TTL caching on the codebase context. Three optimizations stacked: routing, caching, model size. None of them are visible to the user.

20

AI-Generated UI — From Model Output to Rendered Interface

Everything in this book converges here. Generative UI takes an LLM's structured output and transforms it into rendered interface components — React on web, SwiftUI on iOS, Jetpack Compose on Android, or any other renderer.

Let's trace the full journey — from a user's voice to pixels on screen — using everything we've learned:

User: "Show my workout stats" Natural language intent ① TOKENIZE → [4438, ...] (Ch.1) ② ROUTE → Flash (Ch.6) ③ DISCOVER TOOLS (MCP) → Tools (Ch.9) ④ CALL TOOLS → get_summary() ⑤ GENERATE generative UI JSON { type: Card, ... } (Ch.5) ⑥ RENDER → Native UI Native UI on screen 🎉 ← Chapters 1-4 ← Chapter 6 ← Chapter 9 ← Chapters 7-8 ← Chapter 5 ← Generative UI Everything in this book flows through this pipeline
The complete generative UI pipeline — from natural language to native UI. Each step maps to a chapter in this book. Step 6 is where generative UI meets the native rendering layer.

The generative UI schema (conceptual)

At its core, generative UI defines a tree of components. Each component has a type, properties, and optional children. The model generates this tree, and a platform renderer turns it into native UI:

// generative UI response for "Show my workout stats"
{
  "root": {
    "type": "Column",
    "children": [
      {
        "type": "Text",
        "value": "This Week's Workouts",
        "style": "headlineMedium"
      },
      {
        "type": "Card",
        "variant": "elevated",
        "children": [
          {
            "type": "Row",
            "mainAxisAlignment": "spaceBetween",
            "children": [
              { "type": "Text", "value": "Sessions", "style": "labelLarge" },
              { "type": "Text", "value": "4 of 5", "style": "bodyLarge" }
            ]
          },
          {
            "type": "LinearProgressIndicator",
            "progress": 0.8,
            "color": "primary"
          }
        ]
      },
      {
        "type": "Card",
        "variant": "outlined",
        "children": [
          { "type": "Text", "value": "Top Exercise", "style": "labelLarge" },
          { "type": "Text", "value": "Bench Press — 185 lbs × 5", "style": "titleMedium" },
          {
            "type": "Button",
            "label": "View Details",
            "action": { "type": "navigate", "target": "workout_detail" }
          }
        ]
      }
    ]
  }
}
9:41 Fitness Agent This Week Sessions 4 of 5 Top Exercise Bench Press 185 lbs × 5 Log Workout Set Goal ✦ Generated by AI agent User said: "Show my workout stats" → Agent called get_weekly_summary() → Generated from JSON
The complete pipeline in action: user speaks → agent reasons → tool executes → structured JSON returned → native UI rendered. Every component on this screen was generated, not hand-coded.

Notice: the component names (Column, Card, Row, Text, Button, LinearProgressIndicator) map to standard UI primitives available in any framework — React, SwiftUI, Compose, Flutter. The style tokens (headlineMedium, labelLarge) map to a design system. The renderer walks this tree and emits native components for whatever platform you target.

Why This Matters

The schema layer — between the model's output and the rendered interface — is where UX decisions live. Engineers understand the rendering. ML engineers understand the model. The schema in the middle is where UX meets AI constraints. Which components to include, how to handle responsive layouts, what error states to support, how to balance expressiveness with reliability — these are judgment calls that require understanding UX, AI constraints, AND the component system. This is the new design surface.

In the Wild: AI-Generated UI Is Already Shipping

Vercel's v0 is the closest public analog to generative UI — it takes natural language prompts and generates React/Next.js UI components. Under the hood, it produces a structured component tree (JSX) from an LLM, just like generative UI produces a JSON tree for Compose. v0 proved the concept works commercially: developers pay for AI-generated UI. It generates React components from natural language — the same structured-output-to-rendered-component pipeline that generative UI uses across all platforms.

Google Stitch generates UI from natural language prompts as HTML/CSS, proving that designers want to describe interfaces conversationally and get rendered output. Tools like Antigravity then convert HTML to React. The pattern is clear: AI generates a description, a protocol standardizes it, and a renderer turns it into platform-native UI.

Apple's App Intents framework is Apple's version of this pipeline for SwiftUI. When Siri handles a request, App Intents defines the structured data, and SwiftUI renders the result as a native widget or Live Activity. Apple has a complete pipeline from voice → structured intent → native UI. Every platform is converging on this pattern: structured AI output → platform-native rendering.

Microsoft's Copilot in Microsoft 365 generates "Adaptive Cards" — a JSON-based UI format that renders natively in Teams, Outlook, and other Microsoft apps. This is structurally identical to generative UI: a JSON schema defines the component tree, a renderer turns it into native UI. Adaptive Cards has been in production since 2017 and handles billions of renders. It proves the pattern works at scale.

21

Skills & Customization — Teaching AI How You Work

Out of the box, an LLM is a generalist. Skills, instructions, and project configs are how you turn it into a specialist that knows your tools, your conventions, and your workflow.

Every major AI provider has built a system for customizing how their models behave. The names differ but the core idea is the same: give the model persistent context about how you want it to work, not just what you want it to do right now.

Think of it as a spectrum from simple to powerful:

Simple Powerful System Prompts One-shot instructions Custom Personas Reusable behaviors Project Configs Codebase-aware context Skills & Tools Executable capabilities
The customization spectrum: from a single instruction to full modular tool systems.

Layer 1: System Prompts

The simplest form of customization. A system prompt is a set of instructions sent to the model before the user's message. Every API call can include one. It tells the model who it is and how to behave.

// System prompt example
{
  "system": "You are a senior UX researcher. When analyzing user feedback,
             always identify the underlying need behind the stated request.
             Format findings as: Observation → Insight → Recommendation.",
  "messages": [
    { "role": "user", "content": "Users keep asking for a dark mode toggle." }
  ]
}

System prompts are powerful but have a key limitation: they're ephemeral. Every new conversation starts from scratch unless you manually include the system prompt again. They also eat into your context window (Chapter 3) since they're sent with every message.

Layer 2: Custom Personas & Reusable Behaviors

The next step up: save your system prompts as reusable personas that persist across conversations. Each provider has a different name for this:

ProviderFeatureWhat It Does
OpenAICustom GPTsPackage a system prompt + tools + knowledge files into a shareable persona. "GPT that reviews design specs against WCAG guidelines."
GoogleGemsCustom Gemini personas with persistent instructions. "A Gem that writes PRDs in our team's format."
AnthropicProjects + System PromptsProject-scoped instructions and knowledge files that apply to all conversations within a project.
Key Idea

Custom personas are really just saved system prompts with a UI wrapper. The model doesn't fundamentally change. But the UX impact is significant: instead of copy-pasting instructions every time, you have a persistent specialist you can return to. For teams, this means you can create shared personas that encode team conventions.

Layer 3: Project Configs — Codebase-Aware Context

This is where things get interesting for people who work in code-adjacent roles. Project configs give the AI persistent knowledge about a specific project, its conventions, and its structure.

The pattern was popularized by AI coding tools and is now spreading to broader AI workflows:

ToolConfig FileWhat It Contains
Claude CodeCLAUDE.mdProject conventions, architecture decisions, coding standards, team preferences. Lives in the repo root. Claude reads it automatically at the start of every session.
Cursor.cursorrulesSimilar to CLAUDE.md. Rules about code style, preferred libraries, patterns to follow or avoid. Cursor loads it as context for every AI interaction in that project.
GitHub Copilot.github/copilot-instructions.mdRepository-level instructions for Copilot. Defines conventions specific to the codebase.
Windsurf.windsurfrulesProject rules for the Windsurf editor's AI assistant. Same pattern, different file name.
Analogy: The Onboarding Doc

A project config is like the onboarding document you'd give a new team member on their first day. "Here's how we name things. Here's our folder structure. Here are the libraries we use and why. Here's what we've tried before that didn't work." Except instead of a human reading it once and gradually forgetting, the AI reads it at the start of every single session.

CLAUDE.md vs plan.md

In Claude Code's workflow, there's an important distinction between two types of files:

CLAUDE.md — "Who you are"

Persistent project context. Describes the codebase, conventions, architecture, and preferences. Doesn't change between tasks. Think of it as the project's constitution.

Example: "This is a Next.js 15 app using Tailwind. We use server components by default. All API routes go in /app/api. Never use class components."

plan.md — "What to do next"

Task-specific planning document. Created for a specific feature or work session. Breaks down the task into steps, tracks progress, and captures decisions made along the way. Temporary and task-scoped.

Example: "Task: Add dark mode. Step 1: Create theme context ✅. Step 2: Update Tailwind config ✅. Step 3: Add toggle component. Step 4: Persist preference."

The two work together: CLAUDE.md tells the agent how to work in this project. plan.md tells it what to work on right now. One is stable, the other is ephemeral.

Layer 4: Skills — Modular, Executable Capabilities

Skills go beyond instructions. A skill is a packaged capability that the AI can execute, not just follow. Skills combine instructions, tool definitions, and sometimes code into a reusable module.

Anatomy of a Skill SKILL: "Create Design Doc" Instructions How to structure the document Tools File creation, template lookup Examples Good/bad output samples Trigger: "When the user asks to write a PRD or design doc, use this skill" Output: A formatted design doc using your team's template
A skill packages instructions + tools + examples + a trigger condition into a reusable module.

The key difference between a skill and a system prompt: a system prompt says "you are a UX researcher." A skill says "when the user asks you to analyze feedback, here's the exact process to follow, here are the tools to use, here are examples of good output, and here's how to format the result." Skills are procedural, not just descriptive.

How the major providers approach skills

Each AI provider has a different philosophy and architecture for customization. Understanding these differences matters because they shape what you can build and how portable your workflows are.

Provider Approaches to Skills & Customization Anthropic / Claude Philosophy: Open protocols CLAUDE.md plan.md Skill files (SKILL.md) MCP servers Projects + Knowledge File-based, composable, lives in your repo. MCP for tool integration. Open ecosystem OpenAI / GPT Philosophy: App store model Custom GPTs GPT Actions (APIs) Knowledge files Assistants API Memory (cross-chat) Platform-hosted, GUI-built, shareable via GPT Store. Actions for API integration. Walled garden Google / Gemini Philosophy: Integrated suite Gems Extensions (Workspace) NotebookLM sources Vertex AI agents Google AI Studio Deeply tied to Google Workspace. Extensions for Gmail, Docs, Drive. Suite-native
Three approaches: Anthropic builds open protocols (MCP, file-based configs). OpenAI builds a marketplace (GPT Store). Google builds into their existing suite (Workspace extensions).

The key differences that matter

Portability

Claude's approach is file-based: CLAUDE.md and skill files live in your repo. If you switch tools or providers, those files still work as documentation. A Custom GPT lives on OpenAI's platform. If you leave, you lose it. Google's Gems are tied to your Google account.

Composability

Claude's skill system is modular. You can have a "create-docx" skill, a "design-doc" skill, and a "frontend" skill, and they compose together in the same session. The model reads whichever skill files are relevant. OpenAI's Custom GPTs are monolithic: one GPT, one system prompt, one set of tools. You can't easily mix GPTs together.

Tool integration

This is where MCP (Chapter 9) becomes the differentiator. Claude uses MCP as the universal protocol for connecting to external tools. OpenAI uses GPT Actions (custom API integrations defined per-GPT). Google uses Extensions (pre-built connectors to Workspace apps). MCP is open and any tool can implement it. GPT Actions and Extensions are vendor-specific.

In the Wild: How Teams Actually Use This

Cursor + .cursorrules has become the most widely adopted project config pattern among developers. Teams commit their .cursorrules file to the repo, encoding conventions like "use TypeScript strict mode," "prefer server components," "use this testing pattern." New team members get AI assistance that already knows the team's standards from day one.

Custom GPTs for design teams: Several design orgs have built internal GPTs for their specific workflows. "Upload a screenshot and this GPT audits it against our design system." "Paste user feedback and this GPT categorizes it by our taxonomy." These are essentially skills packaged as sharable apps.

Claude Projects for research teams: Research teams upload papers, transcripts, and frameworks into Claude Projects. The project-level knowledge means every conversation starts with deep context about the research domain, without re-explaining the background each time.

Where this is heading

The lines between these approaches are blurring fast. MCP is being adopted beyond Claude (Cursor, Windsurf, and others now support it). OpenAI is moving toward more composable tools. Google is opening up Gemini's extension system.

The convergence point: a world where you define your team's AI configuration once (conventions, tools, knowledge, workflows) and it works across any AI tool your team uses. We're not there yet, but the trajectory is clear. The teams investing in structured AI customization now will have a significant advantage as these systems mature.

Memory and Personalization

Customization (above) is what you tell the AI. Memory is what the AI learns about you over time. Every major provider now has a memory system: ChatGPT stores facts across conversations, Claude encrypts and exports memories, Gemini imports histories from competitors.

It helps to name the layers, because they have different shelf lives and different governance needs:

The UX patterns for memory are standardizing: transparency (see what the AI remembers), editing (correct or delete memories), scoping (global vs project-specific), staleness management (refresh outdated info), and portability (export and import between providers).

Key Idea

A wrong memory is worse than no memory. The AI will confidently act on outdated information ("you said you wanted a vegetarian option" — no, that was last year). The mitigation is governance: time-stamp every memory, summarize aggressively to compress old facts, let users edit or delete, and bias toward forgetting over hoarding. Treat memory like a database that needs a retention policy, not an attic.

Key Idea

Skills and project configs are how AI goes from "generic tool" to "team member who knows our workflow." The investment isn't in the technology but in the articulation: writing down how your team works, what your conventions are, and what good output looks like. That documentation becomes the AI's training manual. Teams that can articulate their process clearly will get dramatically more value from AI than teams that can't.

Product & Strategy
22

AI Product Archetypes — Choosing the Right Pattern

Not every AI feature should be a chatbot. There are distinct product patterns, each with different UX, architecture, and user expectations. Choosing wrong costs months.

Every AI product maps to one of six archetypes. The archetype determines the interaction model, the trust requirements, the latency budget, and the failure modes you need to design for.

Six AI Product Archetypes Chat Open-ended conversation ChatGPT, Claude, Gemini User leads, AI follows Copilot AI assists in your workflow GitHub Copilot, Notion AI User leads, AI augments Agent AI acts autonomously Devin, Replit Agent AI leads, user supervises Search Question → sourced answer Perplexity, Glean Accuracy > creativity Generation Create content from prompts Midjourney, Jasper, v0 Creativity > accuracy Classification Sort, label, route data Spam filters, triage, tagging Invisible to the user The archetype determines everything else Interaction model · Trust requirements · Latency budget · Error handling · Pricing
Most AI products are one of these six patterns. Many products combine two (search + chat, copilot + generation). The primary archetype drives the core UX.

The copilot vs agent decision

The most common product debate in 2026: should this feature be a copilot (AI assists, human decides) or an agent (AI acts, human supervises)? The answer depends on three factors:

Same need, six implementations

The best way to understand archetypes: take one user need and see how each pattern handles it differently.

User need: "Help me write an email" Chat "Write a follow-up email to the client about pricing" User copies result into Gmail Copilot You start typing, AI suggests completions inline as you write Tab to accept, keep typing Agent "Follow up with the client" Agent drafts, sends, logs it You review after the fact Search "Find the email template for pricing follow-ups" Returns existing templates Generation Click "Generate reply" button. Full email appears in editor. Edit and send yourself Classification AI auto-detects email needs a reply and nudges you Invisible — just a reminder The archetype shapes user effort, AI autonomy, and risk profile Chat: user drives. Copilot: shared control. Agent: AI drives. Search: retrieval only. Generation: creation only. Classification: invisible.
Same user need, six different products. The archetype you choose determines the entire UX, engineering architecture, and trust model.

How products evolve across archetypes

Products don't stay in one archetype. They evolve along a predictable path — usually from lower autonomy to higher:

Product Evolution Timelines GitHub Copilot Generate Copilot Agent Notion AI Generate Search Agent Perplexity Search Chat Agent
Products evolve toward higher autonomy over time. The archetype you launch with isn't the archetype you'll have in two years.
Key Idea

The archetype isn't fixed — products evolve along the spectrum. GitHub Copilot started as autocomplete (generation), became a chat sidebar (copilot), and is becoming an agent (Copilot Workspace). Notion started with AI writing (generation), added Q&A (search), and is moving toward AI workflows (agent). The archetype you launch with isn't the archetype you'll have in two years.

In the Wild

Linear uses classification invisibly: AI auto-labels issues by priority and team. Users don't interact with the AI directly — it just makes the product smarter. This is the highest-ROI, lowest-risk archetype.

Figma AI is copilot-patterned: it suggests layouts, generates variants, and fills text — but the designer is always in control. The canvas is the workspace; AI is the assistant.

Cursor spans three archetypes simultaneously: autocomplete (generation), chat panel (copilot), and Composer (agent). Each mode has different trust levels, latency budgets, and UI patterns.

23

Context Engineering — The Prompt Is the Product

Prompt engineering was about writing good instructions. Context engineering is about designing the entire information environment the model sees — and it's now the most important product decision in AI.

The term "context engineering" was popularized by Andrej Karpathy in 2025 and has since become the standard framing. The insight: what matters isn't just the prompt — it's everything in the context window. System instructions, retrieved documents, conversation history, tool outputs, and examples all shape the model's behavior.

Context Engineering: What the Model Actually Sees THE CONTEXT WINDOW System Prompt Identity + rules RAG Docs Retrieved info Tools Available actions Examples Few-shot patterns Conversation History + user msg Every piece is a product decision What to include, what to exclude, how to order it, how much space to give each piece — these shape behavior.
Context engineering means designing what fills this window. The model's behavior is a function of everything it sees, not just the user's message.

The system prompt is the product spec

In traditional software, the product spec becomes code. In AI products, the product spec is the system prompt. Want the bot to be concise? That's a prompt instruction. Want it to always cite sources? Prompt instruction. Want it to refuse certain topics? Prompt instruction. The system prompt is the single most leveraged artifact in an AI product — and iterating on it is how you ship improvements without changing any code.

Prompt techniques every PM should know

Core Prompt Techniques Zero-Shot Just ask. No examples. "Classify this as spam or not spam" Few-Shot Give 2-5 examples first. "Here are 3 labeled examples. Now classify:" Chain-of-Thought Ask model to reason "Think step by step before answering" Role Assignment Set an identity + expertise. "You are a senior UX researcher with 10 years..." Structured Output Constrain the format. "Respond in JSON with fields: summary, score"
Five techniques that cover 90% of prompt engineering. Most production systems combine several: role + few-shot + structured output is the most common stack.

Before and after: prompt quality matters

Weak Prompt

Summarize this feedback.

Result: Generic summary, no structure, misses key themes, inconsistent length across runs.

Engineered Prompt

You are a UX researcher analyzing user feedback. For each piece of feedback, identify: (1) the stated request, (2) the underlying need, (3) severity (1-5). Respond in JSON.

Result: Consistent, structured, actionable. Same format every time.

Prompt iteration as product development

Key Idea

The shift from "prompt engineering" to "context engineering" reflects a maturation: it's not about clever wording tricks anymore. It's about designing the entire information environment. What documents get retrieved? How much conversation history is retained? Which tools are exposed? How are examples selected? These are product architecture decisions that happen to be expressed as text in a context window.

In the Wild

Anthropic's Claude system prompt is thousands of tokens long and is treated as a living product document. Changes go through eval suites before deployment. It defines Claude's personality, capabilities, limitations, and behavior — it IS the product.

Cursor dynamically constructs context for each request: relevant code files (retrieved via embeddings), the user's recent edits, linter errors, and the project's .cursorrules file. No two requests see the same context. The "intelligence" of Cursor is largely in how well it selects what to include.

24

AI Business Models — How AI Products Make Money

Traditional SaaS costs almost nothing per additional user. AI products spend real money on every API call. That single fact rewrites pricing, margins, and which features are worth shipping at all.

The fundamental economic difference: serving one more user on Figma costs Figma almost nothing. Serving one more query on ChatGPT costs OpenAI real money — model inference, compute, and API fees. This marginal cost per request is what makes AI product economics different from everything that came before.

Pricing models in the wild

ModelHow It WorksExampleTradeoff
Per-seat subscriptionFixed price per user/monthChatGPT Plus ($20/mo), Cursor Pro ($20/mo)Simple, predictable. But heavy users cost you money, light users subsidize them.
Usage-basedPay per token / API callOpenAI API, Anthropic API, Google VertexFair pricing, scales with value. But unpredictable bills scare customers.
HybridBase subscription + usage overagesClaude Pro (base + message limits)Best of both: predictable base, usage upside. Most common in 2026.
Free tier + premiumBasic AI free, advanced features paidNotion AI, Grammarly, PerplexityGreat for adoption. Risk: free tier costs real money to serve.
Embedded / platformAI baked into a product you already pay forApple Intelligence, Galaxy AI, Google Workspace AINo separate pricing. AI is a feature, not a product. Funded by the parent product.

Unit economics designers should understand

Worked example: unit economics of a support bot

Can This Product Make Money? The scenario You're building an AI customer support bot for SaaS companies. Costs per customer Average queries/day: 300 Cost per query: $0.03 = $270/mo in API costs Revenue per customer Subscription: $500/month = $230/mo gross margin The catch Gross margin: 46%. Traditional SaaS: 85%. Heavy users (1000 queries/day) cost $900/mo — they lose you money at $500.
Every AI product PM needs to model this math before building. The question isn't "can we build it?" — it's "can we afford to run it?"
Key Idea

In AI products, every design decision is a cost decision. A longer system prompt = more input tokens per call. A "show your reasoning" feature = 5-10x more output tokens. A RAG pipeline = embedding costs + retrieval costs + longer context. A multi-step agent = multiple API calls per user action. Product teams that don't model these costs before building frequently discover their feature is economically unviable at scale.

25

Moats & Defensibility — What's Yours When Everyone Has AI

When every product can access the same foundation models, what makes yours defensible? The models are commoditizing. The value is moving to the layers above and below.

Here's the uncomfortable truth: if your product's value proposition is "we use GPT-4o to do X," your competitor can ship the same thing in a week. The model is an API call. The moat is everything else.

Where Value Accumulates in AI Foundation Models (commoditizing) Context + Data Layer (where moats form) Product + UX + Distribution (where value captures)
Models are becoming interchangeable. Defensibility lives in the data layer (proprietary data, fine-tuned behaviors) and the product layer (UX, distribution, workflow integration).

The six AI moats

  1. Proprietary data: Data the model was fine-tuned on or that your RAG pipeline accesses. Bloomberg built BloombergGPT on proprietary financial data no competitor has.
  2. Data flywheel: User interactions improve the product, which attracts more users, generating more data. Every Spotify listening session makes recommendations better.
  3. Workflow integration: When your AI is deeply embedded in the user's daily workflow, switching costs are enormous. Cursor's deep IDE integration makes it painful to leave.
  4. Distribution: Reaching users first. Apple Intelligence ships on every iPhone. Google AI is in every search. Distribution > technology.
  5. Domain expertise: Understanding the problem deeply enough to build the right eval suite, the right guardrails, and the right UX for a specific vertical. Harvey (legal AI) knows law; a general chatbot doesn't.
  6. Compounding context: The longer a user stays, the more the system knows about them. Memory, preferences, project history. Switching means starting from zero.

The moat audit: evaluating your product

Try It: Rate Your Product's Moat
Score your product 1-5 on each dimension. Be honest.
Moat Audit: Rate Your Product 1-5 on Each Proprietary Data Do you have data no one else has? Data Flywheel Does usage improve the product? Workflow Integration Is switching painful? Distribution Can you reach users first? Domain Expertise Deep domain knowledge? Compounding Context Does the AI learn each user? The "What if" test: What happens if OpenAI ships this feature natively? If your answer is "we're dead" — you don't have a moat. If "we're fine" — you probably do.
Score your product on each dimension. If you're below 3 on all six, your product is a feature — not a company.
Key Idea

The most durable moat in AI is the data flywheel: user interactions → better training/eval data → improved product → more users → more interactions. Products that capture and learn from usage data compound their advantage over time. Products that just wrap an API don't. This is why "AI-native" companies (built around the flywheel) are structurally advantaged over "AI-added" companies (bolted AI onto an existing product).

26

Developer Experience for AI — Designing for Builders

AI developer products have unique DX challenges: non-deterministic outputs, complex debugging, and the need to "try before you buy." The playground isn't a nice-to-have — it's the product.

When a developer evaluates an AI API or tool, they go through a predictable journey: try it in a playground → read the docs → build a prototype → hit edge cases → decide to commit or abandon. The DX at each stage determines conversion.

The AI developer journey

StageWhat They NeedDX Pattern
ExploreCan this do what I need?Interactive playground with real models. Zero setup. Shareable results.
PrototypeCan I build with this?SDKs in major languages. Quickstart that works in < 5 min. Copy-paste examples.
BuildHow do I handle the hard parts?Streaming docs, error handling guides, prompt engineering tutorials, eval tooling.
ScaleCan I rely on this?Rate limits, uptime SLAs, cost calculators, usage dashboards, model versioning.
DebugWhy did it break?Observability: request logs, token counts, latency traces, response diffs.

What makes AI DX different

Time to wow: the single most important DX metric

How quickly does a developer go from zero to a working demo? This determines adoption more than any feature list. The target: under 5 minutes for a "hello world" equivalent.

Time to Wow: Industry Benchmarks Anthropic Console ~30 sec OpenAI Playground ~1 min First API call (any SDK) ~5 min Working prototype ~30 min
Every step that adds friction before "wow" loses a percentage of developers. The playground-to-API-call path must be frictionless.

AI-specific API design decisions

In the Wild

Anthropic's Workbench lets developers test prompts, compare models side-by-side, and share results — all in-browser, before writing any code. The "try it" path has zero friction.

Vercel's AI SDK became the standard for building AI features in web apps because it abstracted away streaming, provider switching, and tool use into a clean TypeScript API. Good SDK design = adoption.

Stripe's API docs (pre-AI) set the DX standard that AI companies now emulate: interactive code examples, copy-paste SDKs, real API keys in the docs. The best AI developer products apply these same principles to a much harder problem space.

27

The AI Ecosystem Map — Where Everything Fits

The AI landscape has dozens of layers and hundreds of companies. Understanding where your product sits — and who the adjacent players are — is essential for strategic positioning.

The AI Stack (2026) Compute: NVIDIA, AWS, Azure, GCP, CoreWeave Models: OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek Infra: LangChain, Pinecone, Weaviate Eval: Braintrust, LangSmith, Arize Dev Tools: Cursor, Claude Code, Copilot, Windsurf, Replit Applications: ChatGPT, Perplexity, Notion AI, Harvey, Glean
Six layers, from compute at the bottom to user-facing applications at the top. Most value captures at the top (applications) and bottom (compute). The middle layers are in a race to avoid commoditization.

Build vs buy at each layer

LayerBuild WhenBuy WhenKey Tradeoff
ModelYou need fine-tuned behaviorAlmost always buy/rentTraining costs $1M+
OrchestrationComplex multi-agent flowsStandard agentic patternsLangChain is fast but opinionated
Vector DBUnique scaling or privacy needsStandard RAGPinecone/Weaviate vs pgvector
EvalHighly domain-specific metricsStandard accuracy/qualityCustom evals + Braintrust hybrid
GuardrailsRegulated industry (health, finance)Standard content safetyCompliance needs drive build
Analogy: The Restaurant Kitchen

You don't grow your own wheat (compute), breed your own cows (train models), or manufacture your own pans (build infra) — you buy those. But you DO create your own recipes (prompts), design your own menu (product), and build the dining room (UX). The moat is what the customer sees and tastes, not what happens in the supply chain.

The strategic question for any AI product: which layer are you in, and who are you depending on? If you're an application, you depend on model providers. If you're a model provider, you depend on compute. Every layer has leverage over the ones above it and dependency on the ones below.

Key Idea

The "barbell" pattern: most value accrues at the top (apps that own the user relationship and data) and the bottom (compute providers with physical infrastructure). The middle layers — model APIs, orchestration frameworks, vector databases — face the most commoditization pressure. The products that thrive in the middle are those that become essential workflow infrastructure (LangChain) or own a critical data layer (Pinecone).

28

The Demo-to-Production Gap — Why AI Products Break at Scale

The demo always works. Production is where AI products fail. Understanding this gap is the difference between a successful launch and an embarrassing one.

Every AI product team has experienced this: you build a prototype, demo it to leadership, everyone is amazed. Then you ship it to real users and it immediately breaks in ways you never anticipated. This isn't a bug — it's a fundamental property of AI systems.

Why demos lie

The production readiness checklist

CategoryDemo Doesn't TestProduction Requires
Input diversity5-10 curated examplesHandling any input, including adversarial ones
Error handling"It works" path onlyTimeouts, rate limits, model errors, bad inputs
LatencyAcceptable for a demoP95 latency under 3s for every request
CostFree during prototype$X per query × millions of queries = real money
MonitoringYou watch it yourselfAutomated alerts, dashboards, anomaly detection
Model updatesPinned to one versionModel provider updates break your prompts

Model regression: when the ground shifts under you

Your product runs on a model you don't control. When the provider updates that model, your prompts can break without any change on your end. This has happened to nearly every AI product team.

The Model Update Trap Your prompt works on model v2.1 Provider ships v2.2 You didn't change anything Your prompt breaks Output format changed Users report bugs You scramble to fix Defense: pin to model versions. Run evals against new versions before upgrading. Budget time for prompt migration.
The model update trap. Your product broke and you didn't change a single line of code. This is unique to AI and catches every team at least once.
Key Idea

The #1 cause of production AI failures: the long tail of user inputs. Your eval suite covers the 90% case. The remaining 10% of queries — ambiguous, multi-language, misspelled, out-of-scope, adversarial — is where the product breaks. Building for the long tail means investing as much in error handling, fallbacks, and edge-case coverage as you do in the happy path.

In the Wild

Google's AI Overviews launch in 2025 became a cautionary tale. The demo was polished. Real users immediately surfaced absurd answers — the AI suggested adding glue to pizza and eating rocks. Google had to add guardrails, limit triggers, and rethink the entire rollout within days. The gap between "works on curated queries" and "works on everything people actually search" was enormous.

Notion AI took a different approach: they shipped with aggressive guardrails (refusing many edge cases) and gradually expanded capabilities based on production data. Start conservative, expand with evidence. Slower launch, fewer crises.

29

Data Flywheels — How AI Products Get Better Over Time

The most powerful AI products aren't the ones with the best model on day one. They're the ones that learn from every user interaction and compound that learning into a better product.

A data flywheel is a self-reinforcing loop: the product generates data from user interactions, that data improves the product, the improved product attracts more users, and more users generate more data. This is the core growth loop for AI products.

The AI Data Flywheel Users interact with product Interactions generate data Data improves evals + model Better product → more users Compounds
The flywheel: more users → more data → better product → more users. This loop is the primary source of long-term competitive advantage in AI.

What data to capture

Closing the loop: from thumbs down to better product

The Feedback Pipeline User: 👎 "Wrong answer" Log + Label Tag failure type Add to Evals New test case Iterate Prompt Fix the failure Ship + Verify Eval score improves Time from thumbs-down to fix: best teams do this in days, not months. Companies that capture data but never close this loop don't have a flywheel. They have a data warehouse.
The feedback pipeline: a single thumbs-down becomes a test case, which triggers a prompt fix, which improves scores. This is the flywheel in practice.

The privacy tension

Collecting user data for improvement creates a tension with user expectations of privacy. Different companies handle this very differently:

ProviderTrains on Your Data?User Control
OpenAIYes, by default (consumer). No (API + Team).Opt-out available in settings
AnthropicNo. Never trains on conversations.Memory is encrypted, exportable
GoogleYes for free tier. No for Workspace paid.Can import/export memory

The trust implication: products that DON'T train on user data can advertise that as a feature. Products that DO train on data get a better flywheel but face privacy scrutiny. This is a genuine strategic tradeoff, not a clear right answer.

Key Idea

The flywheel isn't automatic. You have to design for it. That means: building feedback mechanisms into the UX (easy thumbs up/down, correction flows), creating data pipelines that turn feedback into eval datasets, and establishing processes to regularly retrain or re-prompt based on what you learn. Companies that capture data but never close the loop don't have a flywheel — they have a data warehouse.

In the Wild

Spotify's Discover Weekly is the canonical data flywheel. Every listen, skip, save, and playlist add feeds back into the recommendation model. After 10+ years, the compound advantage is enormous — a new competitor can't replicate a decade of behavioral data.

Tesla Autopilot processes billions of miles of driving data from its fleet. Every car contributes to the training data. More cars → more data → better driving → more customers → more cars. The fleet IS the moat.

ChatGPT's RLHF loop: Human feedback on responses trains the reward model, which improves the base model, which produces better responses, which generate more subscriptions, which fund more human raters. OpenAI turned user feedback into a direct product improvement cycle.

Judgment
30

When NOT to Use AI — The Most Valuable Skill

The best AI PMs are the ones who kill AI features that shouldn't exist. Knowing when a lookup table, a rule, or a simple search is the right answer is rarer and more valuable than knowing how to build with LLMs.

Every chapter in this primer has implicitly said "here's how AI does X." This chapter asks the opposite question: when is AI the wrong tool?

The "Should This Be AI?" Decision Tree New feature idea Is the output deterministic? Yes Use rules/code No Is a wrong answer dangerous? Yes Human-in-loop No Can you afford the cost per call? No Simplify first Yes AI is likely the right tool
Run every feature idea through this tree before writing a single prompt. Most "AI features" that fail in production would have been caught at step 1 or 2.

The replacement test

For any proposed AI feature, ask: "what would this look like without AI?" Often the non-AI version is faster, cheaper, more reliable, and good enough:

AI FeatureNon-AI AlternativeAI Justified?
AI-powered searchGood keyword search with filtersOnly if semantic understanding genuinely matters
AI-generated summariesHuman-written abstracts or excerptsOnly at scale where humans can't keep up
AI categorizationRule-based classifier or dropdownOnly if categories are fuzzy and input varies widely
AI writing assistantTemplates and snippets libraryOnly if the output truly needs to be novel each time
AI-powered recommendationsCurated lists, popularity sortingOnly with enough user data to personalize
Key Idea

The question isn't "can AI do this?" — it almost always can. The question is "does AI do this better than the alternatives, at a cost we can sustain, with a failure rate we can tolerate?" If the answer to any of those is no, the right product decision is to not use AI. This takes more courage than shipping an AI feature, and it's what distinguishes senior product thinking from hype-driven building.

In the Wild

Linear uses AI for issue classification but uses deterministic rules for workflow automation (status changes, assignments, notifications). They could use AI for everything — they chose not to, because rules are faster, cheaper, and 100% predictable for structured workflows.

Stripe Radar combines ML fraud detection with hard rules. Some fraud patterns are simple enough for rules ("block transactions over $10K from new accounts in high-risk countries"). ML handles the fuzzy cases. The hybrid is more reliable than either alone.

31

Researching AI Products — When Users Can't Tell You What They Want

Traditional user research asks "what do you need?" AI product research is harder because users can't articulate needs for a technology they don't fully understand. The methods have to change.

Nobody asked for "next-token prediction." They asked for "help me write faster." The translation from human need to AI capability is a skill most product teams haven't developed yet.

What's different about AI research

Traditional Research vs AI Product Research Traditional • Output is deterministic (testable) • Users can preview before shipping • Success is binary (task done or not) • Errors are reproducible • Users know what they want "Move button to the left" AI Products • Output varies every time • Can't fully preview (stochastic) • Success is a spectrum (how good?) • Errors are hard to reproduce • Users don't know what's possible "Make it... smarter?"
AI product research requires different methods because the outputs are probabilistic, the capabilities are hard to preview, and users often can't articulate what they need.

Research methods that work for AI

The delta question

The most important research question for any AI feature: "What's the delta?" Not "is the AI good?" but "is the AI better than what exists today?" If the current experience is a blank text field and the AI fills in a draft, the delta is huge. If the current experience is a well-designed template library and the AI generates slightly different text, the delta is small. Ship features with big deltas. Kill features with small ones.

Key Idea

AI product research isn't about asking users "do you want AI?" (they'll say yes). It's about measuring whether AI actually improves their outcome vs the non-AI alternative. The METR study found AI coding tools made developers 19% slower despite them believing they were 20% faster.S6 Without rigorous measurement, you're flying on perception, not reality.

In the Wild

Notion tested AI features as prompt prototypes before committing engineering resources. They wired a simple GPT call into their editor, tested with 20 users in a single afternoon, and learned that users wanted AI for "fill in this table" more than "write me a paragraph." This redirected six months of roadmap in one day.

Figma ran A/B tests on AI-generated layout suggestions where the control group got random layouts and the test group got AI layouts. The delta was measurable: AI layouts were chosen 3x more often. That data justified the investment.

32

Trust Calibration — Designing the Right Level of Trust

The goal isn't maximum trust. It's calibrated trust — users trust the AI exactly as much as it deserves to be trusted. Over-trust and under-trust are both product failures.

This is the most nuanced design challenge in AI. Every product sits somewhere on a trust spectrum, and getting the calibration wrong has real consequences.

The Trust Calibration Problem Under-trust Users ignore AI suggestions Wasted investment "I don't trust this thing" Calibrated ✓ Users trust when AI is right Verify when AI might be wrong "Useful but I'll double-check" Over-trust Users accept without checking Dangerous when AI is wrong "The AI said so, must be right" Both extremes are product failures. Under-trust = users don't adopt. Over-trust = users get harmed. The goal is the middle: calibrated trust that matches actual AI reliability.
Trust calibration is the primary functional metric for AI products. It measures whether users rely on the AI appropriately — not too much, not too little.

The trust design toolkit

Design PatternWhat It DoesWhen to Use
Confidence indicatorsShow how certain the AI is ("High confidence" / "Not sure")When accuracy varies by query type
Source attributionShow where the answer came from (citations, links)Any factual or knowledge-based task
Verification prompts"Does this look right?" before executingIrreversible actions or high-stakes outputs
Uncertainty language"I think..." vs "The answer is..." in AI responsesWhen hallucination risk is moderate
AI-generated badgesClear labels that content was made by AIAlways (and legally required in EU by Aug 2026)
Edit affordancesMake AI output easily editable, not a final answerAny generation or drafting task
Comparison viewsShow AI suggestion alongside the originalEditing, rewriting, refactoring tasks
Fallback visibilityShow what happens if AI is wrong (undo, revert)Any action the AI takes on behalf of user

Trust calibration by domain

Higher Stakes = More Trust Scaffolding Low stakes Autocomplete, tags Medium Drafts, summaries High Financial, hiring Critical Medical, legal Minimal signals Auto-apply ok Edit affordances AI badge Sources + confidence Human approval Full audit trail Human mandatory
The higher the stakes, the more trust scaffolding you need. Autocomplete needs almost none. Medical diagnosis needs everything.
Key Idea

Trust is not a feature you add — it's a property that emerges from dozens of design decisions. The font size of an "AI generated" label. Whether the AI says "The answer is" vs "I believe the answer is." Whether the edit button is prominent or hidden. Whether errors are admitted openly or buried. Each micro-decision shifts calibration. The best AI products get this right not through one big trust feature, but through consistent, deliberate calibration across every interaction.

33

Decisions Under Uncertainty — The AI PM's Core Skill

Traditional PMs ship features that work or don't. AI PMs ship features that work 92% of the time and need judgment calls about the other 8%. This chapter is about building that judgment.

Every AI product decision lives in a fog of uncertainty that traditional product decisions don't have. The model might hallucinate. The model provider might change the model. The cost might be unsustainable. A competitor might ship the same thing next week. Here's how to navigate that fog.

The launch threshold framework

"Is This Good Enough to Ship?" Three questions to answer before launch: 1. What's the failure rate? Measure with evals (Ch 17) 2. What's the failure cost? Annoying? Embarrassing? Harmful? 3. What's the fallback? Undo? Human escalation? Nothing? Ship when: rate is low AND cost is manageable AND fallback exists. A 5% failure rate is fine for autocomplete (low cost, easy undo). A 5% failure rate is not fine for medical advice (high cost, no undo).
The same failure rate can be "ship it" or "absolutely not" depending on what happens when it fails. Context determines the threshold.

Communicating uncertainty to stakeholders

The hardest part of being an AI PM isn't the technology — it's translating probabilistic reality into language that leadership, legal, marketing, and sales can act on.

How engineers say it

"The model has a 93% accuracy rate with a P95 latency of 2.3 seconds on our eval suite, though performance degrades on out-of-distribution inputs."

How PMs should say it

"It gets the right answer 19 out of 20 times. When it's wrong, users see a 'not sure' indicator and can retry. We're monitoring the 5% and improving weekly."

The gray area decisions

The decisions that define your career as an AI PM aren't the easy ones. They're the ones where there's no clear right answer:

These have no textbook answer. They require product judgment built from experience, values, and deep understanding of your users and business. The primer gives you the technical foundation. The gray areas are where you earn your title.

Key Idea

The defining characteristic of a strong AI PM isn't technical knowledge (you have that now) or business acumen (you're building that). It's comfort with ambiguity. The ability to make a decision with 70% confidence, communicate the uncertainty honestly, build in reversibility, and iterate based on data. Traditional PMs ship and move on. AI PMs ship, monitor, learn, and adjust — continuously. The product is never "done" because the model, the data, and the users are always changing.

In the Wild

Anthropic's responsible scaling policy is a decision framework for uncertainty at the company level: they define capability thresholds and pre-commit to safety measures at each level. This "decide the framework in advance, not in the moment" approach works for product teams too — define your quality thresholds, failure protocols, and escalation paths before you need them.

Notion AI's launch strategy was a masterclass in uncertainty management: ship with aggressive guardrails (the AI refuses many edge cases), measure what users actually try, expand capabilities based on real data. They chose "too conservative at launch, loosen over time" over "too permissive at launch, tighten after incidents." One approach builds trust. The other destroys it.

Putting It All Together
34

How an AI Feature Ships — The End-to-End Process

This chapter connects all 33 previous chapters into a single workflow. Here's how a team actually goes from "we should add AI" to a shipped, monitored, improving feature.

The AI Feature Lifecycle 1. Validate Should this be AI? Ch 30, 31 2. Prototype Prompt + API in hours Ch 23, 7 3. Build Evals Define "good," test it Ch 17 4. Production Stack RAG, tools, guardrails Ch 10, 9, 16, 19 5. Launch with Trust Scaffolding Guardrails, trust, fallbacks — Ch 14, 16, 28, 32 6. Monitor → Learn → Improve (forever) Feedback → evals → prompt iteration → ship — Ch 17, 23, 29 An AI feature is never "done." The model changes, users change, data changes. Step 6 runs forever.
Six phases, referencing chapters throughout the primer. The lifecycle is linear to launch, then becomes a continuous loop.

Who does what

AI features blur traditional roles. Here's how responsibilities are shifting in 2026:

Who Owns What in AI Product Development Task PM Design Eng Research System prompt Eval rubric Model selection Trust patterns Pipeline / infra Ship / no-ship call ● = owns / leads ○ = contributes
The system prompt is co-owned by PM and Design. Eval rubrics are co-owned by PM and Research. These shared responsibilities are new — traditional products have clearer ownership lines.

The emerging "AI builder" role

The lines between PM, designer, and engineer are blurring on AI teams. Microsoft reorganized in 2025 around a unified "Applied AI" function that merges traditional PM and engineering under a single "builder" role. Google's DeepMind product teams have designers writing prompts and PMs reviewing eval results. Startups like Vercel ship AI features where a single person handles prompt design, eval, and UX — because with API-based AI, you don't need separate specialists for each step.

The implication: the most valuable people on AI teams are T-shaped — deep in one discipline, but able to contribute across the prompt-eval-UX loop. A designer who understands evals. A PM who can write and iterate prompts. An engineer who thinks about trust patterns. This primer exists to build that cross-disciplinary fluency.

In the Wild

Microsoft's March 2026 Copilot reorg merged its consumer and commercial AI teams under a single leader and freed its AI CEO (Mustafa Suleyman) to focus on models. The signal: AI products are no longer a side project staffed by a few people — they're central enough to warrant dedicated, unified organizations with engineering, product, and design working as one unit.

Figma's AI team puts designers directly into the eval process and has designers writing system prompts — a designer wrote the first system prompt for Figma Make. Their Head of AI Product emphasizes keeping teams small by having everyone touch code, enabled by AI tooling that makes this feasible.

Anthropic's Claude Code team describes their approach openly: "designers ship code, engineers make product decisions, product managers build prototypes and evals." They have PMs, but the PM's job has shifted — instead of writing specs and handing them off, PMs build working prototypes with Claude Code and use evals to validate ideas. The team replaced documentation-first thinking with prototype-first thinking.

Production Addendum: The Parts Teams Forget

The workflow above gets the feature shipped. The checklist below keeps it from becoming a beautiful demo that quietly leaks data, loses trust, or collapses when the model changes.

Security & Prompt Injection

Treat retrieved docs, webpages, tool results, and emails as untrusted input. Add allowlisted tools, permission gates, output validation, and tests for data exfiltration attempts.

Human Permission Model

Classify actions as draft, reversible, externally visible, or irreversible. Require confirmation for sends, purchases, access changes, deletions, and sensitive-data transmission.

Privacy & Retention

Define what user data is logged, retained, used for evals, used for training, redacted, or excluded. Make consent and deletion paths part of the product surface.

Model Drift

Pin model versions where possible, run scheduled regression evals, and record model IDs in traces so quality changes can be explained after a provider update.

Accessibility & Generated UI

Require semantic labels, focus order, contrast, motion controls, and readable error states in the UI schema. Generated interface does not get a pass on accessibility.

Copyright & IP

Decide where generated text, code, and images can be used, what sources need attribution, and which workflows need legal review before launch.

Practical Templates

AI Feature PRD

Problem, user task, why AI is needed, non-AI fallback, data access, model route, permissions, success metric, failure cost, launch threshold.

Eval Rubric

3-5 criteria, examples of pass/fail, grader type, golden set owner, minimum score, regression threshold, review cadence.

Launch Readiness

Red-team cases, prompt-injection tests, telemetry, rollback path, human escalation, cost alert, legal/privacy review, support playbook.

RAG Design Worksheet

Corpus, permissions, freshness, chunking strategy, retriever, reranker, citation style, empty-result behavior, source-quality metric.

Deployment Architecture

Most durable AI products converge on the same skeleton: app UI -> AI gateway -> model router -> prompt/context builder -> model call -> tool executor/RAG -> validators/guardrails -> trace/eval logger -> user feedback loop. If one of those boxes is missing, know why.

35

Glossary — Quick Reference

Every term from this primer, in one place. Reference this whenever a concept from an earlier chapter comes up.

TermPlain EnglishChapter
TokenA chunk of text (~4 characters) that models process. The "atom" of AI.1
Next-token predictionHow LLMs work: predict the most likely next word, repeat.2
Context windowThe model's working memory. Everything it can "see" at once.3
TemperatureRandomness dial. Low = predictable. High = creative.4
Reasoning modelModel that "thinks" step-by-step before answering. Slower, costlier, better on hard problems.5
Structured outputForcing the model to respond in a specific format (JSON, XML).6
Model routingSending simple tasks to cheap models, hard tasks to expensive ones.7
MultimodalAI that processes text, images, audio, and video.8
Function callingModel outputs a structured request to use a tool (API, database, etc).9
RAGRetrieval Augmented Generation. Searching a knowledge base before answering.10
EmbeddingConverting text into numbers that capture meaning. Powers semantic search.10
Vector databaseDatabase optimized for storing and searching embeddings.10
Hybrid searchCombining dense vector search (meaning) with sparse keyword search like BM25 (exact match) and merging results.10
RerankingRe-scoring the top retrieval results with a slower, more accurate cross-encoder model. Biggest single quality lever in production RAG.10
HyDEHypothetical Document Embeddings. Have the LLM draft an answer first, then embed and search using that — closes the vocabulary gap between questions and documents.10
Agentic loopThink → Act → Observe → Repeat. How AI agents reason through multi-step tasks.11
MCPModel Context Protocol. Universal standard for connecting AI to tools.12
A2AAgent-to-Agent protocol. Standard for agents communicating with each other.14
GuardrailsSafety systems that filter inputs/outputs (content, PII, jailbreaks).16
EvalsSystematically testing AI against a set of examples to measure quality.17
LLM-as-judgeUsing a strong model to grade a weaker model's outputs.17
InferenceUsing a trained model to generate output. Every API call is an inference call. Distinct from training (which creates the model).2, 19
TTFTTime to First Token. How fast the model starts responding. <200ms feels instant.19
StreamingShowing tokens as they generate rather than waiting for the full response.19
Prompt cachingReusing processed prompt prefixes across calls. Cached tokens billed at ~10% of normal rate. The biggest production cost lever.19
KV cachingInternal caching of attention state during a single generation so the model doesn't recompute earlier tokens. Automatic, not the same as prompt caching.19
Speculative decodingA small fast model drafts tokens, the large model verifies in batch. 2–3x speed at the same quality.19
QuantizationReducing the precision of model weights (e.g. 16-bit to 4-bit) to make models smaller and faster, with a small quality cost.19
DistillationTraining a smaller "student" model to mimic a larger "teacher" model. How frontier capabilities flow down to cheap, fast models.19
OpenTelemetry GenAIEmerging standard for emitting LLM and agent traces (model calls, tool calls, latencies) into your existing observability stack.14
Generative UIAI generating actual interface components (cards, forms) not just text.20
System promptHidden instructions that define the AI's behavior, identity, and rules.21, 23
Context engineeringDesigning everything in the context window: prompt, docs, tools, history.23
Data flywheelUsers → data → better product → more users. The core AI growth loop.29
Trust calibrationDesigning so users trust AI exactly as much as it deserves.32
HallucinationWhen the model confidently states something false.17, 32
Fine-tuningRetraining a model on custom data to change its behavior.10, 21
RLHFReinforcement Learning from Human Feedback. How models learn to be helpful.29
36

Further Reading — Go Deeper

Curated resources to continue learning. Organized by section. The sources below are also the live references for volatile claims like pricing, context windows, regulatory timing, and model capabilities.

Sources & Fact-Check Notes

AI facts age quickly. Any number tied to model price, context window, latency, benchmark score, legal deadline, or provider feature should be rechecked before publishing externally.

S1 · OpenAI pricing

Live API pricing for current model families and token rates.

S2 · Anthropic context windows

Context window docs and extended-thinking token behavior.

S3 · Gemini API docs

Google Gemini pricing and model limits.

S4 · Model Context Protocol

Anthropic MCP docs for protocol concepts and product support.

S5 · EU AI Act

European Commission overview for transparency and high-risk obligations.

S6 · METR productivity study

Original METR writeup on early-2025 AI and experienced developer productivity.

S7 · OpenTelemetry GenAI

Semantic conventions for GenAI spans, metrics, events, and agent traces.

S8 · Prompt caching

OpenAI prompt caching guide for cacheable prefixes, latency, and cost behavior.

S9 · Structured Outputs

Structured Outputs announcement and caveats for schema-constrained generation.

Foundations (Ch 1-8)

Agents (Ch 9-13)

Production (Ch 14-21)

Product & Strategy (Ch 22-29)

Judgment (Ch 30-35)

You made it — now teach it.

The best way to learn: explain it to someone else.
If you can't, you don't know it yet.

@adhithya