AI Primer

Foundations

01TokensAtoms of language 02PredictionHow LLMs think 03ContextWorking memory 04TemperatureControlling creativity 05ReasoningWhen AI thinks 06Structured OutputPredictable models 07Model VariantsRight brain 08MultimodalBeyond text

Agents

09Function CallingTeaching tools 10RAGGiving knowledge 11Agentic LoopReason and act 12MCPUniversal plug 13OrchestrationChains, routing 14Agents in ProdTrust, control 15Error HandlingGraceful failures

Production

16AI SafetyGuardrails 17EvaluationMeasuring quality 18On-Device/CloudTradeoff triangle 19Cost & LatencyPerformance 20AI-Generated UIModel to UI 21Skills & MemoryCustomization

Product & Strategy

22ArchetypesProduct patterns 23Context EngPrompt is product 24Business ModelsAI economics 25MoatsDefensibility 26Developer DXFor builders 27EcosystemWhere it fits 28Demo to ProdWhy AI breaks 29Data FlywheelsGetting better

Judgment

30When NOT to AIMost valuable skill 31AI ResearchUser methods 32TrustCalibrating trust 33DecisionsUnder uncertainty

Reference

34How AI ShipsEnd-to-end process 35GlossaryQuick reference 36Further ReadingGo deeper

Start

How to Use This Primer

You do not need to read this straight through. Pick the track that matches the decision you are trying to make, then use the glossary and sources when a claim needs verification.

PM Track

Read Chapters 1-7, 16-19, 22-34. Focus on model choice, evals, launch thresholds, cost, and when not to use AI.

Design Track

Read Chapters 3, 8, 14-16, 20, 31-33. Focus on trust, failure states, AI-generated UI, and human control.

Engineer Track

Read Chapters 1-13, 16-21, 23, 26, 28-29. Focus on schemas, tools, RAG, observability, evals, and cost.

Founder / Strategy Track

Read Chapters 7, 18-19, 22, 24-30, 34. Focus on economics, defensibility, distribution, data, and product risk.

Reading Rule

Every technical chapter should cash out into a product decision. If a concept does not change what you build, measure, price, or disclose to users, treat it as background knowledge and keep moving.

Tokens — The Atoms of Language

You think in words. LLMs think in tokens. Understanding this difference is the foundation of everything else.

When you type "Hello, how are you?" into ChatGPT or Gemini, the model doesn't see five words. It sees something like this:

Each word (and punctuation mark) becomes a numbered token. The model only sees numbers.

A token is a chunk of text — sometimes a whole word, sometimes part of a word, sometimes just a character. The model has a fixed vocabulary (think of it as a dictionary) of roughly 30,000–100,000 tokens, and every piece of text gets broken into pieces from that dictionary.

Analogy: LEGO Bricks

Imagine you have a box of 50,000 unique LEGO bricks, each with a different shape. To represent any object, you combine bricks from your box. Common objects (like "the" or "hello") get their own single brick. Rare words get broken into multiple bricks. The word "tokenization" might become three bricks: token + iz + ation.

Why tokens matter for product teams

Every API call to an LLM is priced per token. Every model has a maximum number of tokens it can handle at once (its "context window"). When you're designing generative UI — a protocol where an LLM generates UI component trees — the size of that output in tokens directly determines:

Cost — A complex card component might be 200 tokens. A full-screen layout could be 2,000. At scale, this matters.
Speed — More tokens = longer generation time. On mobile, users feel every millisecond.
Feasibility — If your component spec exceeds the model's output limit, it literally can't generate it.

Tokenization in practice

Different models use different tokenizers, but the patterns are similar:

Text	Approximate Tokens	Why
`Hello`	1	Common word, gets its own token
`authentication`	2–3	Long word, split into parts
`{"type": "card"}`	7–9	JSON has lots of punctuation, each costs a token
A full paragraph (100 words)	~130	Rule of thumb: 1 token ≈ 0.75 words in English
A complex generative UI layout (20 components)	~1,500–3,000	Nested JSON structures are token-expensive

Why This Matters for AI-Generated UI

When you design the generative UI schema, every field name you choose costs tokens. A field called backgroundColor costs more tokens than bg. But bg is ambiguous and the model might misinterpret it. This is a real product tradeoff: schema readability vs token efficiency. This is a product and UX decision as much as an engineering one.

Key takeaway

Tokens are the fundamental unit of LLM computation. Everything — cost, speed, capability limits — flows from token counts. When someone says "this model has a 128K context window," they mean 128,000 tokens, which is roughly a 200-page book.

In the Wild: How Token Economics Shape Real Products

OpenAI's pricing is entirely token-based, and the exact prices move as model families change.S1 This means a company like Notion AI, which processes millions of documents daily, must obsess over token efficiency — every unnecessary word in their system prompt costs real money at scale.

Cursor (the AI code editor) ran into token limits early. Their codebase context feature had to be carefully designed to select only the most relevant files to include in the context — because stuffing an entire repo into the prompt would blow past token limits and cost a fortune. They built a retrieval system that picks the 5-10 most relevant files, not all 500.

Stripe optimized their fraud detection prompts to use ~40% fewer tokens by switching from verbose natural language descriptions to compressed, structured formats — cutting their API costs proportionally while maintaining accuracy.

Next-Token Prediction — How LLMs Think

The most important idea in modern AI is embarrassingly simple: predict the next word. That's it. Everything else — conversations, code, UI generation — is a consequence of this one trick done at extraordinary scale.

An LLM doesn't "understand" language the way you do. It's a prediction machine. Given a sequence of tokens, it calculates the probability of every possible next token, then picks one.

The model assigns a probability to every possible next token. "mat" wins at 68%. This happens for every single token in the output.

Here's the key insight: the model generates text one token at a time. After it picks "mat," the input becomes "The cat sat on the mat" and it predicts the next token again. This process repeats — token by token — until the model generates a special "stop" token or hits the maximum output length.

Analogy: Autocomplete on Steroids

You know how your phone keyboard suggests the next word? An LLM is the same idea, but instead of being trained on your text messages, it's been trained on a significant fraction of all text ever written by humans. And instead of choosing from 3 suggestions, it's choosing from 100,000 possibilities, weighted by probability. The "magic" is just autocomplete at absurd scale.

The training process (simplified)

How does the model learn these probabilities? Through training on enormous amounts of text. The process is conceptually simple:

Training = showing the model billions of examples and adjusting its parameters whenever it guesses wrong. After enough repetitions, it gets very good at guessing.

The model doesn't memorize text. It learns patterns — statistical relationships between tokens. After seeing millions of sentences about cats sitting on things, it learns that "mat" is the most likely word after "the cat sat on the." After seeing millions of JSON objects, it learns the patterns of valid JSON. After seeing millions of code snippets, it learns programming syntax.

Training vs Inference: learning vs using

The process above — showing the model billions of examples and adjusting its parameters — is called training. It happens once (or periodically) and costs millions of dollars in compute. The result is a trained model — a massive file of numerical weights.

When you send a message to ChatGPT or Claude, the trained model runs the next-token prediction loop to generate a response. This is called inference. It happens billions of times per day and costs dollars (or fractions of a cent) per request.

Training creates the model. Inference uses the model. When people say "AI costs are high," they almost always mean inference costs — the per-request price of generating responses at scale.

Key Idea

Every time you hear "inference" in an AI conversation, mentally substitute "using the model to generate output." Inference latency = how fast you get a response. Inference cost = how much each API call costs. Inference provider = the company running the model's servers. This is the word you'll hear most often in AI product and engineering conversations.

This is why LLMs can generate UI

If an LLM has seen enough examples of JSON structures that describe UI components, it can predict what a valid UI component JSON should look like. Feed it a prompt like "generate a card component with a title and two buttons" and it produces token after token of valid JSON — not because it "understands" UI, but because it's seen enough patterns to predict what comes next in that kind of document.

Key Idea

An LLM doesn't know what a button looks like or what a card does. It knows what a button looks like in JSON — the statistical pattern of how buttons are described in text. This is both the power (it can generate any structured format it's seen) and the limitation (it can produce something that looks right in JSON but would be terrible UI).

In the Wild: Next-Token Prediction Powers Everything

GitHub Copilot is literally next-token prediction applied to code. When you type function calculateTax(, Copilot predicts the most likely next tokens based on patterns from millions of public repositories. It doesn't "understand" tax law — it's seen enough tax calculation functions to predict the pattern. This is why it's great at boilerplate but stumbles on novel business logic.

Google Search autocomplete works on a similar principle — given "how to make", it predicts "pancakes" or "money" based on frequency patterns. LLMs are this concept taken to an extreme scale.

Midjourney and DALL-E use a variation of this for images: instead of predicting the next token, they predict what pixels should look like given a text description. Different modality, same core idea — pattern prediction at scale.

Context Windows — The Model's Working Memory

A context window is the total amount of text a model can "see" at once — both your input and its output combined. It's the single most important constraint in building AI products.

Think of the context window as a desk. Everything the model needs to work with — your instructions, the conversation history, any documents you've provided, AND the response it's generating — all has to fit on this desk. If it doesn't fit, it falls off the edge and the model can't see it.

The context window is shared between everything: your instructions, conversation history, tool definitions, AND the model's response.

Context window sizes across models

Model	Context Window	Roughly Equivalent To
OpenAI flagship models	Varies by model	Check the live model docs before shipping
Claude Sonnet / Opus family	200K+ tokens, model-dependent	Anthropic documents 200K standard context and newer long-context options
Gemini Pro / Flash family	Up to 1M+ tokens, model-dependent	Google publishes current limits in AI Studio / API docs
Gemini Nano (on-device)	~4–32K tokens	~5–50 pages of text

Model names, context windows, and prices change quickly. Treat this table as orientation, then verify against live provider docs before using it in a spec.S1 S2 S3

The Nano constraint

See that last row? On-device models (Gemini Nano, Apple's local models, Phi-4-mini) have dramatically smaller context windows. A prompt that works with a cloud model might completely fail on-device. Your architecture needs to handle this: shorter schemas, simpler prompts, or a fallback strategy when the on-device model can't handle the request.

Why this matters for product decisions

Context window size shapes every product decision in an AI system:

Big context window (cloud)

Can hold complex schemas, long conversation history, and rich tool definitions. Generates detailed, multi-component UIs. But: slower, more expensive, requires network.

Small context window (on-device)

Fast, private, works offline. But: can only handle simple prompts and small output. Needs compressed schemas. Limited to simpler UI generation.

Key Idea

The context window is a shared budget. Every token you spend on instructions is a token you can't use for output. This is why prompt engineering is an optimization problem: say enough for the model to understand the task, but no more. In generative UI systems, a bloated component schema eats into the space available for the actual UI generation.

In the Wild: Context Window as Product Differentiator

Cursor (AI code editor) lives and dies by context management. A developer's codebase might be millions of lines, but the model can only see a fraction at once. Cursor built an entire retrieval system — indexing your repo, ranking file relevance, and intelligently packing the most useful code into the context window. This "what to include" problem is their core product challenge.

NotebookLM by Google uses Gemini's 1M token window to ingest entire research papers, books, and document collections at once. Before large context windows, this required complex chunking and retrieval (RAG). Now you can just dump 50 PDFs in and ask questions. The product exists because the context window got big enough.

ChatGPT's memory feature is a workaround for context limits. Between conversations, the context window resets. So OpenAI stores a condensed summary of what it learned about you — effectively compressing your history into a few hundred tokens that fit alongside each new conversation.

Temperature and Sampling — Controlling Creativity

When the model predicts the next token, it doesn't always pick the most likely one. Temperature controls how adventurous it gets.

Remember from Chapter 2 that the model produces a probability distribution over all possible next tokens. Temperature is a number that modifies these probabilities before the model makes its pick.

Low temperature = predictable, reliable output. High temperature = diverse, surprising output. For UI generation, you almost always want low.

Other sampling parameters

Temperature isn't the only knob. Two others matter for your work:

Top-P (nucleus sampling): Instead of considering all 100,000 possible tokens, only consider the smallest set whose combined probability exceeds P. If P=0.9, the model only picks from the top tokens that together account for 90% of the probability. This prevents the model from ever picking wildly unlikely tokens.

Top-K: Even simpler — only consider the K most likely tokens. If K=50, the model picks from the top 50 most probable tokens. The other 99,950 are eliminated entirely.

Key Idea

For generative UI and any system where the model's output must conform to a specific schema, you want: temperature near 0, top-P around 0.9, and top-K around 40. This keeps the model focused on producing valid, predictable output while still allowing some flexibility in how it composes the UI.

In the Wild: Temperature Settings Across Products

GitHub Copilot uses low temperature (~0.1–0.2) for code completions. You want predictable, syntactically correct code — not creative surprises. When Copilot suggests a function body, it should be the most likely correct implementation, not a novel experiment.

ChatGPT's creative writing mode uses higher temperature (~0.7–1.0). When you ask it to write a story, you want variety — the same prompt should produce different stories each time. Low temperature would produce the same story every time, which feels robotic.

Jasper AI (marketing copy tool) lets users adjust a "creativity slider" — which maps directly to temperature. "More creative" = higher temperature for brainstorming taglines. "More precise" = lower temperature for factual product descriptions. They turned a technical parameter into a UX feature.

Reasoning Models — When AI Needs to Think

Standard models respond instantly. Reasoning models pause, think step-by-step, and then answer. They cost 5–10x more, take seconds to start, and beat everything else on the hard stuff. The product question is when that trade is worth it.

Remember from Chapter 2 that LLMs predict one token at a time. A reasoning model does something different: before producing its visible answer, it generates internal "thinking tokens" — a private chain of reasoning that the user may or may not see.

Reasoning models spend extra compute "thinking" before answering. The thinking tokens are consumed from the context window but often hidden from the user.

Analogy: Scratch Paper

Standard model: a student who blurts out the answer immediately. Reasoning model: a student who pulls out scratch paper, works through the problem step by step, then gives you the final answer. The scratch paper (thinking tokens) takes time and costs money, but for hard problems the answer is dramatically better.

The UX challenge

Reasoning models create a fundamentally different interaction pattern:

Latency: 5-60+ seconds of silence while the model thinks. You need progress indicators, "thinking" animations, or visible chain-of-thought streaming.
Cost: Thinking tokens cost money even though the user never sees them. A reasoning request can cost 5-10x more than a standard one.
Visibility: Do you show the model's reasoning? Some users find it reassuring ("it's actually thinking about my problem"). Others find it noisy. This is a design decision.
Effort control: OpenAI, Anthropic, and Google all now expose "reasoning effort" as a parameter. Should users control how hard the model thinks? Or should the system decide automatically?

Key Idea

The decision framework is simple: if you wouldn't need scratch paper for this problem, don't use a reasoning model. "What's the weather?" doesn't need reasoning. "Analyze this contract for liability risks across three jurisdictions" does. The product decision is whether to route automatically (like model routing in Chapter 7) or let the user choose.

In the Wild

Cursor uses reasoning models selectively: standard models for autocomplete and quick edits, reasoning models for complex multi-file refactors. The user doesn't choose — the system routes based on task complexity.

ChatGPT shows a collapsible "Thought for X seconds" indicator. Users can expand it to see the chain of thought or collapse it and just read the answer. This progressive disclosure pattern has become the standard.

Claude's Extended Thinking offers four effort levels (low, medium, high, max). Higher effort = more thinking tokens = longer wait = better answers on hard problems. The API exposes this as a parameter, letting product teams tune the tradeoff per feature.

Structured Output — Making Models Predictable

LLMs naturally produce free-flowing text. But generative UI needs valid JSON. Structured output is how we force a creative, probabilistic system to produce machine-readable data.

Without structured output, if you ask an LLM to "generate a card component," you might get:

Sure! Here's a card component for you:

The card has a title "Weather Today" and shows the
current temperature of 72°F with a sunny icon...

That's nice prose, but your UI renderer can't do anything with it. What you need is:

{
  "type": "Card",
  "children": [
    { "type": "Text", "value": "Weather Today", "style": "headline" },
    { "type": "Row", "children": [
      { "type": "Icon", "name": "sunny" },
      { "type": "Text", "value": "72°F", "style": "display" }
    ]}
  ]
}

Structured output constrains the model toward valid JSON that matches your schema. Providers differ in how strict this is, and even schema-valid output can contain wrong values, so still validate before acting.S9 There are three main approaches, and understanding the differences is critical:

Function calling with schema enforcement is the gold standard for generative UI: it makes the output parseable, then your app still validates semantics and permissions.

Function calling in practice

Here's what a real function calling setup looks like. This is the exact pattern that generative UI would use:

// You send this to the API alongside your prompt:
{
  "tools": [{
    "type": "function",
    "function": {
      "name": "render_ui",
      "description": "Generate a UI component tree for the user's request",
      "parameters": {
        "type": "object",
        "properties": {
          "root": {
            "type": "object",
            "properties": {
              "type": { "enum": ["Card", "Column", "Row", "List"] },
              "children": {
                "type": "array",
                "items": { "$ref": "#/$defs/Component" }
              }
            }
          }
        },
        "required": ["root"]
      }
    }
  }]
}

// The model's response is constrained to this schema:
{
  "tool_calls": [{
    "function": {
      "name": "render_ui",
      "arguments": {
        "root": {
          "type": "Card",
          "children": [
            { "type": "Text", "value": "Weather", "style": "headline" },
            { "type": "Text", "value": "72°F Sunny", "style": "body" }
          ]
        }
      }
    }
  }]
}

Why This Matters for AI-Generated UI

A generative UI protocol is, at its core, a function calling schema for generating UI component trees. The schema defines what components exist (Card, Row, Column, Text, Button, Image...), what properties each has, and how they nest. The renderer — React on web, SwiftUI on iOS, Jetpack Compose on Android — maps these to native components. The model's job is to fill in the values. The tighter your schema, the more reliable the output. The looser, the more creative — but more likely to break.

The schema design tradeoff

This is where design instinct becomes a superpower. Schema design is UX design for machines:

Tight Schema

{"type": "Card", "variant": "elevated"|"filled"|"outlined"}

✅ Always valid
✅ Predictable rendering
❌ Limited expressiveness
❌ Model can't improvise

Loose Schema

{"type": "string", "style": "object"}

✅ Creative flexibility
✅ Can handle novel requests
❌ Might generate invalid UIs
❌ Harder to render reliably

In the Wild: Structured Output Powers Production Systems

Shopify's Sidekick uses function calling to let merchants manage their store via natural language. "Give me a 20% discount on winter jackets" triggers a structured tool call with exact parameters: { action: "create_discount", collection: "winter-jackets", percentage: 20 }. Free-text output would be useless — Shopify's backend needs machine-readable instructions.

Zapier's AI Actions connects ChatGPT to 6,000+ apps using structured output. When you say "add this to my Notion database," the model generates a structured API call that Zapier can execute. The schema for each integration is pre-defined — the model fills in the values.

Vercel's v0 generates React code from natural language descriptions. Under the hood, it uses structured output to produce a specific code format with metadata (component name, imports, props). The output isn't "creative writing that happens to be code" — it's schema-constrained generation optimized for parseability and rendering.

Model Variants — Choosing the Right Brain

Not all models are created equal. Choosing which model to use for which task is one of the most impactful product decisions you'll make.

Every major AI provider offers a family of models at different capability/cost/speed tradeoffs. Think of it like cars: you don't drive an 18-wheeler to get groceries, and you don't use a Smart car to haul lumber.

Model selection is a spectrum. The right model depends on your task, latency requirements, cost budget, and privacy constraints.

Model routing: using multiple models

Sophisticated AI products don't use a single model — they route requests to different models based on complexity. This is called model routing or cascading.

A router classifies each request's complexity and sends it to the cheapest/fastest model that can handle it. This is how you scale agentic UI without blowing your cost budget.

Key Idea

Model selection isn't a one-time decision — it's a runtime decision made for every request. A production AI product needs a routing layer: simple tasks go to a small model (Haiku, Flash, on-device), standard tasks go to a mid-tier model, and complex tasks go to a frontier model. Designing this routing logic is a core product and architecture decision.

In the Wild: Model Routing in Production

Perplexity routes queries across multiple models. Simple factual lookups go to a fast, cheap model. Deep research queries go to a frontier model. They built a classifier that evaluates query complexity in <50ms and routes accordingly — cutting their average cost per query by ~60% while maintaining quality where it matters.

Notion AI uses different models for different features: a lightweight model for autocomplete suggestions (speed matters most), a mid-tier model for summarization (balance of speed and quality), and a frontier model for complex writing tasks (quality matters most). Each feature has its own model selection, not one model for everything.

Samsung Galaxy AI on the S24/S25 series does exactly the on-device/cloud routing described here. Simple tasks (text summarization, live translate) run on-device via a smaller model. Complex tasks (generative edit in photos, chat assist) go to cloud. The user doesn't know or care which model is running — they just see the result.

Multimodal AI — Beyond Text

Modern models read text, see images, hear audio, and watch video. The input box is no longer a box. The design problem is figuring out which modalities your product actually needs and which are demo candy.

Every chapter so far has been implicitly text-centric. But as of 2026, every frontier model is natively multimodal — processing text, images, audio, and sometimes video within a single inference call.

Every modality gets converted to tokens that share the same context window. Images and video are expensive — a single photo uses as many tokens as a full page of text.

What this means for product teams

Multimodal AI enables new input patterns that were impossible with text-only models:

Show, don't tell: Instead of describing a UI bug in words, a user can screenshot it. Instead of writing out a math problem, they photograph their notebook.
Document understanding: Drop a PDF, invoice, or contract into the conversation. The model reads it visually, including layout, tables, and figures.
Camera as input: Point your phone at a plant to identify it. Scan a whiteboard to digitize it. Show a product to compare prices.
Voice + vision: Gemini Live lets you share your camera during a voice conversation. "What am I looking at?" while pointing at a building.

Key Idea

Multimodal doesn't just add input types — it changes the fundamental interaction model. Text-only AI is "describe your problem." Multimodal AI is "show me your problem." This is a massive reduction in friction for users who struggle to articulate complex visual or spatial information in words.

In the Wild

Google Lens evolved from a standalone visual search tool into Gemini's eyes. Circle to Search on Pixel/Samsung lets you highlight anything on screen and ask questions about it — multimodal inference running on what you see.

Be My Eyes (accessibility app) uses GPT-4o's vision to describe the world to blind users in real-time. A user points their phone camera and the model narrates what it sees. This was impossible before multimodal.

NotebookLM ingests entire PDFs, slides, and images as visual tokens. You can ask "what's the chart on page 7 showing?" and it answers based on the actual visual layout, not just extracted text.

Agentic Architecture

Function Calling — Teaching Models to Use Tools

An LLM by itself can only generate text. Function calling is how we give it hands — the ability to actually do things in the real world: check calendars, send messages, query databases, and generate UIs.

Imagine you hire a brilliant consultant who knows everything about everything — but they're locked in a room with no phone, no computer, and no internet. They can give you amazing advice, but they can't actually do anything. That's an LLM without function calling.

Function calling gives the consultant a phone. You tell them: "Here are the apps on this phone and what each one does." When they need to check something or take an action, they tell you which app to use and what to type in. You execute it, show them the result, and they continue their work.

The mechanics

The function calling lifecycle has exactly four steps. Every agentic system — including generative UI — follows this pattern:

The four-step function calling lifecycle. The model never calls the function itself — it tells YOUR code what to call, and your code executes it. The model is the brain; your code is the hands.

Critical Insight

The model never actually executes functions. It generates a request to call a function. Your application code runs the function and feeds the result back. This is important for security (the model can't directly access your APIs without your code mediating) and for control (you can validate, log, rate-limit, or reject tool calls before executing them).

How different providers implement it

The concept is identical across providers, but the API syntax differs slightly:

OpenAI / GPT

// Defining tools
tools: [{
  type: "function",
  function: {
    name: "get_weather",
    description: "Get current weather for a city",
    parameters: {
      type: "object",
      properties: {
        city: { type: "string", description: "City name" }
      },
      required: ["city"]
    }
  }
}]

// Model response when it wants to call a tool:
{
  "choices": [{
    "message": {
      "tool_calls": [{
        "id": "call_abc123",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"San Jose\"}"
        }
      }]
    }
  }]
}

Anthropic / Claude

// Defining tools
tools: [{
  name: "get_weather",
  description: "Get current weather for a city",
  input_schema: {
    type: "object",
    properties: {
      city: { type: "string", description: "City name" }
    },
    required: ["city"]
  }
}]

// Model response when it wants to call a tool:
{
  "content": [{
    "type": "tool_use",
    "id": "toolu_abc123",
    "name": "get_weather",
    "input": { "city": "San Jose" }
  }]
}

Google / Gemini

// Defining tools
tools: [{
  function_declarations: [{
    name: "get_weather",
    description: "Get current weather for a city",
    parameters: {
      type: "object",
      properties: {
        city: { type: "string", description: "City name" }
      },
      required: ["city"]
    }
  }]
}]

// Model response when it wants to call a tool:
{
  "candidates": [{
    "content": {
      "parts": [{
        "functionCall": {
          "name": "get_weather",
          "args": { "city": "San Jose" }
        }
      }]
    }
  }]
}

Notice the pattern: the schema definition is nearly identical (JSON Schema), but each provider wraps it differently. The model's response always contains: which function to call, and what arguments to pass. Your code handles the rest.

Why This Matters for AI-Generated UI

In generative UI systems, the "tools" aren't weather APIs — they're app capabilities. A fitness app might expose tools like log_workout, get_weekly_stats, set_goal. The agent calls these tools, gets the data back, and then generates a UI component tree to display the results. Generative UI is function calling where the final output is a rendered interface instead of text.

Why tool descriptions matter enormously

The model chooses which tool to call based entirely on the description field and parameter descriptions. Bad descriptions lead to wrong tool selection. This is a product/design decision:

Bad Description

"description": "Weather function"

Model doesn't know when to use it, might confuse it with a climate function or a forecast function.

Good Description

"description": "Get the current temperature and conditions for a specific city. Returns temp in °F, condition (sunny/cloudy/rainy), and humidity percentage."

Model knows exactly what it gets back and when to use it.

In the Wild: Function Calling at Scale

ChatGPT Plugins (now GPT Actions) was one of the first mass-market implementations of function calling. When you ask ChatGPT to "find flights to Tokyo," it calls the Kayak plugin's search_flights function with structured parameters. Thousands of businesses built plugins — each one is just a function calling schema that lets GPT interact with their service.

Siri and Alexa were doing a primitive version of function calling before LLMs. "Set a timer for 5 minutes" maps to an intent (set_timer) with a slot (duration: 5min). The difference with LLM-based function calling is flexibility: you don't need to pre-define every possible phrasing. The model figures out the intent and extracts the parameters from any natural language input.

Anthropic's Claude introduced "computer use" tool calls — the model can call functions like click(x, y), type(text), and screenshot() to operate a desktop computer. Same function calling pattern, radically different tools. This is where agents start interacting with the physical world.

RAG — Giving AI Access to Knowledge

LLMs only know what was in their training data. RAG (Retrieval Augmented Generation) connects them to external knowledge at query time: your documents, your database, your company wiki.

Imagine you're taking an exam. A standard LLM takes it closed-book, answering from memory. RAG takes it open-book: before answering, it searches a library, pulls out relevant pages, reads them, and answers using both memory and the retrieved material. Most production AI assistants are open-book exams. The interesting work is in how you build and search the library.

The pipeline, end to end

RAG is usually drawn as four boxes. That hides where it actually breaks. Real systems have eight stages, split between work you do once at build time and work you do on every query.

Build-time work happens when your corpus changes. Query-time work happens on every user message. Most teams skip steps 2 and 6, then wonder why their RAG underperforms.

Chunking is where most teams lose

The boringest stage in the pipeline is the one that decides whether RAG works. A chunk too big returns noisy passages with the answer buried inside. A chunk too small loses the surrounding context the model needs to interpret it. There's no universal right answer — different content shapes want different strategies.

Fixed-size

Split every N tokens (typically 200–800), with overlap. Fast to build, predictable. Cuts mid-sentence, mid-table, mid-thought.

Best for: uniform prose like blog posts, marketing copy, news.

Semantic / structural

Split on meaningful boundaries: paragraphs, headings, sections. Preserves the author's structure. Slower to build, harder to tune.

Best for: docs with strong structure — manuals, contracts, policy documents.

Hierarchical

Index small chunks for retrieval but return larger parents at generation time. Best of both: precise hits, full context.

Best for: long documents where the answer needs surrounding context — research papers, legal filings.

Code- or schema-aware

Chunk by function, class, or table — never split a unit of code. Often paired with AST parsing.

Best for: codebases, API docs, structured data.

Beyond plain vector search

Vector search alone is the 2023 baseline. The 2026 production stack adds three things:

Hybrid search combines dense vectors (good for meaning) with sparse keyword search like BM25 (good for exact matches: error codes, product SKUs, legal citations). Vectors miss "ERR_4032"; BM25 nails it. Run both, merge the results. This single change usually beats any amount of tuning to vectors alone.

Reranking takes the top 20–100 results from the cheap retriever and re-scores them with a slower, more accurate cross-encoder model (Cohere Rerank, Voyage Rerank, or a fine-tuned encoder). Cross-encoders look at the query and document together, so they catch nuance that bi-encoder vectors miss. Typical lift: 10–30% on retrieval quality for the cost of one extra model call.

Query rewriting handles the gap between how users phrase questions and how documents phrase answers. HyDE (Hypothetical Document Embeddings) is the well-known move: ask the LLM to draft a hypothetical answer first, then embed and search using that. The drafted answer often shares more vocabulary with real documents than the original question did. For multi-turn chats, query rewriting also rolls earlier turns into a self-contained search query so retrieval doesn't lose context.

Knowledge graphs deserve a mention. When relationships matter more than passages — "who reports to whom," "what's connected to this incident" — a graph beats vectors. Most teams won't need this; the ones that do, know.

RAG vs fine-tuning vs long-context

By 2026 these are three real options, not one. Picking the wrong one costs months. The shape of the choice:

	RAG	Fine-tuning	Long-context (stuff it all in)
Solves	Knowledge the model lacks	Behavior, tone, format the prompt can't get right	Single large document at a time
Freshness	As fresh as your indexer	Frozen at training time	Whatever you paste in
Cost shape	Per-query retrieval cost; cheap to update	Up-front training cost; cheap inference	High per-query token cost
Fails when	Retrieval misses the right chunk	Use case shifts; data drifts	Corpus is too big or context is too noisy
Reach for it when	Corpus changes often or is large	Output style or domain language won't budge with prompting	Corpus is small, stable, and fits in 200K–1M tokens

The 2026 default order: prompt engineering first, then long-context if the corpus fits, then RAG when it doesn't, then fine-tuning only if behavior is still off. Many teams skip straight to fine-tuning because it sounds sophisticated. Most regret it.

Corpus size

How often does the corpus change?

Query type

Latency budget per query

Why RAG matters for product design

RAG creates UX problems pure chatbots don't have, and the design work is what separates trusted products from suspicious ones:

Citations are not optional. If the AI says "our refund policy allows 30-day returns," users need to see which document that came from. Inline numbered citations (Perplexity) and "based on" panels (NotebookLM) are now table stakes.
Scope is part of the product. Tell users what the system can and can't see. "This assistant knows your support docs but not Slack" prevents the worst failure mode: confident answers about things the system never had access to.
Retrieval latency is visible. Search adds 100–500ms before the model starts streaming. A "searching your docs…" state makes that wait feel like work, not lag.
Empty retrieval needs its own UI. When the search finds nothing relevant, the model must say so instead of inventing an answer. This is mostly a prompt problem ("if no sources are relevant, say you don't know") but the UI should also reflect zero-source answers differently.

Key Idea

RAG quality is bottlenecked by retrieval, not generation. A frontier model with the wrong context produces a fluent wrong answer. A weaker model with the right context produces a useful right answer. When RAG feels broken, the fix is almost always upstream of the LLM: better chunks, hybrid search, a reranker.

In the Wild

Perplexity made citations the entire interface. Every claim is numbered and linked back to a source. Users trust it for research because they can verify, not because the model is special.

NotebookLM scopes RAG to documents you upload, never general training data. "Based on your sources" appears on every response. That scope clarity is the trust signal.

Cursor runs RAG over your codebase: it embeds your repo, retrieves relevant files for each request, and adds them to the context. The "intelligence" people praise is mostly retrieval quality, not the underlying model.

Glean and Elastic built enterprise search on RAG. The hard problem isn't retrieval — it's enforcing per-document access permissions so employees only see what they're allowed to see.

Cohere Rerank and Voyage Rerank dominate the reranker market. They're a single API call you bolt onto an existing vector search and they routinely deliver double-digit recall improvements. Most enterprise RAG stacks use one or the other.

The Agentic Loop — How Agents Reason and Act

A single function call is useful. But real agents call multiple functions in sequence, make decisions based on results, and adapt when things go wrong. This is the agentic loop.

The difference between a chatbot and an agent is simple: a chatbot responds once. An agent keeps going until the task is done.

The ReAct pattern: Think → Act → Observe → repeat until done. Each loop is a separate API call where the model sees all previous steps.

A real multi-step example

Let's trace through a realistic generative UI agent scenario: the user says "Schedule dinner with Alex this Friday at a good restaurant near home."

THINK: "I need Alex's availability first." ACT: check_calendar({ person: "Alex", date: "2026-04-10" })
OBSERVE: { free: true, available: ["6pm-9pm"] }

THINK: "She's free. Now find restaurants near home." ACT: search_restaurants({ near: "home", cuisine: "any", rating: ">4.0" })
OBSERVE: [{ name: "Osteria", rating: 4.5 }, { name: "Sushi Gen", rating: 4.3 }, ...]

THINK: "Found options. Check if Osteria has availability." ACT: check_reservation({ restaurant: "Osteria", date: "2026-04-10", time: "7pm", party: 2 })
OBSERVE: { available: false, next_available: "8pm" }

THINK: "7pm is full. 8pm works within Alex's window. Present options to user." ACT: Generate generative UI response — a card showing Osteria at 8pm with "Book" and "See other options" buttons.

Four tool calls, each building on the last. The model maintained context across all of them, made decisions based on intermediate results, and adapted when the first time slot wasn't available. That's an agent.

Key Idea

Each iteration of the loop is a separate API call. The entire conversation history — including all previous tool calls and results — gets sent back to the model each time. This is why context windows matter so much: a complex 10-step agent task might consume thousands of tokens just in history before the model even starts thinking about the next step.

In the Wild: Agentic Loops in Production

Claude Code (Anthropic's coding agent) is a textbook agentic loop. You say "refactor this module to use dependency injection." It thinks ("I need to read the file first"), acts (reads the file), observes (sees the current structure), thinks again ("I see 3 classes that need interfaces"), acts (edits file 1), observes (checks for errors), and loops until all files are updated and tests pass. A single user request can trigger 20+ iterations of the loop.

Devin (the AI software engineer by Cognition) chains together even longer loops: reading GitHub issues → planning an implementation → writing code → running tests → debugging failures → committing. Each step feeds into the next. When tests fail, it doesn't just stop — it reads the error, reasons about the cause, and tries a fix. Some tasks run 50+ loop iterations.

Google's Deep Research (in Gemini) uses extended agentic loops for research. It searches the web, reads articles, identifies gaps in its knowledge, searches again with refined queries, synthesizes findings, and produces a report. One research question can trigger dozens of search-read-think cycles over several minutes.

MCP — The Universal Plug for AI

Function calling lets a model use tools. But who decides which tools exist and how to connect to them? That's the problem MCP solves.

MCP (Model Context Protocol) is an open standard created by Anthropic that standardizes how AI models discover and use tools across any application. Think of it as USB for AI.S4

Analogy: Before and After USB

Before USB: Every device had its own cable. Your printer had a parallel port cable. Your mouse had a PS/2 connector. Your camera had a proprietary cable. If you wanted to connect a new device, you needed to find the right cable and install a custom driver.

After USB: One port, one standard. Plug anything in and it works. The computer asks "what are you?" and the device says "I'm a keyboard" or "I'm a camera" and they negotiate automatically.

MCP is USB for AI. Instead of every app building custom integrations with every AI model, MCP provides one standard protocol. An AI agent asks "what can you do?" and the app says "here are my functions." The agent can immediately use them.

MCP standardizes the connection between AI models and apps. One protocol, infinite tools.

What an MCP server exposes

An MCP server provides three things to the AI model:

Tools — Functions the model can call (like create_event, search_files)
Resources — Data the model can read (like file contents, database records)
Prompts — Reusable prompt templates (like "summarize this document")

Here's what a simple MCP server for a fitness app looks like:

// MCP Server: Fitness Tracker
{
  "name": "fitness-tracker",
  "version": "1.0",
  "tools": [
    {
      "name": "log_workout",
      "description": "Record a completed workout session",
      "inputSchema": {
        "type": "object",
        "properties": {
          "exercise": { "type": "string", "description": "e.g. 'bench press'" },
          "sets": { "type": "number" },
          "reps": { "type": "number" },
          "weight_lbs": { "type": "number" }
        },
        "required": ["exercise", "sets", "reps"]
      }
    },
    {
      "name": "get_weekly_summary",
      "description": "Get workout stats for the current week",
      "inputSchema": {
        "type": "object",
        "properties": {
          "week_offset": {
            "type": "number",
            "description": "0 = this week, -1 = last week"
          }
        }
      }
    },
    {
      "name": "set_goal",
      "description": "Set a fitness goal for a specific exercise",
      "inputSchema": {
        "type": "object",
        "properties": {
          "exercise": { "type": "string" },
          "target_weight": { "type": "number" },
          "target_date": { "type": "string", "format": "date" }
        },
        "required": ["exercise", "target_weight"]
      }
    }
  ]
}

MCP + AI-Generated UI: The Platform Pattern

MCP defines what an app can DO. A generative UI protocol defines what the result LOOKS LIKE.

An app exposes its capabilities via MCP ("I can log workouts, show summaries, set goals"). When an agent calls those tools, generative UI renders the results as native components ("here's a card showing your weekly summary with a progress bar toward your goal").

Together, these standards mean: every app becomes agent-accessible with native, beautiful UI — without the app developer building a custom AI integration. That's the platform pattern emerging across the industry.

In the Wild: MCP Adoption Is Accelerating

Anthropic launched MCP in late 2024 and adoption has been rapid. As of early 2026, there are MCP servers for Slack, GitHub, Google Drive, Notion, Linear, Jira, Figma, Postgres databases, and hundreds more. Claude Desktop, Cursor, Windsurf, and other AI tools can connect to any MCP server — one protocol, instant integration.

Block (Square) and Apollo were early enterprise adopters, building internal MCP servers so their AI tools could interact with proprietary systems. Instead of building custom ChatGPT plugins AND custom Claude integrations AND custom Gemini integrations, they build one MCP server and it works everywhere.

Figma's MCP server lets AI agents read design files, inspect components, and even generate code from designs — all through standard MCP tool calls. This is the "USB for AI" vision in action: Figma implements MCP once, and every AI tool that speaks MCP can now interact with Figma designs.

Google Stitch shipped an MCP server in 2026, letting external AI agents interact with Stitch design projects programmatically. This shows how quickly MCP is becoming the default integration layer — even AI design tools are adopting it.

Orchestration Patterns — ReAct, Chains, and Routing

There's more than one way to build an agent. The orchestration pattern you choose shapes everything: reliability, speed, cost, and user experience.

Pattern 1: ReAct (Reasoning + Acting)

This is the pattern from Chapter 8 — the model alternates between thinking and acting. It's the most common and most flexible pattern.

ReAct: flexible, adaptive, handles complex multi-step tasks. Downside: each step is an API call (slow, expensive).

Pattern 2: Parallel Tool Calls

When multiple independent tools need to be called, a smart agent calls them all at once instead of sequentially:

Parallel execution: when tools are independent, call them simultaneously. Cuts latency in half (or more).

Pattern 3: Router Pattern

A lightweight model classifies the request and routes it to specialized handlers:

The router pattern optimizes cost and latency by sending each request to the cheapest model that can handle it.

Key Idea

In production, most agents use a combination of these patterns. The router picks the right model, that model uses ReAct for complex tasks with parallel tool calls where possible. Designing this orchestration logic — deciding which pattern for which scenario — is a core product decision that shapes the user experience.

In the Wild: Orchestration Patterns in Production

Uber's customer support AI uses a router pattern: a fast classifier determines if the query is about a ride issue, a payment issue, or an Eats issue, then routes to a specialized agent for each domain. Each specialized agent has its own tool set and system prompt optimized for that domain. This is cheaper and more accurate than one monolithic agent handling everything.

LangChain and LlamaIndex popularized orchestration frameworks that make these patterns composable. LangChain's "agent executor" implements the ReAct loop. Their "sequential chain" implements linear pipelines. Their "router chain" implements the routing pattern. These frameworks exist because orchestration is hard enough to warrant dedicated tooling.

OpenAI's Assistants API handles orchestration server-side — you define tools and the API manages the think-act-observe loop for you, calling your functions and feeding results back automatically. This is a bet that most developers don't want to build their own orchestration layer — they just want to define tools and let the platform handle the rest.

Agents in Production — Trust, Control, and the Real World

The primer has covered how agents work mechanically. This chapter covers what happens when you ship them to real users — the UX patterns, trust frameworks, and protocols that make agents usable.

57% of organizations now have agents in production. But "production" doesn't mean "autonomous." The biggest lesson from 2025-2026: users want agents that are powerful but controllable. The UX challenge is designing the right level of autonomy for each context.

The autonomy dial: product teams choose where each agent action sits on this spectrum. Higher risk = more human control.

Six core agent UX patterns

Intent Preview: Agent shows its plan before acting. "I'll check your calendar, find restaurants nearby, and book one. OK?"
Autonomy Dial: Users can adjust how much control the agent has. Gmail's Smart Compose (draft) vs Autopilot (send).
Action Audit: Every agent action is logged and visible. Users can see what happened, when, and why.
Confidence Signals: Agent shows how certain it is. High confidence = proceed. Low confidence = ask for human input.
Escalation Pathways: When the agent is stuck or unsure, it hands off to a human smoothly — not as a failure, but as designed behavior.
Scope Cards: Panels showing what the agent can and cannot access. "This agent can read your calendar and email but cannot make purchases."

Computer Use agents

A new category: agents that literally see and control screens. Claude Computer Use operates a full macOS desktop. OpenAI's Operator controls a remote browser. Google's Project Mariner works inside Chrome. These agents take screenshots, click buttons, type text, and navigate apps just like a human would.

The UX challenge is unique: the user watches their screen being controlled by an AI. This requires real-time observation (screen sharing), permission gates before sensitive actions, and a kill switch to stop the agent immediately.

Observability — debugging an agent that thinks for itself

A multi-step agent fails in ways a single API call doesn't. It picked the wrong tool. It called the right tool with bad arguments. It looped. It silently degraded after a model upgrade. The only way to debug any of this is a trace — a structured log of every model call, every tool call, every input, every output, in order. By 2026 this is standard infrastructure: each step gets a span, the trace tree shows the full reasoning path, and you can replay a failing run in isolation. LangSmith, Braintrust, and Langfuse are common platforms; OpenTelemetry's GenAI semantic conventions define emerging shared fields for model calls, tool calls, token usage, latency, and errors.S7 The headline rule: if you can't replay a bad run with the exact same inputs, you can't fix it. Build trace capture before you build the second tool.

In the Wild

Intercom's Fin is one of the most successful customer service agents in production. It resolves 50%+ of support tickets autonomously but escalates to human agents for complex cases, a textbook confidence-based escalation pattern.

Replit Agent builds entire applications from natural language. It shows its plan (intent preview), executes steps one at a time (audit trail), and asks for approval before deploying (autonomy gate). Users can see every file it creates and modify any step.

A2A (Agent-to-Agent) is the emerging open protocol for agents to delegate work to other agents — one agent can hand off subtasks to specialized peers without bespoke integration code. Alongside MCP (agent-to-tools) and AG-UI (agent-to-frontend), these three protocols form the infrastructure layer for multi-agent systems.

Production

Error Handling — When Agents Fail Gracefully

Agents fail. APIs time out, models hallucinate, tool calls return unexpected data. How you handle failure defines the user experience.

In a traditional app, errors are predictable: network error, invalid input, server down. In an agentic system, you get entirely new failure modes:

Failure Type	Example	How to Handle
Wrong tool selection	Agent calls `send_email` when user wanted `send_message`	Confirmation step before executing irreversible actions
Invalid arguments	Agent passes `"date": "next Friday"` instead of `"2026-04-10"`	Validate arguments against schema before executing; ask model to retry with correct format
Tool execution failure	Restaurant API is down	Return structured error to model; let it try alternatives or inform user
Hallucinated tool	Agent tries to call `book_flight` but no such tool exists	Validate tool name before execution; return "tool not found" to model
Infinite loop	Agent keeps retrying a failed action	Set max iteration count (e.g., 5 loops max); break and inform user
Schema violation in output	generative UI output has invalid component nesting	Validate against schema; show fallback UI; log for monitoring

The graceful degradation ladder

A well-designed agentic UI should degrade gracefully through these stages:

Design for failure at every level. The user should always see something useful — never a blank screen or cryptic error.

The same user request rendered in three states. A well-designed generative UI handles all three gracefully — the user always sees something useful and always has a next step.

Error States in AI-Generated UI

Your generative UI schema should include first-class error and loading states. A component that says "state": "loading" renders a skeleton screen. "state": "partial" renders available data with placeholders. "state": "error" renders a retry card. These aren't afterthoughts — they're the most important states to design because they're what users see when things go wrong (which is often).

In the Wild: How Products Handle Agent Failures

ChatGPT's browsing feature frequently hits websites that block it. Instead of crashing, it tells the user "I wasn't able to access that site" and offers to try alternative sources. This conversational fallback pattern — admit the failure, explain why, offer alternatives — is the baseline every agentic product should hit.

Tesla Autopilot is the hardware analogy for graceful degradation. Full self-driving → lane keeping → adaptive cruise → manual control. Each level is a fallback when the one above it can't handle the situation. It never just stops working — it degrades to a less capable but still functional mode and alerts the driver.

Alexa's confidence thresholds show a different approach: when the model's confidence in its interpretation is below a threshold, it asks for confirmation instead of acting. "Did you mean turn off the bedroom lights?" This is cheaper and safer than executing a wrong action and having to undo it. For generative UI, a confirmation card before irreversible actions follows the same principle.

Notion AI handles hallucination risk by always including an "AI-generated" badge on its outputs and providing the source material alongside the summary. This UI-level pattern — flagging uncertainty visually — is something generative UI should consider as a first-class component state.

AI Safety & Guardrails — Keeping AI in Bounds

Guardrails are the protective systems that prevent AI from generating harmful content, leaking private data, or acting beyond its intended scope. In 2026, they're also a regulatory requirement.

An LLM without guardrails will attempt anything you ask. Guardrails constrain it — blocking harmful content, filtering personal data, preventing jailbreaks, and keeping the AI focused on its intended task. Think of them as the brakes on a very powerful car.

Guardrails wrap around the LLM on both input and output sides. The EU AI Act adds transparency requirements that are explicitly UX obligations.S5

The UX of guardrails

When a guardrail triggers, the user sees... something. What they see is a design decision that directly affects trust:

Content blocked: "I can't help with that" — but explain why and offer alternatives.
PII detected: The system redacts sensitive information before processing. Show what was redacted and why.
Low confidence: The model isn't sure its answer is correct. Show confidence indicators or suggest verification.
Action gate: Before an irreversible action (sending an email, making a purchase), require explicit confirmation.

Key Idea

Four principles for trustworthy AI design: Transparency (users know they're interacting with AI), Proportionality (restrictions match the risk level), Reversibility (actions can be undone), and Contestability (users can challenge AI decisions). These aren't just good design — they're increasingly legal requirements.

Evaluation — Measuring If AI Actually Works

A model scoring 90% on a benchmark might still frustrate real users. Evaluation is how you close the gap between measured performance and experienced quality.

Quality is the #1 barrier to production AI — cited by 32% of teams as their top challenge. "Running evals" is the AI equivalent of usability testing: you systematically check whether the system works for real scenarios, track scores over time, and use the data to decide what to ship.

The eval process, step by step

The eval loop: define → build test set → run → grade → track. Every change triggers a new run.

Step 1: Define what "good" means

This is the hardest and most important step. Writing a grading rubric is the same skill UX researchers use when creating annotation guides for usability studies — and it has the same failure mode: a vague rubric produces noisy, irreproducible scores no matter how good the grader is.

// Example rubric for a customer support bot
{
  "accuracy": {
    "5": "Correct answer with all relevant details",
    "3": "Partially correct, some wrong info",
    "1": "Completely wrong or hallucinated"
  },
  "helpfulness": {
    "5": "Fully resolved the user's issue",
    "3": "Some useful info but didn't resolve",
    "1": "Useless or made things worse"
  }
}

The three grading approaches

Most teams use all three in a funnel: automated catches obvious issues, LLM-judge handles nuance, humans validate the hardest cases.

LLM-as-judge, the careful version

Using a strong model to grade a weaker model's output is now standard. It scales. It's cheap relative to humans. And it's full of biases that quietly invalidate your scores if you're not careful.

LLM judges aren't oracles. Audit them against a small human-graded set before you trust the dashboard. If judge-vs-human agreement is below ~80%, your numbers are noise.

Two practical defaults: require chain-of-thought from the judge ("explain your reasoning before giving a score") — it forces the judge to actually engage with the rubric instead of pattern-matching. And calibrate against humans periodically: have humans grade 50–100 examples, compare to the judge, and treat low-agreement criteria as untrustworthy until you fix the rubric.

Regression testing across model versions

Every model upgrade — Sonnet 4.5 to 4.6 to 4.7, GPT-4o to GPT-5, Gemini 2.5 to whatever's next — is a behavior change in production. Sometimes it's a big upgrade. Sometimes a regression on the queries that matter most to you. The only way to know is to keep a frozen golden set and re-run it on every new model.

The minimum viable version: ~200 examples that represent your real query distribution (sampled from production logs, scrubbed), with expected outputs or rubric scores. On every model upgrade or prompt change, re-run the set, diff the per-example scores against the previous run, and flag regressions before they ship. Most eval platforms (Braintrust, LangSmith, Langfuse) make this a one-click operation.

Two non-obvious things:

Score deltas matter more than absolute scores. Going from 87% to 91% on a custom rubric tells you the change helped. Whether 91% is "good" depends on the use case.
Watch the tail. Average score can stay flat while the worst 5% gets dramatically worse. Track p10 and p50 separately, not just the mean.

Online evals — what users actually do

Offline evals tell you the model is correct. Online evals tell you the product works. They measure different things and you need both.

Implicit signals are usually more honest than explicit ones. Did the user accept the suggestion? Did they edit it heavily? Did they ask the same question again two minutes later? These are the real quality signal.
Thumbs up/down is noisy on its own. Users thumb-down when frustrated for any reason, and the silent majority never clicks anything. Use it as a leading indicator and a way to surface failure cases for human review — not as your headline metric.
A/B test prompts and models like you'd test any product change. Run two variants against real traffic, measure outcomes (acceptance rate, task completion, retention), and decide on data, not on which version "felt" better in the demo.
Mind the gap between "accepted" and "correct." A user can accept a wrong answer because they trusted it. The METR finding (below) is exactly this gap at scale.

The five things every eval program tracks

Drop the "types of evaluation" framing — it's the wrong axis. The right axis is: what dimension of quality are you measuring, and which method gives you the cheapest reliable signal on it?

Dimension	What it answers	Cheapest reliable method
Correctness	Did it produce the right answer?	Automated checks against labeled examples
Helpfulness	Did it actually solve the user's problem?	LLM-judge with a rubric, audited against humans
Safety	Did it avoid harmful, off-policy, or sensitive output?	Automated guardrails + adversarial test set
Latency & cost	Is it fast and affordable enough?	Production telemetry (TTFT, p50/p95, $/task)
Real-world impact	Are users better off because of it?	Online A/B tests on outcome metrics

Key Idea

Benchmark scores measure the model. Custom evals measure your product. A model that scores 95% on MMLU can produce a terrible support bot if the 5% failures land on the queries your users care about most. Build evals on the queries you actually see in production.

In the Wild

Braintrust, LangSmith, and Langfuse dominate the eval-platform market. They handle test-set runs, grading, regression tracking, and online tracing in one place. Most production AI teams pick one and never look back.

OpenAI, Anthropic, and Google all run frozen internal eval suites against every model release. The public benchmarks (MMLU, SWE-bench, HumanEval) are a small fraction of what they actually measure. The real evals are private and use-case specific — exactly the kind you should be building.

METR's 2025 study found experienced developers using AI coding tools were 19% slower, despite believing they were 20% faster.S6 The perception gap is exactly why offline accuracy and online outcomes both have to be measured. Either one alone lies.

On-Device vs Cloud — The Tradeoff Triangle

Mobile platforms are increasingly running models both in the cloud AND on the device itself. Understanding this dual architecture is essential for anyone designing AI-powered experiences.

Every AI request faces a three-way tradeoff between latency, quality, and privacy. Whether you're building for mobile, web, or desktop — there is no option that wins on all three.

The tradeoff triangle. On-device = fast + private but limited capability. Cloud = powerful but slower and data leaves the device. Hybrid strategies try to balance all three.

Practical comparison

	On-Device (Nano, Phi, Apple)	Cloud — Fast (Flash, Haiku)	Cloud — Frontier (Pro, Sonnet)
Latency	~50–200ms	~200–500ms	~1–3s
Context Window	~4–32K tokens	~1M tokens	~1M tokens
Cost	Free (runs on device)	Very low	Moderate
Privacy	Data never leaves phone	Data sent to server	Data sent to server
Offline	Yes	No	No
UI Generation	Simple components only	Standard layouts	Complex, multi-component
Best For	Quick actions, autocomplete, simple classification	Most generative UI tasks	Complex reasoning, multi-step agents

Key Idea for generative UI Design

Your generative UI protocol needs to work across this entire spectrum. That means: compact schemas that fit in Nano's small context window, graceful degradation when the on-device model can't handle a complex layout, and a clear escalation path from on-device → cloud when needed. This is a core architectural decision that shapes the entire protocol design.

In the Wild: On-Device vs Cloud in Shipping Products

Apple Intelligence implements a tiered approach almost identical to what generative UI needs. Simple tasks (notification summaries, smart reply suggestions, text proofreading) run entirely on-device via Apple's ~3B parameter model. Complex tasks (image generation with Image Playground, deep writing assistance) route to Apple's "Private Cloud Compute" servers. The decision happens automatically — the user never chooses.

On-device call screening (available on Pixel and Galaxy devices) is a pure on-device success story. A local model transcribes the caller's speech in real-time — no network needed. It works because the task (speech→text for a short utterance) fits comfortably within an on-device model's capability. This is the kind of scoped, well-defined task that on-device excels at.

Samsung Galaxy AI's Live Translate runs on-device for real-time phone call translation. The latency requirement (sub-200ms) makes cloud infeasible. But for complex features like Chat Assist (rewriting messages in different tones), they route to cloud because tone-shifting requires more sophisticated reasoning than the on-device model can handle.

Spotify's DJ feature uses cloud models to generate the DJ's commentary (creative, personalized text) but on-device models for the voice synthesis (latency-critical). Splitting one feature across on-device and cloud models — each doing what it's best at — is a pattern you'll use in generative UI.

Cost, Latency & the Performance Triangle

Every AI product lives inside a triangle: quality, speed, and cost. You can optimize two at the expense of the third. Every design decision shifts the balance.

The iron triangle: fast + cheap = lower quality. High quality + fast = expensive. High quality + cheap = slow.

The numbers designers should know

Metric	Threshold	Why It Matters
Time to first token (TTFT)	< 200ms ideal	Users perceive streaming responses as 40-60% faster than waiting
Output token cost	3-8x input cost	Every word the AI writes costs more than every word it reads
Streaming	Non-negotiable	Show tokens as they generate — never make users stare at a spinner
Prompt caching	50-90% savings	Reusing system prompts across calls is the easiest cost win

Queries per day

Input tokens/query

Output tokens/query

% of input that's cacheable

Queries per conversation

Monthly active users

Model:

Apply prompt caching (cached input billed at ~10% of normal rate)

Prompt caching — the single biggest cost lever in 2026

Prompt caching is the rare optimization that's both massive and free. By 2026 every major provider supports it (Anthropic, OpenAI, Google), and any production app with a non-trivial system prompt that isn't using it is leaving 50–90% of input cost on the table. It deserves more than a bullet point.

What it actually does. Every API call has a prefix that almost never changes — the system prompt, tool definitions, few-shot examples, sometimes a long retrieved document. The provider runs that prefix through the model once and stores the resulting internal state on their side. When your next call arrives within the cache window, the provider skips re-processing the prefix and starts from the cached state. You're billed at roughly 10% of the normal input rate for the cached portion.

Cache misses are full-price. Cache hits are nearly free for the cached portion. Throughput patterns determine which one you mostly get.

What's cacheable. Anything that's identical across calls and lives at the start of the prompt: system instructions, tool/function definitions, few-shot examples, large retrieved documents shared across users. Anything that varies per user — chat history, the user's question — has to come after the cacheable prefix.

The 5-minute TTL gotcha. Most providers expire idle caches after ~5 minutes. A product with steady traffic gets near-100% cache hits and pockets the savings. A product with bursty or low traffic mostly pays for misses. If your traffic is uneven, batch user requests through a small pool of "warm" sessions, or consider Anthropic's longer-TTL extended cache (1 hour) for high-leverage prompts.

Worked example. A support bot with a 2,000-token system prompt, processing 1,000 messages/hour at ~80% cache hit rate on Sonnet 4.6: caching saves roughly $50/day, or $1,500/month — about 90% of what you'd otherwise spend on input tokens. That's an extra ~$18K/year in margin, gained by adding one parameter to your API call.

Key Idea

Prompt caching vs KV caching. They sound similar and people conflate them. KV caching is what happens inside a single generation — the model caches its own intermediate state so it doesn't recompute earlier tokens as it generates each new one. It's automatic and you don't think about it. Prompt caching is what happens across calls — the provider caches your prompt prefix between requests, billed to you. It's opt-in and you absolutely think about it.

How inference gets faster and cheaper

In Chapter 2 we defined inference as the process of using a trained model to generate output. Every response your product generates is an inference call. The AI industry has developed several techniques to make these calls faster and cheaper — and understanding them helps you make product architecture decisions.

Four inference optimizations that shape the AI products you build. Prompt caching is the only one you directly control. The others happen at the provider level but affect your cost and latency.

The inference provider landscape

An important distinction: the company that trained a model isn't always the company that serves it. Meta trains Llama. But you can run Llama inference through AWS Bedrock, Together AI, Fireworks, Groq, or your own servers. This decoupling matters because:

Price competition: Multiple inference providers serve the same model, so they compete on price and speed. Llama on Groq is faster than Llama on a generic cloud GPU.
Portability: If you build on an open model (Llama, Mistral, DeepSeek), you can switch inference providers without changing your product. If you build on a proprietary model (GPT-4o, Claude), you're locked to one provider.
On-premise options: For regulated industries (healthcare, finance), running inference on your own servers keeps data in-house. This is only possible with open models.

Key Idea

Inference is where all the money flows in production AI. Training is a one-time cost borne by the model lab. Inference is an ongoing cost borne by every product using the model. When someone says "AI is expensive," they mean inference is expensive. When someone says "AI is getting cheaper," they mean inference prices are dropping (they fell roughly 80% between early 2025 and early 2026). Every optimization in this section — caching, speculation, quantization — exists to make inference cheaper and faster.

Key Idea

Every design decision is a cost decision. Longer system prompts mean more input tokens. Verbose AI responses mean more output tokens, which cost 3–8x more. Model routing (Chapter 7) is the biggest lever after caching: send simple tasks to cheap models, complex tasks to expensive ones. A well-designed routing system can cut costs 30–50% with no perceptible quality loss.

When self-hosting open models beats APIs (and when it doesn't)

By 2026 open models — Llama, Mistral, DeepSeek, Qwen — are competitive with closed frontier models on most non-frontier tasks. That makes "should we self-host?" a real question instead of an obvious no. The answer is still usually no, but the exceptions matter.

Self-hosting beats APIs when…	APIs beat self-hosting when…
You have very high steady volume (millions of requests/day) and unit economics dominate everything else.	You have variable or low volume — GPU utilization tanks, ops cost dominates.
Data residency or compliance forbids sending data to third parties (healthcare, defense, regulated finance).	You can use a regional or VPC-deployed API endpoint instead — most providers offer this now.
You're fine-tuning heavily and need full control over weights and training.	Provider fine-tuning APIs (LoRA endpoints) cover the use case.
You need a model the labs don't sell — a specific size, an older checkpoint, an embedding model with custom tokenizer.	You can pick from a frontier model + a cheap routing model and that's enough.
Latency is so tight that even a colocated API endpoint isn't fast enough (Groq, Cerebras territory).	200–500ms TTFT from a hosted API is acceptable.

The hidden cost of self-hosting isn't the GPUs. It's the ops team. Running production inference well — autoscaling, monitoring, model upgrades, security patching, handling traffic spikes — is a full-time SRE function. Most teams that try it eventually move back to a hosted inference provider (AWS Bedrock, Together AI, Fireworks, Groq) which gives them the open-model portability without the ops burden. The genuinely-self-hosted population is small and specialized.

In the Wild

Streaming is the single biggest UX win. ChatGPT, Claude, and Gemini all stream tokens as they generate. Users read faster than models write, so streaming feels interactive rather than like waiting. Products that show a loading spinner until the full response is ready feel dramatically slower, even when total latency is identical.

Prompt caching is now standard across major providers. OpenAI documents up to 90% input-token cost reductions and latency wins when repeated prompt prefixes hit the cache.S8 The teams that wire this in early usually get a cost win without changing the product.

Groq and Cerebras serve open models on custom silicon at speeds frontier APIs can't match — hundreds of tokens per second on Llama-class models. For latency-bound use cases (live voice, autocomplete), they're a different category of product, not just a cheaper one.

Cursor's model routing sends most autocomplete to a small fast model, escalates harder requests to a frontier model, and uses Anthropic's longer-TTL caching on the codebase context. Three optimizations stacked: routing, caching, model size. None of them are visible to the user.

AI-Generated UI — From Model Output to Rendered Interface

Everything in this book converges here. Generative UI takes an LLM's structured output and transforms it into rendered interface components — React on web, SwiftUI on iOS, Jetpack Compose on Android, or any other renderer.

Let's trace the full journey — from a user's voice to pixels on screen — using everything we've learned:

The complete generative UI pipeline — from natural language to native UI. Each step maps to a chapter in this book. Step 6 is where generative UI meets the native rendering layer.

The generative UI schema (conceptual)

At its core, generative UI defines a tree of components. Each component has a type, properties, and optional children. The model generates this tree, and a platform renderer turns it into native UI:

// generative UI response for "Show my workout stats"
{
  "root": {
    "type": "Column",
    "children": [
      {
        "type": "Text",
        "value": "This Week's Workouts",
        "style": "headlineMedium"
      },
      {
        "type": "Card",
        "variant": "elevated",
        "children": [
          {
            "type": "Row",
            "mainAxisAlignment": "spaceBetween",
            "children": [
              { "type": "Text", "value": "Sessions", "style": "labelLarge" },
              { "type": "Text", "value": "4 of 5", "style": "bodyLarge" }
            ]
          },
          {
            "type": "LinearProgressIndicator",
            "progress": 0.8,
            "color": "primary"
          }
        ]
      },
      {
        "type": "Card",
        "variant": "outlined",
        "children": [
          { "type": "Text", "value": "Top Exercise", "style": "labelLarge" },
          { "type": "Text", "value": "Bench Press — 185 lbs × 5", "style": "titleMedium" },
          {
            "type": "Button",
            "label": "View Details",
            "action": { "type": "navigate", "target": "workout_detail" }
          }
        ]
      }
    ]
  }
}

The complete pipeline in action: user speaks → agent reasons → tool executes → structured JSON returned → native UI rendered. Every component on this screen was generated, not hand-coded.

Notice: the component names (Column, Card, Row, Text, Button, LinearProgressIndicator) map to standard UI primitives available in any framework — React, SwiftUI, Compose, Flutter. The style tokens (headlineMedium, labelLarge) map to a design system. The renderer walks this tree and emits native components for whatever platform you target.

Why This Matters

The schema layer — between the model's output and the rendered interface — is where UX decisions live. Engineers understand the rendering. ML engineers understand the model. The schema in the middle is where UX meets AI constraints. Which components to include, how to handle responsive layouts, what error states to support, how to balance expressiveness with reliability — these are judgment calls that require understanding UX, AI constraints, AND the component system. This is the new design surface.

In the Wild: AI-Generated UI Is Already Shipping

Vercel's v0 is the closest public analog to generative UI — it takes natural language prompts and generates React/Next.js UI components. Under the hood, it produces a structured component tree (JSX) from an LLM, just like generative UI produces a JSON tree for Compose. v0 proved the concept works commercially: developers pay for AI-generated UI. It generates React components from natural language — the same structured-output-to-rendered-component pipeline that generative UI uses across all platforms.

Google Stitch generates UI from natural language prompts as HTML/CSS, proving that designers want to describe interfaces conversationally and get rendered output. Tools like Antigravity then convert HTML to React. The pattern is clear: AI generates a description, a protocol standardizes it, and a renderer turns it into platform-native UI.

Apple's App Intents framework is Apple's version of this pipeline for SwiftUI. When Siri handles a request, App Intents defines the structured data, and SwiftUI renders the result as a native widget or Live Activity. Apple has a complete pipeline from voice → structured intent → native UI. Every platform is converging on this pattern: structured AI output → platform-native rendering.

Microsoft's Copilot in Microsoft 365 generates "Adaptive Cards" — a JSON-based UI format that renders natively in Teams, Outlook, and other Microsoft apps. This is structurally identical to generative UI: a JSON schema defines the component tree, a renderer turns it into native UI. Adaptive Cards has been in production since 2017 and handles billions of renders. It proves the pattern works at scale.

Skills & Customization — Teaching AI How You Work

Out of the box, an LLM is a generalist. Skills, instructions, and project configs are how you turn it into a specialist that knows your tools, your conventions, and your workflow.

Every major AI provider has built a system for customizing how their models behave. The names differ but the core idea is the same: give the model persistent context about how you want it to work, not just what you want it to do right now.

Think of it as a spectrum from simple to powerful:

The customization spectrum: from a single instruction to full modular tool systems.

Layer 1: System Prompts

The simplest form of customization. A system prompt is a set of instructions sent to the model before the user's message. Every API call can include one. It tells the model who it is and how to behave.

// System prompt example
{
  "system": "You are a senior UX researcher. When analyzing user feedback,
             always identify the underlying need behind the stated request.
             Format findings as: Observation → Insight → Recommendation.",
  "messages": [
    { "role": "user", "content": "Users keep asking for a dark mode toggle." }
  ]
}

System prompts are powerful but have a key limitation: they're ephemeral. Every new conversation starts from scratch unless you manually include the system prompt again. They also eat into your context window (Chapter 3) since they're sent with every message.

Layer 2: Custom Personas & Reusable Behaviors

The next step up: save your system prompts as reusable personas that persist across conversations. Each provider has a different name for this:

Provider	Feature	What It Does
OpenAI	Custom GPTs	Package a system prompt + tools + knowledge files into a shareable persona. "GPT that reviews design specs against WCAG guidelines."
Google	Gems	Custom Gemini personas with persistent instructions. "A Gem that writes PRDs in our team's format."
Anthropic	Projects + System Prompts	Project-scoped instructions and knowledge files that apply to all conversations within a project.

Key Idea

Custom personas are really just saved system prompts with a UI wrapper. The model doesn't fundamentally change. But the UX impact is significant: instead of copy-pasting instructions every time, you have a persistent specialist you can return to. For teams, this means you can create shared personas that encode team conventions.

Layer 3: Project Configs — Codebase-Aware Context

This is where things get interesting for people who work in code-adjacent roles. Project configs give the AI persistent knowledge about a specific project, its conventions, and its structure.

The pattern was popularized by AI coding tools and is now spreading to broader AI workflows:

Tool	Config File	What It Contains
Claude Code	`CLAUDE.md`	Project conventions, architecture decisions, coding standards, team preferences. Lives in the repo root. Claude reads it automatically at the start of every session.
Cursor	`.cursorrules`	Similar to CLAUDE.md. Rules about code style, preferred libraries, patterns to follow or avoid. Cursor loads it as context for every AI interaction in that project.
GitHub Copilot	`.github/copilot-instructions.md`	Repository-level instructions for Copilot. Defines conventions specific to the codebase.
Windsurf	`.windsurfrules`	Project rules for the Windsurf editor's AI assistant. Same pattern, different file name.

Analogy: The Onboarding Doc

A project config is like the onboarding document you'd give a new team member on their first day. "Here's how we name things. Here's our folder structure. Here are the libraries we use and why. Here's what we've tried before that didn't work." Except instead of a human reading it once and gradually forgetting, the AI reads it at the start of every single session.

CLAUDE.md vs plan.md

In Claude Code's workflow, there's an important distinction between two types of files:

CLAUDE.md — "Who you are"

Persistent project context. Describes the codebase, conventions, architecture, and preferences. Doesn't change between tasks. Think of it as the project's constitution.

Example: "This is a Next.js 15 app using Tailwind. We use server components by default. All API routes go in /app/api. Never use class components."

plan.md — "What to do next"

Task-specific planning document. Created for a specific feature or work session. Breaks down the task into steps, tracks progress, and captures decisions made along the way. Temporary and task-scoped.

Example: "Task: Add dark mode. Step 1: Create theme context ✅. Step 2: Update Tailwind config ✅. Step 3: Add toggle component. Step 4: Persist preference."

The two work together: CLAUDE.md tells the agent how to work in this project. plan.md tells it what to work on right now. One is stable, the other is ephemeral.

Layer 4: Skills — Modular, Executable Capabilities

Skills go beyond instructions. A skill is a packaged capability that the AI can execute, not just follow. Skills combine instructions, tool definitions, and sometimes code into a reusable module.

A skill packages instructions + tools + examples + a trigger condition into a reusable module.

The key difference between a skill and a system prompt: a system prompt says "you are a UX researcher." A skill says "when the user asks you to analyze feedback, here's the exact process to follow, here are the tools to use, here are examples of good output, and here's how to format the result." Skills are procedural, not just descriptive.

How the major providers approach skills

Each AI provider has a different philosophy and architecture for customization. Understanding these differences matters because they shape what you can build and how portable your workflows are.

Three approaches: Anthropic builds open protocols (MCP, file-based configs). OpenAI builds a marketplace (GPT Store). Google builds into their existing suite (Workspace extensions).

The key differences that matter

Portability

Claude's approach is file-based: CLAUDE.md and skill files live in your repo. If you switch tools or providers, those files still work as documentation. A Custom GPT lives on OpenAI's platform. If you leave, you lose it. Google's Gems are tied to your Google account.

Composability

Claude's skill system is modular. You can have a "create-docx" skill, a "design-doc" skill, and a "frontend" skill, and they compose together in the same session. The model reads whichever skill files are relevant. OpenAI's Custom GPTs are monolithic: one GPT, one system prompt, one set of tools. You can't easily mix GPTs together.

Tool integration

This is where MCP (Chapter 9) becomes the differentiator. Claude uses MCP as the universal protocol for connecting to external tools. OpenAI uses GPT Actions (custom API integrations defined per-GPT). Google uses Extensions (pre-built connectors to Workspace apps). MCP is open and any tool can implement it. GPT Actions and Extensions are vendor-specific.

In the Wild: How Teams Actually Use This

Cursor + .cursorrules has become the most widely adopted project config pattern among developers. Teams commit their .cursorrules file to the repo, encoding conventions like "use TypeScript strict mode," "prefer server components," "use this testing pattern." New team members get AI assistance that already knows the team's standards from day one.

Custom GPTs for design teams: Several design orgs have built internal GPTs for their specific workflows. "Upload a screenshot and this GPT audits it against our design system." "Paste user feedback and this GPT categorizes it by our taxonomy." These are essentially skills packaged as sharable apps.

Claude Projects for research teams: Research teams upload papers, transcripts, and frameworks into Claude Projects. The project-level knowledge means every conversation starts with deep context about the research domain, without re-explaining the background each time.

Where this is heading

The lines between these approaches are blurring fast. MCP is being adopted beyond Claude (Cursor, Windsurf, and others now support it). OpenAI is moving toward more composable tools. Google is opening up Gemini's extension system.

The convergence point: a world where you define your team's AI configuration once (conventions, tools, knowledge, workflows) and it works across any AI tool your team uses. We're not there yet, but the trajectory is clear. The teams investing in structured AI customization now will have a significant advantage as these systems mature.

Memory and Personalization

Customization (above) is what you tell the AI. Memory is what the AI learns about you over time. Every major provider now has a memory system: ChatGPT stores facts across conversations, Claude encrypts and exports memories, Gemini imports histories from competitors.

It helps to name the layers, because they have different shelf lives and different governance needs:

Working memory — the current conversation. Lives in the context window. Disappears when the chat ends.
Episodic memory — facts about specific past interactions ("user mentioned they're allergic to peanuts on March 3"). Time-stamped, often retrieved when topically relevant.
Semantic memory — durable facts about the user ("works in product management at a fintech," "prefers terse responses"). Accumulates and gets summarized over time.

The UX patterns for memory are standardizing: transparency (see what the AI remembers), editing (correct or delete memories), scoping (global vs project-specific), staleness management (refresh outdated info), and portability (export and import between providers).

Key Idea

A wrong memory is worse than no memory. The AI will confidently act on outdated information ("you said you wanted a vegetarian option" — no, that was last year). The mitigation is governance: time-stamp every memory, summarize aggressively to compress old facts, let users edit or delete, and bias toward forgetting over hoarding. Treat memory like a database that needs a retention policy, not an attic.

Key Idea

Skills and project configs are how AI goes from "generic tool" to "team member who knows our workflow." The investment isn't in the technology but in the articulation: writing down how your team works, what your conventions are, and what good output looks like. That documentation becomes the AI's training manual. Teams that can articulate their process clearly will get dramatically more value from AI than teams that can't.

Product & Strategy

AI Product Archetypes — Choosing the Right Pattern

Not every AI feature should be a chatbot. There are distinct product patterns, each with different UX, architecture, and user expectations. Choosing wrong costs months.

Every AI product maps to one of six archetypes. The archetype determines the interaction model, the trust requirements, the latency budget, and the failure modes you need to design for.

Most AI products are one of these six patterns. Many products combine two (search + chat, copilot + generation). The primary archetype drives the core UX.

The copilot vs agent decision

The most common product debate in 2026: should this feature be a copilot (AI assists, human decides) or an agent (AI acts, human supervises)? The answer depends on three factors:

Reversibility: Can the action be undone? Reversible actions (drafting text, suggesting code) → copilot is fine. Irreversible actions (sending emails, deploying code, making purchases) → needs agent-level guardrails or stays as copilot.
Risk: What's the cost of a mistake? Low risk (wrong autocomplete suggestion) → copilot. High risk (wrong medical advice, wrong financial transaction) → copilot with strong verification, or don't use AI at all.
Frequency: How often does the user do this task? Daily repetitive tasks benefit from agent automation. Rare high-stakes decisions benefit from copilot augmentation.

Same need, six implementations

The best way to understand archetypes: take one user need and see how each pattern handles it differently.

Same user need, six different products. The archetype you choose determines the entire UX, engineering architecture, and trust model.

How products evolve across archetypes

Products don't stay in one archetype. They evolve along a predictable path — usually from lower autonomy to higher:

Products evolve toward higher autonomy over time. The archetype you launch with isn't the archetype you'll have in two years.

Key Idea

The archetype isn't fixed — products evolve along the spectrum. GitHub Copilot started as autocomplete (generation), became a chat sidebar (copilot), and is becoming an agent (Copilot Workspace). Notion started with AI writing (generation), added Q&A (search), and is moving toward AI workflows (agent). The archetype you launch with isn't the archetype you'll have in two years.

In the Wild

Linear uses classification invisibly: AI auto-labels issues by priority and team. Users don't interact with the AI directly — it just makes the product smarter. This is the highest-ROI, lowest-risk archetype.

Figma AI is copilot-patterned: it suggests layouts, generates variants, and fills text — but the designer is always in control. The canvas is the workspace; AI is the assistant.

Cursor spans three archetypes simultaneously: autocomplete (generation), chat panel (copilot), and Composer (agent). Each mode has different trust levels, latency budgets, and UI patterns.

Context Engineering — The Prompt Is the Product

Prompt engineering was about writing good instructions. Context engineering is about designing the entire information environment the model sees — and it's now the most important product decision in AI.

The term "context engineering" was popularized by Andrej Karpathy in 2025 and has since become the standard framing. The insight: what matters isn't just the prompt — it's everything in the context window. System instructions, retrieved documents, conversation history, tool outputs, and examples all shape the model's behavior.

Context engineering means designing what fills this window. The model's behavior is a function of everything it sees, not just the user's message.

The system prompt is the product spec

In traditional software, the product spec becomes code. In AI products, the product spec is the system prompt. Want the bot to be concise? That's a prompt instruction. Want it to always cite sources? Prompt instruction. Want it to refuse certain topics? Prompt instruction. The system prompt is the single most leveraged artifact in an AI product — and iterating on it is how you ship improvements without changing any code.

Prompt techniques every PM should know

Five techniques that cover 90% of prompt engineering. Most production systems combine several: role + few-shot + structured output is the most common stack.

Before and after: prompt quality matters

Weak Prompt

Summarize this feedback.

Result: Generic summary, no structure, misses key themes, inconsistent length across runs.

Engineered Prompt

You are a UX researcher analyzing user feedback. For each piece of feedback, identify: (1) the stated request, (2) the underlying need, (3) severity (1-5). Respond in JSON.

Result: Consistent, structured, actionable. Same format every time.

Prompt iteration as product development

Version control: System prompts live in the repo, in code review, and in PRs. Every change is tracked, reviewable, and reversible. A prompt change is a product change — treat it like one.
A/B testing: Run two prompt variants against real traffic. Measure outcomes (acceptance rate, task completion, retention) using the online evals from Chapter 17. Decide on data, not gut.
Regression testing: Every prompt change runs against the golden eval set before merge. Score deltas are visible in the PR. Anything that drops the worst-case quintile gets a second look.
Drift detection: Even a frozen prompt drifts when the underlying model upgrades, or when the data distribution shifts. Re-run your eval suite on a schedule (weekly is normal), alert on score deltas, and re-tune before users notice.
Collaborative editing: PMs, designers, researchers, and engineers all contribute. It's the most cross-functional artifact in the product, and it tends to live in whichever doc the team will actually open — the repo, a Notion page, the eval platform.

Key Idea

The shift from "prompt engineering" to "context engineering" reflects a maturation: it's not about clever wording tricks anymore. It's about designing the entire information environment. What documents get retrieved? How much conversation history is retained? Which tools are exposed? How are examples selected? These are product architecture decisions that happen to be expressed as text in a context window.

In the Wild

Anthropic's Claude system prompt is thousands of tokens long and is treated as a living product document. Changes go through eval suites before deployment. It defines Claude's personality, capabilities, limitations, and behavior — it IS the product.

Cursor dynamically constructs context for each request: relevant code files (retrieved via embeddings), the user's recent edits, linter errors, and the project's .cursorrules file. No two requests see the same context. The "intelligence" of Cursor is largely in how well it selects what to include.

AI Business Models — How AI Products Make Money

Traditional SaaS costs almost nothing per additional user. AI products spend real money on every API call. That single fact rewrites pricing, margins, and which features are worth shipping at all.

The fundamental economic difference: serving one more user on Figma costs Figma almost nothing. Serving one more query on ChatGPT costs OpenAI real money — model inference, compute, and API fees. This marginal cost per request is what makes AI product economics different from everything that came before.

Pricing models in the wild

Model	How It Works	Example	Tradeoff
Per-seat subscription	Fixed price per user/month	ChatGPT Plus ($20/mo), Cursor Pro ($20/mo)	Simple, predictable. But heavy users cost you money, light users subsidize them.
Usage-based	Pay per token / API call	OpenAI API, Anthropic API, Google Vertex	Fair pricing, scales with value. But unpredictable bills scare customers.
Hybrid	Base subscription + usage overages	Claude Pro (base + message limits)	Best of both: predictable base, usage upside. Most common in 2026.
Free tier + premium	Basic AI free, advanced features paid	Notion AI, Grammarly, Perplexity	Great for adoption. Risk: free tier costs real money to serve.
Embedded / platform	AI baked into a product you already pay for	Apple Intelligence, Galaxy AI, Google Workspace AI	No separate pricing. AI is a feature, not a product. Funded by the parent product.

Unit economics designers should understand

Cost per query: A single GPT-4o API call costs ~$0.01-0.05. At 1M queries/day, that's $10K-50K/day in API costs alone. Every design decision that increases response length or triggers additional API calls affects this directly.
The output tax: Output tokens cost 3-8x more than input tokens. A verbose AI response costs significantly more than a concise one. "Be helpful and thorough" in a system prompt is literally expensive.
Gross margin: Traditional SaaS: 80-90% gross margin. AI products: 50-65% gross margin. The difference is inference cost. This is why AI startups need to be much more careful about feature scope.

Worked example: unit economics of a support bot

Every AI product PM needs to model this math before building. The question isn't "can we build it?" — it's "can we afford to run it?"

Key Idea

In AI products, every design decision is a cost decision. A longer system prompt = more input tokens per call. A "show your reasoning" feature = 5-10x more output tokens. A RAG pipeline = embedding costs + retrieval costs + longer context. A multi-step agent = multiple API calls per user action. Product teams that don't model these costs before building frequently discover their feature is economically unviable at scale.

Moats & Defensibility — What's Yours When Everyone Has AI

When every product can access the same foundation models, what makes yours defensible? The models are commoditizing. The value is moving to the layers above and below.

Here's the uncomfortable truth: if your product's value proposition is "we use GPT-4o to do X," your competitor can ship the same thing in a week. The model is an API call. The moat is everything else.

Models are becoming interchangeable. Defensibility lives in the data layer (proprietary data, fine-tuned behaviors) and the product layer (UX, distribution, workflow integration).

The six AI moats

Proprietary data: Data the model was fine-tuned on or that your RAG pipeline accesses. Bloomberg built BloombergGPT on proprietary financial data no competitor has.
Data flywheel: User interactions improve the product, which attracts more users, generating more data. Every Spotify listening session makes recommendations better.
Workflow integration: When your AI is deeply embedded in the user's daily workflow, switching costs are enormous. Cursor's deep IDE integration makes it painful to leave.
Distribution: Reaching users first. Apple Intelligence ships on every iPhone. Google AI is in every search. Distribution > technology.
Domain expertise: Understanding the problem deeply enough to build the right eval suite, the right guardrails, and the right UX for a specific vertical. Harvey (legal AI) knows law; a general chatbot doesn't.
Compounding context: The longer a user stays, the more the system knows about them. Memory, preferences, project history. Switching means starting from zero.

The moat audit: evaluating your product

Score your product on each dimension. If you're below 3 on all six, your product is a feature — not a company.

Key Idea

The most durable moat in AI is the data flywheel: user interactions → better training/eval data → improved product → more users → more interactions. Products that capture and learn from usage data compound their advantage over time. Products that just wrap an API don't. This is why "AI-native" companies (built around the flywheel) are structurally advantaged over "AI-added" companies (bolted AI onto an existing product).

Developer Experience for AI — Designing for Builders

AI developer products have unique DX challenges: non-deterministic outputs, complex debugging, and the need to "try before you buy." The playground isn't a nice-to-have — it's the product.

When a developer evaluates an AI API or tool, they go through a predictable journey: try it in a playground → read the docs → build a prototype → hit edge cases → decide to commit or abandon. The DX at each stage determines conversion.

The AI developer journey

Stage	What They Need	DX Pattern
Explore	Can this do what I need?	Interactive playground with real models. Zero setup. Shareable results.
Prototype	Can I build with this?	SDKs in major languages. Quickstart that works in < 5 min. Copy-paste examples.
Build	How do I handle the hard parts?	Streaming docs, error handling guides, prompt engineering tutorials, eval tooling.
Scale	Can I rely on this?	Rate limits, uptime SLAs, cost calculators, usage dashboards, model versioning.
Debug	Why did it break?	Observability: request logs, token counts, latency traces, response diffs.

What makes AI DX different

Non-determinism: The same input can produce different outputs. This breaks traditional testing assumptions. Developers need tools to evaluate quality ranges, not exact matches.
Prompt as code: The system prompt is part of the application logic but it's natural language, not code. Version control, diff tools, and review processes need to accommodate this.
The playground is the sales pitch: Developers decide within 5 minutes of trying a model whether to invest further. Anthropic Console, OpenAI Playground, and Google AI Studio are competitive weapons.
Cost visibility: Every API call costs money. Developers need real-time cost tracking, budget alerts, and the ability to set spending limits. Surprise bills kill trust.

Time to wow: the single most important DX metric

How quickly does a developer go from zero to a working demo? This determines adoption more than any feature list. The target: under 5 minutes for a "hello world" equivalent.

Every step that adds friction before "wow" loses a percentage of developers. The playground-to-API-call path must be frictionless.

AI-specific API design decisions

Streaming vs batch: Always offer streaming. Users perceive streamed responses as 40-60% faster. Non-streaming feels broken in 2026.
Non-determinism: The same input can produce different outputs. API contracts should document this. Provide seed parameters for reproducibility when possible.
Async for agents: Multi-step agent tasks can take minutes. Use webhooks or polling, not blocking HTTP requests. Show progress during long operations.
Cost transparency: Return token counts and cost estimates with every response. Developers shouldn't need a calculator to estimate their bill.

In the Wild

Anthropic's Workbench lets developers test prompts, compare models side-by-side, and share results — all in-browser, before writing any code. The "try it" path has zero friction.

Vercel's AI SDK became the standard for building AI features in web apps because it abstracted away streaming, provider switching, and tool use into a clean TypeScript API. Good SDK design = adoption.

Stripe's API docs (pre-AI) set the DX standard that AI companies now emulate: interactive code examples, copy-paste SDKs, real API keys in the docs. The best AI developer products apply these same principles to a much harder problem space.

The AI Ecosystem Map — Where Everything Fits

The AI landscape has dozens of layers and hundreds of companies. Understanding where your product sits — and who the adjacent players are — is essential for strategic positioning.

Six layers, from compute at the bottom to user-facing applications at the top. Most value captures at the top (applications) and bottom (compute). The middle layers are in a race to avoid commoditization.

Build vs buy at each layer

Layer	Build When	Buy When	Key Tradeoff
Model	You need fine-tuned behavior	Almost always buy/rent	Training costs $1M+
Orchestration	Complex multi-agent flows	Standard agentic patterns	LangChain is fast but opinionated
Vector DB	Unique scaling or privacy needs	Standard RAG	Pinecone/Weaviate vs pgvector
Eval	Highly domain-specific metrics	Standard accuracy/quality	Custom evals + Braintrust hybrid
Guardrails	Regulated industry (health, finance)	Standard content safety	Compliance needs drive build

Analogy: The Restaurant Kitchen

You don't grow your own wheat (compute), breed your own cows (train models), or manufacture your own pans (build infra) — you buy those. But you DO create your own recipes (prompts), design your own menu (product), and build the dining room (UX). The moat is what the customer sees and tastes, not what happens in the supply chain.

The strategic question for any AI product: which layer are you in, and who are you depending on? If you're an application, you depend on model providers. If you're a model provider, you depend on compute. Every layer has leverage over the ones above it and dependency on the ones below.

Key Idea

The "barbell" pattern: most value accrues at the top (apps that own the user relationship and data) and the bottom (compute providers with physical infrastructure). The middle layers — model APIs, orchestration frameworks, vector databases — face the most commoditization pressure. The products that thrive in the middle are those that become essential workflow infrastructure (LangChain) or own a critical data layer (Pinecone).

The Demo-to-Production Gap — Why AI Products Break at Scale

The demo always works. Production is where AI products fail. Understanding this gap is the difference between a successful launch and an embarrassing one.

Every AI product team has experienced this: you build a prototype, demo it to leadership, everyone is amazed. Then you ship it to real users and it immediately breaks in ways you never anticipated. This isn't a bug — it's a fundamental property of AI systems.

Why demos lie

Curated inputs: Demos use carefully chosen examples. Real users type misspelled queries, send ambiguous requests, and use the product in ways you never imagined.
Small scale: 10 demo queries work. 10,000 real queries hit edge cases, rate limits, and failure modes you've never seen.
Non-determinism: The demo query that produced a perfect answer might produce a terrible one on the next run. AI outputs vary.
Missing context: Demos happen in a controlled environment. Production has network latency, concurrent users, stale caches, and upstream API failures.

The production readiness checklist

Category	Demo Doesn't Test	Production Requires
Input diversity	5-10 curated examples	Handling any input, including adversarial ones
Error handling	"It works" path only	Timeouts, rate limits, model errors, bad inputs
Latency	Acceptable for a demo	P95 latency under 3s for every request
Cost	Free during prototype	$X per query × millions of queries = real money
Monitoring	You watch it yourself	Automated alerts, dashboards, anomaly detection
Model updates	Pinned to one version	Model provider updates break your prompts

Model regression: when the ground shifts under you

Your product runs on a model you don't control. When the provider updates that model, your prompts can break without any change on your end. This has happened to nearly every AI product team.

The model update trap. Your product broke and you didn't change a single line of code. This is unique to AI and catches every team at least once.

Key Idea

The #1 cause of production AI failures: the long tail of user inputs. Your eval suite covers the 90% case. The remaining 10% of queries — ambiguous, multi-language, misspelled, out-of-scope, adversarial — is where the product breaks. Building for the long tail means investing as much in error handling, fallbacks, and edge-case coverage as you do in the happy path.

In the Wild

Google's AI Overviews launch in 2025 became a cautionary tale. The demo was polished. Real users immediately surfaced absurd answers — the AI suggested adding glue to pizza and eating rocks. Google had to add guardrails, limit triggers, and rethink the entire rollout within days. The gap between "works on curated queries" and "works on everything people actually search" was enormous.

Notion AI took a different approach: they shipped with aggressive guardrails (refusing many edge cases) and gradually expanded capabilities based on production data. Start conservative, expand with evidence. Slower launch, fewer crises.

Data Flywheels — How AI Products Get Better Over Time

The most powerful AI products aren't the ones with the best model on day one. They're the ones that learn from every user interaction and compound that learning into a better product.

A data flywheel is a self-reinforcing loop: the product generates data from user interactions, that data improves the product, the improved product attracts more users, and more users generate more data. This is the core growth loop for AI products.

The flywheel: more users → more data → better product → more users. This loop is the primary source of long-term competitive advantage in AI.

What data to capture

Explicit feedback: Thumbs up/down, ratings, corrections, "this was wrong" reports. Highest signal, lowest volume.
Implicit feedback: Did the user accept the suggestion? Did they edit the AI's output? Did they retry the query? Did they complete their task? High volume, needs interpretation.
Failure cases: Queries where the AI failed, hallucinated, or was rejected. These are the most valuable training signals — they show exactly where the product is broken.
Usage patterns: Which features get used, which get ignored, what times of day, what sequence of actions. This shapes product roadmap, not model quality.

Closing the loop: from thumbs down to better product

The feedback pipeline: a single thumbs-down becomes a test case, which triggers a prompt fix, which improves scores. This is the flywheel in practice.

The privacy tension

Collecting user data for improvement creates a tension with user expectations of privacy. Different companies handle this very differently:

Provider	Trains on Your Data?	User Control
OpenAI	Yes, by default (consumer). No (API + Team).	Opt-out available in settings
Anthropic	No. Never trains on conversations.	Memory is encrypted, exportable
Google	Yes for free tier. No for Workspace paid.	Can import/export memory

The trust implication: products that DON'T train on user data can advertise that as a feature. Products that DO train on data get a better flywheel but face privacy scrutiny. This is a genuine strategic tradeoff, not a clear right answer.

Key Idea

The flywheel isn't automatic. You have to design for it. That means: building feedback mechanisms into the UX (easy thumbs up/down, correction flows), creating data pipelines that turn feedback into eval datasets, and establishing processes to regularly retrain or re-prompt based on what you learn. Companies that capture data but never close the loop don't have a flywheel — they have a data warehouse.

In the Wild

Spotify's Discover Weekly is the canonical data flywheel. Every listen, skip, save, and playlist add feeds back into the recommendation model. After 10+ years, the compound advantage is enormous — a new competitor can't replicate a decade of behavioral data.

Tesla Autopilot processes billions of miles of driving data from its fleet. Every car contributes to the training data. More cars → more data → better driving → more customers → more cars. The fleet IS the moat.

ChatGPT's RLHF loop: Human feedback on responses trains the reward model, which improves the base model, which produces better responses, which generate more subscriptions, which fund more human raters. OpenAI turned user feedback into a direct product improvement cycle.

Judgment

When NOT to Use AI — The Most Valuable Skill

The best AI PMs are the ones who kill AI features that shouldn't exist. Knowing when a lookup table, a rule, or a simple search is the right answer is rarer and more valuable than knowing how to build with LLMs.

Every chapter in this primer has implicitly said "here's how AI does X." This chapter asks the opposite question: when is AI the wrong tool?

Run every feature idea through this tree before writing a single prompt. Most "AI features" that fail in production would have been caught at step 1 or 2.

The replacement test

For any proposed AI feature, ask: "what would this look like without AI?" Often the non-AI version is faster, cheaper, more reliable, and good enough:

AI Feature	Non-AI Alternative	AI Justified?
AI-powered search	Good keyword search with filters	Only if semantic understanding genuinely matters
AI-generated summaries	Human-written abstracts or excerpts	Only at scale where humans can't keep up
AI categorization	Rule-based classifier or dropdown	Only if categories are fuzzy and input varies widely
AI writing assistant	Templates and snippets library	Only if the output truly needs to be novel each time
AI-powered recommendations	Curated lists, popularity sorting	Only with enough user data to personalize

Key Idea

The question isn't "can AI do this?" — it almost always can. The question is "does AI do this better than the alternatives, at a cost we can sustain, with a failure rate we can tolerate?" If the answer to any of those is no, the right product decision is to not use AI. This takes more courage than shipping an AI feature, and it's what distinguishes senior product thinking from hype-driven building.

In the Wild

Linear uses AI for issue classification but uses deterministic rules for workflow automation (status changes, assignments, notifications). They could use AI for everything — they chose not to, because rules are faster, cheaper, and 100% predictable for structured workflows.

Stripe Radar combines ML fraud detection with hard rules. Some fraud patterns are simple enough for rules ("block transactions over $10K from new accounts in high-risk countries"). ML handles the fuzzy cases. The hybrid is more reliable than either alone.

Researching AI Products — When Users Can't Tell You What They Want

Traditional user research asks "what do you need?" AI product research is harder because users can't articulate needs for a technology they don't fully understand. The methods have to change.

Nobody asked for "next-token prediction." They asked for "help me write faster." The translation from human need to AI capability is a skill most product teams haven't developed yet.

What's different about AI research

AI product research requires different methods because the outputs are probabilistic, the capabilities are hard to preview, and users often can't articulate what they need.

Research methods that work for AI

Wizard of Oz with AI: Before building, fake the AI with a human behind the scenes. Have a researcher manually generate responses in real-time. This tests whether the capability is valuable before investing in the technology.
Prompt prototyping: Build a clickable prototype that calls a real LLM API. Takes hours, not weeks. Test with 5-10 users. You're testing the interaction pattern, not the model quality.
Comparative evaluation: Show users the AI's output alongside a human's output (or alongside the current non-AI experience). Ask: "Which is more helpful?" This grounds evaluation in comparison, not abstract ratings.
Failure case interviews: Show users examples where the AI fails. Ask: "How bad is this? Would you stop using the product if this happened?" This calibrates your quality bar to user tolerance.
Longitudinal trust studies: Trust changes over time. A user who loves the AI on day 1 might distrust it by day 30 after encountering a few failures. Study trust over weeks, not sessions.

The delta question

The most important research question for any AI feature: "What's the delta?" Not "is the AI good?" but "is the AI better than what exists today?" If the current experience is a blank text field and the AI fills in a draft, the delta is huge. If the current experience is a well-designed template library and the AI generates slightly different text, the delta is small. Ship features with big deltas. Kill features with small ones.

Key Idea

AI product research isn't about asking users "do you want AI?" (they'll say yes). It's about measuring whether AI actually improves their outcome vs the non-AI alternative. The METR study found AI coding tools made developers 19% slower despite them believing they were 20% faster.S6 Without rigorous measurement, you're flying on perception, not reality.

In the Wild

Notion tested AI features as prompt prototypes before committing engineering resources. They wired a simple GPT call into their editor, tested with 20 users in a single afternoon, and learned that users wanted AI for "fill in this table" more than "write me a paragraph." This redirected six months of roadmap in one day.

Figma ran A/B tests on AI-generated layout suggestions where the control group got random layouts and the test group got AI layouts. The delta was measurable: AI layouts were chosen 3x more often. That data justified the investment.

Trust Calibration — Designing the Right Level of Trust

The goal isn't maximum trust. It's calibrated trust — users trust the AI exactly as much as it deserves to be trusted. Over-trust and under-trust are both product failures.

This is the most nuanced design challenge in AI. Every product sits somewhere on a trust spectrum, and getting the calibration wrong has real consequences.

Trust calibration is the primary functional metric for AI products. It measures whether users rely on the AI appropriately — not too much, not too little.

The trust design toolkit

Design Pattern	What It Does	When to Use
Confidence indicators	Show how certain the AI is ("High confidence" / "Not sure")	When accuracy varies by query type
Source attribution	Show where the answer came from (citations, links)	Any factual or knowledge-based task
Verification prompts	"Does this look right?" before executing	Irreversible actions or high-stakes outputs
Uncertainty language	"I think..." vs "The answer is..." in AI responses	When hallucination risk is moderate
AI-generated badges	Clear labels that content was made by AI	Always (and legally required in EU by Aug 2026)
Edit affordances	Make AI output easily editable, not a final answer	Any generation or drafting task
Comparison views	Show AI suggestion alongside the original	Editing, rewriting, refactoring tasks
Fallback visibility	Show what happens if AI is wrong (undo, revert)	Any action the AI takes on behalf of user

Trust calibration by domain

The higher the stakes, the more trust scaffolding you need. Autocomplete needs almost none. Medical diagnosis needs everything.

Key Idea

Trust is not a feature you add — it's a property that emerges from dozens of design decisions. The font size of an "AI generated" label. Whether the AI says "The answer is" vs "I believe the answer is." Whether the edit button is prominent or hidden. Whether errors are admitted openly or buried. Each micro-decision shifts calibration. The best AI products get this right not through one big trust feature, but through consistent, deliberate calibration across every interaction.

Decisions Under Uncertainty — The AI PM's Core Skill

Traditional PMs ship features that work or don't. AI PMs ship features that work 92% of the time and need judgment calls about the other 8%. This chapter is about building that judgment.

Every AI product decision lives in a fog of uncertainty that traditional product decisions don't have. The model might hallucinate. The model provider might change the model. The cost might be unsustainable. A competitor might ship the same thing next week. Here's how to navigate that fog.

The launch threshold framework

The same failure rate can be "ship it" or "absolutely not" depending on what happens when it fails. Context determines the threshold.

Communicating uncertainty to stakeholders

The hardest part of being an AI PM isn't the technology — it's translating probabilistic reality into language that leadership, legal, marketing, and sales can act on.

How engineers say it

"The model has a 93% accuracy rate with a P95 latency of 2.3 seconds on our eval suite, though performance degrades on out-of-distribution inputs."

How PMs should say it

"It gets the right answer 19 out of 20 times. When it's wrong, users see a 'not sure' indicator and can retry. We're monitoring the 5% and improving weekly."

The gray area decisions

The decisions that define your career as an AI PM aren't the easy ones. They're the ones where there's no clear right answer:

Should your AI writing tool help students write essays? It's not illegal. It might be unethical. It's definitely revenue. Where's your line?
Your AI hiring tool works great on average but performs worse for certain demographics. Overall accuracy is higher with AI than without. Ship or don't ship?
A user is clearly having a mental health crisis in your chatbot. Your AI isn't trained for this. Do you respond? Redirect? Shut down?
Your competitor shipped your roadmap feature using a newer model. Do you accelerate, pivot, or differentiate?
The model provider raised prices 30%. Absorb the cost, pass it to users, switch providers (6 weeks of migration), or simplify the feature?

These have no textbook answer. They require product judgment built from experience, values, and deep understanding of your users and business. The primer gives you the technical foundation. The gray areas are where you earn your title.

Key Idea

The defining characteristic of a strong AI PM isn't technical knowledge (you have that now) or business acumen (you're building that). It's comfort with ambiguity. The ability to make a decision with 70% confidence, communicate the uncertainty honestly, build in reversibility, and iterate based on data. Traditional PMs ship and move on. AI PMs ship, monitor, learn, and adjust — continuously. The product is never "done" because the model, the data, and the users are always changing.

In the Wild

Anthropic's responsible scaling policy is a decision framework for uncertainty at the company level: they define capability thresholds and pre-commit to safety measures at each level. This "decide the framework in advance, not in the moment" approach works for product teams too — define your quality thresholds, failure protocols, and escalation paths before you need them.

Notion AI's launch strategy was a masterclass in uncertainty management: ship with aggressive guardrails (the AI refuses many edge cases), measure what users actually try, expand capabilities based on real data. They chose "too conservative at launch, loosen over time" over "too permissive at launch, tighten after incidents." One approach builds trust. The other destroys it.

Putting It All Together

How an AI Feature Ships — The End-to-End Process

This chapter connects all 33 previous chapters into a single workflow. Here's how a team actually goes from "we should add AI" to a shipped, monitored, improving feature.

Six phases, referencing chapters throughout the primer. The lifecycle is linear to launch, then becomes a continuous loop.

Who does what

AI features blur traditional roles. Here's how responsibilities are shifting in 2026:

The system prompt is co-owned by PM and Design. Eval rubrics are co-owned by PM and Research. These shared responsibilities are new — traditional products have clearer ownership lines.

The emerging "AI builder" role

The lines between PM, designer, and engineer are blurring on AI teams. Microsoft reorganized in 2025 around a unified "Applied AI" function that merges traditional PM and engineering under a single "builder" role. Google's DeepMind product teams have designers writing prompts and PMs reviewing eval results. Startups like Vercel ship AI features where a single person handles prompt design, eval, and UX — because with API-based AI, you don't need separate specialists for each step.

The implication: the most valuable people on AI teams are T-shaped — deep in one discipline, but able to contribute across the prompt-eval-UX loop. A designer who understands evals. A PM who can write and iterate prompts. An engineer who thinks about trust patterns. This primer exists to build that cross-disciplinary fluency.

In the Wild

Microsoft's March 2026 Copilot reorg merged its consumer and commercial AI teams under a single leader and freed its AI CEO (Mustafa Suleyman) to focus on models. The signal: AI products are no longer a side project staffed by a few people — they're central enough to warrant dedicated, unified organizations with engineering, product, and design working as one unit.

Figma's AI team puts designers directly into the eval process and has designers writing system prompts — a designer wrote the first system prompt for Figma Make. Their Head of AI Product emphasizes keeping teams small by having everyone touch code, enabled by AI tooling that makes this feasible.

Anthropic's Claude Code team describes their approach openly: "designers ship code, engineers make product decisions, product managers build prototypes and evals." They have PMs, but the PM's job has shifted — instead of writing specs and handing them off, PMs build working prototypes with Claude Code and use evals to validate ideas. The team replaced documentation-first thinking with prototype-first thinking.

Production Addendum: The Parts Teams Forget

The workflow above gets the feature shipped. The checklist below keeps it from becoming a beautiful demo that quietly leaks data, loses trust, or collapses when the model changes.

Security & Prompt Injection

Treat retrieved docs, webpages, tool results, and emails as untrusted input. Add allowlisted tools, permission gates, output validation, and tests for data exfiltration attempts.

Human Permission Model

Classify actions as draft, reversible, externally visible, or irreversible. Require confirmation for sends, purchases, access changes, deletions, and sensitive-data transmission.

Privacy & Retention

Define what user data is logged, retained, used for evals, used for training, redacted, or excluded. Make consent and deletion paths part of the product surface.

Model Drift

Pin model versions where possible, run scheduled regression evals, and record model IDs in traces so quality changes can be explained after a provider update.

Accessibility & Generated UI

Require semantic labels, focus order, contrast, motion controls, and readable error states in the UI schema. Generated interface does not get a pass on accessibility.

Copyright & IP

Decide where generated text, code, and images can be used, what sources need attribution, and which workflows need legal review before launch.

Practical Templates

AI Feature PRD

Problem, user task, why AI is needed, non-AI fallback, data access, model route, permissions, success metric, failure cost, launch threshold.

Eval Rubric

3-5 criteria, examples of pass/fail, grader type, golden set owner, minimum score, regression threshold, review cadence.

Launch Readiness

Red-team cases, prompt-injection tests, telemetry, rollback path, human escalation, cost alert, legal/privacy review, support playbook.

RAG Design Worksheet

Corpus, permissions, freshness, chunking strategy, retriever, reranker, citation style, empty-result behavior, source-quality metric.

Deployment Architecture

Most durable AI products converge on the same skeleton: app UI -> AI gateway -> model router -> prompt/context builder -> model call -> tool executor/RAG -> validators/guardrails -> trace/eval logger -> user feedback loop. If one of those boxes is missing, know why.

Glossary — Quick Reference

Every term from this primer, in one place. Reference this whenever a concept from an earlier chapter comes up.

Term	Plain English	Chapter
Token	A chunk of text (~4 characters) that models process. The "atom" of AI.	1
Next-token prediction	How LLMs work: predict the most likely next word, repeat.	2
Context window	The model's working memory. Everything it can "see" at once.	3
Temperature	Randomness dial. Low = predictable. High = creative.	4
Reasoning model	Model that "thinks" step-by-step before answering. Slower, costlier, better on hard problems.	5
Structured output	Forcing the model to respond in a specific format (JSON, XML).	6
Model routing	Sending simple tasks to cheap models, hard tasks to expensive ones.	7
Multimodal	AI that processes text, images, audio, and video.	8
Function calling	Model outputs a structured request to use a tool (API, database, etc).	9
RAG	Retrieval Augmented Generation. Searching a knowledge base before answering.	10
Embedding	Converting text into numbers that capture meaning. Powers semantic search.	10
Vector database	Database optimized for storing and searching embeddings.	10
Hybrid search	Combining dense vector search (meaning) with sparse keyword search like BM25 (exact match) and merging results.	10
Reranking	Re-scoring the top retrieval results with a slower, more accurate cross-encoder model. Biggest single quality lever in production RAG.	10
HyDE	Hypothetical Document Embeddings. Have the LLM draft an answer first, then embed and search using that — closes the vocabulary gap between questions and documents.	10
Agentic loop	Think → Act → Observe → Repeat. How AI agents reason through multi-step tasks.	11
MCP	Model Context Protocol. Universal standard for connecting AI to tools.	12
A2A	Agent-to-Agent protocol. Standard for agents communicating with each other.	14
Guardrails	Safety systems that filter inputs/outputs (content, PII, jailbreaks).	16
Evals	Systematically testing AI against a set of examples to measure quality.	17
LLM-as-judge	Using a strong model to grade a weaker model's outputs.	17
Inference	Using a trained model to generate output. Every API call is an inference call. Distinct from training (which creates the model).	2, 19
TTFT	Time to First Token. How fast the model starts responding. <200ms feels instant.	19
Streaming	Showing tokens as they generate rather than waiting for the full response.	19
Prompt caching	Reusing processed prompt prefixes across calls. Cached tokens billed at ~10% of normal rate. The biggest production cost lever.	19
KV caching	Internal caching of attention state during a single generation so the model doesn't recompute earlier tokens. Automatic, not the same as prompt caching.	19
Speculative decoding	A small fast model drafts tokens, the large model verifies in batch. 2–3x speed at the same quality.	19
Quantization	Reducing the precision of model weights (e.g. 16-bit to 4-bit) to make models smaller and faster, with a small quality cost.	19
Distillation	Training a smaller "student" model to mimic a larger "teacher" model. How frontier capabilities flow down to cheap, fast models.	19
OpenTelemetry GenAI	Emerging standard for emitting LLM and agent traces (model calls, tool calls, latencies) into your existing observability stack.	14
Generative UI	AI generating actual interface components (cards, forms) not just text.	20
System prompt	Hidden instructions that define the AI's behavior, identity, and rules.	21, 23
Context engineering	Designing everything in the context window: prompt, docs, tools, history.	23
Data flywheel	Users → data → better product → more users. The core AI growth loop.	29
Trust calibration	Designing so users trust AI exactly as much as it deserves.	32
Hallucination	When the model confidently states something false.	17, 32
Fine-tuning	Retraining a model on custom data to change its behavior.	10, 21
RLHF	Reinforcement Learning from Human Feedback. How models learn to be helpful.	29

You made it — now teach it.

The best way to learn: explain it to someone else.
If you can't, you don't know it yet.

@adhithya

From tokens— to product strategy.

How to Use This Primer

PM Track

Design Track

Engineer Track

Founder / Strategy Track

Tokens — The Atoms of Language

Why tokens matter for product teams

Tokenization in practice

Key takeaway

Next-Token Prediction — How LLMs Think

The training process (simplified)

Training vs Inference: learning vs using

This is why LLMs can generate UI

Context Windows — The Model's Working Memory

Context window sizes across models

Why this matters for product decisions

Big context window (cloud)

Small context window (on-device)

Temperature and Sampling — Controlling Creativity

Other sampling parameters

Reasoning Models — When AI Needs to Think

The UX challenge

Structured Output — Making Models Predictable

Function calling in practice

The schema design tradeoff

Tight Schema

Loose Schema

Model Variants — Choosing the Right Brain

Model routing: using multiple models

Multimodal AI — Beyond Text

What this means for product teams

Function Calling — Teaching Models to Use Tools

The mechanics

How different providers implement it

OpenAI / GPT

Anthropic / Claude

Google / Gemini

Why tool descriptions matter enormously

Bad Description

Good Description

RAG — Giving AI Access to Knowledge

The pipeline, end to end

Chunking is where most teams lose

Fixed-size

Semantic / structural

Hierarchical

Code- or schema-aware

Beyond plain vector search

RAG vs fine-tuning vs long-context

Why RAG matters for product design

The Agentic Loop — How Agents Reason and Act

A real multi-step example

MCP — The Universal Plug for AI

What an MCP server exposes

Orchestration Patterns — ReAct, Chains, and Routing

Pattern 1: ReAct (Reasoning + Acting)

Pattern 2: Parallel Tool Calls

Pattern 3: Router Pattern

Agents in Production — Trust, Control, and the Real World

Six core agent UX patterns

Computer Use agents

Observability — debugging an agent that thinks for itself

Error Handling — When Agents Fail Gracefully

The graceful degradation ladder

AI Safety & Guardrails — Keeping AI in Bounds

The UX of guardrails

Evaluation — Measuring If AI Actually Works

The eval process, step by step

Step 1: Define what "good" means

The three grading approaches

LLM-as-judge, the careful version

Regression testing across model versions

Online evals — what users actually do

The five things every eval program tracks

On-Device vs Cloud — The Tradeoff Triangle

Practical comparison

Cost, Latency & the Performance Triangle

The numbers designers should know

Prompt caching — the single biggest cost lever in 2026