2.1 How Gen AI Models Actually Work (Without the PhD)

Before you can make good product decisions with AI, you need a working mental model of what's happening inside. You don't need to write code. You do need to know what the machine is doing well enough to predict when it will fail your users.

2.1.1 The Core Idea: Next-Token Prediction

Every major LLM β€” GPT-4, Claude, Gemini, Llama β€” does fundamentally the same thing:

It predicts the most likely next piece of text, one token at a time.

That's it. When ChatGPT writes a paragraph, it isn't "thinking" about the paragraph. It generates token #1, then uses that to generate token #2, then token #3, and so on β€” hundreds or thousands of times.

Analogy: The World's Best Autocomplete

Imagine the autocomplete on your phone keyboard, but trained on essentially the entire public internet β€” books, Wikipedia, code repositories, Reddit threads, scientific papers, news articles. Instead of suggesting the next word, it suggests the next token (a token is roughly ΒΎ of a word). It does this so well that the output reads like it was written by a thoughtful human.

Why this matters for PMs: The model isn't "understanding" your user's question in the way a human support agent does. It's pattern-matching against hundreds of billions of parameters to produce statistically likely continuations. This distinction explains almost every failure mode you'll encounter in production.


2.1.2 Tokenization: How Models See Text

Models don't read words. They read tokens β€” chunks of text that might be a whole word, part of a word, or even a single character.

Text Tokens (approximate)
"Hello world" ["Hello", " world"] β€” 2 tokens
"unbelievable" ["un", "believ", "able"] β€” 3 tokens
"GPT-4" ["G", "PT", "-", "4"] β€” 4 tokens
"こんにけは" Could be 3-5 tokens depending on the model

Why Tokenization Matters to PMs

  • Cost: You pay per token (input + output). GPT-4o charges ~$2.50/million input tokens, ~$10/million output tokens. A customer service response might be 200-500 tokens. Multiply by millions of users and this is a real line item.
  • Context window limits: Models have a maximum number of tokens they can process at once. GPT-4 Turbo handles 128K tokens (~300 pages). Claude 3.5 handles 200K tokens. Gemini 1.5 Pro handles up to 1M tokens. If your product needs to process a 500-page legal document in one shot, model choice matters.
  • Non-English tax: Many languages tokenize less efficiently. The same sentence in Japanese or Hindi might use 2-3x more tokens than in English, making your product more expensive for those users. This has direct implications for international product launches.
  • Math and code: Numbers tokenize unpredictably. "123456" might become ["123", "456"], which is one reason models are unreliable at arithmetic β€” they literally don't see the number as a single entity.

2.1.3 Transformers & Attention: The Architecture That Changed Everything

Every frontier LLM is built on the Transformer architecture, introduced in Google's 2017 paper "Attention Is All You Need."

Analogy: The Brilliant Reader

Imagine you're reading a 50-page document to answer a question. A normal reader goes top to bottom, remembering less as they go. A Transformer is like a reader who can instantly highlight the 15 most relevant sentences across all 50 pages β€” even if they're far apart β€” and synthesize an answer from just those highlights.

This "highlighting" is the attention mechanism.

How Attention Works (PM-Friendly Version)

When the model processes your prompt, every token "looks at" every other token and assigns an attention score β€” how relevant is this other token to understanding me in context?

Example prompt: "The bank by the river was steep."

  • The word "bank" is ambiguous (financial institution vs. riverbank).
  • The attention mechanism lets "bank" look at "river" and "steep," giving those words high attention scores, so the model correctly interprets "bank" as a riverbank.

Multi-head attention means the model runs many of these attention calculations in parallel, each one learning to focus on different types of relationships β€” syntax, semantics, long-range references, etc.

Why Transformers Won

Before Transformers, models (like RNNs/LSTMs) processed text sequentially β€” word by word, left to right. This was: - Slow (couldn't parallelize) - Forgetful (information degraded over long sequences)

Transformers process all tokens in parallel and can attend to any position equally. This is why: - Training became massively parallelizable across thousands of GPUs - Models got dramatically better at long-range coherence - Scale became the dominant strategy (more data + more parameters = better results)


2.1.4 The Training Process: Three Phases

Understanding how a model is trained tells you why it behaves certain ways in your product.

Phase 1: Pre-training (The Knowledge Phase)

What happens: The model reads trillions of tokens from the internet, books, code, and other text. It learns to predict the next token. This is unsupervised β€” no human labels needed.

Analogy: Imagine a medical student who reads every textbook, journal article, and patient record ever published. They absorb an enormous amount of knowledge β€” but they've never actually talked to a patient.

What it produces: A base model (also called a "pre-trained model"). Base models are powerful but awkward β€” they'll complete any text plausibly, but they don't know how to have a conversation. If you type a question, a base model might generate 10 related questions rather than an answer, because that's a plausible continuation of "question" text.

Cost and scale: - GPT-4's training cost is estimated at $100M+ in compute - Llama 3 405B was trained on 15T+ tokens - Gemini Ultra was trained on Google's entire TPU fleet

Phase 2: Fine-tuning / Instruction Tuning (The Behavior Phase)

What happens: Humans create thousands of example conversations: "When a user asks X, a good response looks like Y." The model is fine-tuned on these examples.

Analogy: That medical student now does their residency β€” they learn how to interact with patients, follow protocols, and give useful responses rather than just reciting textbook passages.

What it produces: An instruction-tuned model that can follow directions, answer questions, and hold conversations. ChatGPT, Claude, and Gemini are all instruction-tuned models.

Phase 3: RLHF / RLAIF (The Alignment Phase)

What happens: Human raters compare pairs of model outputs and choose which is better. This feedback is used to further train the model via Reinforcement Learning from Human Feedback (RLHF). Anthropic and Google also use AI-generated feedback (RLAIF β€” Reinforcement Learning from AI Feedback).

Analogy: The doctor now gets patient satisfaction surveys, peer reviews, and malpractice guidelines β€” they learn not just to be knowledgeable and responsive, but to be careful, ethical, and aligned with what patients actually need.

What it produces: A model that is more helpful, less harmful, and more aligned with user expectations. This is why Claude tends toward caution, ChatGPT toward helpfulness, and Gemini toward Google-ecosystem integration.

PM Insight: The alignment phase is where the model's "personality" and safety guardrails get baked in. When you find a model is too cautious (won't answer medical questions) or too permissive (generates harmful content), that's an RLHF tuning decision β€” not a knowledge gap.


2.1.5 Inference: What Happens When a User Hits "Send"

When a user submits a prompt in your product:

  1. Tokenization: The prompt is broken into tokens
  2. Encoding: Each token is converted into a high-dimensional vector (embedding)
  3. Forward pass: Tokens flow through the Transformer layers (GPT-4 reportedly has ~120 layers), with attention computed at each layer
  4. Output distribution: The model produces a probability distribution over all possible next tokens (~100K vocabulary)
  5. Sampling: A token is selected based on the probability distribution (controlled by temperature β€” more on this below)
  6. Repeat: Steps 3-5 repeat until the model produces a stop token or hits the max length

Temperature controls randomness: - Temperature 0: Always picks the most likely token β†’ deterministic, repetitive, "safe" - Temperature 0.7: Balanced β€” good for most product use cases - Temperature 1.0+: More creative/random β€” useful for brainstorming, dangerous for factual tasks

PM Decision Point: Temperature is one of your most important product levers. A customer support bot should use low temperature (0.1-0.3). A creative writing assistant should use higher temperature (0.7-1.0). Getting this wrong creates either a robotic experience or an unreliable one.


2.1.6 What PMs Need to Know vs. What Engineers Handle

Concept PM Must Understand Engineer Handles
Tokenization Cost implications, context window limits, multilingual impact Token vocabulary design, BPE algorithm implementation
Attention Why models handle some tasks well (long-range reasoning) and others poorly (precise counting) Attention head configuration, KV-cache optimization
Training phases How each phase shapes model behavior and limitations Hyperparameter tuning, distributed training infrastructure
Temperature Product impact on user experience; when to use high vs. low Sampling algorithms (top-k, top-p, beam search)
Context window Maximum input size, what to include/exclude in prompts Context compression, retrieval augmentation, chunking strategies
Fine-tuning When your use case needs it, what data you need, cost/timeline Training loops, LoRA adapters, evaluation metrics
Inference cost Unit economics per API call, latency budgets, caching strategy GPU provisioning, batching, model quantization

2.2 Critical Limitations of Foundation Models

This is the most important subsection for PMs. Every AI product failure you've read about traces back to one of these limitations. Memorize them.

Limitation 1: Hallucination (Confident Fabrication)

What it is: Models generate plausible-sounding but factually incorrect information with full confidence. They don't "know" they're wrong.

Why it happens: The model is optimizing for statistically likely text, not truth. If a plausible-sounding answer exists in the latent space, the model will produce it β€” whether or not it's factually accurate.

Real-world examples: - Google Bard launch (2023): In its very first public demo, Bard claimed the James Webb Space Telescope took the first pictures of exoplanets outside our solar system. This was wrong. Google's stock dropped ~$100B in market cap that day. - ChatGPT legal citations: Lawyer Steven Schwartz used ChatGPT to write a legal brief that cited six entirely fabricated court cases. The judge sanctioned him for submitting fake citations. - Microsoft Bing Chat (early 2023): Confidently provided incorrect financial data from earnings reports, hallucinating specific revenue numbers that didn't exist. - Amazon product reviews: AI-generated product descriptions have included fabricated specifications and non-existent features.

Hallucination rate benchmarks (approximate): | Model | Hallucination Rate (general QA) | |---|---| | GPT-4o | ~3-5% | | Claude 3.5 Sonnet | ~3-4% | | Gemini 1.5 Pro | ~4-6% | | Llama 3 70B | ~6-10% | | GPT-3.5 | ~15-20% |

PM Mitigation Strategies: - Implement RAG (Retrieval-Augmented Generation) to ground responses in verified data - Add source citations and confidence indicators to the UI - Use model-as-judge verification (have a second model check the first) - Design human-in-the-loop workflows for high-stakes outputs - Never ship AI in domains where a single hallucination has severe consequences (medical diagnosis, legal advice, financial trading) without robust guardrails


Limitation 2: Knowledge Cutoff (Frozen in Time)

What it is: Models only know what was in their training data. They have no awareness of events after their training cutoff date.

Specific cutoff dates (as of early 2025): | Model | Approximate Knowledge Cutoff | |---|---| | GPT-4o | October 2023 | | Claude 3.5 Sonnet | Early 2024 | | Gemini 1.5 Pro | ~Late 2023 (with Search grounding) | | Llama 3.1 | December 2023 |

Real-world impact: - Customer service bots that can't answer questions about product updates or policy changes made after training - Travel planning tools that suggest restaurants that have closed, reference pre-pandemic travel rules, or miss new visa requirements - Financial analysis tools that don't know about recent earnings, regulatory changes, or market events - HR chatbots that reference outdated company policies

PM Mitigation Strategies: - Implement RAG with a frequently updated knowledge base - Use web search/grounding (Gemini's Google Search grounding, ChatGPT's Browse feature) - Clearly display knowledge cutoff dates to users - Build pipelines to inject current context into prompts - Design products that fail gracefully when asked about recent events ("I don't have information about events after [date]. Let me search for the latest...")


Limitation 3: Reasoning Gaps (Brittle Logic)

What it is: Models can appear to reason through complex problems but fail in predictable, sometimes bizarre ways β€” especially on novel problems that require true logical deduction rather than pattern matching.

Where models struggle: - Multi-step math: Ask GPT-4 to multiply 3,847 Γ— 9,261 and it will often get it wrong. It's not calculating β€” it's guessing based on patterns. - Spatial reasoning: "I'm facing north. I turn left. I turn left again. What direction am I facing?" Models frequently get this wrong. - Counterfactual reasoning: "If the Roman Empire never fell, what language would modern France speak?" Models tend to give superficially plausible but logically inconsistent answers. - Planning and constraint satisfaction: "Schedule 5 meetings in 3 rooms over 2 days with these 12 constraints" β€” models struggle with combinatorial constraint problems. - Negation and logic puzzles: "Which of these statements is NOT true?" β€” models are measurably worse with negation.

Real-world product implication: If you're building a product that requires precise calculation, deterministic logic, or constraint satisfaction β€” don't rely on the LLM for that part. Use the LLM for natural language understanding and generation; use traditional code for computation.

The o1/o3 evolution: OpenAI's o1 and o3 "reasoning" models use chain-of-thought at inference time to dramatically improve on reasoning tasks. But they're slower and more expensive. Anthropic's Claude 3.5 also invested heavily in reasoning. This is an active area of improvement β€” but it's not solved.


Limitation 4: No Persistent Memory (Goldfish Problem)

What it is: Every conversation starts from zero. The model has no memory of previous interactions with the same user unless you explicitly provide that context.

Analogy: Imagine a brilliant consultant who gives great advice β€” but every time you call, they have total amnesia. You have to re-explain your company, your goals, and everything you discussed last time.

Real-world impact: - ChatGPT's memory feature: OpenAI added a "memory" layer that stores user facts between conversations β€” but this is an engineering solution built on top of the model, not an intrinsic capability. - Customer support: A user contacts your AI support bot for the 5th time about the same billing issue. Without memory engineering, the bot has no idea about the previous 4 conversations. - Personalization: AI assistants can't learn user preferences over time without external memory systems.

What this means architecturally: You need to build memory yourself β€” user profiles, conversation history databases, retrieval systems that inject relevant past context into each prompt. This is a significant engineering investment.


Limitation 5: No Real-World Interaction (Locked in a Box)

What it is: Base models can only read text in and write text out. They cannot browse the web, run code, access databases, call APIs, send emails, or interact with any external system.

Why this matters: A PM might imagine "the AI will check our inventory database and respond to the customer." But the model can't do any of that. Engineers have to build: - Tool calling / function calling: The model outputs a structured request (e.g., "call function check_inventory(product_id=X)"), and your application layer actually executes it - Agents: Frameworks where the model plans a sequence of tool calls to accomplish a complex task - Plugins / integrations: Connections to external data sources and services

Real-world examples: - ChatGPT Plugins (2023): OpenAI launched and then deprecated plugins, replacing them with GPTs and function calling β€” showing how hard it is to get tool use right - Amazon Bedrock Agents: AWS's framework for giving models access to company APIs and knowledge bases - Google Gemini + Workspace: Gemini accessing Gmail, Docs, and Calendar isn't the model's native ability β€” it's Google engineering integration layers


Limitation 6: Context Window Constraints

What it is: Models can only process a fixed amount of text at once. Anything beyond the context window is simply invisible to the model.

Context window sizes (as of early 2025): | Model | Context Window | Rough Page Equivalent | |---|---|---| | GPT-4o | 128K tokens | ~300 pages | | Claude 3.5 Sonnet | 200K tokens | ~500 pages | | Gemini 1.5 Pro | 1M tokens | ~2,500 pages | | Llama 3.1 405B | 128K tokens | ~300 pages | | Mistral Large | 128K tokens | ~300 pages |

The "lost in the middle" problem: Even within the context window, models pay less attention to information in the middle of long contexts. Information at the beginning and end gets more attention. Google's research on Gemini and UC Berkeley's research on GPT-4 both confirmed this.

PM implications: - If your product processes long documents, choose models with large context windows (Gemini 1.5 Pro's 1M tokens is a differentiator) - Put the most important information at the beginning or end of the prompt - Consider chunking strategies for documents that exceed the context window - Longer contexts = higher cost and latency


Limitation 7: Safety, Bias, and Alignment Issues

What it is: Models can reflect biases present in training data, generate harmful content if guardrails are circumvented, and behave in ways that don't align with your product's values.

Real-world examples: - Gemini image generation (2024): Generated historically inaccurate images (e.g., diverse Nazi soldiers) due to over-aggressive diversity prompts, leading Google to temporarily pause the feature - GPT-4 jailbreaks: Users discovered prompts that bypassed safety filters, leading to harmful content generation - Resume screening bias: AI models used for recruiting have shown bias against certain demographics

PM responsibilities: - Define your product's safety requirements clearly - Implement content filtering and moderation layers - Test extensively for bias across user demographics - Monitor production outputs for safety violations - Build user reporting mechanisms


Limitation Summary Matrix

Limitation Severity Can Be Mitigated? Primary Mitigation
Hallucination πŸ”΄ Critical Partially RAG, citations, human review
Knowledge Cutoff 🟑 Moderate Yes RAG, search grounding
Reasoning Gaps 🟑 Moderate Improving Chain-of-thought, code tools, reasoning models
No Memory 🟑 Moderate Yes External memory systems, conversation history
No Real-World Access 🟑 Moderate Yes Tool calling, agents, integrations
Context Limits 🟑 Moderate Yes Chunking, retrieval, model selection
Safety/Bias πŸ”΄ Critical Partially Guardrails, monitoring, testing

2.3 The Foundation Model as a Building Block

The "Base + Enhancements" Mental Model

Think of a foundation model like a smartphone's base operating system. iOS or Android out of the box is useful β€” but the real value comes from: - Apps (tools and integrations) - Your data (photos, contacts, files) - Settings and preferences (personalization) - Accessories (hardware peripherals)

Similarly, a foundation model out of the box is impressive but incomplete. The real product value comes from layering enhancements on top:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  YOUR PRODUCT (User-facing experience)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Guardrails & Safety Layer                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Agent / Orchestration Layer (LangChain, custom)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Tool Integrations (APIs, DBs, Search)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  RAG Layer (Your proprietary knowledge)              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Prompt Engineering / System Instructions            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Fine-tuning Layer (optional, domain-specific)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  FOUNDATION MODEL (GPT-4, Claude, Gemini,   β”‚    β”‚
β”‚  β”‚  Llama, Mistral)                             β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each layer gives you leverage to overcome a base model limitation:

Base Limitation Enhancement Layer Example
Hallucination RAG + Citations Perplexity grounds answers in web sources
Knowledge cutoff Search integration Gemini uses Google Search for real-time data
No memory Memory / user profile system ChatGPT's memory feature, character.ai's long-term memory
No tool use Function calling + agents Expedia's AI travel planner booking flights
Safety gaps Guardrails + content filters Anthropic's Constitutional AI, OpenAI's moderation API
Reasoning limits Code interpreter + chain of thought ChatGPT Code Interpreter for math problems

2.4 Foundation Model Comparison for PMs

Major Model Comparison (Early 2025)

Dimension GPT-4o (OpenAI) Claude 3.5 Sonnet (Anthropic) Gemini 1.5 Pro (Google) Llama 3.1 405B (Meta) Mistral Large (Mistral)
Strengths Broadest capabilities, strong at code, excellent tool use Best at long documents, nuanced writing, safety-conscious, strong coding Massive context window (1M), multimodal, Google ecosystem Open-source, self-hostable, strong performance Open-weight, EU-based, efficient, strong multilingual
Weaknesses Expensive at scale, closed-source Slightly weaker on math/code vs GPT-4, more conservative Inconsistent quality vs GPT-4/Claude, less mature API ecosystem Requires infrastructure to self-host, slightly behind on benchmarks Smaller community, less mature tooling
Context Window 128K 200K 1M 128K 128K
Multimodal Text, image, audio, video Text, image Text, image, audio, video Text, image Text, image
Best For General-purpose, complex reasoning, code generation Long-form content, analysis, safety-critical applications Data-heavy applications, Google Workspace integration, ultra-long context On-premises deployment, cost-sensitive at scale, customization European data sovereignty, multilingual, cost-efficient
Pricing (approx input/output per 1M tokens) $2.50 / $10 $3 / $15 $1.25 / $5 Free (self-hosted infra costs) $2 / $6
API Maturity ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ (via providers) ⭐⭐⭐

2.5 The Model Selection Framework

When choosing a foundation model for a product, use this decision framework:

Step 1: Define Your Requirements

Requirement Questions to Answer
Task type Conversational? Analytical? Creative? Code generation?
Quality bar What error rate is acceptable? What's the cost of a mistake?
Latency Does the user expect real-time response (<2s) or can they wait?
Volume How many requests per day/month?
Data sensitivity Can data leave your infrastructure? Regulatory requirements?
Modalities Text only? Need image/audio/video understanding?
Context needs How much input data per request?
Languages Which languages must be supported?

Step 2: Apply the Cost–Latency–Quality Triangle

              QUALITY
               /\
              /  \
             /    \
            / Pick \
           / Two    \
          /          \
         /____________\
      COST           LATENCY
      (Low)          (Fast)

You can optimize for two of three. The third will suffer.

Optimization Result Example
High Quality + Low Latency πŸ’° Expensive GPT-4o for real-time premium support β†’ high API costs
High Quality + Low Cost 🐌 Slow Batch processing with GPT-4o β†’ queue and run overnight
Low Cost + Low Latency πŸ“‰ Lower Quality GPT-4o-mini or Llama 3 8B β†’ fast and cheap but less capable

PM Strategy: Tiered Model Architecture

The best AI products don't use one model β€” they route requests to different models based on complexity:

  • Tier 1 (Simple queries): Small/fast model (GPT-4o-mini, Claude Haiku, Gemini Flash) β€” $0.15/1M input tokens, <500ms latency
  • Tier 2 (Standard queries): Mid-tier model (GPT-4o, Claude Sonnet, Gemini Pro) β€” $2-3/1M input tokens, 1-3s latency
  • Tier 3 (Complex reasoning): Premium model (GPT-4o, o1, Claude Opus) β€” $10-15/1M input tokens, 5-30s latency

Real-world example: Notion AI reportedly uses this tiered approach β€” simple formatting tasks use a smaller model, while complex writing tasks use a more powerful one. This can reduce costs by 60-80% while maintaining quality where it matters.

Step 3: Evaluate Against Your Use Case

Run a structured evaluation (or "eval"): 1. Create 100+ representative test cases from real user queries 2. Run each test case against 2-3 candidate models 3. Have domain experts grade outputs on a rubric (accuracy, completeness, tone, safety) 4. Calculate cost per query for each model 5. Measure latency (P50, P95, P99) for each model 6. Make a data-driven decision

Do not choose a model based on benchmark leaderboards alone. Benchmarks like MMLU, HumanEval, and HellaSwag measure narrow capabilities. Your product has specific needs that may not correlate with benchmark rankings.


2.6 Build vs. Buy Decision Framework

The Spectrum

Approach Description When to Use Examples
Use an API (Buy) Call OpenAI/Anthropic/Google APIs directly Most products. Fast to market, lowest upfront cost. Intercom using GPT-4 for support bots
Fine-tune a hosted model Customize a provider's model on your data When you need domain-specific behavior that prompting can't achieve A medical company fine-tuning GPT-4 on clinical guidelines
Deploy an open-source model Self-host Llama, Mistral, or similar Data sovereignty requirements, very high volume (API costs prohibitive), need full control A European bank deploying Mistral on-premises for regulatory compliance
Train from scratch Build your own foundation model Almost never. Only if you're Google, Meta, or a well-funded AI lab. Bloomberg training BloombergGPT on financial data

Decision Matrix

Factor Use API Fine-tune Self-Host Open Source Train from Scratch
Time to market Days-weeks Weeks-months Months Years
Upfront cost ~$0 $1K-100K $100K-1M+ (infra) $10M-100M+
Ongoing cost Per-token API fees Per-token + fine-tuning cost Infrastructure + ops Infrastructure + ops + research
Data privacy Data sent to provider Data sent for fine-tuning Full control Full control
Customization Prompt engineering only Moderate High Complete
Maintenance Provider handles updates Re-fine-tune periodically You manage everything You manage everything
Team needed PM + 1-2 engineers PM + ML engineer ML team + infra team Large ML research team

PM Decision Rule of Thumb

START with API β†’ PROVE product-market fit β†’ OPTIMIZE with fine-tuning or self-hosting IF needed

Do not start with self-hosting or training from scratch. 90%+ of AI products should start with API access. You can always move to fine-tuning or self-hosting after you've validated that: 1. Users want the product 2. The API approach has clear, measurable limitations you can't solve with prompt engineering or RAG 3. You have the data and team to justify the investment

Real-world example: Duolingo started by integrating GPT-4 via API for its "Explain My Answer" and "Roleplay" features (Duolingo Max). They didn't build their own model. They validated user demand first, then optimized.


2.7 PM Action Items & Exercises

Exercise 1: Token Cost Calculator

Pick a product feature you've shipped (or want to build). Estimate: - Average tokens per user query (input) - Average tokens per AI response (output) - Expected daily active users - Queries per user per day

Calculate: Monthly API cost = (input_tokens Γ— input_price + output_tokens Γ— output_price) Γ— queries_per_user Γ— DAU Γ— 30

Now calculate the same cost using three different models (GPT-4o, Claude Sonnet, Gemini Flash). What's the difference?

Exercise 2: Hallucination Audit

Take an AI-powered feature in a product you use (ChatGPT, Perplexity, Gemini, Copilot). Ask it 10 factual questions in your domain of expertise. For each answer: - Is it correct? - Is it confident? - Would an average user be able to detect if it's wrong? - What would the business consequence be if wrong?

Exercise 3: Limitation Mapping

For a product you're working on (or a product you admire), fill in this table:

Feature Which Limitation Is Most Dangerous? Current Mitigation Gap
Feature 1
Feature 2
Feature 3

Exercise 4: Model Selection

You're the PM for a customer support chatbot at a large e-commerce company (think Amazon scale). Walk through the Model Selection Framework: 1. Define your requirements using the Step 1 table 2. Where do you sit on the Cost-Latency-Quality triangle? 3. Which 2-3 models would you evaluate? 4. What would your tiered architecture look like? 5. Build vs. Buy β€” where do you start and how do you evolve?


2.8 Discussion Questions

  1. The Hallucination Dilemma: Your CEO wants to launch an AI-powered financial advisor that gives personalized investment recommendations. The best model you've tested still hallucates ~3% of the time. How do you think about the risk? What guardrails would you require before launch? Would you launch at all?

  2. Open vs. Closed Models: Meta's strategy is to release Llama as open-source, while OpenAI keeps GPT-4 closed. What are the product implications of building on each? How does this affect your competitive moat? What happens if OpenAI changes pricing by 5x?

  3. The Context Window Race: Google's Gemini can process 1M tokens (a few books). Does this change what products are possible? What use cases were impossible at 4K tokens that are now viable at 1M? Is "just make the context window bigger" a substitute for better retrieval systems?

  4. Vendor Lock-in: You've built your product on GPT-4's API with heavy prompt engineering specific to GPT-4's behavior. Anthropic releases a model that's 50% cheaper with comparable quality. How hard is it to switch? What would you do differently from the start to maintain model portability?

  5. The "Good Enough" Model: When is a smaller, cheaper model (GPT-4o-mini, Llama 8B) actually the better product choice than the most powerful model? Can "worse" AI create a "better" product?

  6. Cost at Scale: Your AI feature costs $0.02 per query. You have 10M DAU making 5 queries/day. That's $1M/day in API costs. How do you make the unit economics work? What levers do you have?


2.9 Key Takeaways

  1. LLMs are next-token predictors, not thinking machines. They generate statistically likely text. Understanding this explains most of their failure modes and helps you set correct user expectations.

  2. The three training phases shape the product. Pre-training gives knowledge, instruction-tuning gives behavior, RLHF gives alignment. Each phase is a lever that determines how the model behaves in your product.

  3. Seven critical limitations define your product's risk surface. Hallucination, knowledge cutoff, reasoning gaps, no persistent memory, no real-world interaction, context limits, and safety/bias. Every feature should be mapped against these limitations with explicit mitigation plans.

  4. Foundation models are building blocks, not finished products. The value you create as a PM comes from the enhancement layers you add on top: RAG, tools, memory, guardrails, and orchestration. The model is the foundation β€” not the house.

  5. Model selection is a product decision, not just a technical one. It affects cost, quality, latency, data privacy, vendor lock-in, and user experience. Use the Cost–Latency–Quality triangle and tiered architecture to make informed decisions.

  6. Start with APIs, prove value, then optimize. Don't over-invest in self-hosting or fine-tuning before you've validated product-market fit. The "Build vs. Buy" spectrum is a journey, not a one-time choice.

  7. The model landscape changes every 3-6 months. What's frontier today will be mid-tier tomorrow. Design your product architecture to be model-agnostic wherever possible. Your competitive moat comes from your data, your UX, and your integration layers β€” not from which model you use.