2.1 How Gen AI Models Actually Work (Without the PhD)
Before you can make good product decisions with AI, you need a working mental model of what's happening inside. You don't need to write code. You do need to know what the machine is doing well enough to predict when it will fail your users.
2.1.1 The Core Idea: Next-Token Prediction
Every major LLM β GPT-4, Claude, Gemini, Llama β does fundamentally the same thing:
It predicts the most likely next piece of text, one token at a time.
That's it. When ChatGPT writes a paragraph, it isn't "thinking" about the paragraph. It generates token #1, then uses that to generate token #2, then token #3, and so on β hundreds or thousands of times.
Analogy: The World's Best Autocomplete
Imagine the autocomplete on your phone keyboard, but trained on essentially the entire public internet β books, Wikipedia, code repositories, Reddit threads, scientific papers, news articles. Instead of suggesting the next word, it suggests the next token (a token is roughly ΒΎ of a word). It does this so well that the output reads like it was written by a thoughtful human.
Why this matters for PMs: The model isn't "understanding" your user's question in the way a human support agent does. It's pattern-matching against hundreds of billions of parameters to produce statistically likely continuations. This distinction explains almost every failure mode you'll encounter in production.
2.1.2 Tokenization: How Models See Text
Models don't read words. They read tokens β chunks of text that might be a whole word, part of a word, or even a single character.
| Text | Tokens (approximate) |
|---|---|
| "Hello world" | ["Hello", " world"] β 2 tokens |
| "unbelievable" | ["un", "believ", "able"] β 3 tokens |
| "GPT-4" | ["G", "PT", "-", "4"] β 4 tokens |
| "γγγ«γ‘γ―" | Could be 3-5 tokens depending on the model |
Why Tokenization Matters to PMs
- Cost: You pay per token (input + output). GPT-4o charges ~$2.50/million input tokens, ~$10/million output tokens. A customer service response might be 200-500 tokens. Multiply by millions of users and this is a real line item.
- Context window limits: Models have a maximum number of tokens they can process at once. GPT-4 Turbo handles 128K tokens (~300 pages). Claude 3.5 handles 200K tokens. Gemini 1.5 Pro handles up to 1M tokens. If your product needs to process a 500-page legal document in one shot, model choice matters.
- Non-English tax: Many languages tokenize less efficiently. The same sentence in Japanese or Hindi might use 2-3x more tokens than in English, making your product more expensive for those users. This has direct implications for international product launches.
- Math and code: Numbers tokenize unpredictably. "123456" might become
["123", "456"], which is one reason models are unreliable at arithmetic β they literally don't see the number as a single entity.
2.1.3 Transformers & Attention: The Architecture That Changed Everything
Every frontier LLM is built on the Transformer architecture, introduced in Google's 2017 paper "Attention Is All You Need."
Analogy: The Brilliant Reader
Imagine you're reading a 50-page document to answer a question. A normal reader goes top to bottom, remembering less as they go. A Transformer is like a reader who can instantly highlight the 15 most relevant sentences across all 50 pages β even if they're far apart β and synthesize an answer from just those highlights.
This "highlighting" is the attention mechanism.
How Attention Works (PM-Friendly Version)
When the model processes your prompt, every token "looks at" every other token and assigns an attention score β how relevant is this other token to understanding me in context?
Example prompt: "The bank by the river was steep."
- The word "bank" is ambiguous (financial institution vs. riverbank).
- The attention mechanism lets "bank" look at "river" and "steep," giving those words high attention scores, so the model correctly interprets "bank" as a riverbank.
Multi-head attention means the model runs many of these attention calculations in parallel, each one learning to focus on different types of relationships β syntax, semantics, long-range references, etc.
Why Transformers Won
Before Transformers, models (like RNNs/LSTMs) processed text sequentially β word by word, left to right. This was: - Slow (couldn't parallelize) - Forgetful (information degraded over long sequences)
Transformers process all tokens in parallel and can attend to any position equally. This is why: - Training became massively parallelizable across thousands of GPUs - Models got dramatically better at long-range coherence - Scale became the dominant strategy (more data + more parameters = better results)
2.1.4 The Training Process: Three Phases
Understanding how a model is trained tells you why it behaves certain ways in your product.
Phase 1: Pre-training (The Knowledge Phase)
What happens: The model reads trillions of tokens from the internet, books, code, and other text. It learns to predict the next token. This is unsupervised β no human labels needed.
Analogy: Imagine a medical student who reads every textbook, journal article, and patient record ever published. They absorb an enormous amount of knowledge β but they've never actually talked to a patient.
What it produces: A base model (also called a "pre-trained model"). Base models are powerful but awkward β they'll complete any text plausibly, but they don't know how to have a conversation. If you type a question, a base model might generate 10 related questions rather than an answer, because that's a plausible continuation of "question" text.
Cost and scale: - GPT-4's training cost is estimated at $100M+ in compute - Llama 3 405B was trained on 15T+ tokens - Gemini Ultra was trained on Google's entire TPU fleet
Phase 2: Fine-tuning / Instruction Tuning (The Behavior Phase)
What happens: Humans create thousands of example conversations: "When a user asks X, a good response looks like Y." The model is fine-tuned on these examples.
Analogy: That medical student now does their residency β they learn how to interact with patients, follow protocols, and give useful responses rather than just reciting textbook passages.
What it produces: An instruction-tuned model that can follow directions, answer questions, and hold conversations. ChatGPT, Claude, and Gemini are all instruction-tuned models.
Phase 3: RLHF / RLAIF (The Alignment Phase)
What happens: Human raters compare pairs of model outputs and choose which is better. This feedback is used to further train the model via Reinforcement Learning from Human Feedback (RLHF). Anthropic and Google also use AI-generated feedback (RLAIF β Reinforcement Learning from AI Feedback).
Analogy: The doctor now gets patient satisfaction surveys, peer reviews, and malpractice guidelines β they learn not just to be knowledgeable and responsive, but to be careful, ethical, and aligned with what patients actually need.
What it produces: A model that is more helpful, less harmful, and more aligned with user expectations. This is why Claude tends toward caution, ChatGPT toward helpfulness, and Gemini toward Google-ecosystem integration.
PM Insight: The alignment phase is where the model's "personality" and safety guardrails get baked in. When you find a model is too cautious (won't answer medical questions) or too permissive (generates harmful content), that's an RLHF tuning decision β not a knowledge gap.
2.1.5 Inference: What Happens When a User Hits "Send"
When a user submits a prompt in your product:
- Tokenization: The prompt is broken into tokens
- Encoding: Each token is converted into a high-dimensional vector (embedding)
- Forward pass: Tokens flow through the Transformer layers (GPT-4 reportedly has ~120 layers), with attention computed at each layer
- Output distribution: The model produces a probability distribution over all possible next tokens (~100K vocabulary)
- Sampling: A token is selected based on the probability distribution (controlled by temperature β more on this below)
- Repeat: Steps 3-5 repeat until the model produces a stop token or hits the max length
Temperature controls randomness: - Temperature 0: Always picks the most likely token β deterministic, repetitive, "safe" - Temperature 0.7: Balanced β good for most product use cases - Temperature 1.0+: More creative/random β useful for brainstorming, dangerous for factual tasks
PM Decision Point: Temperature is one of your most important product levers. A customer support bot should use low temperature (0.1-0.3). A creative writing assistant should use higher temperature (0.7-1.0). Getting this wrong creates either a robotic experience or an unreliable one.
2.1.6 What PMs Need to Know vs. What Engineers Handle
| Concept | PM Must Understand | Engineer Handles |
|---|---|---|
| Tokenization | Cost implications, context window limits, multilingual impact | Token vocabulary design, BPE algorithm implementation |
| Attention | Why models handle some tasks well (long-range reasoning) and others poorly (precise counting) | Attention head configuration, KV-cache optimization |
| Training phases | How each phase shapes model behavior and limitations | Hyperparameter tuning, distributed training infrastructure |
| Temperature | Product impact on user experience; when to use high vs. low | Sampling algorithms (top-k, top-p, beam search) |
| Context window | Maximum input size, what to include/exclude in prompts | Context compression, retrieval augmentation, chunking strategies |
| Fine-tuning | When your use case needs it, what data you need, cost/timeline | Training loops, LoRA adapters, evaluation metrics |
| Inference cost | Unit economics per API call, latency budgets, caching strategy | GPU provisioning, batching, model quantization |
2.2 Critical Limitations of Foundation Models
This is the most important subsection for PMs. Every AI product failure you've read about traces back to one of these limitations. Memorize them.
Limitation 1: Hallucination (Confident Fabrication)
What it is: Models generate plausible-sounding but factually incorrect information with full confidence. They don't "know" they're wrong.
Why it happens: The model is optimizing for statistically likely text, not truth. If a plausible-sounding answer exists in the latent space, the model will produce it β whether or not it's factually accurate.
Real-world examples: - Google Bard launch (2023): In its very first public demo, Bard claimed the James Webb Space Telescope took the first pictures of exoplanets outside our solar system. This was wrong. Google's stock dropped ~$100B in market cap that day. - ChatGPT legal citations: Lawyer Steven Schwartz used ChatGPT to write a legal brief that cited six entirely fabricated court cases. The judge sanctioned him for submitting fake citations. - Microsoft Bing Chat (early 2023): Confidently provided incorrect financial data from earnings reports, hallucinating specific revenue numbers that didn't exist. - Amazon product reviews: AI-generated product descriptions have included fabricated specifications and non-existent features.
Hallucination rate benchmarks (approximate): | Model | Hallucination Rate (general QA) | |---|---| | GPT-4o | ~3-5% | | Claude 3.5 Sonnet | ~3-4% | | Gemini 1.5 Pro | ~4-6% | | Llama 3 70B | ~6-10% | | GPT-3.5 | ~15-20% |
PM Mitigation Strategies: - Implement RAG (Retrieval-Augmented Generation) to ground responses in verified data - Add source citations and confidence indicators to the UI - Use model-as-judge verification (have a second model check the first) - Design human-in-the-loop workflows for high-stakes outputs - Never ship AI in domains where a single hallucination has severe consequences (medical diagnosis, legal advice, financial trading) without robust guardrails
Limitation 2: Knowledge Cutoff (Frozen in Time)
What it is: Models only know what was in their training data. They have no awareness of events after their training cutoff date.
Specific cutoff dates (as of early 2025): | Model | Approximate Knowledge Cutoff | |---|---| | GPT-4o | October 2023 | | Claude 3.5 Sonnet | Early 2024 | | Gemini 1.5 Pro | ~Late 2023 (with Search grounding) | | Llama 3.1 | December 2023 |
Real-world impact: - Customer service bots that can't answer questions about product updates or policy changes made after training - Travel planning tools that suggest restaurants that have closed, reference pre-pandemic travel rules, or miss new visa requirements - Financial analysis tools that don't know about recent earnings, regulatory changes, or market events - HR chatbots that reference outdated company policies
PM Mitigation Strategies: - Implement RAG with a frequently updated knowledge base - Use web search/grounding (Gemini's Google Search grounding, ChatGPT's Browse feature) - Clearly display knowledge cutoff dates to users - Build pipelines to inject current context into prompts - Design products that fail gracefully when asked about recent events ("I don't have information about events after [date]. Let me search for the latest...")
Limitation 3: Reasoning Gaps (Brittle Logic)
What it is: Models can appear to reason through complex problems but fail in predictable, sometimes bizarre ways β especially on novel problems that require true logical deduction rather than pattern matching.
Where models struggle: - Multi-step math: Ask GPT-4 to multiply 3,847 Γ 9,261 and it will often get it wrong. It's not calculating β it's guessing based on patterns. - Spatial reasoning: "I'm facing north. I turn left. I turn left again. What direction am I facing?" Models frequently get this wrong. - Counterfactual reasoning: "If the Roman Empire never fell, what language would modern France speak?" Models tend to give superficially plausible but logically inconsistent answers. - Planning and constraint satisfaction: "Schedule 5 meetings in 3 rooms over 2 days with these 12 constraints" β models struggle with combinatorial constraint problems. - Negation and logic puzzles: "Which of these statements is NOT true?" β models are measurably worse with negation.
Real-world product implication: If you're building a product that requires precise calculation, deterministic logic, or constraint satisfaction β don't rely on the LLM for that part. Use the LLM for natural language understanding and generation; use traditional code for computation.
The o1/o3 evolution: OpenAI's o1 and o3 "reasoning" models use chain-of-thought at inference time to dramatically improve on reasoning tasks. But they're slower and more expensive. Anthropic's Claude 3.5 also invested heavily in reasoning. This is an active area of improvement β but it's not solved.
Limitation 4: No Persistent Memory (Goldfish Problem)
What it is: Every conversation starts from zero. The model has no memory of previous interactions with the same user unless you explicitly provide that context.
Analogy: Imagine a brilliant consultant who gives great advice β but every time you call, they have total amnesia. You have to re-explain your company, your goals, and everything you discussed last time.
Real-world impact: - ChatGPT's memory feature: OpenAI added a "memory" layer that stores user facts between conversations β but this is an engineering solution built on top of the model, not an intrinsic capability. - Customer support: A user contacts your AI support bot for the 5th time about the same billing issue. Without memory engineering, the bot has no idea about the previous 4 conversations. - Personalization: AI assistants can't learn user preferences over time without external memory systems.
What this means architecturally: You need to build memory yourself β user profiles, conversation history databases, retrieval systems that inject relevant past context into each prompt. This is a significant engineering investment.
Limitation 5: No Real-World Interaction (Locked in a Box)
What it is: Base models can only read text in and write text out. They cannot browse the web, run code, access databases, call APIs, send emails, or interact with any external system.
Why this matters: A PM might imagine "the AI will check our inventory database and respond to the customer." But the model can't do any of that. Engineers have to build:
- Tool calling / function calling: The model outputs a structured request (e.g., "call function check_inventory(product_id=X)"), and your application layer actually executes it
- Agents: Frameworks where the model plans a sequence of tool calls to accomplish a complex task
- Plugins / integrations: Connections to external data sources and services
Real-world examples: - ChatGPT Plugins (2023): OpenAI launched and then deprecated plugins, replacing them with GPTs and function calling β showing how hard it is to get tool use right - Amazon Bedrock Agents: AWS's framework for giving models access to company APIs and knowledge bases - Google Gemini + Workspace: Gemini accessing Gmail, Docs, and Calendar isn't the model's native ability β it's Google engineering integration layers
Limitation 6: Context Window Constraints
What it is: Models can only process a fixed amount of text at once. Anything beyond the context window is simply invisible to the model.
Context window sizes (as of early 2025): | Model | Context Window | Rough Page Equivalent | |---|---|---| | GPT-4o | 128K tokens | ~300 pages | | Claude 3.5 Sonnet | 200K tokens | ~500 pages | | Gemini 1.5 Pro | 1M tokens | ~2,500 pages | | Llama 3.1 405B | 128K tokens | ~300 pages | | Mistral Large | 128K tokens | ~300 pages |
The "lost in the middle" problem: Even within the context window, models pay less attention to information in the middle of long contexts. Information at the beginning and end gets more attention. Google's research on Gemini and UC Berkeley's research on GPT-4 both confirmed this.
PM implications: - If your product processes long documents, choose models with large context windows (Gemini 1.5 Pro's 1M tokens is a differentiator) - Put the most important information at the beginning or end of the prompt - Consider chunking strategies for documents that exceed the context window - Longer contexts = higher cost and latency
Limitation 7: Safety, Bias, and Alignment Issues
What it is: Models can reflect biases present in training data, generate harmful content if guardrails are circumvented, and behave in ways that don't align with your product's values.
Real-world examples: - Gemini image generation (2024): Generated historically inaccurate images (e.g., diverse Nazi soldiers) due to over-aggressive diversity prompts, leading Google to temporarily pause the feature - GPT-4 jailbreaks: Users discovered prompts that bypassed safety filters, leading to harmful content generation - Resume screening bias: AI models used for recruiting have shown bias against certain demographics
PM responsibilities: - Define your product's safety requirements clearly - Implement content filtering and moderation layers - Test extensively for bias across user demographics - Monitor production outputs for safety violations - Build user reporting mechanisms
Limitation Summary Matrix
| Limitation | Severity | Can Be Mitigated? | Primary Mitigation |
|---|---|---|---|
| Hallucination | π΄ Critical | Partially | RAG, citations, human review |
| Knowledge Cutoff | π‘ Moderate | Yes | RAG, search grounding |
| Reasoning Gaps | π‘ Moderate | Improving | Chain-of-thought, code tools, reasoning models |
| No Memory | π‘ Moderate | Yes | External memory systems, conversation history |
| No Real-World Access | π‘ Moderate | Yes | Tool calling, agents, integrations |
| Context Limits | π‘ Moderate | Yes | Chunking, retrieval, model selection |
| Safety/Bias | π΄ Critical | Partially | Guardrails, monitoring, testing |
2.3 The Foundation Model as a Building Block
The "Base + Enhancements" Mental Model
Think of a foundation model like a smartphone's base operating system. iOS or Android out of the box is useful β but the real value comes from: - Apps (tools and integrations) - Your data (photos, contacts, files) - Settings and preferences (personalization) - Accessories (hardware peripherals)
Similarly, a foundation model out of the box is impressive but incomplete. The real product value comes from layering enhancements on top:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β YOUR PRODUCT (User-facing experience) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Guardrails & Safety Layer β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Agent / Orchestration Layer (LangChain, custom) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Tool Integrations (APIs, DBs, Search) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β RAG Layer (Your proprietary knowledge) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Prompt Engineering / System Instructions β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Fine-tuning Layer (optional, domain-specific) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FOUNDATION MODEL (GPT-4, Claude, Gemini, β β
β β Llama, Mistral) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each layer gives you leverage to overcome a base model limitation:
| Base Limitation | Enhancement Layer | Example |
|---|---|---|
| Hallucination | RAG + Citations | Perplexity grounds answers in web sources |
| Knowledge cutoff | Search integration | Gemini uses Google Search for real-time data |
| No memory | Memory / user profile system | ChatGPT's memory feature, character.ai's long-term memory |
| No tool use | Function calling + agents | Expedia's AI travel planner booking flights |
| Safety gaps | Guardrails + content filters | Anthropic's Constitutional AI, OpenAI's moderation API |
| Reasoning limits | Code interpreter + chain of thought | ChatGPT Code Interpreter for math problems |
2.4 Foundation Model Comparison for PMs
Major Model Comparison (Early 2025)
| Dimension | GPT-4o (OpenAI) | Claude 3.5 Sonnet (Anthropic) | Gemini 1.5 Pro (Google) | Llama 3.1 405B (Meta) | Mistral Large (Mistral) |
|---|---|---|---|---|---|
| Strengths | Broadest capabilities, strong at code, excellent tool use | Best at long documents, nuanced writing, safety-conscious, strong coding | Massive context window (1M), multimodal, Google ecosystem | Open-source, self-hostable, strong performance | Open-weight, EU-based, efficient, strong multilingual |
| Weaknesses | Expensive at scale, closed-source | Slightly weaker on math/code vs GPT-4, more conservative | Inconsistent quality vs GPT-4/Claude, less mature API ecosystem | Requires infrastructure to self-host, slightly behind on benchmarks | Smaller community, less mature tooling |
| Context Window | 128K | 200K | 1M | 128K | 128K |
| Multimodal | Text, image, audio, video | Text, image | Text, image, audio, video | Text, image | Text, image |
| Best For | General-purpose, complex reasoning, code generation | Long-form content, analysis, safety-critical applications | Data-heavy applications, Google Workspace integration, ultra-long context | On-premises deployment, cost-sensitive at scale, customization | European data sovereignty, multilingual, cost-efficient |
| Pricing (approx input/output per 1M tokens) | $2.50 / $10 | $3 / $15 | $1.25 / $5 | Free (self-hosted infra costs) | $2 / $6 |
| API Maturity | βββββ | ββββ | ββββ | βββ (via providers) | βββ |
2.5 The Model Selection Framework
When choosing a foundation model for a product, use this decision framework:
Step 1: Define Your Requirements
| Requirement | Questions to Answer |
|---|---|
| Task type | Conversational? Analytical? Creative? Code generation? |
| Quality bar | What error rate is acceptable? What's the cost of a mistake? |
| Latency | Does the user expect real-time response (<2s) or can they wait? |
| Volume | How many requests per day/month? |
| Data sensitivity | Can data leave your infrastructure? Regulatory requirements? |
| Modalities | Text only? Need image/audio/video understanding? |
| Context needs | How much input data per request? |
| Languages | Which languages must be supported? |
Step 2: Apply the CostβLatencyβQuality Triangle
QUALITY
/\
/ \
/ \
/ Pick \
/ Two \
/ \
/____________\
COST LATENCY
(Low) (Fast)
You can optimize for two of three. The third will suffer.
| Optimization | Result | Example |
|---|---|---|
| High Quality + Low Latency | π° Expensive | GPT-4o for real-time premium support β high API costs |
| High Quality + Low Cost | π Slow | Batch processing with GPT-4o β queue and run overnight |
| Low Cost + Low Latency | π Lower Quality | GPT-4o-mini or Llama 3 8B β fast and cheap but less capable |
PM Strategy: Tiered Model Architecture
The best AI products don't use one model β they route requests to different models based on complexity:
- Tier 1 (Simple queries): Small/fast model (GPT-4o-mini, Claude Haiku, Gemini Flash) β $0.15/1M input tokens, <500ms latency
- Tier 2 (Standard queries): Mid-tier model (GPT-4o, Claude Sonnet, Gemini Pro) β $2-3/1M input tokens, 1-3s latency
- Tier 3 (Complex reasoning): Premium model (GPT-4o, o1, Claude Opus) β $10-15/1M input tokens, 5-30s latency
Real-world example: Notion AI reportedly uses this tiered approach β simple formatting tasks use a smaller model, while complex writing tasks use a more powerful one. This can reduce costs by 60-80% while maintaining quality where it matters.
Step 3: Evaluate Against Your Use Case
Run a structured evaluation (or "eval"): 1. Create 100+ representative test cases from real user queries 2. Run each test case against 2-3 candidate models 3. Have domain experts grade outputs on a rubric (accuracy, completeness, tone, safety) 4. Calculate cost per query for each model 5. Measure latency (P50, P95, P99) for each model 6. Make a data-driven decision
Do not choose a model based on benchmark leaderboards alone. Benchmarks like MMLU, HumanEval, and HellaSwag measure narrow capabilities. Your product has specific needs that may not correlate with benchmark rankings.
2.6 Build vs. Buy Decision Framework
The Spectrum
| Approach | Description | When to Use | Examples |
|---|---|---|---|
| Use an API (Buy) | Call OpenAI/Anthropic/Google APIs directly | Most products. Fast to market, lowest upfront cost. | Intercom using GPT-4 for support bots |
| Fine-tune a hosted model | Customize a provider's model on your data | When you need domain-specific behavior that prompting can't achieve | A medical company fine-tuning GPT-4 on clinical guidelines |
| Deploy an open-source model | Self-host Llama, Mistral, or similar | Data sovereignty requirements, very high volume (API costs prohibitive), need full control | A European bank deploying Mistral on-premises for regulatory compliance |
| Train from scratch | Build your own foundation model | Almost never. Only if you're Google, Meta, or a well-funded AI lab. | Bloomberg training BloombergGPT on financial data |
Decision Matrix
| Factor | Use API | Fine-tune | Self-Host Open Source | Train from Scratch |
|---|---|---|---|---|
| Time to market | Days-weeks | Weeks-months | Months | Years |
| Upfront cost | ~$0 | $1K-100K | $100K-1M+ (infra) | $10M-100M+ |
| Ongoing cost | Per-token API fees | Per-token + fine-tuning cost | Infrastructure + ops | Infrastructure + ops + research |
| Data privacy | Data sent to provider | Data sent for fine-tuning | Full control | Full control |
| Customization | Prompt engineering only | Moderate | High | Complete |
| Maintenance | Provider handles updates | Re-fine-tune periodically | You manage everything | You manage everything |
| Team needed | PM + 1-2 engineers | PM + ML engineer | ML team + infra team | Large ML research team |
PM Decision Rule of Thumb
START with API β PROVE product-market fit β OPTIMIZE with fine-tuning or self-hosting IF needed
Do not start with self-hosting or training from scratch. 90%+ of AI products should start with API access. You can always move to fine-tuning or self-hosting after you've validated that: 1. Users want the product 2. The API approach has clear, measurable limitations you can't solve with prompt engineering or RAG 3. You have the data and team to justify the investment
Real-world example: Duolingo started by integrating GPT-4 via API for its "Explain My Answer" and "Roleplay" features (Duolingo Max). They didn't build their own model. They validated user demand first, then optimized.
2.7 PM Action Items & Exercises
Exercise 1: Token Cost Calculator
Pick a product feature you've shipped (or want to build). Estimate: - Average tokens per user query (input) - Average tokens per AI response (output) - Expected daily active users - Queries per user per day
Calculate: Monthly API cost = (input_tokens Γ input_price + output_tokens Γ output_price) Γ queries_per_user Γ DAU Γ 30
Now calculate the same cost using three different models (GPT-4o, Claude Sonnet, Gemini Flash). What's the difference?
Exercise 2: Hallucination Audit
Take an AI-powered feature in a product you use (ChatGPT, Perplexity, Gemini, Copilot). Ask it 10 factual questions in your domain of expertise. For each answer: - Is it correct? - Is it confident? - Would an average user be able to detect if it's wrong? - What would the business consequence be if wrong?
Exercise 3: Limitation Mapping
For a product you're working on (or a product you admire), fill in this table:
| Feature | Which Limitation Is Most Dangerous? | Current Mitigation | Gap |
|---|---|---|---|
| Feature 1 | |||
| Feature 2 | |||
| Feature 3 |
Exercise 4: Model Selection
You're the PM for a customer support chatbot at a large e-commerce company (think Amazon scale). Walk through the Model Selection Framework: 1. Define your requirements using the Step 1 table 2. Where do you sit on the Cost-Latency-Quality triangle? 3. Which 2-3 models would you evaluate? 4. What would your tiered architecture look like? 5. Build vs. Buy β where do you start and how do you evolve?
2.8 Discussion Questions
-
The Hallucination Dilemma: Your CEO wants to launch an AI-powered financial advisor that gives personalized investment recommendations. The best model you've tested still hallucates ~3% of the time. How do you think about the risk? What guardrails would you require before launch? Would you launch at all?
-
Open vs. Closed Models: Meta's strategy is to release Llama as open-source, while OpenAI keeps GPT-4 closed. What are the product implications of building on each? How does this affect your competitive moat? What happens if OpenAI changes pricing by 5x?
-
The Context Window Race: Google's Gemini can process 1M tokens (a few books). Does this change what products are possible? What use cases were impossible at 4K tokens that are now viable at 1M? Is "just make the context window bigger" a substitute for better retrieval systems?
-
Vendor Lock-in: You've built your product on GPT-4's API with heavy prompt engineering specific to GPT-4's behavior. Anthropic releases a model that's 50% cheaper with comparable quality. How hard is it to switch? What would you do differently from the start to maintain model portability?
-
The "Good Enough" Model: When is a smaller, cheaper model (GPT-4o-mini, Llama 8B) actually the better product choice than the most powerful model? Can "worse" AI create a "better" product?
-
Cost at Scale: Your AI feature costs $0.02 per query. You have 10M DAU making 5 queries/day. That's $1M/day in API costs. How do you make the unit economics work? What levers do you have?
2.9 Key Takeaways
-
LLMs are next-token predictors, not thinking machines. They generate statistically likely text. Understanding this explains most of their failure modes and helps you set correct user expectations.
-
The three training phases shape the product. Pre-training gives knowledge, instruction-tuning gives behavior, RLHF gives alignment. Each phase is a lever that determines how the model behaves in your product.
-
Seven critical limitations define your product's risk surface. Hallucination, knowledge cutoff, reasoning gaps, no persistent memory, no real-world interaction, context limits, and safety/bias. Every feature should be mapped against these limitations with explicit mitigation plans.
-
Foundation models are building blocks, not finished products. The value you create as a PM comes from the enhancement layers you add on top: RAG, tools, memory, guardrails, and orchestration. The model is the foundation β not the house.
-
Model selection is a product decision, not just a technical one. It affects cost, quality, latency, data privacy, vendor lock-in, and user experience. Use the CostβLatencyβQuality triangle and tiered architecture to make informed decisions.
-
Start with APIs, prove value, then optimize. Don't over-invest in self-hosting or fine-tuning before you've validated product-market fit. The "Build vs. Buy" spectrum is a journey, not a one-time choice.
-
The model landscape changes every 3-6 months. What's frontier today will be mid-tier tomorrow. Design your product architecture to be model-agnostic wherever possible. Your competitive moat comes from your data, your UX, and your integration layers β not from which model you use.