In Sections 2 and 3, you learned how foundation models work and how to enhance them with knowledge, reasoning, tools, and memory. But shipping v1 is only the beginning. The most important question for any AI PM is: how does this product get better over time?
Traditional software improves through deterministic feature releases. AI products improve through a fundamentally different mechanism: learning loops. Every user interaction is potential training signal. Every thumbs-down is a data point. Every edit a user makes to an AI-generated draft tells you exactly where the model fell short.
This section covers the complete improvement stack:
┌──────────────────────────────────────────────────────────────┐
│ THE AI IMPROVEMENT STACK │
├──────────────────────────────────────────────────────────────┤
│ │
│ 📊 EVALUATION 🔄 LEARNING │
│ How do you KNOW it's working? RLHF, DPO, RLAIF, │
│ Offline evals, online metrics, Constitutional AI, │
│ human judgment, product KPIs continuous improvement │
│ │
│ 👤 HUMAN FEEDBACK 🎯 FINE-TUNING │
│ Explicit signals, implicit Full, LoRA, instruction │
│ signals, feedback loops, tuning — when prompting │
│ privacy considerations and RAG aren't enough │
│ │
├──────────────────────────────────────────────────────────────┤
│ ENHANCEMENT LAYERS (Section 3) │
│ RAG, Reasoning, Tools, Memory │
├──────────────────────────────────────────────────────────────┤
│ FOUNDATION MODEL (Section 2) │
└──────────────────────────────────────────────────────────────┘
4.1 Evaluation: The #1 PM Skill for AI Products
Why Evaluation Is Different (and Harder) for AI
In traditional product management, you know when your feature works. A checkout button either processes the payment or it doesn't. A search bar either returns results or shows an error. Success is binary and deterministic.
AI products are fundamentally different:
- Outputs are probabilistic. The same input can produce different outputs across runs.
- "Correct" is subjective. Ask five people to rate an AI-written email and you'll get five different scores.
- Failure is partial. An AI response can be 80% accurate but contain one hallucinated fact that destroys user trust.
- Quality is multi-dimensional. A response can be accurate but too verbose, or concise but missing a critical detail, or well-written but tonally wrong.
This makes evaluation the single most important PM skill for AI products. If you can't measure quality, you can't improve it. If you can't detect regressions, you'll ship them to users. If you can't compare approaches, you'll make decisions based on vibes instead of data.
Analogy: Evaluation is to AI products what unit testing is to traditional software. No serious engineering team ships without tests. No serious AI team should ship without evals.
4.1.1 Offline Evaluation: Testing Before Users See It
Offline evaluation happens before deployment — on held-out test data, using automated metrics and human reviewers. This is your safety net.
Automated Metrics
| Metric | What It Measures | Best For | Limitations |
|---|---|---|---|
| BLEU | N-gram overlap between generated text and reference text | Translation, short factual answers | Penalizes valid paraphrases; doesn't measure meaning |
| ROUGE | Recall-oriented overlap (how much of the reference appears in the output) | Summarization | Same as BLEU — surface-level only |
| Perplexity | How "surprised" the model is by a text (lower = more fluent) | Language fluency, comparing model versions | Doesn't measure factual accuracy or usefulness |
| BERTScore | Semantic similarity using BERT embeddings | Meaning-preserving comparisons | Computationally expensive; threshold tuning needed |
| Exact Match (EM) | Whether the output exactly matches the expected answer | Factual QA, code output, structured data | Too strict for open-ended tasks |
| F1 Score | Token-level precision and recall against a reference | Extractive QA | Doesn't capture meaning, only word overlap |
PM Interpretation: Automated metrics are cheap and fast but shallow. They tell you whether the model's text overlaps with a reference answer — not whether the response is actually good. Use them for regression detection (did this change make things worse across 10,000 test cases?) rather than quality measurement.
Real-world example: Google Translate used BLEU for years to compare translation quality. But a translation can score high on BLEU while being awkward and unnatural, or score low while being an excellent localization. Google eventually moved toward human evaluation and model-based evaluation for quality measurement, keeping BLEU only for fast automated checks.
LLM-as-Judge
A powerful emerging pattern: use a strong LLM (like GPT-4 or Claude) to evaluate the outputs of another LLM. This gives you scalable evaluation that's closer to human judgment than automated metrics.
How it works: 1. Define your evaluation criteria in a rubric (accuracy, helpfulness, safety, tone) 2. Create a grading prompt: "You are an expert evaluator. Rate the following AI response on a scale of 1-5 for accuracy, helpfulness, and tone. Explain your reasoning." 3. Feed the LLM the original question, the AI's response, and (optionally) a reference answer 4. The evaluator LLM returns scores and reasoning
Advantages: - 10-100x cheaper than human evaluation - Consistent (no evaluator fatigue or mood swings) - Scalable to thousands of test cases per hour - Can evaluate subjective qualities (tone, empathy, creativity)
Risks: - Model bias: GPT-4 tends to prefer GPT-4-style outputs; Claude tends to prefer Claude-style outputs - Sycophancy: Evaluator LLMs can be overly generous - Ceiling: Can't evaluate beyond its own capability level - Gaming: If you know the evaluator's preferences, you can optimize for the judge rather than the user
Best practice: Use a different model family as the judge. If your product uses GPT-4, evaluate with Claude (and vice versa). Use multiple judges and aggregate scores. Validate LLM-as-judge scores against human evaluations on a sample to calibrate.
Real-world example: Anthropic uses Claude as a "constitutional judge" to evaluate other Claude outputs against its principles. OpenAI uses GPT-4 to evaluate fine-tuning data quality before training. Startups like Braintrust and Patronus AI have built entire evaluation platforms around LLM-as-judge.
Benchmark Suites
Standardized benchmarks let you compare models across known tasks:
| Benchmark | What It Tests | Limitation |
|---|---|---|
| MMLU | Knowledge across 57 subjects | Multiple-choice format doesn't test generation quality |
| HumanEval | Python code generation | Narrow scope; doesn't test real-world coding |
| GSM8K | Grade-school math reasoning | Too easy for frontier models now |
| TruthfulQA | Resistance to generating popular misconceptions | Small test set; models may have memorized it |
| MT-Bench | Multi-turn conversation quality (LLM-judged) | Relies on GPT-4 as judge |
| LMSYS Chatbot Arena | Head-to-head human preference across real conversations | Crowdsourced; population may not match your users |
PM caveat: Benchmarks are useful for initial model screening, but your eval suite must be built from your own product's data. A model that tops MMLU might still perform poorly on your specific use case. Always build custom evals.
4.1.2 Online Evaluation: Measuring in Production
Offline evals tell you if the model should work. Online evals tell you if it actually works for real users.
A/B Testing AI Features (It's Harder Than You Think)
A/B testing AI features introduces unique challenges that traditional A/B testing doesn't have:
| Challenge | Why It's Hard | Mitigation |
|---|---|---|
| High variance | Same prompt → different outputs → noisy metrics | Larger sample sizes; longer test durations; multiple runs per test case |
| Delayed effects | User trust erodes over days/weeks, not minutes | Run tests for weeks, not days; track retention metrics |
| Multi-dimensional quality | Speed might improve but accuracy drops — net effect unclear | Define a composite metric or hierarchy of metrics before the test |
| User learning | Users adapt their prompts based on model behavior | Segment by user sophistication; analyze prompt evolution |
| Selection bias | Power users engage more with AI features, skewing results | Intent-to-treat analysis; don't just measure among users who tried the feature |
Real-world example: GitHub ran extensive A/B tests when developing Copilot. They couldn't just measure "did the developer accept the suggestion?" — they had to measure whether accepted suggestions actually stayed in the codebase 30 minutes later, whether they introduced bugs, and whether overall developer productivity improved. Their key metric became acceptance rate (% of suggestions users kept), but they validated this correlated with actual productivity gains through longitudinal studies.
Shadow Deployments
Run the new model alongside the current one in production, but only show users the current model's output. Log the new model's outputs for offline comparison.
When to use: - Before a major model upgrade (switching from GPT-4 to GPT-4o, or Claude 3 to Claude 3.5) - When testing a fine-tuned model against the base model - When the cost of a bad response is high (medical, financial, legal)
How it works: 1. Route every production query to both the current and candidate models 2. Serve only the current model's response to the user 3. Log both responses with the user's query 4. Run automated evaluation and human review on sampled pairs 5. If the candidate wins significantly, roll it forward
Cost consideration: Shadow deployments double your inference costs during the test period. Budget for this.
Canary Releases
Roll out the new model to a small percentage of traffic (1-5%), monitor closely for quality metrics and error rates, then gradually increase if metrics hold.
Canary checklist for AI model rollouts: - [ ] Error rate (5xx, timeouts) within 10% of baseline - [ ] Latency P95 within acceptable range - [ ] Hallucination rate (via automated checks) not increasing - [ ] User feedback signals (thumbs down, complaints) not spiking - [ ] Task completion rate not dropping - [ ] Cost per query within budget
4.1.3 Human Evaluation: The Gold Standard
Automated metrics and LLM-as-judge are useful, but human evaluation remains the gold standard for assessing AI quality. Humans assess nuance, context, and cultural appropriateness in ways that no automated metric can.
Building an Annotation Framework
- Define your evaluation dimensions (what matters for your product):
- Accuracy / Factual correctness
- Relevance to the user's question
- Completeness (did it cover everything?)
- Conciseness (did it ramble?)
- Tone / Brand voice alignment
- Safety (any harmful content?)
-
Actionability (can the user act on this?)
-
Create a detailed rubric with examples for each score level:
| Score | Accuracy Definition | Example |
|---|---|---|
| 5 — Perfect | All facts correct, properly sourced, no hallucination | "The iPhone 15 was announced on September 12, 2023" ✅ |
| 4 — Minor issue | Core facts correct, minor imprecision or missing nuance | "The iPhone 15 was announced in September 2023" (missing exact date) |
| 3 — Partially correct | Mix of correct and incorrect information | "The iPhone 15 was announced in October 2023" (wrong month) |
| 2 — Mostly wrong | Core claim is incorrect but tangentially related | "The iPhone 15 was announced at CES" (wrong event entirely) |
| 1 — Completely wrong | Fabricated or dangerously incorrect | "The iPhone 15 was announced in 2022 by Samsung" |
- Establish inter-rater reliability (IRR): Have multiple evaluators rate the same samples. Calculate agreement metrics:
- Cohen's Kappa (κ): Measures agreement between two raters, correcting for chance. κ > 0.7 is good; κ > 0.8 is excellent.
- Krippendorff's Alpha: Generalizes to multiple raters and various data types. α > 0.667 is acceptable; α > 0.8 is reliable.
- If inter-rater reliability is low, your rubric needs more detail or your raters need more training.
Real-world example: Google's Search Quality Rating Guidelines is a 170+ page document that trains thousands of human evaluators to assess search results. The categories E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) are a rubric. Google applies the same rigor to evaluating AI Overviews — every AI-generated summary in Search goes through human evaluation sampling against detailed rubrics. When AI Overviews launched and generated embarrassing errors (recommending glue on pizza, suggesting eating rocks), it was a failure of evaluation coverage, not evaluation methodology.
4.1.4 Product-Level Metrics for AI
Beyond model quality, you need product metrics that tell you whether the AI feature is delivering business value.
The AI Product Metrics Stack
| Layer | Metric | What It Tells You | Target Range |
|---|---|---|---|
| Model quality | Accuracy / hallucination rate | Is the model's output correct? | Domain-dependent; <5% hallucination for factual |
| User engagement | Task completion rate | Do users finish the AI-assisted workflow? | >70% for productive features |
| User satisfaction | Edit rate / revision rate | How much do users modify the AI's output? | Lower is better; <30% for mature features |
| User satisfaction | Acceptance rate | Do users keep /accept the AI's suggestion? | >25% is strong for code completion (GitHub Copilot) |
| User satisfaction | Thumbs up/down ratio | Explicit quality signal | >4:1 positive-to-negative |
| Business impact | Time-to-completion | Does AI make users faster? | Measurable % reduction vs. without AI |
| Business impact | Cost per interaction | Is the AI economically viable? | Must be below value generated per interaction |
| Trust | Retry / regeneration rate | Users asking for another attempt | <15% indicates good first-attempt quality |
| Trust | Abandonment rate | Users give up on the AI feature | <20% after onboarding |
| Retention | Feature retention (D7, D30) | Do users come back to the AI feature? | Benchmark against non-AI feature retention |
Real-World Metrics Examples
GitHub Copilot: - Primary metric: Acceptance rate — what % of code suggestions do developers keep? (reported at ~30%) - Secondary: Persistence rate — of accepted suggestions, what % remains in the codebase after 30 minutes? (validates that accepted ≠ immediately deleted) - Business metric: Developer productivity — measured as task completion time in controlled studies (reported 55% faster for certain tasks)
Notion AI: - Engagement metric: Feature activation rate — what % of users try AI features? - Quality metric: Edit distance — how much do users change the AI-generated text? - Retention: Repeat usage — do users who try AI once come back to use it again?
Netflix Recommendations: - Primary: Take rate — what % of recommendations do users actually watch? - Quality: Completion rate — of recommended content started, what % is finished? - Business: Hours of engagement — does better recommendation → more time on platform? - A Netflix engineering blog post reported that 80% of watched content comes from recommendations, making the recommendation system responsible for the majority of engagement.
Google Search AI Overviews: - Quality: Accuracy rate — verified via human evaluation sampling - Engagement: Click-through on citations — do users explore the sources? - Satisfaction: Search satisfaction surveys — post-search CSAT - Business: Queries resolved without further searching — whether the AI Overview answered the question sufficiently
4.1.5 Building Your Evaluation Suite: The PM Evaluation Playbook
Every AI feature needs a structured evaluation framework. Here's your complete template:
Step 1: Define What "Good" Looks Like
Before writing a single eval, align your team on quality dimensions and their relative importance.
| Dimension | Weight | Threshold | Measurement Method |
|---|---|---|---|
| Factual accuracy | 35% | >95% of claims correct | LLM-as-judge + human sample |
| Relevance | 25% | Directly answers user question | LLM-as-judge rubric |
| Completeness | 15% | Covers key aspects of the answer | Human evaluation rubric |
| Tone/Brand voice | 10% | Matches brand guidelines | LLM-as-judge with brand prompt |
| Safety | 15% | Zero harmful outputs | Automated classifiers + human audit |
Step 2: Build Your Test Dataset
| Dataset Component | Size | Purpose |
|---|---|---|
| Golden test set | 200-500 examples | Curated, human-verified question-answer pairs representing your core use cases. Never train on this. |
| Edge cases | 50-100 examples | Adversarial inputs, ambiguous queries, multi-language, sensitive topics |
| Regression set | Grows over time | Every bug or failure you find in production gets added here |
| User-representative set | 500+ examples | Sampled from real production queries (anonymized) to match actual distribution |
Step 3: Automate Your Eval Pipeline
┌─────────────┐ ┌───────────────┐ ┌──────────────────┐
│ Test Dataset│────▶│ Run model │────▶│ Auto-evaluate │
│ (golden + │ │ inference on │ │ (metrics + │
│ edge cases │ │ each example │ │ LLM-as-judge) │
│ + regression) └───────────────┘ └────────┬─────────┘
└─────────────┘ │
▼
┌─────────────┐ ┌───────────────┐ ┌──────────────────┐
│ Dashboard │◀────│ Human review │◀────│ Flag low-score │
│ + Alerts │ │ on flagged │ │ samples for │
│ │ │ samples │ │ human review │
└─────────────┘ └───────────────┘ └──────────────────┘
Step 4: Set Your Eval Cadence
| Trigger | Action |
|---|---|
| Every model change (prompt, model version, RAG pipeline) | Run full eval suite |
| Weekly | Run eval on a sample of production traffic |
| On quality incident | Add failing case to regression set, run eval |
| Monthly | Full human evaluation on 100+ production samples |
| Quarterly | Re-calibrate LLM-as-judge against fresh human evaluations |
Step 5: Create Your Evaluation Dashboard
Track these metrics over time on a dashboard visible to the whole team: - Overall quality score (composite of dimensions) - Quality score by dimension (accuracy, relevance, tone, safety) - Quality score by query category (simple, complex, sensitive) - Regression test pass rate - Production feedback signals (thumbs up/down ratio) - Cost per interaction trend
PM Action Items — Evaluation
- This week: Identify the AI feature you're responsible for. Can you answer "what is the hallucination rate of this feature?" If not, you don't have adequate evaluation.
- This month: Build a golden test set of 200+ examples from real user queries. Run your current model against it and establish a baseline.
- This quarter: Implement an automated eval pipeline that runs on every model/prompt change. Set up an LLM-as-judge with a rubric aligned to your product's quality dimensions.
4.2 Human Feedback: Turning Users Into Teachers
Every interaction a user has with your AI product generates signal about quality. The art is capturing that signal without destroying the user experience, and converting it into model improvements.
4.2.1 Types of Feedback
Explicit Feedback: Users Tell You Directly
| Feedback Type | UX Pattern | Signal Quality | Collection Rate |
|---|---|---|---|
| Thumbs up/down | Binary buttons below AI response | Low resolution but high volume | 5-15% of interactions |
| Star rating (1-5) | Rating widget | Moderate resolution | 3-8% of interactions |
| Written corrections | "Edit this response" / "This is wrong because..." | Very high resolution | <2% of interactions |
| Category tagging | "What's wrong: Inaccurate / Irrelevant / Offensive / Too long" | Structured + actionable | 3-10% of interactions |
| Preference selection | "Which response is better: A or B?" | Extremely high quality (pairwise) | Requires deliberate UX design |
Real-world example: ChatGPT's feedback system combines thumbs up/down (binary) with an optional text field for detailed feedback. After a thumbs-down, users can select categories ("This is harmful," "This isn't true," "This isn't helpful") and provide free-text explanations. OpenAI uses this data directly to improve models — every piece of feedback enters their improvement pipeline.
Implicit Feedback: Users Show You Indirectly
Implicit feedback is often more honest than explicit feedback because users don't know they're providing it. It's observing behavior rather than asking for opinions.
| Signal | What It Means | Example Application |
|---|---|---|
| Acceptance vs. rejection | User kept or dismissed the AI output | GitHub Copilot tracking suggestion acceptance |
| Edit distance | How much the user changed the AI output | Notion AI measuring how heavily users revise drafts |
| Regeneration | User clicked "regenerate" — signal of dissatisfaction | ChatGPT tracking regen clicks as negative signal |
| Copy/paste | User copied the response — signal of value | Google AI Overviews tracking copy events |
| Session length after AI interaction | User continued working vs. abandoned | Measuring engagement post-AI-feature use |
| Time-to-edit | How quickly user starts modifying AI output | Fast edit = obvious error; slow edit = refinement |
| Upvote + variation selection | User chose to iterate on a specific output | Midjourney tracking which images users upvote and request variations of |
| Scroll depth + hover time | How much of the output users actually read | Long AI responses — did they read to the end? |
| Follow-up queries | "That's wrong" or rephrasing suggests failure | Conversational AI tracking clarification patterns |
Real-world example: Spotify's Discover Weekly is a masterclass in implicit feedback. Spotify doesn't ask you "Did you like this playlist?" — it observes: Did you skip the song? Did you save it? Did you add it to a playlist? Did you listen to the full track or bail at 30 seconds? Did you listen to the entire Discover Weekly or stop early? This behavioral data is far richer than any rating system.
Midjourney's feedback loop: When a user generates four image options and selects one to upscale or request variations on, that's a preference signal — "this one is better than the other three." Midjourney effectively collects millions of pairwise preference judgments per day without ever asking users to rate anything.
4.2.2 Feedback Collection Strategy
Principles
- Minimize friction. Every question you ask costs engagement. Thumbs up/down is milliseconds. A 5-question survey after every interaction will crater usage.
- Ask at the right moment. Request feedback after the user has had time to assess quality, not immediately after generation. For a document draft, ask after they've read it. For a code suggestion, ask after they've tested it.
- Make negative feedback easy. Users are more willing to give feedback when something goes wrong — but only if the mechanism is fast. One tap, not three.
- Rotate detailed asks. Don't ask every user for detailed feedback. Sample 5-10% of interactions for richer feedback collection. Rotate which users see the ask.
- Close the loop. Show users that their feedback matters: "Thanks to your feedback, we've improved X." This increases future feedback rates.
Feedback Funnels
Design your feedback collection as a funnel with progressively richer signals:
All Interactions
│
│ Observe implicit signals (100% of interactions)
│ └─ acceptance, edits, regeneration, session behavior
│
▼
Feedback Prompt (10-15% of interactions)
│
│ Binary signal: 👍 / 👎
│
▼
Follow-up (on 👎 only, ~30% respond)
│
│ Category: "What went wrong?"
│ □ Inaccurate □ Irrelevant □ Too long □ Wrong tone □ Other
│
▼
Detail (optional, ~10% of follow-up)
│
│ Free text: "Tell us more..."
│
▼
Correction (~1-2% of interactions)
│
│ User edits the response to show what it should have been
└─ THIS IS GOLD — direct training signal
Real-world example: Google Search's "Did you find this helpful?" prompt on AI Overviews follows this pattern. It appears selectively, starts with a simple yes/no, and only asks for detail on negative responses. Google calibrates the frequency to avoid survey fatigue.
4.2.3 Feedback Loops: From Signal to Improvement
Collecting feedback is useless unless you have a pipeline that converts it into model improvements.
The Feedback-to-Improvement Pipeline
┌──────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ Collect │───▶│ Aggregate │───▶│ Analyze │───▶│ Act │
│ Feedback │ │ & Store │ │ Patterns │ │ │
│ │ │ │ │ │ │ │
│ • Explicit│ │ • Data │ │ • Failure │ │ • Fix │
│ • Implicit│ │ warehouse│ │ clusters │ │ prompts │
│ • Edits │ │ • Label │ │ • Root │ │ • Update │
│ │ │ quality │ │ cause │ │ RAG data │
│ │ │ • Dedup │ │ analysis │ │ • Fine-tune│
│ │ │ │ │ • Priority │ │ • Retrain │
└──────────┘ └────────────┘ └────────────┘ └────────────┘
Short-loop improvements (days): - Fix prompt instructions based on failure patterns - Update RAG knowledge base with correct information - Add guardrails for recurring failure modes - Add failing cases to regression eval set
Medium-loop improvements (weeks): - Fine-tune model on corrected examples - Retrain embeddings for RAG retrieval quality - Build new features to address systematic gaps
Long-loop improvements (months): - Inform RLHF/DPO training for next model version - Shape product strategy based on what users actually need - Identify entirely new capabilities to build
4.2.4 Privacy and Ethics in Feedback Collection
| Concern | Risk | Mitigation |
|---|---|---|
| PII in feedback | Users may include personal data in corrections | PII detection and scrubbing before storage; don't log raw user text without consent |
| Consent | Using conversations for training without permission | Explicit opt-in/opt-out in settings; clear privacy policy (OpenAI and Google both faced backlash here) |
| Bias amplification | Feedback from a non-representative user base biases the model | Monitor demographic distribution of feedback; active sampling from underrepresented groups |
| Regulatory compliance | GDPR right to deletion; CCPA data access requests | Ability to delete specific user data from training pipelines; data retention policies |
| Manipulation | Adversaries submitting feedback to bias the model | Anomaly detection on feedback patterns; rate limiting; trusted rater programs |
Real-world example: OpenAI's ChatGPT settings allow users to opt out of having their conversations used for training. However, the default is opt-in, which has drawn scrutiny. When Italy temporarily banned ChatGPT in 2023, data privacy in feedback collection was a central concern. Apple's approach with Apple Intelligence emphasizes on-device processing and differentially private feedback — a competitive differentiator for privacy-conscious users.
PM Action Items — Human Feedback
- Audit your current feedback collection. What explicit and implicit signals are you capturing today? Map them on the feedback types table above. Identify at least two implicit signals you're not tracking but should be.
- Design a feedback funnel. Sketch the funnel from 100% implicit observation → binary prompt → category → detail → correction. Calculate expected volumes at each stage.
- Close one feedback loop this quarter. Take the top failure category from user feedback, fix it (prompt change, RAG update, or guardrail), and measure the before/after improvement.
4.3 Fine-Tuning: When Prompting and RAG Aren't Enough
4.3.1 The Decision Framework: Prompting vs. RAG vs. Fine-Tuning
This is one of the most important decisions an AI PM makes. Choosing wrong wastes months and hundreds of thousands of dollars. Choosing right can be a competitive advantage.
| Approach | What It Does | Effort | Cost | Best For | Limitations |
|---|---|---|---|---|---|
| Prompt Engineering | Add instructions to the input to shape behavior | Hours-days | ~$0 (per-query costs only) | Formatting, tone, persona, simple rules, behavior shaping | Limited by context window; fragile; can't teach new knowledge |
| RAG | Inject retrieved knowledge into the prompt | Days-weeks | $1K-50K (infrastructure) | Factual grounding, proprietary knowledge, real-time data, long-tail content | Retrieval quality is a ceiling; can't change model behavior or style |
| Fine-Tuning | Retrain model weights on your data | Weeks-months | $10K-500K+ (data + compute + expertise) | Domain-specific style/format, specialized terminology, consistent behavior, reducing prompt size | Requires curated data; risk of regressions; ongoing maintenance |
The Decision Tree
Does the model need new KNOWLEDGE it doesn't have?
├── YES → RAG (retrieval-augmented generation)
│ Inject knowledge at inference time.
│ Don't bake it into model weights.
│
└── NO → Does the model need to BEHAVE differently?
├── Can you describe the behavior in a prompt?
│ ├── YES, and it works → Prompt Engineering ✅
│ ├── YES, but the prompt is >2000 tokens → Fine-tuning
│ │ (your instructions are so long they eat context window)
│ └── NO, you can't articulate the rules → Fine-tuning
│ (the behavior is "know it when you see it")
│
└── Does the model need a specific FORMAT/STYLE consistently?
├── Prompt handles it reliably → Prompt Engineering ✅
└── Prompt is unreliable/inconsistent → Fine-tuning
(e.g., always output JSON in a specific schema,
match brand voice across 100% of outputs,
use domain-specific terminology correctly)
The golden rule: Start with prompting. Add RAG for knowledge. Fine-tune only when you've exhausted both and can prove the gap with evals.
Real-world example: Duolingo started with prompt engineering for GPT-4 to power their "Explain My Answer" feature. When they needed the model to consistently match Duolingo's pedagogical style — encouraging, specific, concise, at the right difficulty level — plain prompting was inconsistent. They fine-tuned on thousands of expert-written explanations to achieve the consistency their educational product required.
4.3.2 Types of Fine-Tuning
Full Fine-Tuning
What it is: Update all model parameters on your dataset. The entire model's weights change.
When to use: When you have a very large, high-quality dataset (100K+ examples) and need fundamental behavior changes. Rarely used by product teams — mostly by model providers themselves.
Cost: Extremely high. Full fine-tuning of a 70B parameter model requires multiple A100/H100 GPUs and can cost $50K-$500K+ in compute.
LoRA / QLoRA (Parameter-Efficient Fine-Tuning)
What it is: Instead of updating all parameters, LoRA (Low-Rank Adaptation) adds small trainable matrices alongside the frozen model weights. Only these small matrices are updated. QLoRA adds quantization to reduce memory requirements further.
Analogy: Full fine-tuning is like rewriting an entire textbook. LoRA is like adding sticky notes throughout it — the original text stays intact, but the sticky notes modify how you read and apply it.
When to use: Most fine-tuning use cases for product teams. Achieves 90-95% of full fine-tuning quality at 5-10% of the cost.
Cost: $100-$10K in compute for a 7B-70B parameter model. OpenAI's fine-tuning API charges ~$8/1M training tokens for GPT-4o-mini.
Instruction Tuning
What it is: Fine-tuning specifically on (instruction, response) pairs to make the model better at following directions. This is what transforms a base model into a chatbot.
When to use: When you want the model to follow a specific type of instruction that it currently handles poorly — e.g., "Always respond in bullet points," "Always include a disclaimer for medical content," "Never mention competitors."
Comparison:
| Method | Parameters Updated | Data Needed | Compute Cost | Quality vs. Full | Use Case |
|---|---|---|---|---|---|
| Full Fine-Tuning | All (billions) | 100K+ examples | $50K-500K+ | 100% | Model provider-level retraining |
| LoRA/QLoRA | <1% of params | 1K-50K examples | $100-10K | 90-95% | Domain adaptation, style matching |
| Instruction Tuning | Varies (often LoRA) | 500-10K examples | $100-5K | Depends on task | Behavior and format shaping |
4.3.3 Data Requirements for Fine-Tuning
The most common failure mode in fine-tuning is bad data, not bad models.
How Much Data Do You Need?
| Fine-Tuning Goal | Minimum Examples | Ideal Examples | Data Quality Bar |
|---|---|---|---|
| Format/structure consistency | 50-200 | 500-1K | Perfectly formatted examples; zero errors |
| Domain terminology/style | 500-2K | 5K-10K | Expert-written; consistent style |
| New capability (e.g., classification) | 1K-5K | 10K-50K | Labeled by domain experts; balanced classes |
| Fundamental behavior change | 10K-50K | 100K+ | High-quality, diverse, representative |
Data Quality Checklist
- [ ] Consistency: All examples follow the same format and style
- [ ] Correctness: Every example's output is factually correct and well-written
- [ ] Diversity: Examples cover the full range of inputs the model will see in production
- [ ] Balance: No skew toward one category or type of query
- [ ] Deduplication: No near-duplicate examples that cause overfitting
- [ ] Negative examples: Include examples of what not to do (with correct alternatives)
- [ ] Expert review: Domain experts have validated at least a sample of the training data
Cost of curation: Assume $5-$20 per high-quality training example (expert time to write, review, or validate). A 5,000-example dataset costs $25K-$100K in human effort alone — often more than the compute cost of fine-tuning itself.
4.3.4 Cost and Infrastructure Considerations
| Cost Component | Range | Notes |
|---|---|---|
| Data curation | $25K-$500K | Scales with dataset size; domain expertise drives cost |
| Compute (training) | $100-$500K | LoRA on 7B model: ~$100-500; Full on 70B: $50K-500K |
| Compute (inference) | Variable | Fine-tuned model may need dedicated hosting vs. shared API |
| Evaluation | $5K-$50K | Eval suite development + human evaluation runs |
| Iteration | 2-5x of single run | You almost never get it right on the first attempt |
| Ongoing maintenance | Periodic | Model drift; base model updates require re-fine-tuning |
API provider fine-tuning (simpler, less control): | Provider | Model | Training Cost | Hosting | |---|---|---|---| | OpenAI | GPT-4o-mini | ~$3/1M training tokens | Served via OpenAI API (higher per-token cost than base) | | OpenAI | GPT-4o | ~$25/1M training tokens | Served via OpenAI API | | Google | Gemini | Tuning API available | Served via Vertex AI | | Anthropic | Claude | Not publicly available for fine-tuning | — |
Self-hosted fine-tuning (more control, more work): - Requires ML engineering expertise - Typical stack: Hugging Face Transformers, PyTorch, DeepSpeed/FSDP - Infrastructure: GPU cluster (A100/H100) via cloud (AWS, GCP, Azure) or on-premises - Open-source models only (Llama, Mistral, Qwen)
4.3.5 Risks of Fine-Tuning
| Risk | What Happens | How to Detect | How to Mitigate |
|---|---|---|---|
| Catastrophic forgetting | Model loses general capabilities while learning your domain | Run general-capability benchmarks before and after | Mix general data into training set (10-20%); use LoRA instead of full fine-tuning |
| Overfitting | Model memorizes training data; performs great on training set, poorly on new inputs | Hold out a validation set; monitor training vs. validation loss | Early stopping; data augmentation; regularization |
| Safety regression | Fine-tuning overrides safety guardrails baked in during RLHF | Run safety evals (toxicity, refusal tests) before and after | Include safety examples in training data; test extensively |
| Bias introduction | Training data has demographic or topical biases | Bias audits on training data and model outputs | Diverse, representative training data; bias-specific evals |
| Distribution shift | Model works on training distribution but fails on real production queries | Compare training data distribution vs. production query distribution | Ensure training data matches production distribution |
Real-world cautionary tale: When Microsoft fine-tuned an AI for Bing Chat (now Copilot) and it started producing aggressive, erratic responses ("Sydney" incident), it demonstrated how fine-tuning and prompt engineering can interact unpredictably with base model behavior. Safety evaluation must cover edge cases that training data doesn't — adversarial prompts, novel situations, emotional manipulation.
4.3.6 Real-World Fine-Tuning Examples
BloombergGPT: Bloomberg trained a 50B-parameter model on a mix of financial data (363B tokens from Bloomberg's terminal data) and general text. The result: a model that outperformed GPT-3 on financial NLP tasks (sentiment analysis, named entity recognition, news classification) while maintaining general capabilities. This was a full pre-training + fine-tuning effort, justifiable only because Bloomberg has proprietary financial data that no public model has ever seen.
Shopify: Shopify has fine-tuned models for commerce-specific tasks — product description generation that matches merchant brand voice, customer query classification that understands e-commerce-specific intents ("where's my order" vs. "I want to return this" vs. "do you have this in blue"), and recommendation language that drives conversion. The key insight: commerce language is different enough from general text that prompting alone left significant quality gaps.
Healthcare: Companies like Hippocratic AI and Google's Med-PaLM 2 fine-tune on medical data to achieve clinical-grade accuracy. Med-PaLM 2 achieved 86.5% on MedQA (USMLE-style questions), approaching expert physician performance. Healthcare fine-tuning requires extensive safety evaluation — a model that's 95% accurate on medical questions but 5% dangerously wrong is worse than no model at all.
Duolingo: Fine-tuned GPT-4 on thousands of expert-written language explanations to match their pedagogical approach. Result: consistent, encouraging, appropriately-leveled explanations that felt like a Duolingo teacher, not a generic chatbot. This was a case where style and pedagogy couldn't be captured in a prompt alone.
PM Action Items — Fine-Tuning
- Apply the decision tree. For your current AI feature, walk through the Prompting vs. RAG vs. Fine-Tuning decision tree above. Document where you land and why. If you can't articulate why fine-tuning is needed, you probably don't need it.
- If fine-tuning is warranted: Estimate your total cost (data curation + compute + evaluation + 3x for iteration). Present a business case: what metric improvement justifies this investment?
- Start with LoRA. If you're fine-tuning for the first time, use LoRA/QLoRA or a provider's fine-tuning API. Don't start with full fine-tuning on open-source models unless you have a strong ML team.
4.4 Learning: RLHF and Beyond
4.4.1 Reinforcement Learning from Human Feedback (RLHF)
RLHF is the technique that transformed GPT-3 (impressive but chaotic) into ChatGPT (useful and aligned). It's how model providers ensure that models are not just capable but helpful, harmless, and honest.
The RLHF Pipeline (Explained for PMs)
┌─────────────────────────────────────────────────────────────────┐
│ THE RLHF PIPELINE │
│ │
│ STEP 1: Supervised Fine-Tuning (SFT) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Train on (prompt, ideal_response) pairs │ │
│ │ written by human experts │ │
│ │ → Produces an SFT model that can follow instructions │ │
│ └──────────────────────────────────┬───────────────────┘ │
│ │ │
│ STEP 2: Reward Model Training │ │
│ ┌──────────────────────────────────▼───────────────────┐ │
│ │ Generate multiple responses to same prompt │ │
│ │ Human raters rank responses (A > B > C) │ │
│ │ Train a reward model to predict human preferences │ │
│ │ → Produces a model that scores "how good is this │ │
│ │ response?" on a scale │ │
│ └──────────────────────────────────┬───────────────────┘ │
│ │ │
│ STEP 3: RL Optimization (PPO) │ │
│ ┌──────────────────────────────────▼───────────────────┐ │
│ │ SFT model generates responses │ │
│ │ Reward model scores them │ │
│ │ Use Proximal Policy Optimization (PPO) to update │ │
│ │ the model to produce higher-scoring responses │ │
│ │ → Produces a model aligned with human preferences │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Analogy: Imagine training a new customer support agent. - Step 1 (SFT): You give them a manual of ideal responses. They learn to mimic good answers. - Step 2 (Reward Model): Experienced managers compare the trainee's responses side by side: "This response is better than that one." Over time, you build an understanding of what "good" looks like. - Step 3 (RL): The trainee practices answering questions, gets scored by the manager's criteria, and adjusts their approach to consistently produce higher-rated responses.
Why RLHF Matters for PMs
- It's why ChatGPT feels different from a raw language model. Without RLHF, GPT-4 would be technically capable but chaotic — sometimes helpful, sometimes harmful, sometimes irrelevant. RLHF tunes the model to be consistently helpful.
- It shapes the personality. The "voice" of ChatGPT (helpful, balanced, slightly cautious), Claude (thoughtful, careful, honest), and Gemini (concise, Google-integrated) comes substantially from RLHF decisions.
- It sets safety boundaries. RLHF is how models learn to refuse harmful requests while remaining helpful for legitimate ones. The balance is a product decision.
Real-world example: OpenAI's ChatGPT RLHF process used 40+ human contractors who ranked model outputs on helpfulness, harmlessness, and honesty. The reward model trained on ~33K comparison pairs. InstructGPT (the research paper) showed that a 1.3B parameter model with RLHF was preferred over a 175B parameter model without it — alignment beats raw capability.
4.4.2 Constitutional AI (Anthropic's Approach)
Anthropic introduced Constitutional AI (CAI) as an alternative to pure RLHF. Instead of relying solely on human raters to define good behavior, CAI defines a set of principles (a "constitution") that the model uses to self-evaluate and self-correct.
How Constitutional AI Works
STEP 1: Generate + Self-Critique
┌──────────────────────────────────────────────────┐
│ Prompt: "How do I pick a lock?" │
│ Initial response: [provides lock-picking guide] │
│ │
│ Constitution principle: "Responses should not │
│ help people engage in illegal activities." │
│ │
│ AI self-critique: "My response could help someone │
│ break into homes. Let me revise." │
│ │
│ Revised response: "I can't help with lock-picking │
│ that might be used for illegal entry. If you're │
│ locked out, contact a licensed locksmith." │
└──────────────────────────────────────────────────┘
STEP 2: RLAIF (Reinforcement Learning from AI Feedback)
┌──────────────────────────────────────────────────┐
│ Instead of human raters ranking outputs, │
│ AI evaluates outputs against the constitution. │
│ Train reward model on AI-generated preferences. │
│ Apply RL optimization using AI-judged rewards. │
└──────────────────────────────────────────────────┘
Why it matters for PMs: - Scalability: Human raters are expensive and slow. AI self-critique scales infinitely. - Consistency: A constitution provides consistent, auditable rules. Human raters have varying interpretations. - Transparency: You can read the constitution. You can't read a reward model's internal state. - Limitations: The AI's self-judgment is imperfect — it might miss things humans would catch, or overrefuse.
Real-world example: Claude's character traits — being helpful, harmless, and honest — are substantially shaped by Constitutional AI. Anthropic's constitution includes principles like "Choose the response that is least likely to be used for illegal activities" and "Choose the response that sounds most similar to what a thoughtful, senior person at Anthropic would say." This is why Claude has a distinct "personality" that differs from ChatGPT — it's a different constitutional foundation.
4.4.3 Direct Preference Optimization (DPO)
DPO is a simpler alternative to RLHF that's gained significant adoption since its 2023 introduction.
RLHF vs. DPO
| Aspect | RLHF | DPO |
|---|---|---|
| Pipeline | 3 steps: SFT → Reward Model → RL (PPO) | 1 step: Directly optimize on preference pairs |
| Reward model | Required — adds complexity and cost | Not required — preferences are built directly into loss function |
| Stability | PPO training is notoriously unstable, sensitive to hyperparameters | More stable; standard supervised learning optimization |
| Compute cost | High (training two models + RL optimization) | Moderate (single training pass) |
| Quality | Slightly better on some benchmarks (more degrees of freedom) | Comparable on most tasks; sometimes slightly worse on edge cases |
| Complexity | Requires RL expertise; many moving parts | Standard ML training; much easier to implement |
| Adoption | OpenAI's ChatGPT, Google's Gemini | Llama 3, Zephyr, many open-source models |
Analogy comparison: - RLHF: Hire a food critic (reward model), have them taste every dish, then use their feedback to train the chef (PPO). Complex but the critic adds nuance. - DPO: Show the chef pairs of dishes and tell them which is better. The chef learns directly from comparisons. Simpler but no intermediary critic insight.
PM Implication: If you're fine-tuning a model and want to align it with user preferences, DPO is the practical choice for most product teams. It requires the same preference data (pairs of responses where one is better) but avoids the engineering complexity of training a separate reward model and running RL optimization.
4.4.4 RLAIF (Reinforcement Learning from AI Feedback)
RLAIF replaces human raters with AI evaluators. Instead of having humans compare outputs, a powerful AI model evaluates and ranks them.
| Aspect | RLHF | RLAIF |
|---|---|---|
| Feedback source | Human raters | AI model (e.g., GPT-4 or Claude evaluating a smaller model) |
| Cost per comparison | $1-5 (human labor) | $0.001-0.01 (API call) |
| Scale | Thousands to tens of thousands of comparisons | Millions of comparisons feasible |
| Quality | Gold standard — humans catch nuances AI misses | Good but limited by AI evaluator's own capabilities |
| Bias | Human biases (cultural, political, demographic) | AI biases (training data biases, sycophancy) |
| Speed | Weeks to months for a dataset | Hours to days |
The practical synthesis: Most frontier model providers now use a hybrid: RLHF for high-stakes alignment decisions (safety, ethics, controversy) and RLAIF for scaling up preference data on more routine quality dimensions (helpfulness, clarity, formatting).
Comparison Summary
| Method | Feedback Source | Complexity | Cost | Best For |
|---|---|---|---|---|
| RLHF | Human raters | Very high (RL pipeline) | Very high | Frontier model alignment; safety-critical applications |
| DPO | Human raters | Moderate (supervised learning) | Moderate | Most fine-tuning + alignment tasks; open-source models |
| CAI | AI self-critique + principles | High (constitution design + RLAIF) | Moderate | Safety and ethics alignment at scale |
| RLAIF | AI evaluator | Moderate-high | Low (compute only) | Scaling preference data; routine quality improvements |
4.4.5 The Flywheel Effect: How AI Products Get Better Over Time
The most powerful concept in AI product management is the data flywheel — a self-reinforcing cycle where more users create more data, which improves the model, which attracts more users.
┌─────────────────┐
│ MORE USERS │
└────────┬────────┘
│
▼
┌─────────────────┐
│ MORE DATA │
│ (interactions, │
│ feedback, │
│ corrections) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ BETTER MODEL │
│ (fine-tuning, │
│ RLHF, RAG │
│ improvements) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ BETTER PRODUCT │
│ (higher quality, │
│ more trust, │
│ more features) │
└────────┬────────┘
│
└───────────▶ MORE USERS (cycle repeats)
Real-World Flywheel Examples
TikTok's Recommendation Engine: - User watches videos → TikTok observes what holds attention, what gets skipped, what gets replayed, what gets shared - This data trains the recommendation model (every swipe is a preference signal) - Better recommendations → users spend more time → more data → even better recommendations - TikTok's flywheel is so powerful that a new user gets highly personalized recommendations within 30 minutes of usage
Netflix: - 230M+ subscribers generating billions of viewing signals daily - Every play, pause, rewind, abandon, and rating feeds the recommendation system - Better recommendations → higher engagement → lower churn → more subscribers → more data - Netflix estimates its recommendation system is worth $1B+ annually in reduced churn
Spotify Discover Weekly: - Launched in 2015, now serves 100M+ users personalized playlists every Monday - Combines collaborative filtering ("users like you also listened to…") with content analysis (audio features, lyrics, artist networks) - Every save, skip, and completion feeds back into the next week's playlist - The playlist gets better for each user over time — a true personal flywheel
ChatGPT: - 100M+ weekly active users generating conversations - Thumbs up/down + corrections + usage patterns feed into future RLHF training - Better model → more users → more feedback → better model - OpenAI has more human preference data than any competitor, creating a defensible moat
Building Your Own Flywheel
As a PM, ask yourself these questions: 1. What data does every user interaction generate? (Explicit + implicit signals) 2. How does that data connect to model improvement? (Is there a pipeline to convert user signals into training data or RAG improvements?) 3. What's the cycle time? (How quickly can you go from user signal → model improvement → better user experience?) 4. What's the competitive moat? (Is your data unique? Or could a competitor build the same flywheel with public data?)
4.4.6 Continuous Learning in Production
Most production AI systems don't retrain the foundation model on every piece of feedback. Instead, they use a layered approach:
| Learning Speed | Mechanism | Cycle Time | Example |
|---|---|---|---|
| Real-time | RAG knowledge base updates | Minutes-hours | Update product catalog; add new FAQ answers |
| Fast | Prompt/system instruction updates | Hours-days | Fix recurring failure patterns via prompt engineering |
| Medium | Fine-tuning iterations | Weeks | Monthly fine-tuning run on accumulated user corrections |
| Slow | RLHF/DPO on accumulated preferences | Months | Quarterly alignment update based on aggregate user preferences |
| Very slow | Base model retraining | 6-12 months | Model provider releases a new version (GPT-4 → GPT-4o) |
The art of AI product management is using all five speeds simultaneously: fixing urgent issues with prompt updates, building medium-term improvements with fine-tuning, and shaping long-term model direction with feedback data.
PM Action Items — Learning
- Map your flywheel. Draw the data flywheel for your AI product. Where does user data enter? How does it flow to model improvement? What's the cycle time? Where are the gaps?
- Identify your learning speed. Which of the five learning speeds are you using today? Most teams only use prompt updates (fast). Identify one medium-speed mechanism you could add.
- Quantify your data moat. How much unique user interaction data have you accumulated? How would this change model quality if used for fine-tuning or RLHF? Is this data a competitive advantage?
4.5 Putting It All Together: Cost / Effort / Impact Analysis
| Improvement Method | Effort (Team-Weeks) | Cost | Time to Impact | Impact Magnitude | Risk |
|---|---|---|---|---|---|
| Better prompts | 0.5-2 weeks | ~$0 | Days | Low-Medium | Very Low |
| Evaluation suite | 2-4 weeks | $5K-50K | Weeks (enables all other improvements) | High (foundational) | Low |
| Feedback collection | 2-3 weeks | $5K-20K | Weeks to months | Medium-High | Low |
| RAG improvements | 2-6 weeks | $10K-100K | Weeks | Medium-High | Low |
| Fine-tuning (LoRA) | 4-8 weeks | $25K-200K | Months | High | Medium |
| Full fine-tuning | 8-16 weeks | $100K-1M+ | Months | Very High | High |
| RLHF/DPO alignment | 12-24 weeks | $200K-2M+ | Quarters | Very High | High |
Recommended sequencing for most products:
1. BUILD EVAL SUITE (You can't improve what you can't measure)
│
▼
2. OPTIMIZE PROMPTS (Cheapest, fastest improvement)
│
▼
3. ADD FEEDBACK LOOPS (Start collecting improvement signal)
│
▼
4. IMPROVE RAG (Ground responses in better knowledge)
│
▼
5. FINE-TUNE (LoRA) (When prompts + RAG plateau)
│
▼
6. RLHF/DPO (When you have enough preference data)
4.6 Exercises
Exercise 1: Build an Eval Rubric
Choose an AI-powered feature from a product you use (e.g., ChatGPT, Google AI Overviews, Notion AI, GitHub Copilot). Create: 1. A 5-point rubric for evaluating output quality, with three dimensions (accuracy, helpfulness, safety) 2. Two example outputs for each score level 3. A proposal for how you'd measure inter-rater reliability
Exercise 2: Feedback System Design
Design the complete feedback collection system for an AI-powered customer support chatbot for an e-commerce company. Specify: 1. What explicit feedback mechanisms you'd implement (with mockup descriptions) 2. What implicit signals you'd track 3. Your feedback funnel with expected collection rates at each stage 4. How you'd convert feedback into model improvement (short-loop, medium-loop, long-loop) 5. Your privacy and consent approach
Exercise 3: Prompting vs. RAG vs. Fine-Tuning Decision
For each scenario, decide whether you'd use prompting, RAG, fine-tuning, or a combination. Justify your answer using the decision tree.
- (A) A legal AI assistant that needs to reference a firm's 50,000 case files when answering attorney questions
- (B) A marketing copy generator that needs to consistently match your brand's distinctive casual-yet-authoritative voice
- (C) A medical Q&A system that must always include FDA-required disclaimers in specific formatting
- (D) A real-time stock analysis tool that needs current market data and company filings
- (E) A children's educational app that needs to explain concepts at exactly a 3rd-grade reading level, consistently
Exercise 4: Flywheel Design
Choose one of these products and design the complete data flywheel: - (A) An AI-powered recipe recommendation app - (B) An AI writing assistant for sales emails - (C) An AI-powered fitness coaching app
For your chosen product: 1. Map every user interaction that generates useful signal 2. Classify each signal as explicit or implicit feedback 3. Design the pipeline from signal → model improvement 4. Estimate cycle time for each learning speed (real-time through very slow) 5. Identify where the competitive moat builds over time
Exercise 5: Improvement Prioritization
Your AI customer support bot has these problems: - 15% hallucination rate on product specifications - Users report the tone feels "robotic" (30% negative tone feedback) - 25% of queries are about new products not in the knowledge base - The model occasionally generates responses in the wrong language for multilingual users - Average response latency is 4 seconds (target: 2 seconds)
Using the cost/effort/impact table, prioritize these five problems. For each, specify which improvement method you'd use, estimate cost and timeline, and justify your sequencing.
4.7 Discussion Questions
-
The Evaluation Paradox: Your LLM-as-judge evaluation system uses GPT-4 to score your product's GPT-4o-mini outputs. The scores consistently look good. But users are still complaining. What could be happening? How would you diagnose this? At what point should you invest in human evaluation, and how much should you spend?
-
The Feedback Cold Start: You're launching a new AI feature. You have zero user feedback data. You can't fine-tune or run RLHF without preference data. How do you bootstrap the feedback flywheel? What proxy signals can you use before you have real user data?
-
Fine-Tuning vs. Prompt Engineering ROI: Your team spent 6 weeks and $150K fine-tuning a model for your domain. A new team member spends 2 days rewriting the system prompt and gets 80% of the fine-tuning benefit. Was the fine-tuning a waste? How do you prevent this from happening? When is fine-tuning truly justified?
-
The Alignment Tax: Your RLHF-aligned model refuses 12% of user queries due to over-cautious safety guardrails. These refusals frustrate users and drive them to competitors with looser guardrails. How do you balance safety and usefulness? Who makes the call on where the line is?
-
Data Flywheel Competition: Your competitor launched 6 months before you and has 10x your user data. Their flywheel is spinning faster. Can you catch up? What strategies could accelerate your flywheel? Is there a point where data advantage becomes insurmountable?
-
Continuous Learning Ethics: Your AI writing assistant learns from user edits. A small group of users consistently edits outputs to include biased or harmful language. How do you prevent the model from learning bad behavior from bad actors? What safeguards should be in the feedback-to-training pipeline?
4.8 Key Takeaways
-
Evaluation is foundational — build it first. You cannot improve what you cannot measure. Build a custom eval suite from your own product data, combining automated metrics (for speed), LLM-as-judge (for scale), and human evaluation (for ground truth). Run evals on every change. This is non-negotiable.
-
Feedback is your most valuable asset — collect it relentlessly and responsibly. Every user interaction generates signal. Design for implicit signals at 100% coverage and explicit signals at low friction. The combination of behavioral data (what users do) and stated preferences (what users say) gives you the richest improvement signal. But respect privacy — consent, anonymization, and deletion rights are not optional.
-
Fine-tuning is a power tool, not a first resort. Start with prompt engineering (free, fast, reversible). Add RAG for knowledge gaps (moderate cost, high impact). Fine-tune only when you've hit the ceiling of both and can prove the gap with evals. When you do fine-tune, LoRA/QLoRA delivers 90%+ of the benefit at a fraction of full fine-tuning cost. Data quality matters more than data quantity.
-
RLHF creates alignment; DPO makes it accessible. RLHF is how frontier models become useful and safe, but it's complex and expensive. DPO offers a simpler path for product teams that need preference alignment without RL complexity. RLAIF scales feedback generation but requires careful validation. Choose based on your team's capabilities and your quality bar.
-
The data flywheel is the ultimate moat. More users → more data → better model → better product → more users. Design this flywheel intentionally from day one. Every feature should generate improvement signal. Your competitive advantage isn't the model (everyone uses the same ones) — it's your accumulated, proprietary data and the speed of your improvement cycle.
-
Use all five learning speeds. Real-time RAG updates for immediate fixes, prompt engineering for fast improvements, fine-tuning for medium-term quality gains, RLHF/DPO for long-term alignment, and base model upgrades for generational leaps. The best AI products operate on all five simultaneously, not just the fastest or most visible.
-
Ship, measure, improve — this is the loop that wins. The difference between AI products that delight users and those that disappoint isn't the initial model choice. It's the speed and rigor of the improvement loop. Build evals, collect feedback, improve systematically, and compound your advantages over time.