Priority:
Must Track
Should Track
Good to Track
5 Phases ยท 20 Metrics ยท 0 Jargon
๐ฏ
Accuracy
Is the AI getting the right answer?
Must
How to track
Ask your data team: "Out of every 100 predictions, how many are correct?" Good baseline, but never rely on it alone.
Why it matters
A fraud model at 99% accuracy sounds great โ until you learn it just says "not fraud" every time. If only 0.1% of data is fraud, it's right 99.9% while catching nothing.
PM action
Ask: "What happens when the model is wrong?" If wrong = costly, pair accuracy with precision & recall. Never ship on accuracy alone.
โ๏ธ
Precision vs Recall
Quality of catches vs completeness of catches
Must
How to track
Precision = Of everything AI flagged, how much was truly correct? Recall = Of everything that should've been caught, how much did AI find?
Why it matters
Spam filter: High precision = clean inbox, no real emails wrongly flagged. High recall = catches all spam, but some real emails get swept up. You can't max both.
PM action
Decide which mistake is worse. Cancer screening โ optimize recall (don't miss real cases). Email spam โ optimize precision (don't block real emails).
๐
Training Loss
Is the model improving as it learns?
Should
How to track
Ask ML team to show two lines on a chart โ training score and validation score. Both should improve together over time.
Why it matters
If training keeps improving but validation gets worse = the model is memorizing, not learning. Like a student who aces practice tests but fails the real exam.
PM action
When you see the gap widening, tell your team: "We need more diverse data or a simpler model." Don't wait for launch to discover this.
๐๏ธ
Data Quality
Garbage in = garbage out
Must
How to track
Check 4 things: Is data complete (no missing fields)? Consistent (same format)? Fresh (recent)? Accurate (correct labels)?
Why it matters
A recommendation engine on 2-year-old data with 30% gaps will always lose to a simple model on clean, fresh data. Better data beats better algorithms.
PM action
Before asking "can we try a better model?" ask "can we get better data?" Less exciting, almost always higher impact.
๐
F1 Score
Single grade combining precision + recall
Should
How to track
Your data team calculates this. Think of it as a combined grade from 0 (terrible) to 1 (perfect). If either precision or recall is bad, F1 drops hard.
Why it matters
Great for comparing models: "Model A = 0.78, Model B = 0.84" โ simple decision. Medical AI below 0.7? Not ready for production.
PM action
Use for version comparisons. But always ask your team what precision/recall tradeoff is hiding behind the single number.
๐ป
Hallucination Rate
How often does the AI confidently make things up?
Must
How to track
Run test questions where you already know the right answer. Count how many times AI invents facts, cites fake sources, or is confidently wrong.
Why it matters
AI support bot tells a customer "your warranty covers water damage" โ when it doesn't. One confident wrong answer can destroy trust and create legal liability.
PM action
Set a hard threshold (e.g. <1% hallucination) and don't ship until you hit it. Add guardrails: source citations, confidence scores, human escalation paths.
โฑ๏ธ
Latency
How long do users wait for the AI?
Must
How to track
Measure the average response time (typical user) AND the slowest 5% (worst experience). Both matter โ don't just track average.
Why it matters
Average = 200ms feels great. But if 5% of users wait 8 seconds, at 2M daily queries that's 100K frustrated users every day.
PM action
Set SLAs: "Average under 300ms, 95th percentile under 2s." Put latency on your live dashboard. Users don't care how smart AI is if it's slow.
๐
AUC-ROC
How good is the model at telling things apart?
Good
How to track
Think of it as a grade from 0.5 to 1.0. 0.5 = the model is randomly guessing. 1.0 = perfectly separates categories. Your data team runs this.
Why it matters
Fraud detection at 0.92 = strong separation. At 0.55 = basically guessing. Best for comparing which model version performs better overall.
PM action
Use as a side-by-side comparison tool: "Model A is 0.88, Model B is 0.91 โ go with B." Don't worry about the math; just use the number.
๐ก
Model Drift
Is the real world changing faster than your model?
Must
How to track
Your data team compares what the model trained on vs what it's seeing now. Big gap = predictions going stale.
Why it matters
Post-COVID, buying patterns shifted overnight. A demand model trained on 2019 data became dangerously inaccurate within weeks. Drift kills silently.
PM action
Set a recurring calendar check: "Is our model still accurate?" Monthly for stable domains, weekly for fast-changing ones. Have a retraining plan ready.
๐ข
Uptime
Is the AI available when users need it?
Must
How to track
% of time the system is working. 99.9% = ~9 hours downtime/year. 99.99% = ~52 minutes/year. Sounds similar โ it's not.
Why it matters
For supply chain AI, 15 minutes of downtime during peak = millions in missed decisions. Every additional "nine" in your uptime matters enormously.
PM action
Define your SLA clearly in your PRD. The difference between 99.9% and 99.99% is 8 hours vs 52 minutes per year. Know what your use case demands.
๐ธ
Cost per Query
Can you afford this AI at scale?
Must
How to track
Simple math: cost of each API call ร expected daily volume ร 30. Include hosting, API fees, and compute overhead.
Why it matters
Premium LLM for 10K daily queries at $0.06 each = $18K/month. A fine-tuned smaller model might cost $1.8K for similar quality. 10ร difference.
PM action
Build a cost model in a spreadsheet and show leadership. Many AI features die not because they don't work, but because they're too expensive at scale.
๐
Throughput
How many users can hit the AI at once?
Should
How to track
Measure queries per second at normal load AND peak load (Black Friday, viral moments, launches).
Why it matters
23 QPS normally. Launch day spikes to 200 QPS. If you didn't plan for it, the system crashes exactly when you need it most.
PM action
Rule of thumb: plan for 3ร your expected peak. Ask engineering: "What happens at 5ร and 10ร current traffic?" Have a graceful degradation plan.
๐ฐ
ROI
Is the AI worth the investment?
Must
How to track
(Money earned or saved โ Total cost) รท Total cost ร 100. This is the number your CFO cares about. Track from Day 1.
Why it matters
AI inventory optimization: $2M build cost, saves $140M/year. ROI = 6,900%. This number alone gets your budget approved for next year.
PM action
Even rough estimates help. "We think this saves $X/month" โ track actuals โ report quarterly. This is your best ammunition for headcount and resources.
โ
Task Completion Rate
Can AI finish the job without human help?
Must
How to track
Count requests AI resolves completely without human handoff, divided by total requests. Break down by query type for better insights.
Why it matters
AI resolves 73% of support tickets solo. Each saves $12 in agent time. At 50K tickets/month = $438K in monthly savings. This is how you quantify automation.
PM action
Break down by type: "90% for password resets, 20% for billing disputes." That tells you exactly where to invest next.
๐ฅ
User Adoption & Retention
Are people using it โ and coming back?
Must
How to track
How many people try it (adoption) and how many return after 30 days (retention). Good AI features hold 25-40% at Day 30.
Why it matters
10K signups in Week 1 โ exciting! But D30 retention = 8%. High adoption + low retention = AI wowed initially but didn't deliver ongoing value. Leaky bucket.
PM action
Fix retention before pouring into growth. Dig into WHY people leave โ inaccurate? Too slow? Not solving a real pain point?
โณ
Time to Value
How fast do users get something useful?
Should
How to track
Measure time from opening the feature โ first useful result. Shorter = higher activation and adoption.
Why it matters
15 settings to configure before seeing insights = most users leave. Best AI products deliver value in under 30 seconds with smart defaults.
PM action
Map the journey from click โ "aha moment." Every step you eliminate = higher activation. Smart defaults are your best friend.
โ๏ธ
Bias & Fairness
Does the AI treat all groups equally?
Must
How to track
Check if AI decisions (approvals, recommendations) are roughly equal across different user groups. Big gaps = bias problem.
Why it matters
Loan model approves 72% of one group but 51% of another. That's not just ethically wrong โ it's a regulatory violation and PR crisis waiting to happen.
PM action
Require a fairness audit in your launch checklist. Test across segments before shipping. Much cheaper to fix bias before launch than after headlines hit.
๐ซ
Toxicity Rate
Is the AI saying harmful things?
Must
How to track
Test with tricky edge cases and adversarial inputs (people WILL try to break it). Count harmful, offensive, or inappropriate responses.
Why it matters
0.3% toxic rate ร 100K daily messages = 300 harmful outputs per day. Any single one could become a viral screenshot and a PR disaster.
PM action
Safety filters = non-negotiable launch gate. Test adversarially. Have a moderation escalation path. Monitor daily post-launch.
๐
Explainability
Can you explain WHY the AI decided that?
Should
How to track
For each AI decision, can you show a human-readable reason? "We flagged this because..." If you can't explain it, users won't trust it.
Why it matters
Healthcare AI recommends treatment but can't say why โ doctors rejected it (12% adoption). After adding clear explanations โ adoption jumped to 67%.
PM action
Build "why" alongside "what" from day one. In regulated industries (healthcare, finance) explainability isn't optional โ it's legally required.
๐
PII Leakage
Is private data leaking into AI outputs?
Must
How to track
Test if the AI ever reveals names, emails, phone numbers, or addresses in responses โ especially if trained on real user data.
Why it matters
An LLM trained on support logs starts generating real customer names and emails. That's an instant GDPR violation with fines in the millions.
PM action
Scrub training data of PII before training. Add output filters. Run automated scans daily. This is a "shut it down immediately" severity if found live.