AI Metrics Cheatsheet

01 Build Phase 4 metrics

🎯

Accuracy

Is the AI getting the right answer?

Must

How to track

Ask your data team: "Out of every 100 predictions, how many are correct?" Good baseline, but never rely on it alone.

Why it matters

A fraud model at 99% accuracy sounds great — until you learn it just says "not fraud" every time. If only 0.1% of data is fraud, it's right 99.9% while catching nothing.

PM action

Ask: "What happens when the model is wrong?" If wrong = costly, pair accuracy with precision & recall. Never ship on accuracy alone.

⚖️

Precision vs Recall

Quality of catches vs completeness of catches

Must

How to track

Precision = Of everything AI flagged, how much was truly correct? Recall = Of everything that should've been caught, how much did AI find?

Why it matters

Spam filter: High precision = clean inbox, no real emails wrongly flagged. High recall = catches all spam, but some real emails get swept up. You can't max both.

PM action

Decide which mistake is worse. Cancer screening → optimize recall (don't miss real cases). Email spam → optimize precision (don't block real emails).

📉

Training Loss

Is the model improving as it learns?

Should

How to track

Ask ML team to show two lines on a chart — training score and validation score. Both should improve together over time.

Why it matters

If training keeps improving but validation gets worse = the model is memorizing, not learning. Like a student who aces practice tests but fails the real exam.

PM action

When you see the gap widening, tell your team: "We need more diverse data or a simpler model." Don't wait for launch to discover this.

🗄️

Data Quality

Garbage in = garbage out

Must

How to track

Check 4 things: Is data complete (no missing fields)? Consistent (same format)? Fresh (recent)? Accurate (correct labels)?

Why it matters

A recommendation engine on 2-year-old data with 30% gaps will always lose to a simple model on clean, fresh data. Better data beats better algorithms.

PM action

Before asking "can we try a better model?" ask "can we get better data?" Less exciting, almost always higher impact.

02 Test Phase 4 metrics

🏅

F1 Score

Single grade combining precision + recall

Should

How to track

Your data team calculates this. Think of it as a combined grade from 0 (terrible) to 1 (perfect). If either precision or recall is bad, F1 drops hard.

Why it matters

Great for comparing models: "Model A = 0.78, Model B = 0.84" — simple decision. Medical AI below 0.7? Not ready for production.

PM action

Use for version comparisons. But always ask your team what precision/recall tradeoff is hiding behind the single number.

👻

Hallucination Rate

How often does the AI confidently make things up?

Must

How to track

Run test questions where you already know the right answer. Count how many times AI invents facts, cites fake sources, or is confidently wrong.

Why it matters

AI support bot tells a customer "your warranty covers water damage" — when it doesn't. One confident wrong answer can destroy trust and create legal liability.

PM action

Set a hard threshold (e.g. <1% hallucination) and don't ship until you hit it. Add guardrails: source citations, confidence scores, human escalation paths.

⏱️

Latency

How long do users wait for the AI?

Must

How to track

Measure the average response time (typical user) AND the slowest 5% (worst experience). Both matter — don't just track average.

Why it matters

Average = 200ms feels great. But if 5% of users wait 8 seconds, at 2M daily queries that's 100K frustrated users every day.

PM action

Set SLAs: "Average under 300ms, 95th percentile under 2s." Put latency on your live dashboard. Users don't care how smart AI is if it's slow.

📊

AUC-ROC

How good is the model at telling things apart?

Good

How to track

Think of it as a grade from 0.5 to 1.0. 0.5 = the model is randomly guessing. 1.0 = perfectly separates categories. Your data team runs this.

Why it matters

Fraud detection at 0.92 = strong separation. At 0.55 = basically guessing. Best for comparing which model version performs better overall.

PM action

Use as a side-by-side comparison tool: "Model A is 0.88, Model B is 0.91 — go with B." Don't worry about the math; just use the number.

03 Deploy Phase 4 metrics

📡

Model Drift

Is the real world changing faster than your model?

Must

How to track

Your data team compares what the model trained on vs what it's seeing now. Big gap = predictions going stale.

Why it matters

Post-COVID, buying patterns shifted overnight. A demand model trained on 2019 data became dangerously inaccurate within weeks. Drift kills silently.

PM action

Set a recurring calendar check: "Is our model still accurate?" Monthly for stable domains, weekly for fast-changing ones. Have a retraining plan ready.

🟢

Uptime

Is the AI available when users need it?

Must

How to track

% of time the system is working. 99.9% = ~9 hours downtime/year. 99.99% = ~52 minutes/year. Sounds similar — it's not.

Why it matters

For supply chain AI, 15 minutes of downtime during peak = millions in missed decisions. Every additional "nine" in your uptime matters enormously.

PM action

Define your SLA clearly in your PRD. The difference between 99.9% and 99.99% is 8 hours vs 52 minutes per year. Know what your use case demands.

💸

Cost per Query

Can you afford this AI at scale?

Must

How to track

Simple math: cost of each API call × expected daily volume × 30. Include hosting, API fees, and compute overhead.

Why it matters

Premium LLM for 10K daily queries at $0.06 each = $18K/month. A fine-tuned smaller model might cost $1.8K for similar quality. 10× difference.

PM action

Build a cost model in a spreadsheet and show leadership. Many AI features die not because they don't work, but because they're too expensive at scale.

📈

Throughput

How many users can hit the AI at once?

Should

How to track

Measure queries per second at normal load AND peak load (Black Friday, viral moments, launches).

Why it matters

23 QPS normally. Launch day spikes to 200 QPS. If you didn't plan for it, the system crashes exactly when you need it most.

PM action

Rule of thumb: plan for 3× your expected peak. Ask engineering: "What happens at 5× and 10× current traffic?" Have a graceful degradation plan.

04 Business Impact 4 metrics

💰

ROI

Is the AI worth the investment?

Must

How to track

(Money earned or saved − Total cost) ÷ Total cost × 100. This is the number your CFO cares about. Track from Day 1.

Why it matters

AI inventory optimization: $2M build cost, saves $140M/year. ROI = 6,900%. This number alone gets your budget approved for next year.

PM action

Even rough estimates help. "We think this saves $X/month" → track actuals → report quarterly. This is your best ammunition for headcount and resources.

✅

Task Completion Rate

Can AI finish the job without human help?

Must

How to track

Count requests AI resolves completely without human handoff, divided by total requests. Break down by query type for better insights.

Why it matters

AI resolves 73% of support tickets solo. Each saves $12 in agent time. At 50K tickets/month = $438K in monthly savings. This is how you quantify automation.

PM action

Break down by type: "90% for password resets, 20% for billing disputes." That tells you exactly where to invest next.

👥

User Adoption & Retention

Are people using it — and coming back?

Must

How to track

How many people try it (adoption) and how many return after 30 days (retention). Good AI features hold 25-40% at Day 30.

Why it matters

10K signups in Week 1 — exciting! But D30 retention = 8%. High adoption + low retention = AI wowed initially but didn't deliver ongoing value. Leaky bucket.

PM action

Fix retention before pouring into growth. Dig into WHY people leave — inaccurate? Too slow? Not solving a real pain point?

⏳

Time to Value

How fast do users get something useful?

Should

How to track

Measure time from opening the feature → first useful result. Shorter = higher activation and adoption.

Why it matters

15 settings to configure before seeing insights = most users leave. Best AI products deliver value in under 30 seconds with smart defaults.

PM action

Map the journey from click → "aha moment." Every step you eliminate = higher activation. Smart defaults are your best friend.

05 Safety & Trust 4 metrics

⚖️

Bias & Fairness

Does the AI treat all groups equally?

Must

How to track

Check if AI decisions (approvals, recommendations) are roughly equal across different user groups. Big gaps = bias problem.

Why it matters

Loan model approves 72% of one group but 51% of another. That's not just ethically wrong — it's a regulatory violation and PR crisis waiting to happen.

PM action

Require a fairness audit in your launch checklist. Test across segments before shipping. Much cheaper to fix bias before launch than after headlines hit.

🚫

Toxicity Rate

Is the AI saying harmful things?

Must

How to track

Test with tricky edge cases and adversarial inputs (people WILL try to break it). Count harmful, offensive, or inappropriate responses.

Why it matters

0.3% toxic rate × 100K daily messages = 300 harmful outputs per day. Any single one could become a viral screenshot and a PR disaster.

PM action

Safety filters = non-negotiable launch gate. Test adversarially. Have a moderation escalation path. Monitor daily post-launch.

🔍

Explainability

Can you explain WHY the AI decided that?

Should

How to track

For each AI decision, can you show a human-readable reason? "We flagged this because..." If you can't explain it, users won't trust it.

Why it matters

Healthcare AI recommends treatment but can't say why → doctors rejected it (12% adoption). After adding clear explanations → adoption jumped to 67%.

PM action

Build "why" alongside "what" from day one. In regulated industries (healthcare, finance) explainability isn't optional — it's legally required.

🔐

PII Leakage

Is private data leaking into AI outputs?

Must

How to track

Test if the AI ever reveals names, emails, phone numbers, or addresses in responses — especially if trained on real user data.

Why it matters

An LLM trained on support logs starts generating real customer names and emails. That's an instant GDPR violation with fines in the millions.

PM action

Scrub training data of PII before training. Add output filters. Run automated scans daily. This is a "shut it down immediately" severity if found live.

The AI Metrics Cheatsheet