Hamel Husain & Shreya Shankar

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic).

6 skills 12 insights

AI & Technology Skills

Developing evaluation systems is the most critical and high-return activity for AI product development.

"To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in."
00:00

AI evaluation is essentially a specialized form of data analytics applied to large language model outputs.

"Evals is a way to systematically measure and improve an AI application, and it really doesn't have to be scary or unapproachable at all. It really is, at its core, data analytics on your LLM applicati..."
05:49

A 'benevolent dictator' with domain expertise should lead the evaluation process to avoid committee-driven stagnation.

"You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager."
01:09

Automating evaluation entirely with AI without human grounding fails because AI lacks the specific product context and domain expertise.

"The top one is, 'We live in the age of AI. Can't the AI just eval it?' But it doesn't work."
00:39

When using an LLM to judge another LLM, you must validate the judge's accuracy against human-labeled data.

"LLM as a judge is something, it's like a meta eval. You have to eval that eval to make sure the LLM that's judging is doing the right thing"
47:56

Binary (pass/fail) evaluation scores are more actionable and reliable than Likert scales (1-5).

"You want to make it binary because we want to simplify things. We don't want, 'Hey, score this on a rating of one to five. How good is it?' That's just in most cases, that's a weasel way of not making..."
52:16

The guests explicitly define this as a 'new skill' that is distinct from traditional software testing or general AI strategy. It involves a specific multi-step workflow (Error Analysis, Open Coding, A

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

Product Management Skills

In AI development, executable evaluations are replacing static PRDs as the definitive source of product requirements.

"This is the purest sense of what a product requirements document should be, is this eval judge that's telling you exactly what it should be, and it's automatic and running constantly."
01:00:56

AI product requirements must be flexible and derived from real-world data rather than just hypothetical planning.

"You're never going to know what the failure modes are going to be upfront... PRDs are a great abstraction for thinking about this. It's not the end all, be all. It's going to change."
01:02:28

Effective testing requires grounding in actual observed errors rather than theoretical assumptions.

"Start with some kind of data analysis to ground what you should even test... you should start with some kind of data analysis to ground what you should even test."
10:06

Manual 'trace analysis' (reviewing logs) is the most effective way to understand how an AI product is actually performing.

"The first step in conquering data like this is just to write notes... You sample your data and just take a look, and it's surprising how much you learn when you do this."
17:47

The manual review process should continue until the team stops discovering new types of failure modes.

"Keep looking at traces until you feel like you're not learning anything new... theoretical saturation."
30:30