AI & Technology 2 guests | 2 insights

AI Evaluation (Evals)

AI evaluation (evals) is the emerging skill of systematically testing and measuring AI model performance. As models become products, evals become the product requirements document. This involves error analysis, creating rubrics, building benchmarks, and developing systematic tests - a critical bottleneck for AI labs and a new core competency for product builders.

Download Agent Skill Read Guide

The Guide

3 key steps synthesized from 2 experts.

Treat evals as your product requirements

In AI products, the eval suite defines what the product should do. If you can't measure it, you can't improve it. Before building features, define how you'll evaluate success. The eval is the spec - it tells the model (and your team) exactly what 'good' looks like.

Featured guest perspectives

"If the model is the product, then the eval is the product requirement document."

— Brendan Foody

Build systematic evaluation workflows

Develop a multi-step process: start with error analysis to understand where the model fails, use open coding to categorize failure modes, create rubrics based on those categories, and build automated tests. This systematic approach replaces gut-feel assessments with rigorous measurement.

Featured guest perspectives

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

— Hamel Husain & Shreya Shankar

Invest in this as a core skill

The heads of product at major AI labs consider evals one of the most important emerging skills. This isn't traditional QA or software testing - it's a new discipline that product builders need to develop. Treat it as a first-class competency worth investing significant time in learning.

Featured guest perspectives

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

— Hamel Husain & Shreya Shankar

Get this guide as an AI skill for Claude Code

✗ Common Mistakes

Treating AI testing like traditional software testing
Relying on vibes instead of systematic measurement
Not investing in eval infrastructure early
Evaluating only accuracy without considering other dimensions like safety, helpfulness, or style

✓ Signs You're Doing It Well

You can quantify model performance across multiple dimensions
You have automated eval suites that run on every model change
Your product decisions are informed by eval results, not intuition
You can explain exactly why one model version is better than another

All Guest Perspectives

Deep dive into what all 2 guests shared about ai evaluation (evals).

Hamel Husain & Shreya Shankar 1 quote

Listen to episode →

"Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders."

View all skills from Hamel Husain & Shreya Shankar →

Brendan Foody 1 quote

Listen to episode →

"If the model is the product, then the eval is the product requirement document."

View all skills from Brendan Foody →

Install This Skill

Add this skill to Claude Code, Cursor, or any AI coding assistant that supports Agent Skills.

Quick Install (Recommended)

Install this skill directly using npx:

npx skills add RefoundAI/lenny-skills --skill ai-evals

Or install all 86 skills:

npx skills add RefoundAI/lenny-skills

View on GitHub →

Manual Installation

Download the skill

Download Skill (.zip)

Add to your project

Create a folder in your project root and add the skill file:

.claude/skills/ai-evals/SKILL.md

Start using it

Claude will automatically detect and use the skill when relevant. You can also invoke it directly:

Help me with ai evaluation (evals)

Related Skills

Other AI & Technology skills you might find useful.

94 guests

AI Product Strategy

AI strategy should focus on using algorithms to scale human expertise and judgment rather than just...

View Skill → →

60 guests

Building with LLMs

Using LLMs for text-to-SQL can democratize data access and reduce the burden on data analysts for ad...

View Skill → →

24 guests

Platform Strategy

Platform and ecosystem success comes from identifying 'gardening' opportunities—projects with inhere...

View Skill → →

22 guests

Evaluating New Technology

Be skeptical of 'out-of-the-box' AI solutions for enterprises; real ROI requires a pipeline that acc...

View Skill → →

AI Evaluation (Evals)

The Guide

Treat evals as your product requirements

Build systematic evaluation workflows

Invest in this as a core skill

✗ Common Mistakes

✓ Signs You're Doing It Well

All Guest Perspectives

Install This Skill

Quick Install (Recommended)

Download the skill

Add to your project

Start using it

Get notified of new skills

Related Skills

AI Product Strategy

Building with LLMs

Platform Strategy

Evaluating New Technology