AI Engineering Notes: Evaluation Methodology (Ch.3)

These are my personal summaries and observations from reading AI Engineering by Chip Huyen. They are not a comprehensive review — just the notes and takeaways that stood out to me. I highly recommend picking up the book for the full picture.


This chapter dives deep into the evaluation of AI systems. Evaluation is arguably the most critical and yet most under-invested area in production AI — and this chapter does a great job laying out the landscape.

Introduction

Many echoes that the biggest hurdle to bringing AI applications to production is evaluation. I see many of “experts” writing about this on LinkedIn.

Skylar Payne's LinkedIn post about AI Agents in Production Conference, discussing how production AI systems are delivering real results, evaluation frameworks are critical, and the market is maturing

Evaluation comes hand in hand with logging, tracing, and monitoring in the said system. Without these components, engineers and product owners can’t understand the quality of the AI system and make improvements. Traditional NLP models relied on reference-based evaluation to compute a score such as ROUGE, BLEU.

Challenges of Evaluating Foundation Models

Standard Language Modeling Metrics

Entropy-Based Metrics

Perplexity on a dataset — you need to have the logits of the prediction to calculate the metrics (generally only available with open-weight models).

Key observations about perplexity:

Exact Evaluation

LLM-as-a-Judge

Diagram of an AI Judge showing the flow: (Question, Answer) fed into a Prompt template for criteria, producing a Final Prompt sent to a GPT-4 Model, which outputs a Score

Benefits of the LLM Judge

How Does a LLM Judge Work?

A prompt template defines criteria, takes (question, answer), sends to a model (e.g. GPT-4), and outputs a score.

Limitations of the LLM Judge

How to Choose Models as a Judge?

Comparative Evaluation (Elo Rating)

A prominent example is the LMSys Chatbot Arena, where users can compare outputs of different models using the same prompt. It works surprisingly well in practice and allows ties.

Not all questions should be answered by preference.

Metrics used for pitting model A against model B is called “win rate”. If in 10 matches, model A is preferred 6 times, win rate = 60%.

Challenges of Comparative Evaluation

The Direction Forward

The direction is probably a mix of human-in-the-loop, LLM judge, and rule-based evaluation.

Some tips from production:

References