AI Evaluation Systems Are Measuring the Wrong Things

Drew Dillon

July 22, 2025

Your model scores 94.2% on MMLU. Your BLEU scores are in the 95th percentile. Your safety evals pass with flying colors. Users hate your product.

This isn't a hypothetical scenario—it's the reality for most AI applications in production. While the ML community has built sophisticated evaluation frameworks for model capabilities, we've largely ignored the engineering challenge of measuring what actually matters: user experience.

The Evaluation Gap

Traditional AI evaluations optimize for technical correctness within controlled environments. They measure what models can do, not what they should do in real user contexts.

Consider these standard evaluation approaches:

Academic Benchmarks (MMLU, HellaSwag, Arc-AGI, AIME, Humanity's Last Exam)

Measure factual knowledge, reasoning capability, and problem-solving
Use standardized multiple-choice, completion, or reasoning tasks
Designed for model comparison, not user experience prediction

NLP Metrics (BLEU, ROUGE, BERTScore)

Compare generated text to reference outputs
Focus on lexical overlap and semantic similarity
Assume there's a "correct" output for every input

Safety Evaluations (Constitutional AI, Red Teaming)

Test for harmful or biased outputs
Use adversarial prompts and edge cases
Optimize for avoiding negative outcomes, not creating positive ones

These evaluations tell us whether a model is technically competent and safe. They don't tell us whether users will find the model helpful, trustworthy, or pleasant to interact with.

The LLM Grading Systems Problem

The latest evolution in AI evaluation uses large language models to evaluate other LLMs, promising more nuanced assessment than traditional metrics. Popular platforms like Chatbot Arena, LMSys, and automated evaluation services have made LLM-as-a-judge approaches standard practice.

The fundamental problem is that LLM judges often prefer responses that are comprehensive, technically accurate, and well-structured. These are exactly the qualities that can overwhelm or frustrate real users. The judge optimizes for what sounds good to another AI system, not what works for humans in context.

# LLM judge prefers this response: judge_favorite = """ I'd be happy to assist you with that request. Based on your query, I can provide several comprehensive approaches to address your needs, including detailed explanations and relevant context... """ # Users actually prefer this response: user_favorite = "Sure thing! Here's what you need..." # Production data: # Judge favorite: 67 thumbs down, 12 thumbs up # User favorite: 8 thumbs down, 73 thumbs up

This creates a measurement system that actively selects against user preferences while appearing sophisticated and thorough.

Where Traditional Evals Fail

Problem 1: Context Collapse

Traditional evaluations strip away the context that determines whether an AI response provides actual value to users.

Take a simple example: a user asks "What is the capital of France?" A traditional evaluation measures whether the AI correctly responds "Paris" and scores it perfectly. But if the user was planning a business trip, that technically correct response provides zero value. They needed flight information, hotel recommendations, meeting venues, or local business customs.

The evaluation framework optimizes for answering the literal question rather than helping the user accomplish their underlying goal. This creates AI systems that are technically accurate but practically useless.

Problem 2: Reference Output Bias

Most evaluation frameworks assume there's a single "correct" response for any given input. This works for factual questions but completely breaks down for creative, subjective, or context-dependent tasks.

Consider evaluating email tone. A formal business email template might score highly against a reference output focused on "professional communication." But users in casual company cultures find that same email robotic and off-putting. The evaluation system rewards responses that match an arbitrary reference rather than responses that actually work for the specific user and context.

BLEU scores exemplify this problem. They measure n-gram overlap between generated text and reference text, essentially rewarding AI systems for reproducing expected patterns rather than being genuinely helpful. A response with zero BLEU score might be exactly what the user needed, while a high-scoring response might completely miss the mark.

Problem 3: Static Distribution Assumptions

Traditional evaluations use fixed datasets that don't evolve with changing user needs, cultural shifts, or domain evolution. They're like taking a snapshot of user preferences in 2022 and assuming those preferences remain constant forever.

Email communication styles have shifted dramatically toward casual, direct communication. Article summaries increasingly need bullet points rather than paragraphs for mobile consumption. But evaluation datasets frozen in time continue optimizing AI systems for outdated user expectations.

This creates a growing disconnect between what evaluation systems reward and what users actually want. The AI gets better at producing 2022-style outputs while users have moved on to 2024 expectations.

Problem 4: Missing Temporal Dynamics

Evaluations typically measure single-turn performance, completely ignoring how user satisfaction changes over time and repeated interactions. This is like judging a restaurant based on one dish rather than the entire dining experience.

User trust and satisfaction have temporal dynamics that single-point measurements can't capture. A user might initially be impressed by comprehensive AI responses but become frustrated after several interactions when they realize they consistently need to extract key information from verbose outputs. Conversely, a user might initially find brief responses inadequate but develop appreciation for their efficiency over time.

Traditional evaluations miss these learning curves, adaptation patterns, and relationship dynamics that determine long-term user success with AI systems.

Problem 5: Aggregation Artifacts

Traditional metrics average performance across diverse use cases, hiding critical failure modes that affect specific user segments. An overall accuracy score of 85% might look impressive, but if it represents 95% accuracy for power users and 65% accuracy for casual users, you're failing most of your user base.

Different user segments have fundamentally different success criteria. Power users might appreciate detailed technical explanations, while casual users need simple, actionable guidance. Mobile users need concise responses that work on small screens, while desktop users can handle more comprehensive information. Enterprise users might prioritize accurate, defensible outputs, while consumer users care more about speed and ease of use.

Aggregate metrics obscure these segment-specific patterns, leading to AI systems optimized for an average user who doesn't actually exist.

The Cascade Effect: How Evaluation Gaps Compound Into Business Problems

The real cost of inadequate AI evaluation isn't just poor user experience—it's how negative experiences cascade through organizations, creating lasting resistance to AI adoption.

The Antibody Formation Process

A single bad AI experience, which scored well on traditional metrics, can influence enterprise decisions months later through this progression:

Week 1: A marketing manager tries an AI writing assistant to draft a pricing announcement email. The AI produces grammatically perfect, professionally toned content that covers all the required topics. Traditional evaluation metrics would score this highly.

Week 2: The marketing manager tells her team the AI "doesn't understand our brand voice" because the output was too formal for their casual company culture.

Week 3: The team stops experimenting with AI writing tools entirely.

Month 2: When asked about AI tool budget allocation, marketing requests to redirect funds away from AI assistance.

Quarter 2: The company becomes generally skeptical of AI vendor pitches.

One evaluation system optimized for "professional tone" rather than "brand alignment" influenced enterprise purchasing decisions six months later. The technical metrics suggested success while the user experience created organizational antibodies against AI adoption.

Trust Debt Accumulation

Organizations accumulate "trust debt" with AI systems. Each negative experience makes future AI adoption harder, even for unrelated use cases. Sales teams that had bad experiences with AI email generation become skeptical of AI customer analysis tools. Support teams frustrated by AI response suggestions resist AI ticket routing systems.

This trust debt compounds across teams and use cases because organizations share stories about AI failures much more readily than successes. A single bad demo can influence buying committee decisions. A frustrated developer's experience with an AI coding assistant can slow adoption of AI tools across an entire engineering organization.

Traditional evaluation systems can't detect these cascade effects because they measure immediate technical performance rather than long-term organizational confidence in AI capabilities.

The Developer Experience Poison

The most damaging cascades occur when AI tools fail developers, because developers influence technology adoption across entire organizations. When an AI coding assistant suggests technically correct but contextually inappropriate code (code that violates company security policies, ignores existing patterns, or doesn't integrate with established infrastructure), the developer loses trust in AI assistance generally.

That developer then shares their skepticism with colleagues, influences tool selection discussions, and shapes the organization's approach to AI adoption. Traditional evaluations focused on syntactic correctness miss the organizational context that determines whether AI suggestions actually help or hinder developer productivity.

The Engineering Challenge of User-Centric Metrics

Building evaluation systems that capture user experience requires solving several technical challenges that traditional ML evaluations sidestep.

Signal Sparsity and Noise: User feedback is sparse, delayed, and noisy compared to automated metrics. You might get explicit feedback on 2-5% of interactions, with significant selection bias toward extreme experiences. Traditional sampling assumptions break down when dealing with small sample sizes and high variance in user responses.

Time-Weighted Signal Processing: User trust and satisfaction have temporal dynamics that require sophisticated analysis. Recent feedback might be more predictive of future experience than historical averages. Trust tends to erode quickly but rebuild slowly. Evaluation systems need to detect these trends early enough to take corrective action.

Multi-Modal Signal Fusion: User experience emerges from multiple signals including explicit feedback, behavioral patterns, and contextual factors. These signals need intelligent combination. The challenge lies in weighting thumbs-up votes against session abandonment rates while accounting for user expertise levels and device constraints when interpreting feedback.

Statistical Significance with Small Samples: Unlike traditional ML datasets with millions of examples, user feedback often has small sample sizes that require careful statistical handling. Popular benchmarking platforms generate thousands of synthetic comparisons, but real product usage might yield only dozens of explicit feedback events per user segment.

Real-Time Adaptation: User-centric metrics need to support real-time decision making, not just post-hoc analysis. Evaluation systems need to detect declining user satisfaction quickly enough to trigger interventions before negative experiences cascade through organizations.

Breaking the Cascade: Early Warning Systems

Understanding how evaluation gaps compound into organizational problems suggests a different approach to AI measurement: build systems that detect cascade risks before they propagate.

Instead of evaluating AI responses in isolation, measure them within the broader context of user workflows and organizational culture. Consider whether responses match the company's communication style, whether outputs will work in high-stakes presentations, whether non-expert users can successfully use the functionality, and whether interactions build or erode confidence in AI capabilities.

Early warning systems should track organizational network effects by identifying users with high influence on AI adoption decisions, monitoring whether any high-influence users are experiencing declining satisfaction with AI tools, tracking teams with stagnant adoption rates, and detecting patterns in feedback that suggest systematic problems rather than isolated issues.

The goal is identifying users whose negative experiences are likely to influence broader organizational AI adoption and intervening before their skepticism spreads.

The Path Forward

Traditional AI evaluations will remain important for measuring model capabilities, safety, and technical performance. But they're insufficient for building AI products that organizations actually want to keep using.

The future belongs to hybrid evaluation systems that combine technical rigor from traditional ML evaluation methods with user experience focus from product analytics, real-time adaptation from modern observability systems, and organizational systems thinking that accounts for cascade effects.

The companies that solve this measurement problem will build AI products that don't just work correctly in controlled environments—they create positive adoption spirals that compound over time. Instead of individual users having good experiences, entire organizations develop confidence in AI capabilities.

Building these systems requires treating user experience measurement as a first-class engineering problem with the same rigor we apply to model training and safety evaluation. The technical challenges are substantial, but the business impact of solving them is even more substantial.

Every AI company that figures out user-centric evaluation gains a sustainable competitive advantage. This comes not just from better user experience, but from the positive organizational cascades that follow.

The most successful AI companies of the next decade won't just have the best models. They'll have the best understanding of how AI success and failure propagate through human systems.

More from the blog

AI Evaluation Systems Are Measuring the Wrong Things

The gap between "model works in eval" and "organization trusts AI" is where the next generation of AI companies will win or lose. Traditional benchmarks can't bridge that gap. Only systems that measure and optimize for real human experience in real organizational contexts can. The most successful AI companies of the next decade won't just have the best models. They'll have the best understanding of how AI success and failure propagate through human systems.

Drew Dillon

July 22, 2025

API Contracts in 2024: A Comprehensive Guide to Implementation Approaches

API contracts come in Schema-Based (Protocol Buffers/Thrift), Consumer-Driven, and Provider-Driven varieties, with costs ranging from $4.8K to $53.6K depending on organizational needs and desired long-term benefits.

Drew Dillon

November 8, 2024

Understanding the Costs of Staging and Demo Environments in B2B SaaS

A detailed analysis of the costs of maintaining production clones, including staging, local, and demo environments in B2B SaaS companies. How these environments affect infrastructure and maintenance expenses for companies with $12M in annual revenue.

Drew Dillon

September 9, 2024

Securing Test Data: Methods and Tools for Safe Development

A comprehensive guide to test data security methods: dynamic masking, synthetic generation, and subsetting. Includes tools, best practices, and emerging trends for developers.

Drew Dillon

September 3, 2024

The Complete Guide to API Testing

Discover the essentials of API testing in this comprehensive guide. Learn best practices, overcome challenges, and ensure robust, high-performance APIs for your applications.

Drew Dillon

August 27, 2024

CI/CD Pipelines Explained: From Basics to Best Practices

Explore CI/CD pipelines: how to automate software delivery, improve code quality, and accelerate deployment. Discover best practices and overcome common challenges.

Drew Dillon

August 20, 2024

View all

Experience your software as it's meant to be seen

Join the waitlist and help build the future of enterprise demos.

Join the waitlist Book a demo