Evaluation Philosophy
Building a system that compares AI models sounds straightforward — send the same prompt to two models, collect a vote. But beneath that simplicity are a set of design problems that, if left unsolved, make every result meaningless. This page explains the reasoning behind how the Arena evaluates models, and why each decision was made deliberately.Fair Comparison Is a Design Problem
When two models answer the same prompt, the natural instinct is to assume the comparison is fair. It isn’t — not without careful design. A comparison is only valid when:- Both models receive the exact same input, constructed identically
- Both models operate under equivalent constraints (token limits, temperature, context window)
- The user does not know which model produced which output at the time of voting
- Results are aggregated across many comparisons, not judged on a single exchange
The Blind Comparison Requirement
The most important structural decision in the Arena is full anonymity — model identities are hidden until after a vote is cast. This is not optional. It is the foundation. When users know which model they are evaluating, every cognitive bias in existence activates:- Brand familiarity bias — a well-known model feels more authoritative, regardless of output quality
- Expectation bias — a model known for coding is scored higher even on creative writing
- Anchoring — the first model seen is used as a baseline, advantaging it regardless of quality
Prompt Bias Is Real and Systematic
Not all prompts are equal testing environments. Some prompts structurally favor certain model architectures:- Long, multi-part reasoning tasks favor models with larger context windows
- Conversational prompts favor models optimized for chat fine-tuning
- Code generation tasks favor models trained heavily on programming corpora
- Single-word completion tasks are unreliable differentiators at any scale
Latency ≠ Intelligence
A model that responds in 200ms and one that responds in 4 seconds are not directly comparable on quality — at least not without isolating the latency variable. This matters for two reasons: First, latency shapes perception of quality. Users exposed to a faster response form an immediate impression of fluency and confidence, even before reading the content. This is well-documented in UX research. A slower response that is objectively better may lose the vote simply because the user had more time to find fault with it. Second, latency is infrastructure, not model intelligence. A model’s raw capability does not change based on whether it is served on overloaded shared infrastructure or a dedicated accelerator. Conflating speed with intelligence produces a leaderboard that reflects hosting decisions more than model quality. The Arena addresses this by:- Measuring and surfacing Time to First Token (TTFT) and Tokens per Second (TPS) as distinct, visible metrics — never hidden inside a quality score
- Presenting both model responses in parallel rather than sequentially, so neither has a temporal advantage over the other
Reproducibility as a Signal of System Integrity
A comparison system that cannot reproduce results consistently is not a measurement system — it is a noise generator. Reproducibility in the Arena means:- The same prompt, sent to the same model, with the same parameters, should produce outputs within a predictable range of variation
- The ELO rating of a model should reflect a stable long-run average, not oscillate wildly due to small sample effects
- Vote data should be attributable and timestamped, enabling audits of rating drift over time
Why This Matters
These are not abstract principles. Each one represents a category of failure that produces a broken leaderboard:| If you skip… | The failure is… |
|---|---|
| Blind comparison | Users vote on brand, not output |
| Consistent inputs | Apples-to-oranges model prompting |
| Latency separation | Hosting quality masquerades as model intelligence |
| ELO over win rate | Small sample fluctuations look like stable rankings |
| Reproducibility constraints | The leaderboard drifts without explanation |
These principles inform how the Arena is designed, not just how it is documented. If you are building evaluation tooling on top of the API, the same reasoning applies to how you interpret and aggregate the data you receive.