Measuring Performance
In addition to response quality, DualMind Arena tracks model execution speed. For many applications, a slightly “worse” model might be preferred if it is significantly faster and cheaper. We track several key metrics for every generation:1. Time to First Token (TTFT)
The time elapsed between sending your request and receiving the very first character of the response.- Why it matters: Crucial for perceived responsiveness in chat applications.
- Typical values: < 200ms (Excellent) to > 1000ms (Sluggish).
2. Output Speed (Tokens/Sec)
How fast the model generates text once it starts writing.- Why it matters: Determines how quickly a long summary or code block finishes.
- Typical values: 30 t/s (Readable) to 100+ t/s (Instant).
3. Total Latency
The total wall-clock time from request to completion.The Trade-off Triangle
When evaluating models in the Arena, you’ll often notice a trade-off:- Quality (Reasoning depth)
- Speed (Inference velocity)
- Cost (Compute resources)