Comparison Modes
DualMind Arena offers three distinct ways to evaluate models, each serving a different research purpose.1. Random Battle (Discovery)
In Random Battle, the system selects two different models from the active pool at random. Best for:- Unbiased Discovery: Finding hidden gems among smaller or open-source models.
- Fair Leaderboard Data: Contributing the most valuable, unbiased data to the global rankings.
- General Testing: Checking how models handle a wide variety of prompts without preconceptions.
This mode contributes the highest weight to the global ELO leaderboard because it is the most statistically neutral.
2. Topper Battle (Challenge)
In Topper Battle, the current #1 ranked model (the “King of the Hill”) is paired against a random challenger. Best for:- Stress Testing Strategies: Specifically trying to break the top model to see if others can handle edge cases better.
- Verification: Confirming if the top-ranked model truly deserves its spot across different domains (coding, creative writing, math).
- Update Checks: Quickly checking if a newly released model can dethrone the current champion.
3. Side-by-Side (Manual Selection)
In Side-by-Side mode, you explicitly choose Model A and Model B from the dropdown menu. Best for:- A/B Testing: Directly comparing two specific versions (e.g., Llama 3 vs Llama 2).
- Regression Testing: Checking if a specific model response has degraded compared to another.
- Use-Case Validation: testing two specific models known for coding to see which is better for your specific language.
Voting Options
When voting, accuracy is key. You have four options:| Vote | Meaning | Statistical Impact |
|---|---|---|
| Left is Better | Model A provided a distinctly better response | Model A wins, Model B loses |
| Right is Better | Model B provided a distinctly better response | Model B wins, Model A loses |
| Tie | Both models gave high-quality, helpful responses | Points shared (benefits lower-rated model) |
| Both Bad | Neither model followed instructions or gave a good answer | No points awarded; flagged for review |
Best Practices for Comparison
To get the most out of DualMind Arena:- Use Challenging Prompts: Simple “Hello” prompts don’t reveal intelligence. Ask for complex reasoning, code generation, or creative nuance.
- Check Constraints: Did the model follow negative constraints? (e.g., “Write a story without using the letter ‘e’”).
- Verify Facts: For factual queries, check if the models hallucinated wrong information.
- Ignore Formatting: Try to look past bolding or bullet points and focus on the content quality.