Skip to main content

Product Roadmap

DualMind Arena is in active development. This page reflects our current thinking on what we are building next, organized by theme. Priorities are informed by usage patterns, platform stability, and the goal of making AI model evaluation as rigorous and useful as possible.
This roadmap reflects intent, not a delivery commitment. We ship incrementally and adjust based on what we learn.

In Progress

Expanded Model Roster

Continuous integration of new frontier models as providers release them. Our goal is zero lag between a model’s public release and its availability in Arena comparisons.Focus areas:
  • Automated model ingestion pipeline
  • Provider health monitoring and fallback
  • Model metadata standardization across providers

Category-Specific Leaderboards

The global ELO leaderboard treats all prompts equally. Category boards will segment rankings by task type — so you can see which model leads on code, which leads on reasoning, and which leads on creative writing.Planned categories:
  • Code generation & debugging
  • Logical reasoning & math
  • Summarization & compression
  • Creative writing & tone control
  • Instruction following & structured output

Upcoming

1

Anonymous vs. Identified Voting Study

We plan to run a controlled study comparing vote distributions when model names are shown before voting vs. hidden until after. This will produce a quantified measure of brand bias — a number we intend to publish.This is foundational to the platform’s credibility: if blind and identified votes diverge significantly, it validates the core design. If they don’t, we learn something equally important.
2

Prompt Difficulty Scoring

Not all prompts are equally useful for distinguishing model quality. A prompt like “say hello” produces nearly identical outputs from any capable model. A prompt that stress-tests reasoning or instruction-following produces a much more informative vote.We are building an automatic difficulty scorer that estimates how discriminative a prompt is likely to be, and weights votes accordingly in ELO calculations.
3

Multi-Turn Conversation Comparisons

Current comparisons are single-turn — one prompt, two responses, one vote. Multi-turn comparison lets users evaluate model behavior over an extended conversation, capturing qualities like:
  • Context retention across messages
  • Consistency of tone and persona
  • Recovery from user-introduced errors
  • Willingness to update prior conclusions
This is a significantly harder UX problem than single-turn and will require redesigning the voting interface.
4

Team Workspaces

Organizations want to run structured evaluation campaigns — sets of pre-defined prompts, run against a specific set of models, with aggregated results across their team.Team Workspaces will support:
  • Shared prompt libraries
  • Private leaderboards scoped to a workspace
  • Exportable results for reporting
  • Role-based access (admin, contributor, viewer)
5

Custom Evaluation Rubrics

Right now, every vote is binary: Model A is better, Model B is better, or it’s a tie. Custom rubrics will allow structured evaluation across multiple axes simultaneously:
  • Accuracy (factual correctness)
  • Conciseness (did it over-explain?)
  • Safety (did it refuse appropriately or refuse unnecessarily?)
  • Instruction adherence (did it follow the format specified?)
Each axis gets its own score, producing a richer evaluation profile per comparison.
6

Comparative Response Diffing

A tool for visually diffing two model responses to the same prompt — surfacing word choice differences, structure differences, and length differences. Useful for researchers and developers who want to understand where models diverge, not just that they do.

On the Horizon

These are further out but actively being thought through:
A developer-facing REST API to submit comparisons, retrieve leaderboard data, and integrate Arena results into external tooling. This is planned but not yet available. The API exists internally and will be published once the authentication, rate limiting, and developer experience meet our standards.When available, it will be documented here.
When a model releases a new version, does it actually perform better in practice? We want Arena to be able to run structured regression comparisons — same prompt sets, before and after a model update — and surface whether the new version wins, loses, or is statistically indistinguishable from the prior one.
A curated set of prompts specifically designed to probe known failure modes: hallucination risk, sycophancy, refusal calibration, and instruction conflicts. These will be run periodically across the model roster and published as a living benchmark.
As frontier models expand into voice and image-understanding, so will the Arena. Evaluating a model’s ability to interpret an image or the naturalness of its spoken output requires new evaluation infrastructure — but the blind comparison methodology applies directly.
Anonymized, aggregated vote data exported in structured formats for academic research. We believe the data we are collectively generating has value beyond the leaderboard — it is a large-scale record of human preference over AI outputs, with controlled experimental conditions.

Recently Shipped

  • Full blind comparison mode — model identities hidden until post-vote
  • Real-time ELO leaderboard with live ranking updates
  • Parallel response generation — both models evaluated concurrently
  • Enhanced voting interface with tie option
  • Time to First Token (TTFT) tracked and displayed per response
  • Tokens per Second (TPS) visible alongside every comparison
  • Latency separated from quality scoring so speed doesn’t bias votes
  • Token-by-token response streaming for both models simultaneously
  • Real-time display eliminates waiting before reading
  • Cancellation support — users can stop a generation mid-stream