Research-grade leaderboard
Model performance by benchmark family, not vibes.
Compare frontier AI systems across reasoning, long-horizon work, computer use, programming, legal, and STEM evaluations. Each score is connected to the benchmarks and caveats behind it.
Explore by category
Start with the kind of work you care about.
Category ranking
Top model comparison
Showing all benchmark categories.
Clear filters or choose another benchmark category.
Benchmark catalog
Evidence behind each category
Model profiles
Context for every row
Methodology
How to read the leaderboard
What each category measures
Categories group benchmarks by the user-visible work they test: reasoning depth, long-horizon planning, computer-use reliability, programming quality, legal analysis, and STEM problem solving.
How scores are normalized
Raw benchmark results are converted to a 0-100 scale, then averaged within each category. Benchmarks with missing results are omitted instead of being treated as failures.
What good performance means
Scores above 90 indicate consistently strong results across a category. Scores in the 80s are competitive but may hide benchmark variance. Lower scores usually mean the model needs narrower task framing or stronger supervision.
Known caveats
Benchmarks can lag product updates, miss real-world workflow friction, or overrepresent public tasks. Confidence notes and recency metadata should be read alongside the rank.