Research-grade leaderboard

Model performance by benchmark family, not vibes.

Compare frontier AI systems across reasoning, long-horizon work, computer use, programming, legal, and STEM evaluations. Each score is connected to the benchmarks and caveats behind it.

Explore by category

Start with the kind of work you care about.

Category ranking

Top model comparison

Showing all benchmark categories.

Benchmark catalog

Evidence behind each category

Model profiles

Context for every row

Methodology

How to read the leaderboard

What each category measures

Categories group benchmarks by the user-visible work they test: reasoning depth, long-horizon planning, computer-use reliability, programming quality, legal analysis, and STEM problem solving.

How scores are normalized

Raw benchmark results are converted to a 0-100 scale, then averaged within each category. Benchmarks with missing results are omitted instead of being treated as failures.

What good performance means

Scores above 90 indicate consistently strong results across a category. Scores in the 80s are competitive but may hide benchmark variance. Lower scores usually mean the model needs narrower task framing or stronger supervision.

Known caveats

Benchmarks can lag product updates, miss real-world workflow friction, or overrepresent public tasks. Confidence notes and recency metadata should be read alongside the rank.