benchmark

ZeroSumEval: An Extensible Framework for Scaling LLM Evaluation with Inter-Model Competition

An extensible framework for scaling LLM evaluation through inter-model competition.

Humanity's Last Exam

A large-scale collaborative benchmark designed to test the limits of frontier AI systems.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Revealing the sensitivity of LLM leaderboards to evaluation choices, showing that rankings can shift significantly with minor methodological changes.