SBMARUF
Home
Experience
Projects
Talk
Publications
CV
News
Blog
Awards
benchmark
ZeroSumEval: An Extensible Framework for Scaling LLM Evaluation with Inter-Model Competition
An extensible framework for scaling LLM evaluation through inter-model competition.
Humanity's Last Exam
A large-scale collaborative benchmark designed to test the limits of frontier AI systems.
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Revealing the sensitivity of LLM leaderboards to evaluation choices, showing that rankings can shift significantly with minor methodological changes.
Cite
×