leaderboard

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Revealing the sensitivity of LLM leaderboards to evaluation choices, showing that rankings can shift significantly with minor methodological changes.