When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Abstract

This paper reveals the sensitivity of large language model leaderboards to various evaluation choices, demonstrating that benchmark rankings can be significantly affected by seemingly minor methodological decisions.

Publication
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics