When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

May 1, 2024

Paper

Abstract

This paper reveals the sensitivity of large language model leaderboards to various evaluation choices, demonstrating that benchmark rankings can be significantly affected by seemingly minor methodological decisions.

Type

Conference paper

Publication

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics