This paper reveals the sensitivity of large language model leaderboards to various evaluation choices, demonstrating that benchmark rankings can be significantly affected by seemingly minor methodological decisions.