evaluation

ZeroSumEval: An Extensible Framework for Scaling LLM Evaluation with Inter-Model Competition

An extensible framework for scaling LLM evaluation through inter-model competition.

Beyond Fertility: STRR as a Metric for Multilingual Tokenization Evaluation

A new metric for multilingual tokenization evaluation that goes beyond fertility-based measures.

Humanity's Last Exam

A large-scale collaborative benchmark designed to test the limits of frontier AI systems.

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

A systematic survey and critical review on evaluating large language models with recommendations for more rigorous evaluation.

BenLLM-Eval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Comprehensive evaluation of LLMs on Bengali NLP, examining potentials and pitfalls for low-resource language processing.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Revealing the sensitivity of LLM leaderboards to evaluation choices, showing that rankings can shift significantly with minor methodological changes.

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

The paper comprehensively evaluates ChatGPT's performance on various academic tasks, covering 140 tasks across diverse fields, highlighting strengths and weaknesses, and introducing a new ability to follow multi-query instructions, ultimately paving the way for practical applications of ChatGPT-like models.