language-model

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Reframing tokenization as a core modeling decision in LLMs rather than a preprocessing step, arguing for context-aware tokenizer and model co-design.

Amazon Nova 2: Multimodal Reasoning and Generation Models

A family of multimodal reasoning and generation models from Amazon AGI.

ZeroSumEval: An Extensible Framework for Scaling LLM Evaluation with Inter-Model Competition

An extensible framework for scaling LLM evaluation through inter-model competition.

Beyond Fertility: STRR as a Metric for Multilingual Tokenization Evaluation

A new metric for multilingual tokenization evaluation that goes beyond fertility-based measures.

Humanity's Last Exam

A large-scale collaborative benchmark designed to test the limits of frontier AI systems.

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

A systematic survey and critical review on evaluating large language models with recommendations for more rigorous evaluation.

BenLLM-Eval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Comprehensive evaluation of LLMs on Bengali NLP, examining potentials and pitfalls for low-resource language processing.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Revealing the sensitivity of LLM leaderboards to evaluation choices, showing that rankings can shift significantly with minor methodological changes.

Transfer Learning for Language Model Adaptation

PhD thesis on transfer learning for language model adaptation, covering multilingual generalization and scalable training dynamics.

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

The paper comprehensively evaluates ChatGPT's performance on various academic tasks, covering 140 tasks across diverse fields, highlighting strengths and weaknesses, and introducing a new ability to follow multi-query instructions, ultimately paving the way for practical applications of ChatGPT-like models.