Reframing tokenization as a core modeling decision in LLMs rather than a preprocessing step, arguing for context-aware tokenizer and model co-design.
A family of multimodal reasoning and generation models from Amazon AGI.
An extensible framework for scaling LLM evaluation through inter-model competition.
A new metric for multilingual tokenization evaluation that goes beyond fertility-based measures.
A large-scale collaborative benchmark designed to test the limits of frontier AI systems.
A systematic survey and critical review on evaluating large language models with recommendations for more rigorous evaluation.
Comprehensive evaluation of LLMs on Bengali NLP, examining potentials and pitfalls for low-resource language processing.
Revealing the sensitivity of LLM leaderboards to evaluation choices, showing that rankings can shift significantly with minor methodological changes.
PhD thesis on transfer learning for language model adaptation, covering multilingual generalization and scalable training dynamics.
The paper comprehensively evaluates ChatGPT's performance on various academic tasks, covering 140 tasks across diverse fields, highlighting strengths and weaknesses, and introducing a new ability to follow multi-query instructions, ultimately paving the way for practical applications of ChatGPT-like models.