multilingual

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Reframing tokenization as a core modeling decision in LLMs rather than a preprocessing step, arguing for context-aware tokenizer and model co-design.

Beyond Fertility: STRR as a Metric for Multilingual Tokenization Evaluation

A new metric for multilingual tokenization evaluation that goes beyond fertility-based measures.

BenLLM-Eval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Comprehensive evaluation of LLMs on Bengali NLP, examining potentials and pitfalls for low-resource language processing.

Transfer Learning for Language Model Adaptation

PhD thesis on transfer learning for language model adaptation, covering multilingual generalization and scalable training dynamics.

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

Adding new language support to the BLOOM multilingual language model for zero-shot prompting.