Reframing tokenization as a core modeling decision in LLMs rather than a preprocessing step, arguing for context-aware tokenizer and model co-design.
A new metric for multilingual tokenization evaluation that goes beyond fertility-based measures.
Comprehensive evaluation of LLMs on Bengali NLP, examining potentials and pitfalls for low-resource language processing.
PhD thesis on transfer learning for language model adaptation, covering multilingual generalization and scalable training dynamics.
Adding new language support to the BLOOM multilingual language model for zero-shot prompting.