Navigating NLP for Underrepresented Languages: Dataset Challenges, Efficient Techniques, and Evaluations
- Fajri Koto, Postdoc, MBZUAI
B9 L4 R4225
Democratizing NLP across numerous languages is a non-trivial task, as it may encounter challenges related to data scarcity, limitations in computational resources, and the intricacies of multilingual and multicultural diversity. The speaker will discuss the efforts and findings in tackling these challenges in this talk. To begin, data scarcity and inconsistency in metadata present common obstacles in low-resource NLP, complicating the understanding of the NLP landscape for low-resource languages.
Overview
Abstract
Democratizing NLP across numerous languages is a non-trivial task, as it may encounter challenges related to data scarcity, limitations in computational resources, and the intricacies of multilingual and multicultural diversity. The speaker will discuss the efforts and findings in tackling these challenges in this talk. To begin, data scarcity and inconsistency in metadata present common obstacles in low-resource NLP, complicating the understanding of the NLP landscape for low-resource languages. He addresses this issue by standardizing the datasets and reevaluating the NLP status of Indonesia, a country with 700+ languages. Subsequently, he introduces the first multilingual dataset for 10 Indonesian local languages by directly engaging with native speakers. In the recent endeavor, he scrutinized the optimal strategy for constructing datasets in low-resource languages, and found that directly composing texts in local languages yields superior lexical diversity and language models compared to the manual translation approach. Next, regions with under-represented languages often face computational resource challenges. To address this issue, he will discuss efforts including vocabulary adaptation in language modeling, and a zero-shot approach demonstrating superior performance compared to large language models. Lastly, while multilingual and multicultural considerations are vital in multilingual models, evaluations beyond English are limited. He will highlight the significance of datasets with local context in assessing the LLMs in terms of knowledge, culture, and commonsense reasoning.
Brief Biography
Dr. Fajri Koto currently holds the position of Postdoctoral Research Fellow at MBZUAI, specializing in Natural Language Processing. He completed his PhD at The University of Melbourne under the guidance of Prof. Timothy Baldwin and Dr. Jey Han Lau. His ongoing research focuses on multilingual and low-resource NLP, commonsense reasoning, language generation, and the evaluation of large language models. Before his current position, Dr. Koto gained professional experience working at Amazon and Samsung R&D Institute. His research contributions have found their place in well-regarded international conferences and journals such as ACL, EMNLP, NAACL, COLING, EACL, AACL, and JAIR. His works have been recognized with the Best Paper Award in CSRR at ACL 2022, the Outstanding Paper Award at EACL 2023, and the Resource Paper Award at AACL 2023. Additionally in 2022, Dr. Koto was invited as a keynote panelist at ACL.