Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world’s languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, ’typologically diverse’ language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.
Wikipedia’s perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
Conference papers
NAACL
BPE-knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision
Thomas Bauwens, and Pieter Delobelle
In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Jun 2024
Byte-pair encoding (BPE) has become the default subword tokeniser in language models (LMs), allowing the representation of an infinite space of text with a finite set of units. Yet, BPE training is unsupervised, receiving no explicit information about a language’s morphology. This results in a subword vocabulary wherein many units are a concatenation of partial morphemes, preventing their formation as tokens. This, in turn, causes consistent intra-word patterns to be displayed inconsistently to downstream models, and bloats the vocabulary, hence requiring unnecessary embedding storage. In this paper, we address this issue by identifying blameworthy BPE merges and removing the resulting subwords from the BPE vocabulary, without impeding further use of merges that relied on them. We find that our method, BPE-knockout, is effective at making BPE’s segmentation positions adhere better to derivational and compound boundaries in English, Dutch and German, and improves token-based tasks in Dutch RoBERTa models, indicating that a tokeniser’s adherence to morphology impacts downstream models. We demonstrate the latter not only by training LMs from scratch, but also by continuing the pre-training of existing LMs. This proves promising, showing that suboptimal tokenisers can be remedied whilst salvaging training cost of downstream LMs.
EMNLP
What Is “Typological Diversity” in NLP?
Esther Ploeger*, Wessel Poelman*, Miryam de Lhoneux, and 1 more author
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Nov 2024
The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world’s languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being ’typologically diverse’. In this work, we systematically investigate NLP research that includes claims regarding ’typological diversity’. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Furthermore, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of ’typological diversity’ that empirically justifies the diversity of language samples.
EMNLP
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, and 1 more author
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Nov 2024
Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.
Workshop papers
SIGTYP
Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification
Kushal Tatariya, Heather Lent, Johannes Bjerva, and 1 more author
In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP Mar 2024
Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression,especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolinguistic studies have shown that Hinglish speakers switch to Hindi when expressing negative emotions and to English when expressing positive emotions. To understand if language models can learn these associations, we study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. Using LIME and token level language ID, we find that models do learn these associations between language choice and emotional expression. Moreover, having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce. We also conclude from the misclassifications that the models may overgeneralise this heuristic to other infrequent examples where this sociolinguistic phenomenon does not apply.
SIGTYP
A Call for Consistency in Reporting Typological Diversity
Wessel Poelman*, Esther Ploeger*, Miryam de Lhoneux, and 1 more author
In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP Mar 2024
In order to draw generalizable conclusions about the performance of multilingual models across languages, it is important to evaluate on a set of languages that captures linguistic diversity.Linguistic typology is increasingly used to justify language selection, inspired by language sampling in linguistics.However, justifications for ‘typological diversity’ exhibit great variation, as there seems to be no set definition, methodology or consistent link to linguistic typology.In this work, we provide a systematic insight into how previous work in the ACL Anthology uses the term ‘typological diversity’.Our two main findings are: 1) what is meant by typologically diverse language selection is not consistent and 2) the actual typological diversity of the language sets in these papers varies greatly.We argue that, when making claims about ‘typological diversity’, an operationalization of this should be included.A systematic approach that quantifies this claim, also with respect to the number of languages used, would be even better.
MRL
Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?
Zeno Vandenbulcke, Lukas Vermeire, and Miryam de Lhoneux
In 4th Multilingual Representation Learning (MRL) workshop Mar 2024
Monolinguals make up a minority of the world’s speakers, and yet most language technologies lag behind in handling linguistic behaviours produced by bilingual and multilingual speakers. A commonly observed phenomenon in such communities is code-mixing, which is prevalent on social media, and thus requires attention in NLP research. In this work, we look into the ability of pretrained language models to handle code-mixed data, with a focus on the impact of languages present in pretraining on the downstream performance of the model as measured on the task of sentiment analysis. Ultimately, we find that the pretraining language has little effect on performance when the model sees code-mixed data during downstream finetuning. We also evaluate the models on code-mixed data in a zero-shot setting, after task-specific finetuning on a monolingual dataset. We find that this brings out differences in model performance that can be attributed to the pretraining languages. We present a thorough analysis of these findings that also looks at model performance based on the composition of participating languages in the code-mixed datasets.