Publications | LAGoM NLP

Latest preprints

QQ: A Toolkit for Language Identifiers and Metadata

Wessel Poelman, Yiyi Chen, and Miryam Lhoneux

Feb 2026

Abs URL

The growing number of languages considered in multilingual NLP, including new datasets and tasks, poses challenges regarding properly and accurately reporting which languages are used and how. For example, datasets often use different language identifiers; some use BCP-47 (e.g. en_Latn), others use ISO 639-1 (en), and more linguistically oriented datasets use Glottocodes (stan1293). Mapping between identifiers is manageable for a few dozen languages, but becomes unscalable when dealing with thousands. We introduce QwanQwa, a light-weight Python toolkit for unified language metadata management. QQ integrates multiple language resources into a single interface, provides convenient normalization and mapping between language identifiers, and affords a graph-based structure that enables traversal across families, regions, writing systems, and other linguistic attributes. QQ serves both as (1) a simple "glue" library in multilingual NLP research to make working with many languages easier, and (2) as an intuitive way for exploring languages, such as finding related ones through shared scripts, regions or other metadata.

arXiv
How Good Is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Kushal Tatariya*, Artur Kulmizev*, Wessel Poelman, and 6 more authors

Nov 2024

Abs Bib URL

Wikipedia’s perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
@misc{tatariya2024how, title = {How {{Good}} Is {{Your Wikipedia}}? Auditing Data Quality for Low-resource and Multilingual NLP}, author = {Tatariya*, Kushal and Kulmizev*, Artur and Poelman, Wessel and Ploeger, Esther and Bollmann, Marcel and Bjerva, Johannes and Luo, Jiaming and Lent, Heather and de Lhoneux, Miryam}, year = {2024}, month = nov, number = {arXiv:2411.05527}, eprint = {2411.05527}, publisher = {arXiv}, urldate = {2024-11-13}, keywords = {Computer Science - Computation and Language}, primaryclass = {cs}, }

Conference papers

EACL
Form and Meaning in Intrinsic Multilingual Evaluations

Wessel Poelman, and Miryam Lhoneux

In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics Mar 2026

Abs Bib URL

Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.
@inproceedings{poelman-de-lhoneux-2026-form, title = {Form and Meaning in Intrinsic Multilingual Evaluations}, author = {Poelman, Wessel and de Lhoneux, Miryam}, booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics}, month = mar, year = {2026}, address = {Rabat, Morocco}, publisher = {Association for Computational Linguistics}, }
EACL (Findings)
Typologically Informed Parameter Aggregation

Stef Accou, and Wessel Poelman

In Findings of the Association for Computational Linguistics: EACL 2026 Mar 2026

Abs Bib URL

Massively multilingual language models enable cross-lingual generalization but underperform on low-resource and unseen languages. While adapter-based fine-tuning offers a parameter-efficient solution, training language-specific adapters at scale remains costly. We introduce Typologically Informed Parameter Aggregation (TIPA), a training-free method that constructs proxy language adapters by aggregating existing ones, weighted by typological similarity. Integrated into the MAD-X framework, these proxies enable zero-shot cross-lingual transfer without additional training. We evaluate TIPA on five NLP tasks and over 230 languages. TIPA consistently outperforms or matches baselines such as English-only fine-tuning or selecting the typologically closest language adapter. We see the largest gains for languages lacking dedicated adapters. Our results demonstrate that typologically informed aggregation provides a viable alternative to language-specific modules without any training needed.
@inproceedings{accou-poelman-2026-typologically, title = {Typologically Informed Parameter Aggregation}, author = {Accou, Stef and Poelman, Wessel}, booktitle = {Findings of the Association for Computational Linguistics: EACL 2026}, month = mar, year = {2026}, address = {Rabat, Morocco}, publisher = {Association for Computational Linguistics}, }

NoDaLiDa
The Roles of English in Evaluating Multilingual Language Models

Wessel Poelman, and Miryam de Lhoneux

In The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies Mar 2025

Abs Bib URL

Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from this imprecise method and instead focus on furthering language understanding.
@inproceedings{poelman2024roles, title = {The {{Roles}} of {{English}} in {{Evaluating Multilingual Language Models}}}, author = {Poelman, Wessel and {de Lhoneux}, Miryam}, booktitle = {The {{Joint}} 25th {{Nordic Conference}} on {{Computational Linguistics}} and 11th {{Baltic Conference}} on {{Human Language Technologies}}}, year = {2025}, month = mar, }
ACL
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model

Thomas Bauwens, David Kaczér, and Miryam de Lhoneux

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2025

Abs Bib URL

Stochastically sampling word segmentations from a subword tokeniser, also called subword regularisation, is a known way to increase robustness of language models to out-of-distribution inputs, such as text containing spelling errors. Recent work has observed that usual augmentations that make popular deterministic subword tokenisers stochastic still cause only a handful of all possible segmentations to be sampled. It has been proposed to uniformly sample across these instead, through rejection sampling of paths in an unweighted segmentation graph. In this paper, we argue that uniformly random segmentation in turn skews the distributions of certain segmentational properties (e.g. token lengths and amount of tokens produced) away from uniformity, which still ends up hiding meaningfully diverse tokenisations. We propose an alternative uniform sampler using the same segmentation graph, but weighted by counting the paths through it. Our sampling algorithm, GRaMPa, provides hyperparameters allowing sampled tokenisations to skew towards fewer, longer tokens. Furthermore, GRaMPa is single-pass, guaranteeing significantly better computational complexity than previous approaches relying on rejection sampling. We show experimentally that language models trained with GRaMPa outperform existing regularising tokenisers in a data-scarce setting on token-level tasks such as dependency parsing, especially with spelling errors present.
@inproceedings{bauwens-etal-2025-grampa, title = {{GR}a{MP}a: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting {M}arkov Model}, author = {Bauwens, Thomas and Kacz{\'e}r, David and {de Lhoneux}, Miryam}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.acl-long.1180/}, pages = {24228--24257}, isbn = {979-8-89176-251-0}, }
ACL (Findings)
Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT

Elke Vandermeerschen, and Miryam de Lhoneux

In Findings of the Association for Computational Linguistics: ACL 2025 Jul 2025

Abs Bib URL

Contemporary language models (LMs) such as BERT (Devlin et al., 2019, T5 (Raffel et al., 2023), GPT-4 (OpenAI, 2023), have exhibited remarkable capabilities, effectively addressing long-standing challenges in the field. However, these models rely on shortcut learning, using a decision rule that relies on superficial cues that are spuriously correlated with the labels (Geirhos et al., 2020). In this research, we focus on the reliance on a specific type of shortcuts, namely syntactic heuristics, in BERT when performing Natural Language Inference (NLI), a representative task in Natural Language Understanding (Jeretic et al., 2020). By making use of two probing methods, one supervised, one unsupervised, we investigate where these shortcuts emerge, how they evolve and how they impact the latent knowledge of the LM. Our findings reveal that syntactic heuristics are absent in pretrained models but emerge and evolve as the model is finetuned with datasets of increasing size. The adoption of these shortcuts varies across different hidden layers, with specific layers closer to the output contributing more to this phenomenon. Despite the model’s reliance on shortcuts during inference, it retains information relevant to the task, and our supervised and unsupervised probes process this information differently.
@inproceedings{vandermeerschen-de-lhoneux-2025-supervised, title = {Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in {BERT}}, author = {Vandermeerschen, Elke and {de Lhoneux}, Miryam}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2025}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-acl.499/}, pages = {9592--9604}, isbn = {979-8-89176-256-5}, }
EMNLP
Confounding Factors in Relating Model Performance to Morphology

Wessel Poelman, Thomas Bauwens, and Miryam Lhoneux

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing Nov 2025

Abs Bib URL

The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.
@inproceedings{poelman-etal-2025-confounding, title = {Confounding Factors in Relating Model Performance to Morphology}, author = {Poelman, Wessel and Bauwens, Thomas and de Lhoneux, Miryam}, editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.369/}, pages = {7273--7298}, isbn = {979-8-89176-332-6}, }
AACL
On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Kushal Tatariya, Wessel Poelman, and Miryam Lhoneux

In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics Dec 2025

Abs Bib URL

Language model architectures are predominately first created for English and afterwards applied to other languages. This can lead to problems for languages that are structurally different from English. We study one specific architectural choice: positional encodings. We do this through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis states there exists a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pre-train and evaluate three monolingual model variants with absolute, relative and no position encodings for seven typologically diverse languages and evaluate on four downstream tasks. We fail to find a consistent trend with various proxies for morphological complexity and word order flexibility. Our work shows choice of tasks, languages, and metrics are essential for drawing stable conclusions.
@inproceedings{tatariya-etal-2025-interplay, title = {On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility}, author = {Tatariya, Kushal and Poelman, Wessel and de Lhoneux, Miryam}, editor = {Inui, Kentaro and Sakti, Sakriani and Wang, Haofen and Wong, Derek F. and Bhattacharyya, Pushpak and Banerjee, Biplab and Ekbal, Asif and Chakraborty, Tanmoy and Singh, Dhirendra Pratap}, booktitle = {Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics}, month = dec, year = {2025}, address = {Mumbai, India}, publisher = {The Asian Federation of Natural Language Processing and The Association for Computational Linguistics}, url = {https://aclanthology.org/2025.ijcnlp-long.95/}, pages = {1761--1778}, isbn = {979-8-89176-298-5}, }

NAACL
BPE-knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision

Thomas Bauwens, and Pieter Delobelle

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Jun 2024

Abs Bib URL Poster Slides Video

Byte-pair encoding (BPE) has become the default subword tokeniser in language models (LMs), allowing the representation of an infinite space of text with a finite set of units. Yet, BPE training is unsupervised, receiving no explicit information about a language’s morphology. This results in a subword vocabulary wherein many units are a concatenation of partial morphemes, preventing their formation as tokens. This, in turn, causes consistent intra-word patterns to be displayed inconsistently to downstream models, and bloats the vocabulary, hence requiring unnecessary embedding storage. In this paper, we address this issue by identifying blameworthy BPE merges and removing the resulting subwords from the BPE vocabulary, without impeding further use of merges that relied on them. We find that our method, BPE-knockout, is effective at making BPE’s segmentation positions adhere better to derivational and compound boundaries in English, Dutch and German, and improves token-based tasks in Dutch RoBERTa models, indicating that a tokeniser’s adherence to morphology impacts downstream models. We demonstrate the latter not only by training LMs from scratch, but also by continuing the pre-training of existing LMs. This proves promising, showing that suboptimal tokenisers can be remedied whilst salvaging training cost of downstream LMs.
@inproceedings{bauwens-delobelle-2024-bpe, title = {{BPE}-knockout: Pruning Pre-existing {BPE} Tokenisers with Backwards-compatible Morphological Semi-supervision}, author = {Bauwens, Thomas and Delobelle, Pieter}, booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.naacl-long.324}, pages = {5810--5832}, }
EMNLP
What Is “Typological Diversity” in NLP?

Esther Ploeger*, Wessel Poelman*, Miryam de Lhoneux, and 1 more author

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Nov 2024

Abs Bib URL

The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world’s languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being ’typologically diverse’. In this work, we systematically investigate NLP research that includes claims regarding ’typological diversity’. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Furthermore, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of ’typological diversity’ that empirically justifies the diversity of language samples.
@inproceedings{ploeger2024what, title = {What Is {{``Typological Diversity''}} in {{NLP}}?}, author = {Ploeger*, Esther and Poelman*, Wessel and {de Lhoneux}, Miryam and Bjerva, Johannes}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, year = {2024}, month = nov, publisher = {Association for Computational Linguistics}, }

EMNLP

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, and 1 more author

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Nov 2024

Bib URL

@inproceedings{tatariya2024pixology,
  title = {Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models},
  author = {Tatariya, Kushal and Araujo, Vladimir and Bauwens, Thomas and {de Lhoneux}, Miryam},
  year = {2024},
  publisher = {Association for Computational Linguistics},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
}

COLM

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

François Remy, Pieter Delobelle, Hayastan Avetisyan, and 3 more authors

In First Conference on Language Modeling Nov 2024

Bib URL

@inproceedings{remy-delobelle2024transtokenization,
  title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}},
  author = {Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and {de Lhoneux}, Miryam and Demeester, Thomas},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=sBxvoDhvao},
}

Journal papers

CL

A Principled Framework for Evaluating on Typologically Diverse Languages

Esther Ploeger, Wessel Poelman, Andreas Holck Høeg-Petersen, and 3 more authors

Computational Linguistics Oct 2025

Abs URL

Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world’s languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, ‘typologically diverse’ language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.

TACL

CreoleVal: Multilingual Multitask Benchmarks for Creoles

Heather Lent, Kushal Tatariya, Raj Dabre, and 18 more authors

Transactions of the Association for Computational Linguistics Sep 2024

Abs URL

Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

Workshop papers

MRL
Type and Complexity Signals in Multilingual Question Representations

Robin Kokot, and Wessel Poelman

In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) Nov 2025

Abs Bib URL

This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions well in explicitly marked languages and structural complexity prediction, while neural probes lead on individual metrics. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce availability of pre-trained linguistic information.
@inproceedings{kokot-poelman-2025-type, title = {Type and Complexity Signals in Multilingual Question Representations}, author = {Kokot, Robin and Poelman, Wessel}, editor = {Adelani, David Ifeoluwa and Arnett, Catherine and Ataman, Duygu and Chang, Tyler A. and Gonen, Hila and Raja, Rahul and Schmidt, Fabian and Stap, David and Wang, Jiayi}, booktitle = {Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)}, month = nov, year = {2025}, address = {Suzhuo, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.mrl-main.28/}, pages = {411--425}, isbn = {979-8-89176-345-6}, }

SIGTYP
Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Kushal Tatariya, Heather Lent, Johannes Bjerva, and 1 more author

In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP Mar 2024

Abs Bib URL Poster

Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression,especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolinguistic studies have shown that Hinglish speakers switch to Hindi when expressing negative emotions and to English when expressing positive emotions. To understand if language models can learn these associations, we study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. Using LIME and token level language ID, we find that models do learn these associations between language choice and emotional expression. Moreover, having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce. We also conclude from the misclassifications that the models may overgeneralise this heuristic to other infrequent examples where this sociolinguistic phenomenon does not apply.
@inproceedings{tatariya_sociolinguistically_2024, address = {St. Julian's, Malta}, title = {Sociolinguistically {Informed} {Interpretability}: {A} {Case} {Study} on {Hinglish} {Emotion} {Classification}}, shorttitle = {Sociolinguistically {Informed} {Interpretability}}, url = {https://aclanthology.org/2024.sigtyp-1.9}, urldate = {2024-03-22}, booktitle = {Proceedings of the 6th {Workshop} on {Research} in {Computational} {Linguistic} {Typology} and {Multilingual} {NLP}}, publisher = {Association for Computational Linguistics}, author = {Tatariya, Kushal and Lent, Heather and Bjerva, Johannes and {de Lhoneux}, Miryam}, editor = {Hahn, Michael and Sorokin, Alexey and Kumar, Ritesh and Shcherbakov, Andreas and Otmakhova, Yulia and Yang, Jinrui and Serikov, Oleg and Rani, Priya and Ponti, Edoardo M. and Muradoğlu, Saliha and Gao, Rena and Cotterell, Ryan and Vylomova, Ekaterina}, month = mar, year = {2024}, pages = {66--74}, }
SIGTYP
A Call for Consistency in Reporting Typological Diversity

Wessel Poelman*, Esther Ploeger*, Miryam de Lhoneux, and 1 more author

In Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP Mar 2024

Abs Bib URL

In order to draw generalizable conclusions about the performance of multilingual models across languages, it is important to evaluate on a set of languages that captures linguistic diversity.Linguistic typology is increasingly used to justify language selection, inspired by language sampling in linguistics.However, justifications for ‘typological diversity’ exhibit great variation, as there seems to be no set definition, methodology or consistent link to linguistic typology.In this work, we provide a systematic insight into how previous work in the ACL Anthology uses the term ‘typological diversity’.Our two main findings are: 1) what is meant by typologically diverse language selection is not consistent and 2) the actual typological diversity of the language sets in these papers varies greatly.We argue that, when making claims about ‘typological diversity’, an operationalization of this should be included.A systematic approach that quantifies this claim, also with respect to the number of languages used, would be even better.
@inproceedings{poelman2024call, title = {A {{Call}} for {{Consistency}} in {{Reporting Typological Diversity}}}, booktitle = {Proceedings of the 6th {{Workshop}} on {{Research}} in {{Computational Linguistic Typology}} and {{Multilingual NLP}}}, author = {Poelman*, Wessel and Ploeger*, Esther and {de Lhoneux}, Miryam and Bjerva, Johannes}, editor = {Hahn, Michael and Sorokin, Alexey and Kumar, Ritesh and Shcherbakov, Andreas and Otmakhova, Yulia and Yang, Jinrui and Serikov, Oleg and Rani, Priya and Ponti, Edoardo M. and Murado{\u g}lu, Saliha and Gao, Rena and Cotterell, Ryan and Vylomova, Ekaterina}, year = {2024}, month = mar, pages = {75--77}, publisher = {Association for Computational Linguistics}, address = {St. Julian's, Malta}, url = {https://aclanthology.org/2024.sigtyp-1.10}, urldate = {2024-04-22}, }

MRL

Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?

Zeno Vandenbulcke, Lukas Vermeire, and Miryam de Lhoneux

In 4th Multilingual Representation Learning (MRL) workshop Mar 2024

Bib URL

@inproceedings{vandenbulcke2024recipe,
  title = {Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?},
  author = {Vandenbulcke, Zeno and Vermeire, Lukas and {de Lhoneux}, Miryam},
  year = {2024},
  booktitle = {4th Multilingual Representation Learning (MRL) workshop},
}

HumEval

Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French

Ayla Rigouts Terryn, and Miryam de Lhoneux

In Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024 May 2024

Bib URL

@inproceedings{rigouts-terryn-de-lhoneux-2024-exploratory,
  title = {Exploratory Study on the Impact of {E}nglish Bias of Generative Large Language Models in {D}utch and {F}rench},
  author = {Rigouts Terryn, Ayla and {de Lhoneux}, Miryam},
  editor = {Balloccu, Simone and Belz, Anya and Huidrom, Rudali and Reiter, Ehud and Sedoc, Joao and Thomson, Craig},
  booktitle = {Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024},
  month = may,
  year = {2024},
  address = {Torino, Italia},
  publisher = {ELRA and ICCL},
  url = {https://aclanthology.org/2024.humeval-1.2},
  pages = {12--27},
}

WASSA
Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter?

Kushal Tatariya, Heather Lent, and Miryam de Lhoneux

In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis Jul 2023

Abs Bib URL Poster

Monolinguals make up a minority of the world’s speakers, and yet most language technologies lag behind in handling linguistic behaviours produced by bilingual and multilingual speakers. A commonly observed phenomenon in such communities is code-mixing, which is prevalent on social media, and thus requires attention in NLP research. In this work, we look into the ability of pretrained language models to handle code-mixed data, with a focus on the impact of languages present in pretraining on the downstream performance of the model as measured on the task of sentiment analysis. Ultimately, we find that the pretraining language has little effect on performance when the model sees code-mixed data during downstream finetuning. We also evaluate the models on code-mixed data in a zero-shot setting, after task-specific finetuning on a monolingual dataset. We find that this brings out differences in model performance that can be attributed to the pretraining languages. We present a thorough analysis of these findings that also looks at model performance based on the composition of participating languages in the code-mixed datasets.
@inproceedings{tatariya_transfer_2023, address = {Toronto, Canada}, title = {Transfer {Learning} for {Code}-{Mixed} {Data}: {Do} {Pretraining} {Languages} {Matter}?}, shorttitle = {Transfer {Learning} for {Code}-{Mixed} {Data}}, url = {https://aclanthology.org/2023.wassa-1.32}, doi = {10.18653/v1/2023.wassa-1.32}, urldate = {2023-12-27}, booktitle = {Proceedings of the 13th {Workshop} on {Computational} {Approaches} to {Subjectivity}, {Sentiment}, \& {Social} {Media} {Analysis}}, publisher = {Association for Computational Linguistics}, author = {Tatariya, Kushal and Lent, Heather and {de Lhoneux}, Miryam}, editor = {Barnes, Jeremy and De Clercq, Orphée and Klinger, Roman}, month = jul, year = {2023}, pages = {365--378}, }