Events

Language Similarity in Machine Translation: A Typological Perspective

Speaker:

drs. Esther Ploeger

Date:

2024-08-01 at 13:00–14:00

Location:

Department of Computer Science (Celestijnenlaan 200A) -- Java (room 5.152)

Abstract:

State-of-the-art performance in machine translation is currently achieved by training models on an ever larger amount of text. While this data-driven approach has increased translation quality substantially in some cases, languages for which little such data is available often remain underserved. Additionally, this approach is increasingly computationally expensive. Linguistic typology provides potential for mitigating these issues. Typologists have systematically analysed the similarities and differences between many of the world’s languages, resulting in comprehensive linguistic descriptions and databases. This talk will address challenges and opportunities in applying what we know about languages to make translation models more efficient and accurate in low-resource settings.

Bio:

Esther Ploeger is a PhD student at the Department of Computer Science at Aalborg University, Copenhagen. In her research, she focuses on leveraging knowledge about language and cross-linguistic tendencies (linguistic typology) in practical natural language processing applications, such as machine translation. Prior to this, she obtained a BSc. And MSc. in Information Science at the Univeristy of Groningen in The Netherlands. → Website

When Language Models Meet Words

Speaker:

Dr. Yuval Pinter
Senior Lecturer at the Department of Computer Science of Ben-Gurion University

Date:

2023-09-27 at 11:00–12:00

Location:

Department of Computer Science (Celestijnlaan 200A) -- Java (room 5.152)

Abstract:

Over the last few years, deep neural models have taken over the field of natural language processing (NLP), brandishing great improvements on many of its sequence-level tasks. But the end-to-end nature of these models makes it hard to figure out whether the way they represent individual words aligns with how language builds itself from the bottom up, or how lexical changes in register and domain can affect the untested aspects of such representations, or which phenomena can be modeled by units smaller than the word. In this talk, I will present NYTWIT, a dataset created to challenge large language models (LLMs) at the lexical level, tasking them with identification of processes leading to the formation of novel English words, as well as with segmentation and recovery of the specific subclass of lexical blends, demonstrating the ways in which subword-tokenized LLMs fail to analyze them. I will then present Nakdimon, a lightweight Hebrew diacritizer that avoids tokenization artifacts by working at the character level alone; and SaGe, a subword tokenizer that incorporates context into the vocabulary creation objective.

Bio:

Yuval Pinter is a Senior Lecturer in the Department of Computer Science at Ben-Gurion University of the Negev, focusing on natural language processing as PI of the MeLeL lab. Yuval got his PhD at the Georgia Institute of Technology School of Interactive Computing as a Bloomberg Data Science PhD Fellow. Prior to this, he worked as a Research Engineer at Yahoo Labs and as a Computational Linguist at Ginger Software, and obtained an MA in Linguistics and a BSc in CS and Mathematics, both from Tel Aviv University. Yuval blogs (in Hebrew) about language matters on Dagesh Kal. → Website