When Language Models Meet Words

Speaker:

Dr. Yuval Pinter
Senior Lecturer at the Department of Computer Science of Ben-Gurion University

Date:

2023-09-27 at 11:00–12:00

Location:

Department of Computer Science (Celestijnlaan 200A) -- Java (room 5.152)

Abstract:

Over the last few years, deep neural models have taken over the field of natural language processing (NLP), brandishing great improvements on many of its sequence-level tasks. But the end-to-end nature of these models makes it hard to figure out whether the way they represent individual words aligns with how language builds itself from the bottom up, or how lexical changes in register and domain can affect the untested aspects of such representations, or which phenomena can be modeled by units smaller than the word. In this talk, I will present NYTWIT, a dataset created to challenge large language models (LLMs) at the lexical level, tasking them with identification of processes leading to the formation of novel English words, as well as with segmentation and recovery of the specific subclass of lexical blends, demonstrating the ways in which subword-tokenized LLMs fail to analyze them. I will then present Nakdimon, a lightweight Hebrew diacritizer that avoids tokenization artifacts by working at the character level alone; and SaGe, a subword tokenizer that incorporates context into the vocabulary creation objective.

Bio:

Yuval Pinter is a Senior Lecturer in the Department of Computer Science at Ben-Gurion University of the Negev, focusing on natural language processing as PI of the MeLeL lab. Yuval got his PhD at the Georgia Institute of Technology School of Interactive Computing as a Bloomberg Data Science PhD Fellow. Prior to this, he worked as a Research Engineer at Yahoo Labs and as a Computational Linguist at Ginger Software, and obtained an MA in Linguistics and a BSc in CS and Mathematics, both from Tel Aviv University. Yuval blogs (in Hebrew) about language matters on Dagesh Kal. → Website