projects

SLALLM

Large language models are highly effective across widely spoken languages, but they struggle on languages which are underrepresented in their training corpora. In many cases, these languages do not have sufficient large-scale corpora, so alternate training approaches are necessary.

In the Second Language Acquisition for LLMs project (SLALLM), we aim to develop a flexible training framework for LLMs that utilizes insights from human language acquisition. Specifically, we apply language learning materials such as textbooks and online courses. We are designing an iterative framework where the model is trained on one lesson at a time, applying synthetic data generation, LLM interplay, and training techniques such as reinforcement learning from human feedback (RLHF) and

Finite-State Distillation

Finite-state transducers (FSTs) remain practical for high-performance string-to-string tasks (inflection, transliteration), but the construction of FSTs by hand (or by algorithm) is difficult.

However, recurrent neural networks (RNNs) are non-finite state machines with a close correspondance to FSTs. We can utilize trained RNNs, and the state clustering algorithm of Giles (1991), to induce FSTs from datasets, even with noise/irregular transductions.

Automated Language Documentation

Automatic production of interlinear glossed text (IGT) has the potential to impact documentary linguistic workflows, as well as being useful as an intermediate representation for NLP tasks (Ramos et al, 2024).

We study various methods for automatic IGT production, including pretrained multilingual models ("GlossLM", Ginn et al, 2024) and in-context learning with LLMs ("Can we teach", Ginn et al, 2024). Currently, we are developing PolyGloss, a single multilingual model for aligned segmentation and interlinear glossing.