Michael Ginn
/maɪkəl dʒɪn/
I'm a fourth-year Ph.D. student at the University of Colorado studying natural language processing, machine learning, and occasionally linguistics, supervised by Prof. Alexis Palmer and Prof. Mans Hulden.
My Research Journey
I started my Ph.D. with a software engineering background, an interest in language (and especially rare languages), and very little understanding of what machine learning research actually entailed. I've ended up primarily working on NLP for endangered languages/rare languages, and I'm grateful for several reasons.
First, it is an area where I truly believe there is some opportunity to do some good in the world. Of course, we have to be careful, since NLP research for language revitalization historically has not lived up to its promises and can sometimes be harmful. Nonetheless, I firmly believe that by developing practical, usable, and open-source techniques that work for any language, we at least give speakers of endangered language tools they can use in pursuit of revitalization.
Second, as someone who did not have any ML experience prior to my program, it has been an area where I feel I can make real impact, both in fundamental and applied problems. There is considerably less investment in the area (from either industry or academia) and therefore far less research activity than the hottest NLP fields. Moreover, much of the research that is conducted is small monolingual studies that do not necessarily generalize to other languages. My goal, then, has been to apply cutting-edge techniques in multilingual research to find techniques that work broadly across languages. Furthermore, since these languages rarely have much language tech, it's a rare opportunity for researchers to interact directly with tools and software that real users use. With the BELT platform and GlossLM user studies, we had to consider not just fundamental ML science, but also issues with human-computer interaction that prevented real-world usage.
Finally, working in endangered/rare languages has allowed me to work with a huge variety of techniques and areas of NLP/ML. Early on, I ran small experiments on topics like out-of-domain generalization, static embeddings, and data augmentation/synthetic data. I trained baselines for the first interlinear glossing shared task, which helped develop my fundamentals in training and evaluating models. With the advance of LLMs, I ran an early study on doing in-context learning and retrieval for rare languages. Then, I tried to apply the Bitter Lesson to language documentation by training a multilingual glossing model on as much data as we could find. This project, and the followup PolyGloss model, gave me an appreciation for the engineering that goes into large scale ML, topics like distributed training, mixed precision, quantization, LoRA adapters, etc.
Along the way, I've trained hundreds of thousands of small models on tiny datasets, which has helped me build strong intuitions about hyperparameters, loss curves, etc. I built an even stronger intuition for recurrent nets by converting them into automata. I've learned some good lessons about doing ML research:
- Log everything, even when you don't think you'll need it
- Be proactive about organizing experiments