11 december 2025 - how to train your low-resource model
I've trained something like 50,000 small ML models so far in my Ph.D. Often, I'm working at the absolute lowest scale datasets—with as few as 100 examples for a difficult task. My first claim is that even in these settings, it is very possible to train a non-random model, and likely even a pretty decent model.
The first obvious step is to pick a nice strong base model--there's basically no reason to ever start from nothing. I am fond of the now-unpopular encoder-decoder transformers (e.g. T5) which have a really nice theoretical interpretation. If your language has an odd script, make sure your model has some exposure, or just Romanize.
I tend to find that synthetic data is not worth it for seq2seq tasks (and possibly not classification tasks either). In many of the experiments I've seen, the same improvement could have been achieved by training the model better. This fits with the theory that augmentation is regularization, but it's also easy to drastically shift your distribution.
Hyperparameters matter far, far more in the very low-resource regime. While a million-example dataset will probably work reasonably well for a lot of different settings, this is not true at 100 examples. If you have the compute, run massive sweeps unless you have pretty good intuitions. Definitely optimize learning rate, batch size (probably going to be large relative to the training set), model width and depth, and some regularization hparams (dropout, weight decay). Low-capacity models (small embedding, minimal layers and attention heads, strong regularization) will be harder to train but should generalize well.
Don't trust noisy metrics. If you plot the metric of interest (accuracy, BLEU, etc) alongside loss, you will often see that while loss improves smoothly, the target metric wildly jumps. Caution against over-interpreting small metric differences, since these could just be noise. Use multiple runs, cross-validate, etc to reduce variance.
Some architecture and training tricks are easy enough to implement and worth it (tied embedding/unembedding, RoPE, beam search, aggressive gradient clipping) others maybe aren't (teacher forcing). Most of all, keep it simple! Formulate the inputs and outputs well so that the model has to learn as little as possible.
mg