I’m excited to be joining the Fall 2020 cohort of OpenAI’s Scholars program. I’ll be writing semi-regularly as a log of what I’m learning and thinking about.
I’m excited to be part of the scholars’ program since I find learning in a group motivating and useful, especially now when everyone is more isolated. It’s also been beneficial to ask questions and learn from people who have already been thinking about my research interests.
One of the high-level goals I’d like to work on is developing “taste” or “aesthetic” for deep learning research throughout this experience.
For the past two weeks, I’ve been reading about generalization and language models. I’ve also been working on reimplementing the smaller transformers from the Scaling Laws for Neural Languages.
Scaling Laws for Neural Languages
The paper uses a decoder only transformer for most of its experiments, in addition to LSTM models and the universal transformer. For now, I’ll focus on reproducing the smaller-scale experiments with the transformer architecture. To understand the architecture for the decoder only transformer better, I read the original GPT paper. It’s surprising to remember that this paper is only ~2 years old. I plan to use datasets available via HuggingFace’s Dataset library for training initially, and look into this WebText scraper later.
I found these resources really useful for understanding and implementing the transformer architecture.
- The Illustrated Transformer – Jay Alammar
- Illustrated Guide to Transformers
- The Annotated Transformer
Coincidentally, Jared Kaplan, one of the paper’s authors, gave a talk on scaling laws this past Wednesday. The slides and the video from the talk can be accessed here on the Physics ∩ ML website.
Below are papers suggested by my mentor for other relevant language model papers to read:
- Transformer XL
- Universal Transformer
- RL from human feedback
- Critical batch size paper
I’ve also been thinking about model generalization this week. I’ve been thinking about some questions: what are the differences between generalization and memorization for some of these larger models with smaller datasets? what is the minimum amount of data required to generalize? what are other factors that allow models to generalize quickly? are there similar scaling law-esque properties for model generalization? what does it look like for a model to generalize well on out of distribution data?
Some papers I’ve been reading about generalization: