Most machine learning tutorials gear toward defined datasets that can fit in the memory of most machines. These datasets are great for benchmarking new algorithms and for learning. However, newer SOTA models have many more parameters, and they train in an infinite data regime.
I ran into quite a few bugs while setting up an experiment with OpenWebText2, a clone of WebText which contains over 40GB of data. In this blog post, I want to share some differences to consider when working in an infinite data regime and how to prevent common bugs.
Working in an infinite data regime means you won’t have overfitting issues. You won’t need to be worried about having enough samples for training while saving enough for testing and evaluating. Instead of setting a max number of epochs, you’ll be setting the max number of steps in an infinite data regime since you shouldn’t need to see all of the samples (aka an entire epoch) to reach the lowest possible loss.
In an infinite data regime, it makes sense to prepare, tokenize, batch on the fly. In contrast, an indexable finite dataset can transform and fit easily in a GPU’s memory. Understanding how to work in an infinite data regime will only become more critical for machine learning researchers and practitioners.
Below is how I use PyTorch’s IterableDataset to stream from multiple files to create batches. You can use this dataset class with PyTorch’s DataLoader class. It’s important to remember that all shuffling and batching should be handled within your IterableDataset, (batch_size for your DataLoader should be set to None to let the DataLoader know that your dataset is batching).
To handle the multiple transformations from the raw text, to a batched output, I use generators for each transformation in the process.