HuggingFace dataloading on remote storage data

Posted on Sat 06 May 2023 in code
Updated: Sat 06 May 2023 • 2 min read

HuggingFace is an open-source library that started to make transformers-based architectures more accessible to a lot of programmers through their APIs. Therefore, being familiar with their technologies it's a good idea for anyone who wants to be able to implement training and inference pipelines faster and still be able to use huge models and amounts of data in a very efficient way.

As we are dealing with a bunch of new technologies and methods, it's natural to face problems regarding

I couldn't use ImageFolder/AudioFolder classes (which are so handy since it already reads all the files properly before you even notice) because it doesn't allow us to use data directories outside of our local filesystem. Therefore, I had to implement a way to fetch the data and store it in its appropriate way.

When we implement the method to fetch data from any cloud service, depending on the size of each file you are trying to read, it can take a while and therefore slow down a lot the training process (since it's usually done on the fly). But how to overcome that? Depending on the task you're willing to train your model on, you can't compress the file and you must face the problem.

Considering that we have some CPUS available, a very common way to solve this problem is to allow multiple workers to fetch the data, take a look at the snippet below:

# Using native PyTorch

loader = torch.utils.data.DataLoader(my_dataset, num_workers=16) 

Or

# Using HuggingFace Trainer

transformers.Trainer(train_dataset, eval_dataset, dataloader_num_workers=16)

But once you start to iterate over our DataLoader batches, you will quickly find out that it's super slow (sometimes it is in fact stuck). The problem with that is whenever our multiple workers try to access data within our cloud storage platform, the multiple workers can't get the data and they keep hanging.

Some people suggest that to fix this behavior you could use different multiprocessing spawn methods such as spawn or forkserver, but since I was using with HuggingFace Trainer, I was afraid that this could harm data parallel model training steps. The most straightforward solution to fix this problem is to basically set up a "worker init function" and pass it to our dataloader when instantiating it.

import fsspec

def worker_init(worker_seed=None):
    fsspec.asyn.iothread[0] = None
    fsspec.asyn.loop[0] = None

loader = torch.utils.data.DataLoader
            my_dataset, 
            num_workers=16, 
            worker_init_fn = worker_init) 

Cool but now, how can I still take advantage of all the useful features of huggingface Trainer, such as logging platforms already integrated, a bunch of different callback methods, and tons of ready-to-go metrics to evaluate our models?

To do so, you will need to create a custom trainer by inheriting from the Transformers Trainer class. After that, basically, overwrite the functions responsible for creating the dataloaders using the worker init method we have used above.


Share on: