Padding, DataCollators and DataLoaders #17854

vikigenius · 2023-06-16T17:43:38Z

vikigenius
Jun 16, 2023

It's hard to have a good grasp of how various libraries and their components interact.

Here are my requirements.

I want to pad my texts to maximum length in a batch.
I want to shuffle my data after each epoch.

Right now I use the datasets library and use datasets map function to pre tokenize the data and convert it into torch format like

ds.map(...).with_format("torch")

But the issue is if I call the dataloader like this with suffle on

    def train_dataloader(self):
        """The training data loader."""
        return DataLoader(
            self.dataset["train"],  # type: ignore
            shuffle=True,
            collate_fn=self.data_collator, # Default data collator
            batch_size=self.batch_sizes["train"],
        )

Then the original padding is useless and you will get batches with unequal number of tokens.

Maybe I can address this using a different data collator? What is the purpose of a data collator, can I use it somehow to do the padding? The Transformers library warns that if you use a fast tokenizer, it is much faster to pad with the original call instead of separately tokenizing and padding.

What is the most efficient way of doing this pipeline in lightning, keeping in mind DDP scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding, DataCollators and DataLoaders #17854

{{title}}

Replies: 0 comments

Select a reply

Padding, DataCollators and DataLoaders #17854

vikigenius Jun 16, 2023

Replies: 0 comments

vikigenius
Jun 16, 2023