Padding, DataCollators and DataLoaders #17854
Unanswered
vikigenius
asked this question in
code help: NLP / ASR / TTS
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
It's hard to have a good grasp of how various libraries and their components interact.
Here are my requirements.
Right now I use the datasets library and use datasets map function to pre tokenize the data and convert it into torch format like
ds.map(...).with_format("torch")
But the issue is if I call the dataloader like this with suffle on
Then the original padding is useless and you will get batches with unequal number of tokens.
Maybe I can address this using a different data collator? What is the purpose of a data collator, can I use it somehow to do the padding? The Transformers library warns that if you use a fast tokenizer, it is much faster to pad with the original call instead of separately tokenizing and padding.
What is the most efficient way of doing this pipeline in lightning, keeping in mind DDP scenarios.
Beta Was this translation helpful? Give feedback.
All reactions