Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Dataloader failing when using num_workers > 0 #1161

Open
aosakwe opened this issue May 16, 2024 · 1 comment
Open

Parallel Dataloader failing when using num_workers > 0 #1161

aosakwe opened this issue May 16, 2024 · 1 comment

Comments

@aosakwe
Copy link

aosakwe commented May 16, 2024

Hi,

I am trying to increase the number of workers used by the dataloader but have been encountering issues. I saw issue 625 and 626 which included the warning message but cannot find an example vignette showing how to properly implement the parallel dataloader. Would it be possible to have a brief example for this?

@dfalbel
Copy link
Member

dfalbel commented May 16, 2024

When torch creates a parallel dataloader (num_workers > 1) it will create some new R processes using callr and then copy the dataset you passed on into each one of those processes. It will then run .getitem() in each of theses precesses.

Problems can arise when copying dataset into those processes, for example:

  • if the dataset contains torch_tensors as attributes. torch tensors are not serializable using saveRDS() thus it's hard to reliably move them between process. The alternative in this case is to not have any dataset attribute that is a tensor.
  • the dataset has very large attributes. if your dataset has very large attributes, theywill be copied into each process potentially using a lot of memory.
  • other kinds of objects that are not copiable using saveRDS(). eg connections, XML objects, anything that is a pointer.

Here's a small example running the mnist dataset in parallel:

library(torch)
library(torchvision)

dir <- "~/Downloads/mnist2"
train_ds <- mnist_dataset(
  dir,
  download = TRUE,
  transform = transform_to_tensor
)

train_dl <- dataloader(train_ds, batch_size = 128, shuffle = TRUE, num_workers = 4)
d <- coro::collect(train_dl)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants