Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On a large GPU cluster, DynamicBucketingSampler.__next__ spend a lot of time #1399

Open
shushanxingzhe opened this issue Oct 9, 2024 · 1 comment

Comments

@shushanxingzhe
Copy link

@pzelasko When I use DynamicBucketingSampler on a 600 gpu card cluster, the code at

for _ in range(self.world_size):
waste a lot of time. since the world_size 600 need a lot of time to loop. Could you please give me any advice to reduce the time on that.

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 9, 2024

I suggest either moving to Lhotse shar format (see the tutorial in examples directory); or sharding your manifest into a lot of small chunks and using CutSet.from_files with random seed set to „trng”, calling .repeat() in CutSet (makes it infinite) and then manually overriding rank to 0 and world size to 1 in the sampler on every GPU. Finally you can wrap both the sampler and dataset into IterableDatasetWrapper (but with non shar data maybe its not needed). This will cause the order of data iteration to be different on each dataloading worker instead of trying to deduplicate. In practice it works just as well but you need to count training steps instead of epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants