Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training low perf #58

Open
bhack opened this issue Apr 10, 2024 · 10 comments
Open

Training low perf #58

bhack opened this issue Apr 10, 2024 · 10 comments

Comments

@bhack
Copy link

bhack commented Apr 10, 2024

Can I ask you a bit of details about your perf number on base model training? How much time require a forward and backward pass? How much time the dataloader?

I find it very hard to have a minimum decent GPU load also using a local SSD for the data. I've also tested with a similar setup as your paper with A100 GPUs.

Have you tested it with Pytorch 2.x?

@hkchengrex
Copy link
Owner

I didn't time each component specifically. For pre-training/main training, each iteration took around 0.28/0.66 seconds.
The GPU load should be near 99% most of the time -- which means the GPU shouldn't have to wait for the dataloader at all. Sometimes CPUs can be a bottleneck, but that really depends on the hardware. I have tested this PyTorch 2.0 and didn't observe any significant slowdown.

@bhack
Copy link
Author

bhack commented Apr 10, 2024

Thanks it is important to have a reference. In my tests we are barely under 20% of GPUs occupancy... I am investigating.

@bhack
Copy link
Author

bhack commented Apr 12, 2024

I've really tested the original and compiled code with last stable pytorch, pytorch nightly, with A100, H100, with different number of workers, different number of GPU, with a larger batch size, with local ssd, using larger image like Davis fullres, with larger crops to fill the memory.

In any of these configurations I've achieved a decent GPU load with the base model.

@hkchengrex
Copy link
Owner

You have (all good then)? Or you haven't...?

@bhack
Copy link
Author

bhack commented Apr 12, 2024

No, in the best combo the load it is always around 20%.

@hkchengrex
Copy link
Owner

I see. I think there is a typo in your previous comment.
How is the CPU usage (like, with top)?

@bhack
Copy link
Author

bhack commented Apr 12, 2024

it is quite high.. of course it depends by the num of workers. E.g. The H100 instance have 207 cores with 98 workers and batch size 32 we have an avg CPU load 50/55%.

@hkchengrex
Copy link
Owner

hkchengrex commented Apr 13, 2024

I just tried with the latest code and PyTorch (small model). This is on a different machine and I had to increase the number of workers in the pre-training stage to 32. I couldn't get it to 90+ utilization on average, but it is a lot better than 20%. With this utilization the avg_time is similar -- 0.283/0.801 for pre-training/main training after warm-up. The pre-training stage is more CPU-intensive and has a lower GPU utilization.

For reference, below are the screenshots during pre-training and main training respectively. It is likely that with better GPUs like H100, the CPUs would need to work extra hard to keep the GPUs fed but in any case, they should not be slower than the 0.283/0.801 avg_time.

Pre-training:
Screenshot from 2024-04-13 14-26-56

Main training:
Screenshot from 2024-04-13 14-18-47

@hkchengrex
Copy link
Owner

Are you getting "good" avg_time?

@bhack
Copy link
Author

bhack commented Apr 13, 2024

Currently I am testing only the main_training stage.
With more ram on the H100 I've increased the batch size to 32 and num_workers to 64 but I've test also up to 128 with 8 GPU.
Just to check also the balance between the file transfer and the processing and the network load I've also tried to use Davis full-res instead of Davis 480p doubling the crop_size.

With batch_size 32 and doubled crop_size we have avg_time:~3.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants