Training low perf #58

bhack · 2024-04-10T10:48:03Z

Can I ask you a bit of details about your perf number on base model training? How much time require a forward and backward pass? How much time the dataloader?

I find it very hard to have a minimum decent GPU load also using a local SSD for the data. I've also tested with a similar setup as your paper with A100 GPUs.

Have you tested it with Pytorch 2.x?

hkchengrex · 2024-04-10T16:27:17Z

I didn't time each component specifically. For pre-training/main training, each iteration took around 0.28/0.66 seconds.
The GPU load should be near 99% most of the time -- which means the GPU shouldn't have to wait for the dataloader at all. Sometimes CPUs can be a bottleneck, but that really depends on the hardware. I have tested this PyTorch 2.0 and didn't observe any significant slowdown.

bhack · 2024-04-10T16:45:21Z

Thanks it is important to have a reference. In my tests we are barely under 20% of GPUs occupancy... I am investigating.

bhack · 2024-04-12T16:33:51Z

I've really tested the original and compiled code with last stable pytorch, pytorch nightly, with A100, H100, with different number of workers, different number of GPU, with a larger batch size, with local ssd, using larger image like Davis fullres, with larger crops to fill the memory.

In any of these configurations I've achieved a decent GPU load with the base model.

hkchengrex · 2024-04-12T17:48:21Z

You have (all good then)? Or you haven't...?

bhack · 2024-04-12T17:57:05Z

No, in the best combo the load it is always around 20%.

hkchengrex · 2024-04-12T17:58:45Z

I see. I think there is a typo in your previous comment.
How is the CPU usage (like, with top)?

bhack · 2024-04-12T18:19:55Z

it is quite high.. of course it depends by the num of workers. E.g. The H100 instance have 207 cores with 98 workers and batch size 32 we have an avg CPU load 50/55%.

hkchengrex · 2024-04-13T19:38:58Z

I just tried with the latest code and PyTorch (small model). This is on a different machine and I had to increase the number of workers in the pre-training stage to 32. I couldn't get it to 90+ utilization on average, but it is a lot better than 20%. With this utilization the avg_time is similar -- 0.283/0.801 for pre-training/main training after warm-up. The pre-training stage is more CPU-intensive and has a lower GPU utilization.

For reference, below are the screenshots during pre-training and main training respectively. It is likely that with better GPUs like H100, the CPUs would need to work extra hard to keep the GPUs fed but in any case, they should not be slower than the 0.283/0.801 avg_time.

Pre-training:

Main training:

hkchengrex · 2024-04-13T19:40:36Z

Are you getting "good" avg_time?

bhack · 2024-04-13T20:03:25Z

Currently I am testing only the main_training stage.
With more ram on the H100 I've increased the batch size to 32 and num_workers to 64 but I've test also up to 128 with 8 GPU.
Just to check also the balance between the file transfer and the processing and the network load I've also tried to use Davis full-res instead of Davis 480p doubling the crop_size.

With batch_size 32 and doubled crop_size we have avg_time:~3.2.

hkchengrex mentioned this issue May 31, 2024

Training Time #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training low perf #58

Training low perf #58

bhack commented Apr 10, 2024

hkchengrex commented Apr 10, 2024

bhack commented Apr 10, 2024

bhack commented Apr 12, 2024

hkchengrex commented Apr 12, 2024

bhack commented Apr 12, 2024

hkchengrex commented Apr 12, 2024

bhack commented Apr 12, 2024

hkchengrex commented Apr 13, 2024 •

edited

hkchengrex commented Apr 13, 2024

bhack commented Apr 13, 2024

Training low perf #58

Training low perf #58

Comments

bhack commented Apr 10, 2024

hkchengrex commented Apr 10, 2024

bhack commented Apr 10, 2024

bhack commented Apr 12, 2024

hkchengrex commented Apr 12, 2024

bhack commented Apr 12, 2024

hkchengrex commented Apr 12, 2024

bhack commented Apr 12, 2024

hkchengrex commented Apr 13, 2024 • edited

hkchengrex commented Apr 13, 2024

bhack commented Apr 13, 2024

hkchengrex commented Apr 13, 2024 •

edited