Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274

xmc-andy · 2023-09-14T03:52:20Z

compute_environment: LOCAL_MACHINE
distributed_type: no
downcast_bf16: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 20687

Luodian · 2023-09-14T04:10:32Z

yes it seems correct!

xmc-andy · 2023-09-14T04:14:19Z

OK，thank u，I also want to ask about the main thread memory is higher than other threads and overflow situation, how I should solve it, do you have suggestions?

…

Luodian · 2023-09-14T06:11:32Z

I think you can refer to this link to see if you can do something.

https://github.com/huggingface/accelerate/blob/6b3e559926afc4b9a127eb7762fc523ea0ea656a/src/accelerate/big_modeling.py#L514

I know that you may able to set device_map=balanced_low_0 to decreased GPU usage on rank 0 (since rank0 will do gather operations and sometimes other params will be shifted to rank 0 so induce to OOM).

Luodian · 2023-09-14T06:12:15Z

Previously I see some code doing so but I didnt use it before, maybe you should do some search on device_map mechanism and how to set it. And we are welcome that you could update your experience to us to help more users tackle the problem on V100 GPU~

xmc-andy · 2023-09-14T06:34:04Z

Thank u for your shared suggestions, I will try them,

xmc-andy · 2023-09-14T12:44:46Z

I tried setting device_map to 'auto', 'balanced', 'balanced_low_0' or 'sequential' respectively. Unfortunately, it still overflows the memory on 3 V100s (unfrozen ViT). In comparison, I think balanced_low_0 is It might be possible if I have enough cards, I will try it further if I have 4 V100s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274

Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274

xmc-andy commented Sep 14, 2023

Luodian commented Sep 14, 2023

xmc-andy commented Sep 14, 2023 via email

Luodian commented Sep 14, 2023

Luodian commented Sep 14, 2023 •

edited

xmc-andy commented Sep 14, 2023

xmc-andy commented Sep 14, 2023

Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274

Is it correct to set up fsdp for a machine (V100) that does not support bf16? #274

Comments

xmc-andy commented Sep 14, 2023

Luodian commented Sep 14, 2023

xmc-andy commented Sep 14, 2023 via email

Luodian commented Sep 14, 2023

Luodian commented Sep 14, 2023 • edited

xmc-andy commented Sep 14, 2023

xmc-andy commented Sep 14, 2023

Luodian commented Sep 14, 2023 •

edited