Fine-tuning fails after installation from source #393

devon-research · 2024-03-13T20:03:13Z

System Info

I use the base Docker image pytorch/pytorch. I then run

pip install --upgrade pip setuptools
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm]

Information

The official example scripts
My own modified scripts

Code to reproduce the bug

torchrun \
    --nnodes 1 \
    --nproc_per_node 4 \
    llama-recipes/examples/finetuning.py \
        --enable_fsdp \
        --model_name meta-llama/Llama-2-7b-hf \
        --dist_checkpoint_root_folder model_checkpoints \
        --dist_checkpoint_folder fine-tuned \
        --pure_bf16 \
        --use_fast_kernels

Error logs

Traceback (most recent call last):
File "/llama-recipes/examples/finetuning.py", line 8, in <module>
  fire.Fire(main)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
  component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
  component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
  component = fn(*varargs, **kwargs)
File "/llama-recipes/src/llama_recipes/finetuning.py", line 154, in main
  model = FSDP(
TypeError: FullyShardedDataParallel.__init__() got an unexpected keyword argument 'device_mesh'

Other notes

Note that running python -c "import torch; print(torch.__version__)" yields 2.1.2+cu118. Furthermore, the output of pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e llama-recipes[tests,auditnlg,vllm] involves uninstalling the latest PyTorch version (2.2.1) from the base image and installing an older version.

My understanding from the relevant PyTorch release notes is that the device_mesh abstraction (which is the cause of the original error above) is introduced into torch.distributed only in PyTorch 2.2. However, the requirements.txt here in llama-recipes only specifies torch>=2.0.1.

Unfortunately, simply changing the requirement to torch>=2.2 results in an error when installing llama-recipes:

Downloading vllm-0.1.3.tar.gz (102 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 102.7/102.7 kB 35.5 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
        main()
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
        json_out['return_val'] = hook(**hook_input['kwargs'])
      File "/opt/conda/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
        return hook(config_settings)
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
        return self._get_build_requires(config_settings, requirements=['wheel'])
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
        self.run_setup()
      File "/tmp/pip-build-env-h3jj4ttq/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
        exec(code, locals())
      File "<string>", line 24, in <module>
    RuntimeError: Cannot find CUDA_HOME. CUDA must be available in order to build the package.

This error does not occur if the only change I make is to revert 2.2 to 2.0.1 in the requirements.txt file.

Workaround

A workaround is to simply run

pip install --index-url https://download.pytorch.org/whl/cu118 torch==2.2.1

after installing llama-recipes.

The text was updated successfully, but these errors were encountered:

HamidShojanazeri · 2024-03-18T15:22:57Z

@devon-research can you please install from src, it works on my end. BTW we did some refactor recently would be great to pull the latest first. We are planning a release soon. The HDSP, device_mesh was added recently not present in binaries yet.

wukaixingxp · 2024-06-03T21:47:49Z

Hi! It seems that a solution has been provided to the issue and there has not been a follow-up conversation for a long time. I will close this issue for now and feel free to reopen it if you have any questions!

devon-research changed the title ~~Fine-tuning fails after default installation from source~~ Fine-tuning fails after installation from source Mar 13, 2024

HamidShojanazeri self-assigned this Mar 18, 2024

HamidShojanazeri added the triaged label Mar 18, 2024

wukaixingxp closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning fails after installation from source #393

Fine-tuning fails after installation from source #393

devon-research commented Mar 13, 2024 •

edited

HamidShojanazeri commented Mar 18, 2024

wukaixingxp commented Jun 3, 2024

Fine-tuning fails after installation from source #393

Fine-tuning fails after installation from source #393

Comments

devon-research commented Mar 13, 2024 • edited

System Info

Information

Code to reproduce the bug

Error logs

Other notes

Workaround

HamidShojanazeri commented Mar 18, 2024

wukaixingxp commented Jun 3, 2024

devon-research commented Mar 13, 2024 •

edited