Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symlink snapshot_download files from cache #2284

Closed
joecummings opened this issue May 20, 2024 · 6 comments
Closed

Symlink snapshot_download files from cache #2284

joecummings opened this issue May 20, 2024 · 6 comments

Comments

@joecummings
Copy link

joecummings commented May 20, 2024

It appears that symlinking as a default was removed from the huggingface-cli download in #2223.

The Problem

A user that uses from_pretrained then tries to use torchtune, which requires you have the model checkpoints available in an easily accessible directory structure. Under the hood, this calls snapshot_download with a local_dir. After #2223, this process copies the files from ~/.cache/huggingface/hub to local_dir, resulting in 2X the disk space being used.

The Ask

Is there any way to utilize symlinking capabilities so users have a way to reduce disk usage?


Obviously, my example and interest lie in the use-case for torchtune, but I think this is applicable to anyone trying to use the huggingface-cli download command in conjuction with the from_pretrained API

cc: @Wauplin

@Wauplin
Copy link
Contributor

Wauplin commented May 21, 2024

Is there any way to utilize symlinking capabilities so users have a way to reduce disk usage?

Hi @joecummings, no, there is currently no way to do that and we most likely won't support it in the future. Usage of downloading to local dir + symlinking to real cache was quite low compared to the drawbacks it had (especially the massive confusion for users not knowing where the files are actually cached, but also problems on windows, on shared clusters, on mounted volumes, etc.). In the end we took the decision to completely separate the "cached process" that shares the cache directory between libraries and the "local process" that is designed to be managed by the users, including potential duplication. I'm sorry it that affects torchtune users and I hope we can find a suitable solution to either:

  • use cache_dir instead of local_dir in torchtune?
  • encourage users to download to local dir and then load from torchtune/transformers? Transformers supports loading from a directory.

Note that file duplication was already silently happening before (on windows, shared clusters, mounted volumes, etc.). The difference now is that it makes it clear in all use cases -including a common unix usage-.

@joecummings
Copy link
Author

Thanks for your response @Wauplin, and several possible options. For now, I think we want to avoid loading with transformers as we have our own model definitions. For the first suggestion, do the checkpoint files exist in the cache_dir as binary files or as readable safetensor and config JSON files?

@Wauplin
Copy link
Contributor

Wauplin commented May 24, 2024

For the first suggestion, do the checkpoint files exist in the cache_dir as binary files or as readable safetensor and config JSON files?

I'm not sure to understand this question. Safetensors files are binary files (in opposition to json files that are utf8 text files). No matter if you download to the cache or in a local directory, you will have the exact same files. In one case it is shared with other libraries, in the later it is solely in a directory managed by the user.

@joecummings
Copy link
Author

I'm not sure to understand this question. Safetensors files are binary files (in opposition to json files that are utf8 text files). No matter if you download to the cache or in a local directory, you will have the exact same files. In one case it is shared with other libraries, in the later it is solely in a directory managed by the user.

Yep, I misspoke - thanks for the clarification! The point I was trying to understand was whether all the same files exist in the cache as do in the local_dir, which appears to be yes.

@Wauplin
Copy link
Contributor

Wauplin commented May 24, 2024

Yes, I confirm!

@joecummings
Copy link
Author

Closing for now to clean up your Issues - thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants