Relationship between dvc commands (push, pull, add), file size and number of files. #10337
-
The following table shows the relationship between the processing time of dvc commands and the number and size of files.
Best regards. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Downloading lots of small files is always going to be slower than big files. That's probably what you are seeing in the benchmarks. It also depends on what remote you are using. If you are using s3, azure, google storage, dvc tries to upload file asynchronously in batches. For other filesystems, dvc uses multithreaded executor which might have high overhead. There may be some overhead with building
Regarding DVC keeps those hashes on another sqlite database, which is used to avoid recomputing hashes.
If you are using The first hashing step depends involves a lot of On relink step, dvc might try reflinking (which is default), or symlink/hardlink. In all of these cases, it might depend on no. of files. If reflinking is not supported, and no If you really want to benchmark, you may be interested in https://github.com/skshetry/dvc-data-rs. It only supports |
Beta Was this translation helpful? Give feedback.
-
@Shin-ichi-Takayama we are also building DVCx (that will work upstream of DVC and will be tailored to manage a lot of (unstructured) files, curate datasets, etc). It would be great to learn more about your scenarios and if you are interested we can show you the current version of the new product. |
Beta Was this translation helpful? Give feedback.
Downloading lots of small files is always going to be slower than big files. That's probably what you are seeing in the benchmarks.
It also depends on what remote you are using. If you are using s3, azure, google storage, dvc tries to upload file asynchronously in batches. For other filesystems, dvc uses multithreaded executor which might have high overhead.
There may be some overhead with building
index
too, a sqlite database where it keeps record of files.