Relationship between dvc commands (push, pull, add), file size and number of files. #10337

Shin-ichi-Takayama · 2024-03-05T04:57:22Z

Shin-ichi-Takayama
Mar 5, 2024

The following table shows the relationship between the processing time of dvc commands and the number and size of files.
I have the following findings and questions. Please let me know your opinion.

dvc file name	number of files	size	pull	add	push
test1.dvc	67618	6.9GB	341sec	563sec	476sec
test2.dvc	221072	1.0GB	802sec	1302sec	510sec
test3.dvc	484146	0.34GB	1759sec	2995sec	549sec

The pull processing time depends on the number of files, not the size. Is this because time is taken by IO to storage?
The add processing time also depends on the number of files, not the size. Is this because the hash value is calculated for each file?
The push processing time is almost the same. Why is this?

Best regards.

Answered by skshetry

Mar 5, 2024

The pull processing time depends on the number of files, not the size. Is this because time is taken by IO to storage?
The push processing time is almost the same. Why is this?

Downloading lots of small files is always going to be slower than big files. That's probably what you are seeing in the benchmarks.

It also depends on what remote you are using. If you are using s3, azure, google storage, dvc tries to upload file asynchronously in batches. For other filesystems, dvc uses multithreaded executor which might have high overhead.

There may be some overhead with building index too, a sqlite database where it keeps record of files.

The add processing time also depends on the number of …

View full answer

skshetry · 2024-03-05T16:21:59Z

skshetry
Mar 5, 2024
Maintainer

The pull processing time depends on the number of files, not the size. Is this because time is taken by IO to storage?
The push processing time is almost the same. Why is this?

Downloading lots of small files is always going to be slower than big files. That's probably what you are seeing in the benchmarks.

It also depends on what remote you are using. If you are using s3, azure, google storage, dvc tries to upload file asynchronously in batches. For other filesystems, dvc uses multithreaded executor which might have high overhead.

There may be some overhead with building index too, a sqlite database where it keeps record of files.

The add processing time also depends on the number of files, not the size. Is this because the hash value is calculated for each file?

Regarding dvc add, yes, we do hash each and every file. dvc add hashes on a single thread (no batching), and there might some overhead from Python depending on the amount of files and file-sizes.
If you have a fast SSD, you are most likely to be CPU bound. Otherwise, you may be IO bound.

DVC keeps those hashes on another sqlite database, which is used to avoid recomputing hashes.

dvc add works on three steps.

First, it calculates the hashes or partially reuses hashes from the db (aka hashing).
It then copies the file to the cache (aka transfer).
It replaces the workspace files from the cache (aka relink)

If you are using reflink or copy as cache.type, you can see how the third step is wasteful.
But, dvc also needs to support symlink/hardlink files from the cache, that today, it relinks them from the cache anyway for all link types.

The first hashing step depends involves a lot of stat-ing, and db updates. If you have small files, it is likely going to depend on number of files than the size.
Similarly, on transfer, dvc tries to use hardlink, so it will likely depend on no. of files as well.

On relink step, dvc might try reflinking (which is default), or symlink/hardlink. In all of these cases, it might depend on no. of files. If reflinking is not supported, and no cache.type is specified, dvc will try to copy files back from cache in which case it might depend on file-sizes too.

If you really want to benchmark, you may be interested in https://github.com/skshetry/dvc-data-rs. It only supports dvc add and a few other plumbing commands (no push/pull, etc), but should be compatible with dvc. But a disclaimer, it is a prototype, and is not very well-tested. Use at your own risk. For me, it's 10 times faster but it depends on your hardware.

5 replies

Shin-ichi-Takayama Mar 6, 2024
Author

@skshetry
Thank you for your response.
My understanding is that the processing time depends on the number of files.

Let me ask an additional question.
Is there any way to speed up the processing of push, add and pull?
Would compressing the files into a single file (e.g. zip) and managing them would work?

Best regards.

skshetry Mar 6, 2024
Maintainer

If you could share a few more details, it'd be easier for me to suggest.

What kind of data are you planning to use dvc with? How frequently do they change?
Are there multiple datasets with large no. of files? Or, are you going to track a single dataset with large no. of files?
What's downstream of those files? Where are they going to be used?
Could you share details about the setup: your machine, OS, filesystem, remote storage, etc.?

Is there any way to speed up the processing of push, add and pull?
Would compressing the files into a single file (e.g. zip) and managing them would work?

Zipping them into a single file would of course improve push/add/pull. But you lose a lot of functionality that dvc offers. If your data is not frequently changing, I'll suggest tracking them as files anyway even if they are not fast. After you initially add and push to the remote, you can modify the dataset virtually without downloading it completely and then push the "delta" changes back to the remote. With delta, I mean the updated files in whole.

See Modifying Large Datasets.

Shin-ichi-Takayama Mar 6, 2024
Author

@skshetry
Thank you for your response. I will answer your question.

What kind of data are you planning to use dvc with? How frequently do they change?

I am using DVC to manage a machine learning dataset. The files are png files and are updated once a day. The number of files updated is about 3000 files.

Are there multiple datasets with large no. of files? Or, are you going to track a single dataset with large no. of files?

I track a two datasets with a large number of files.

What's downstream of those files? Where are they going to be used?

Files are used as training data for machine learning in Azure's VM.

Could you share details about the setup: your machine, OS, filesystem, remote storage, etc.?

The machine is "Standard_NC4as_T4_v3" in Azure VM. Setup details are as follows
OS:Ubuntu 22.04
File system: ext4
Remote storage: Azure blob

By the way, would it work to write the following in .dvc/config?

['remote "myremote"']
        jobs = 4

Best regards.

skshetry Mar 6, 2024
Maintainer

DVC runs $4 \cdot \text{cpu\_count}()$ of jobs by default. That'd be 16 jobs for you. You can try lowering/increasing it, but it's probably not going to improve.

You can partially add or update a dataset, which will be faster, as it only updates or pushes the changed subdirectory or files.

eg;

dvc pull dataset/subdirectory/to/update
# update
dvc add dataset/subdirectory/to/update
dvc push dataset/subdirectory/to/update

But you may need all the data in your training. (Where do you run your training? In the same VM or do you have something special?).

Also, feel free to reach out via https://dvc.ai/ if you are interested, which we are building specially for dataset management that can scale easily to a large no. of files. We are targeting it for CV scenarios.

Shin-ichi-Takayama Mar 6, 2024
Author

@skshetry
Thank you for your response.
It is my understanding that setting jobs is ineffective.

But you may need all the data in your training. (Where do you run your training? In the same VM or do you have something special?).

Yes, I need all the data for training. The training is done in another VM.
So, I think the method you suggested is not suitable for my use case.

Best regards.

shcheklein · 2024-03-05T16:28:56Z

shcheklein
Mar 5, 2024
Maintainer

@Shin-ichi-Takayama we are also building DVCx (that will work upstream of DVC and will be tailored to manage a lot of (unstructured) files, curate datasets, etc). It would be great to learn more about your scenarios and if you are interested we can show you the current version of the new product.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relationship between dvc commands (push, pull, add), file size and number of files. #10337

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Relationship between dvc commands (push, pull, add), file size and number of files. #10337

Shin-ichi-Takayama Mar 5, 2024

Replies: 2 comments · 5 replies

skshetry Mar 5, 2024 Maintainer

Shin-ichi-Takayama Mar 6, 2024 Author

skshetry Mar 6, 2024 Maintainer

Shin-ichi-Takayama Mar 6, 2024 Author

skshetry Mar 6, 2024 Maintainer

Shin-ichi-Takayama Mar 6, 2024 Author

shcheklein Mar 5, 2024 Maintainer

Shin-ichi-Takayama
Mar 5, 2024

Replies: 2 comments 5 replies

skshetry
Mar 5, 2024
Maintainer

Shin-ichi-Takayama Mar 6, 2024
Author

skshetry Mar 6, 2024
Maintainer

Shin-ichi-Takayama Mar 6, 2024
Author

skshetry Mar 6, 2024
Maintainer

Shin-ichi-Takayama Mar 6, 2024
Author

shcheklein
Mar 5, 2024
Maintainer