-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video metadata processing no longer writes temp files #293
base: main
Are you sure you want to change the base?
Video metadata processing no longer writes temp files #293
Conversation
Should be merged after #288. Metadata-finding subsamplers (FFProbeSubsampler and CutDetectionSubsampler) no longer take byte streams, write to a temp file, then operate on the temp file. Rather, we can pass a filepath directly to these subsamplers, and they will extract metadata without performing additional I/O operations. After this I will do the same with the video processing subsamplers in the next pull request. This pull request has been tested with my usual workflow, reproducing expected results. |
can you rebase please ? |
@MattUnderscoreZhang what speed difference do you observe? sharing wandb links would be helpful |
I'm showing my noob status here, but I've never really used wandb. Could we maybe schedule a call sometime where you show me how to generate the links you're looking for? Also, I don't think there should be very much speedup at this stage anyways. I still need to perform a temporary write at the beginning of sample processing, since I haven't touched the dataloaders yet. That is, it would be nice if the dataloaders passed filepaths rather than byte streams, but this will have to be a later change. As it stands, I think I'm currently only saving a single read/write. The major savings should be in the next pull request, when I update the actual video processing subsamplers. |
I think it's quite important to check the speed for this kind of major change |
Here's a check of this branch vs. main on WandB. |
@iejMac do you have some numbers of how many vid/s you reached on webvid ? @MattUnderscoreZhang how many workers are you using? |
I'm using 5 processes with 2 threads each. 100 samples per shard.
|
@rom1504 says at the bottom of this - https://github.com/iejMac/video2dataset/blob/main/dataset_examples/WebVid.md
|
That seems to be for a download config with no video processing. My changes would not have any effect in that use case. |
@MattUnderscoreZhang ok let's try to run with the same settings as in that example Also it would be helpful to increase the number of processes and threads 5 and 2 are too low to catch problems |
I tried replicating a run with the exact config used in the linked example. 16 processes with 16 threads each, with 1000 samples per shard. I ran on a vast.ai instance with a webvid results_2M_val dataset. It's good we ran this test because unfortunately, it looks like this branch is definitely buggy. Something about the threading and multiprocessing is causing the run to freeze. Looking at old commits and comparing to commit c6f3ed2 (the one right before my first commit), I see that the download worker refactor commit also has a threading problem (even with the threading_fix branch applied). The commit right before, e1b5d89, is fine. The speed comparison for this commit matches the older commit: For now I recommend rolling back the download worker refactor commit. Fixing the threading issue will take some debugging, but I don't think I have the capacity for it right now. You can close this pull request if you want, and I'll come back and review this later. |
This would need a rebase |
No description provided.