The entirety of WebVid can be downloaded very easily using scripts/configs provided in video2dataset/examples. You don't even need to use complex distribution strategies, multiprocessing is fine to download all 10M samples in a timely manner on a single machine.
subsampling: {}
reading:
yt_args:
download_size: 360
download_audio_rate: 44100
yt_metadata_args: null
timeout: 60
sampler: null
storage:
number_sample_per_shard: 1000
oom_shard_count: 5
captions_are_subtitles: False
distribution:
processes_count: 16
thread_count: 16
subjob_size: 1000
distributor: "multiprocessing"
#!/bin/bash
wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv
video2dataset --url_list="results_10M_train.csv" \
--input_format="csv" \
--output-format="webdataset" \
--output_folder="dataset" \
--url_col="contentUrl" \
--caption_col="name" \
--save_additional_columns='[videoid,page_idx,page_dir,duration]' \
--enable_wandb=True \
--config="path/to/config.yaml" \
On a single cpu16 (16 core) EC2 instance (c6i-4xlarge) the entirety of WebVid (10M samples) can be downloaded in ~12h. It achieves 230 video/s (14.4 videos/s/core) or 420 Mb/s and of course can be proportionally sped up by utilizing more nodes via spark or slurm distribution to reduce the processing time even more. The cost to download WebVid is ~ 0.68$/h * 12h = 8.16$