Skip to content

Latest commit

 

History

History
50 lines (38 loc) · 1.56 KB

WebVid.md

File metadata and controls

50 lines (38 loc) · 1.56 KB

The entirety of WebVid can be downloaded very easily using scripts/configs provided in video2dataset/examples. You don't even need to use complex distribution strategies, multiprocessing is fine to download all 10M samples in a timely manner on a single machine.

Create the config:

subsampling: {}

reading:
    yt_args:
        download_size: 360
        download_audio_rate: 44100
        yt_metadata_args: null
    timeout: 60
    sampler: null

storage:
    number_sample_per_shard: 1000
    oom_shard_count: 5
    captions_are_subtitles: False

distribution:
    processes_count: 16
    thread_count: 16
    subjob_size: 1000
    distributor: "multiprocessing"

Download WebVid:

#!/bin/bash

wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv

video2dataset --url_list="results_10M_train.csv" \
        --input_format="csv" \
        --output-format="webdataset" \
	--output_folder="dataset" \
        --url_col="contentUrl" \
        --caption_col="name" \
        --save_additional_columns='[videoid,page_idx,page_dir,duration]' \
        --enable_wandb=True \
	--config="path/to/config.yaml" \

Performance

On a single cpu16 (16 core) EC2 instance (c6i-4xlarge) the entirety of WebVid (10M samples) can be downloaded in ~12h. It achieves 230 video/s (14.4 videos/s/core) or 420 Mb/s and of course can be proportionally sped up by utilizing more nodes via spark or slurm distribution to reduce the processing time even more. The cost to download WebVid is ~ 0.68$/h * 12h = 8.16$