Both the preprocessor and dataloaders are modified versions of the clip-inference and embedding-reader libraries respectively. Many thanks to @rom1504 for his amazing work :)
By default, the encoders remained uninstalled for ease of access. You can install the neccesary libraries for each encoder using the commands shown below.
CLIP requires both the PIL
library and the clip
library installed. You can install the latest version of clip
using the command below:
pip install git+https://github.com/openai/CLIP.git
You can install all CLIP requirements in one command using pip install -r requirements-clip.txt
.
CLAP requires the clap
branch of the LAION fork of open_clip
installed. It can be installed like so:
pip install git+https://github.com/LAION-AI/CLAP.git@clap
CLAP also requires a few audio libraries to be installed for the audio transforms, also installable using the command below:
pip install torchaudio h5py torchlibrosa PySoundFile
You can install all CLAP requirements in one command using pip install -r requirements-clap.txt
.
Feel free to submit and issue and/or PR to help ClipCap easily support other encoders. This repo has been designed to be as modular and reproducible as possible, and it has therefore been made very easy to implement your own encoders into ClipCap.
You can take a look at the sample encoder script as a guide when creating your pull requests.
Once all requirements (including clipcap
) have been installed, you can run the preprocessor using the command below:
python -m clipcap.preprocess --help
Running the command with --help
provides extensive documentation for each individual argument. Alternatively, you can view the preprocess/args.py and encoder/args.py files.
Example for preprocessing a webdataset with a pretrained CLAP model, with the webdataset containing FLAC audio files at key flac
and captions of key text
contained inside of json files inside the webdataset with the key json
, resulting in a final wds-caption-key
of json/text
.
python3.8 -m clipcap.preprocess \
--input-dataset "./datasets/clotho/train/{0..7}.tar"
--output-folder "./preprocessed/clotho/train/"
--input-format "webdataset"
--batch-size 256
--device "cuda:0"
--write-batch-size 10000
--wds-media-key "flac"
--wds-caption-key "json/text"
--wds-samples-per-file 512
--encoder-model-name "clap"