Skip to content

Latest commit





Large Scale Image Captioning With Dataflow


This repo prepares a Dataflow job for large scale image captioning using BLIP and CLIP to generate and rank image captions.

Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing based on Apache beam.

The model licenses can be found BLIP CLIP


  • Creates image captions using BLIP.
  • Ranks captions and uses the top one.
  • Parallelize a job with lots of images across multiple workers.
  • Saves image/caption pairs in jsonl format in HuggingFace's datasets format.
  • Can run a small subsample in a local environment before deploying the dataflow job.


  1. Clone the repo if you haven't. Navigate to the image-captioning-dataflow folder.

  2. Install python3.8 and dependencies

    conda create -n py38 python=3.8
    conda activate py38
    pip install -r requirements.txt
  3. Install BLIP, download weights and save state dict. Change to the absolute path of the folder where BLIP was cloned.

    git clone
    export PYTHONPATH=$PYTHONPATH:<your-blip-location>/BLIP
    gdown '*_base_caption.pth'
  4. Copy BLIP configs

    mkdir configs
    cp BLIP/configs/med_config.json configs/
  5. Download clip weights

    git lfs install
    git clone
  6. Create a dataset.txt file. You'll need to upload the images you want to caption to Google cloud storage. For example, I created a bucket jfacevedo-demos-datasets with a folder me and uploaded all my images to that folder. We will upload the output file, dataset.txt, into the same directory where our images are located. Test this with a few images at first before using the full image dataset.

    export BUCKET_ID="jfacevedo-demos-datasets"
    export PREFIX="me"
    gsutil cp dataset.txt gs://$BUCKET_ID/$PREFIX/
  7. Next we'll need to move the weights to a local directory /captioning. The dataflow job won't actually use local files but this is needed to deploy the dataflow job and also we'll be testing this locally before deploying.

    chmod 755 clip-vit-base-patch32/
    sudo mkdir /captioning/
    sudo chmod 755 /captioning/
    sudo cp -r clip-vit-base-patch32/ /captioning/
  8. Test the pipeline locally. This works without GPUs but takes longer.

    python --dataset-filename gs://$BUCKET_ID/$PREFIX/dataset.txt --output-filename gs://$BUCKET_ID/$PREFIX/metadata.jsonl

    If we look at the output file (or files), beam has sharded the output into multiple files which improves the performance of running this workload in parallel. You can join the files as follows.

    gsutil compose \
    gs://${BUCKET_ID}/$PREFIX/metadata* \
  9. We'll be using a custom container to run our Dataflow job. Build and push the container. Make sure you set to yours

    export PROJECT_ID=<project-id>
    docker build . -t$PROJECT_ID/dataflow-captioning:latest
    docker push$PROJECT_ID/dataflow-captioning:latest
  10. Run the dataflow job. First, you'll need a service account with Dataflow Admin, Dataflow Worker and Compute Network User. You can either use the default service account or create a new service account. Furthermore, if you are on the default network that comes with your project, you can ommit --subnetwork. If you're using the default service account, you can ommit --service_account_email. In the following snippet, I'm using a custom service account and a VPC network. If you're using the same --temp_location as the command below, make sure to create a bucket $PROJECT_ID-bucket.

    This job uses a T4 GPU.

    python \
    --dataset-filename gs://$BUCKET_ID/$PREFIX/dataset.txt \
    --output-filename gs://$BUCKET_ID/$PREFIX/metadata.jsonl \
    --runner=DataflowRunner \
    --project=$PROJECT_ID \
    --region=us-central1 \
    --job_name=captioning \
    --temp_location=gs://$PROJECT_ID-bucket/ \$PROJECT_ID/dataflow-captioning:latest \
    --machine_type=n1-standard-16 \
    --experiment="worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver" \
    --experiment=use_runner_v2 \
    --disk_size_gb=200 \
    --subnetwork=$PROJECT_ID/regions/us-central1/subnetworks/jfacevedo-demo-subnet \
    --service_account_email=vertex-ai@$ \
  11. You can view the job's progress through the Dataflow console.

    Don't forget to consolidate the sharded files into one to use for training , for example, with Stable diffusion.

Running with multiple GPUs

If you're trying to deploy this with multiple workers and multiple GPUs, check your project's quota allows for this, or you'll be prevented from running efficiently. Leaving the default max_num_workers value of 100 will surely saturate the single GPU in the job above for large workloads. Instead, set the value to something reasonable and increase the number of GPUs. For example.

python \
--dataset-filename gs://$BUCKET_ID/$PREFIX/dataset.txt \
--output-filename gs://$BUCKET_ID/$PREFIX/metadata.jsonl \
--runner=DataflowRunner \
--project=$PROJECT_ID \
--region=us-central1 \
--job_name=captioning \
--temp_location=gs://$PROJECT_ID-bucket/ \$PROJECT_ID/dataflow-captioning:latest \
--machine_type=n1-standard-16 \
--experiment="worker_accelerator=type:nvidia-tesla-t4;count:4;install-nvidia-driver" \
--experiment=use_runner_v2 \
--disk_size_gb=200 \
--subnetwork=$PROJECT_ID/regions/us-central1/subnetworks/jfacevedo-demo-subnet \
--service_account_email=vertex-ai@$ \
--sdk_location=container \