Skip to content

dhaval-d/bq_streaming_inserts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This sample application builds a BigQuery table to store some data for a sample rides application. Once the table is ready, run this Python application to populate the sample data.

Python application can be run in following ways depending on your needs:

  1. Stand alone mode: If you want < 1 million records, stand alone mode would work just fine.
  2. Run as a GKE job: If you want to ingest millions of records running a GKE job would be the best option.

Following are the steps to build a BigQuery table and run Python application as a job on a GKE cluster.

Prerequisites:

  1. You have an Editor access to a Google Cloud project.
  2. You have installed and configured a gCloud utility to refer to the above project.
  3. You have created a service account key file with BigQuery Editor permissions.
  4. Store this file as a bq-editor.json
  5. Your python environment is setup with all the dependencies from requirements.txt installed.

Step 1: Clone this repository to your local machine using the following command.

git clone https://github.com/dhaval-d/bq_streaming_inserts .

Step 2: Go to the above directory and run the following command to create a BigQuery table.

bq mk --table <br/> --schema rides.json <br/> --time_partitioning_field insert_date <br/> --description "Table with sample rides data" <br/> [YOUR_DATASET_NAME].rides

Step 3: Run the following command to set GOOGLE_APPLICATION_CREDENTIALS to point to your service account key file.
export GOOGLE_APPLICATION_CREDENTIALS=bq-editor.json

Step 4: Run the following command to run the python application on your local environment.
python3 app.py \
--project [YOUR_GCP_PROJECT_NAME] \
--dataset [YOUR_DATASET_NAME] \
--table rides \
--batch_size 1 \
--total_batches 1


Step 5: Change the Dockerfile CMD line(line 13) to point to your project and a BigQuery dataset.

Then build a docker container by using the following command.

docker build -t gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1 .

Step 6: Make sure you can see your container image using the following command.

docker images

Step 7: Run the following docker command to run your application as a container in a local environment. (For testing purposes)

docker run -- name bq_streaming
-e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/bq-editor.json
-v $GOOGLE_APPLICATION_CREDENTIALS:/tmp/keys/bq-editor.json:ro
gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1


Step 8: Configure the docker to authenticate with your GCP project using following command.

gcloud auth configure-docker

Step 9: Push your docker image to the Google Container Registry on your GCP project.

docker push gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1

Step 10: Create and verify a GKE cluster using the following commands.

gcloud container clusters create demo-cluster --num-nodes=2

Step 11: Once a cluster is up and running, you can use the following command to check the status of the nodes.

kubectl get nodes

Step 12: Change args: line in a deployment.yaml file to refer to your project and a dataset. Also, you can change completions and parallelism parameters in a file based on how many output records you are trying to generate.

Step 13: One your deployment.yaml is updated, run following command to start your GKE job.

kubectl apply -f deployment.yaml

Step 14: Go to the GKE console and check the status of your job. Also, go to the BigQuery console and validate if the job is populating records or no.