Skip to content

GoogleCloudPlatform/dataflow-pubsub-dedup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deduplication with Cloud PubSub and Cloud Dataflow on Google Cloud Platform

This is the source code that accompanies the solution: Deduplication of messages with Cloud PubSub and Cloud Dataflow. This sample code demonstrates three approaches for deduplication:

  • PubSubIO: com.google.examples.dfdedup.DedupWithPubSubIO
  • Distinct transform: com.google.examples.dfdedup.DedupWithDistinct
  • Custom state based deduplication: com.google.examples.dfdedup.DedupWithStateAndGC

End to end pipeline

You can run the following end to end pipeline to explore deduplication behavior across all three approaches:

End to end flow

Setting up resources

NOTE: If you're new to GCP, please see quickstarts for Cloud PubSub, BigQuery and Cloud Dataflow

BigQuery

Use the schema files under bqschemas/ to create

Cloud PubSub

Running Python-based the data generator

Blah blah