Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way we can run a DML Query (INSERT/MERGE) via BigQuery spark connector? #575

Open
spadhi7 opened this issue Mar 27, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@spadhi7
Copy link

spadhi7 commented Mar 27, 2022

No description provided.

@davidrabinowitz
Copy link
Member

INSERT can be run by creating a DataFrame and saving it to BigQuery.

MERGE is not supported at the moment - I'd appreciate to see a use case.

@davidrabinowitz davidrabinowitz self-assigned this Mar 27, 2022
@spadhi7
Copy link
Author

spadhi7 commented Mar 27, 2022

We are trying to do CDC from traditional RDMS to BigQuery via Kafka. Source -> Kafka -> Spark Structured Streaming -> BigQuery

@joydeepml
Copy link

joydeepml commented Sep 13, 2022

@davidrabinowitz Merge support in the spark connector will be greatly appreciated. Will help avoid running a separate job using SQL merge https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement

the use case as mentioned by @spadhi7 is to apply the changes in a source table to a table in bigquery by utilising the change feed which provides inserts, updates and deletes in the source table.

afaik, merge is supported in spark only for the delta format
https://docs.delta.io/latest/delta-update.html#language-python

Will be great if bigquery connector would support that too

@davidrabinowitz
Copy link
Member

Thanks for the suggestion - I agree this is a great idea, however at the moment we try to use only Spark APIs without any proprietary APIs from our end. We will review this and see what it the best way to implement the merge functionality.

@nicodds
Copy link

nicodds commented Mar 29, 2023

It would be a great feature!

In my current situation, I need to update a table with a daily delta that may originate either from new inserts or from updates to existing records. In order to keep the bigquery table updated, I have to upload the changes to a staging table and then launch a separate merge query. It would be optimal if that could be done directly in the spark connector.

I agree, anyway, that this feature is hard to implement, since it would add complexity on the load job.

@h5chauhan
Copy link

h5chauhan commented Apr 5, 2023

This would be similar to Delta.io. It would be nice if the connector could support it.

@khaledh
Copy link

khaledh commented Apr 6, 2023

As @nicodds mentioned, we're also using the same approach: write the new changes to a temp table using the connector, then run a BQ SQL query to do the merge, and finally we drop the temp table. It would be nice to do this directly instead.

I wonder if the connector can implement this feature in a way similar to how Iceberg does it: https://iceberg.apache.org/docs/latest/spark-writes/#merge-into

@ajaybiswal
Copy link

Hi just wanted to know if merge feature is available now as I couldn't find anything in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants