Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark observable metrics not working when writing to BigQuery tables #1040

Open
ylashin opened this issue Aug 9, 2023 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@ylashin
Copy link

ylashin commented Aug 9, 2023

I have a DataFrame to write to a BQ table using direct write method. I would like to run some metrics during the write operation. The write operation succeeds but extracting the metrics value stucks and does not return.

Here is the workflow.

Run a Spark shells as follows:

export GOOGLE_APPLICATION_CREDENTIALS=<CREDENTIALS-FILE-PATH>

spark-shell \
    --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.32.1 \
    --jars /root/gcs-connector-latest-hadoop2.jar \
    --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

Then run the below snippet inside it.

import org.apache.spark.sql.functions.{count,lit}
import org.apache.spark.sql.Observation

val df = spark.range(100)
val obs = Observation("record-count-observable")
val df_wrapped = df.observe(obs, count(lit(1)).alias("count"))
df_wrapped.write.format("bigquery").option("table", "<TABLE-NAME>").option("writeMethod", "direct").mode("overwrite").save()
obs.get("count")

Last statement obs.get("count") just hangs forever and does not return at all.

If we change the write operation to write parquet to GCS, the last statement returns fine and gives a 100 which is record count.

df_wrapped.write.parquet("gs://<SOME-GCS-PATH>")

@davidrabinowitz
Copy link
Member

Please try the com.google.cloud.spark:spark-3.3-bigquery:0.32.2 package

@ylashin
Copy link
Author

ylashin commented Aug 9, 2023

Hi @davidrabinowitz

Thank you for prompt response. I tried the latest version 0.32.2 but still the same issue

image

I tried starting Spark shell once with --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.32.2 and once with com.google.cloud.spark:spark-3.3-bigquery:0.32.2 and both behave the same.

@davidrabinowitz
Copy link
Member

This is a recent Spark 3.3 feature. As it seems it is not supported out of the box, we'll see how we support it.

@isha97 isha97 added the enhancement New feature or request label Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants