Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark String -> BigQuery Datetime field fails to load with Indirect load method #1323

Closed
jster1357 opened this issue Dec 10, 2024 · 2 comments
Assignees

Comments

@jster1357
Copy link

Connect Version tested: v.39.1, 0.41.0
Spark Version 3.5.1

When working with datatime fields, the data needs to be serialized as strings or converted to timestamp due to Spark not having a datetime data type.

If you have the datetime field in your spark dataframe serialized as a string when you load to BQ using the indirect method the load fails. This is not the case with the direct load method that uses the stroage write API.

Sample Data:

schema = StructType([
    StructField("a", StringType(), True),
    StructField("b", TimestampType(), True)
])

data = []
for _ in range(10):
  random_string = ''.join(random.choice('abcdefghijklmnopqrstuvwxyz') for i in range(10))
  random_datetime = datetime.now() - timedelta(days=random.randint(0, 365))
  data.append((random_string, random_datetime))

df = spark.createDataFrame(data, schema)

My BQ Table:

create or replace table demo_data.datetime_test (
  a string, 
  b datetime
);

Error:

Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table [projectid]:demo_data.datetime_test. Field b has changed type from DATETIME to STRING

I dumped the data I was loading to a parquet file and attempted to load it directly using the bq load tool. I received a similar error there as well...which made me thing that the issue was related to the bq load utility.

Provided Schema does not match Table [projectid]:demo_data.datetime_test. Field b has changed type from DATETIME to STRING

I figure this should be a bug given that it works with one load method but not the other.

Is the solution here to just convert to Timestamp_ntz in the dataframe? This seems to work with both direct and indirect load methods.

@davidrabinowitz
Copy link
Member

Do you use the spark-bigquery-with-dpendencies or spark-3.5-bigquery connector?

@jster1357
Copy link
Author

I tested with s8s (interactive) and I set connector version using the property: dataproc.sparkBqConnector.version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants