[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser #10757

Erua-chijioke · 2024-12-10T18:50:01Z

Describe the bug
Using Great expectations version 1.2.4 I created a custom expectation using sql and on the query i used the {batch} parameter as shown in the documentation https://docs.greatexpectations.io/docs/core/customize_expectations/use_sql_to_define_a_custom_expectation
to create the custom expectation, then on the checkpoint I passed the batch parameter which is a dataframe (This dataframe's number of rows is actually huge) and it took 17minutes to run the checpoint and return the validation result. Now I executed this custom query using spark on databricks and it took less than 1 minute to execute.

To Reproduce
create data source, asset, batch definition and expectation suite**
df = spark.sql("SELECT * FROM xxx") #ensure that this will have a large number of rows up to 30million rows
context = gx.get_context(project_root_dir="/dbfs/xxx")

data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
batch_definition_name = "my_batch_definition"
suite_name = "d365_enriched_generaljournaltransaction_expectation_suite"
validation_definition_name="d365_enriched_generaljournaltransaction_validation_definition"
checkpoint_name="d365_enriched_generaljournaltransaction_checkpoint"

data_source = context.data_sources.add_or_update_spark(name=data_source_name)
try:
data_asset=data_source.add_dataframe_asset(name=data_asset_name)
except Exception as e:
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

try:
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
except Exception as e:
batch_definition = data_asset.get_batch_definition(batch_definition_name)

try:
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)
except Exception as e:
suite = context.suites.get(name=suite_name)

#now lets create a custom expectation using sql
class ExpectValidLineItemSum(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = ("""SELECT CrayonCompanyIdRef, SUM(accountingcurrencyamount) AS total_amount
FROM {batch}
WHERE accountingdate >= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 1, 1)
AND accountingdate <= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 12, 31)
GROUP BY CrayonCompanyIdRef
HAVING SUM(accountingcurrencyamount) NOT BETWEEN -1 AND 1""")
description: str = "Line items should have a valid sum between -1 amd +1"

expectation = ExpectValidLineItemSum()

try:
suite.add_expectation(expectation)
except Exception as e:
print("Expectation already exists in the suite")

try:
validation_definition = gx.ValidationDefinition(
data=batch_definition, suite=suite, name=validation_definition_name
)
validation_definition = context.validation_definitions.add(
validation_definition
)
except Exception as e:
validation_definition = context.validation_definitions.get(
validation_definition_name
)

validation_definitions = [validation_definition]
validation_definitions

try:
checkpoint = gx.Checkpoint(
name=checkpoint_name,
validation_definitions=validation_definitions,
actions=[
# This Action updates the Data Docs static website with the Validation
# Results after the Checkpoint is run.
gx.checkpoint.UpdateDataDocsAction(
name="update_all_data_docs",
),
],
result_format={"result_format": "COMPLETE"},
)
context.checkpoints.add(checkpoint)
except Exception as e:
checkpoint = context.checkpoints.get(checkpoint_name)

batch_parameters = {"dataframe": df}

run_id = gx.RunIdentifier(run_name=f"validation_checkpoint_for_{batch_definition_name}", run_time=datetime.now())

validation_result = checkpoint.run(batch_parameters=batch_parameters, run_id=run_id)

Image attached below to show you when it took 17mins

Expected behavior
It should not take that long to just run one expectation on a dataframe. I ran the custom query on spark by using the table name instead of {batch} and everything ran in less than 5 seconds. I attached an image for your perusal.

Environment (please complete the following information):

Operating System: Windows
Great Expectations Version: 1.2.4
Data Source: Spark Dataframe
Cloud environment: Databricks

Additional context
Add any other context about the problem here.

github-project-automation bot moved this to To Do in GX Core Issues Board Dec 10, 2024

github-project-automation bot added this to GX Core Issues Board Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser #10757

[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser #10757

Erua-chijioke commented Dec 10, 2024

[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser #10757

[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser #10757

Comments

Erua-chijioke commented Dec 10, 2024