Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser #10757

Open
Erua-chijioke opened this issue Dec 10, 2024 · 0 comments

Comments

@Erua-chijioke
Copy link

Describe the bug
Using Great expectations version 1.2.4 I created a custom expectation using sql and on the query i used the {batch} parameter as shown in the documentation https://docs.greatexpectations.io/docs/core/customize_expectations/use_sql_to_define_a_custom_expectation
to create the custom expectation, then on the checkpoint I passed the batch parameter which is a dataframe (This dataframe's number of rows is actually huge) and it took 17minutes to run the checpoint and return the validation result. Now I executed this custom query using spark on databricks and it took less than 1 minute to execute.

To Reproduce
create data source, asset, batch definition and expectation suite**
df = spark.sql("SELECT * FROM xxx") #ensure that this will have a large number of rows up to 30million rows
context = gx.get_context(project_root_dir="/dbfs/xxx")

data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
batch_definition_name = "my_batch_definition"
suite_name = "d365_enriched_generaljournaltransaction_expectation_suite"
validation_definition_name="d365_enriched_generaljournaltransaction_validation_definition"
checkpoint_name="d365_enriched_generaljournaltransaction_checkpoint"

data_source = context.data_sources.add_or_update_spark(name=data_source_name)
try:
data_asset=data_source.add_dataframe_asset(name=data_asset_name)
except Exception as e:
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

try:
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
except Exception as e:
batch_definition = data_asset.get_batch_definition(batch_definition_name)

try:
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)
except Exception as e:
suite = context.suites.get(name=suite_name)

#now lets create a custom expectation using sql
class ExpectValidLineItemSum(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = ("""SELECT CrayonCompanyIdRef, SUM(accountingcurrencyamount) AS total_amount
FROM {batch}
WHERE accountingdate >= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 1, 1)
AND accountingdate <= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 12, 31)
GROUP BY CrayonCompanyIdRef
HAVING SUM(accountingcurrencyamount) NOT BETWEEN -1 AND 1""")
description: str = "Line items should have a valid sum between -1 amd +1"

expectation = ExpectValidLineItemSum()

try:
suite.add_expectation(expectation)
except Exception as e:
print("Expectation already exists in the suite")

try:
validation_definition = gx.ValidationDefinition(
data=batch_definition, suite=suite, name=validation_definition_name
)
validation_definition = context.validation_definitions.add(
validation_definition
)
except Exception as e:
validation_definition = context.validation_definitions.get(
validation_definition_name
)

validation_definitions = [validation_definition]
validation_definitions

try:
checkpoint = gx.Checkpoint(
name=checkpoint_name,
validation_definitions=validation_definitions,
actions=[
# This Action updates the Data Docs static website with the Validation
# Results after the Checkpoint is run.
gx.checkpoint.UpdateDataDocsAction(
name="update_all_data_docs",
),
],
result_format={"result_format": "COMPLETE"},
)
context.checkpoints.add(checkpoint)
except Exception as e:
checkpoint = context.checkpoints.get(checkpoint_name)

batch_parameters = {"dataframe": df}

run_id = gx.RunIdentifier(run_name=f"validation_checkpoint_for_{batch_definition_name}", run_time=datetime.now())

validation_result = checkpoint.run(batch_parameters=batch_parameters, run_id=run_id)

Image attached below to show you when it took 17mins
image

Expected behavior
It should not take that long to just run one expectation on a dataframe. I ran the custom query on spark by using the table name instead of {batch} and everything ran in less than 5 seconds. I attached an image for your perusal.

image

Environment (please complete the following information):

  • Operating System: Windows
  • Great Expectations Version: 1.2.4
  • Data Source: Spark Dataframe
  • Cloud environment: Databricks

Additional context
Add any other context about the problem here.

@Erua-chijioke Erua-chijioke changed the title Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser [BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Development

No branches or pull requests

1 participant