You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser
#10757
Open
Erua-chijioke opened this issue
Dec 10, 2024
· 0 comments
Describe the bug
Using Great expectations version 1.2.4 I created a custom expectation using sql and on the query i used the {batch} parameter as shown in the documentation https://docs.greatexpectations.io/docs/core/customize_expectations/use_sql_to_define_a_custom_expectation
to create the custom expectation, then on the checkpoint I passed the batch parameter which is a dataframe (This dataframe's number of rows is actually huge) and it took 17minutes to run the checpoint and return the validation result. Now I executed this custom query using spark on databricks and it took less than 1 minute to execute.
To Reproduce
create data source, asset, batch definition and expectation suite**
df = spark.sql("SELECT * FROM xxx") #ensure that this will have a large number of rows up to 30million rows
context = gx.get_context(project_root_dir="/dbfs/xxx")
try:
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)
except Exception as e:
suite = context.suites.get(name=suite_name)
#now lets create a custom expectation using sql
class ExpectValidLineItemSum(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = ("""SELECT CrayonCompanyIdRef, SUM(accountingcurrencyamount) AS total_amount
FROM {batch}
WHERE accountingdate >= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 1, 1)
AND accountingdate <= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 12, 31)
GROUP BY CrayonCompanyIdRef
HAVING SUM(accountingcurrencyamount) NOT BETWEEN -1 AND 1""")
description: str = "Line items should have a valid sum between -1 amd +1"
expectation = ExpectValidLineItemSum()
try:
suite.add_expectation(expectation)
except Exception as e:
print("Expectation already exists in the suite")
try:
checkpoint = gx.Checkpoint(
name=checkpoint_name,
validation_definitions=validation_definitions,
actions=[
# This Action updates the Data Docs static website with the Validation
# Results after the Checkpoint is run.
gx.checkpoint.UpdateDataDocsAction(
name="update_all_data_docs",
),
],
result_format={"result_format": "COMPLETE"},
)
context.checkpoints.add(checkpoint)
except Exception as e:
checkpoint = context.checkpoints.get(checkpoint_name)
Image attached below to show you when it took 17mins
Expected behavior
It should not take that long to just run one expectation on a dataframe. I ran the custom query on spark by using the table name instead of {batch} and everything ran in less than 5 seconds. I attached an image for your perusal.
Environment (please complete the following information):
Operating System: Windows
Great Expectations Version: 1.2.4
Data Source: Spark Dataframe
Cloud environment: Databricks
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Erua-chijioke
changed the title
Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser
[BUG] Custom expectation takes too long to execute when the {batch} parameter is used in a custom query but when you pass the table name to the query instead execution time becomes lesser
Dec 10, 2024
Describe the bug
Using Great expectations version 1.2.4 I created a custom expectation using sql and on the query i used the {batch} parameter as shown in the documentation https://docs.greatexpectations.io/docs/core/customize_expectations/use_sql_to_define_a_custom_expectation
to create the custom expectation, then on the checkpoint I passed the batch parameter which is a dataframe (This dataframe's number of rows is actually huge) and it took 17minutes to run the checpoint and return the validation result. Now I executed this custom query using spark on databricks and it took less than 1 minute to execute.
To Reproduce
create data source, asset, batch definition and expectation suite**
df = spark.sql("SELECT * FROM xxx") #ensure that this will have a large number of rows up to 30million rows
context = gx.get_context(project_root_dir="/dbfs/xxx")
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
batch_definition_name = "my_batch_definition"
suite_name = "d365_enriched_generaljournaltransaction_expectation_suite"
validation_definition_name="d365_enriched_generaljournaltransaction_validation_definition"
checkpoint_name="d365_enriched_generaljournaltransaction_checkpoint"
data_source = context.data_sources.add_or_update_spark(name=data_source_name)
try:
data_asset=data_source.add_dataframe_asset(name=data_asset_name)
except Exception as e:
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
try:
batch_definition = data_asset.add_batch_definition_whole_dataframe(batch_definition_name)
except Exception as e:
batch_definition = data_asset.get_batch_definition(batch_definition_name)
try:
suite = gx.ExpectationSuite(name=suite_name)
suite = context.suites.add(suite)
except Exception as e:
suite = context.suites.get(name=suite_name)
#now lets create a custom expectation using sql
class ExpectValidLineItemSum(gx.expectations.UnexpectedRowsExpectation):
unexpected_rows_query: str = ("""SELECT CrayonCompanyIdRef, SUM(accountingcurrencyamount) AS total_amount
FROM {batch}
WHERE accountingdate >= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 1, 1)
AND accountingdate <= MAKE_DATE(YEAR(CURRENT_DATE) - 1, 12, 31)
GROUP BY CrayonCompanyIdRef
HAVING SUM(accountingcurrencyamount) NOT BETWEEN -1 AND 1""")
description: str = "Line items should have a valid sum between -1 amd +1"
expectation = ExpectValidLineItemSum()
try:
suite.add_expectation(expectation)
except Exception as e:
print("Expectation already exists in the suite")
try:
validation_definition = gx.ValidationDefinition(
data=batch_definition, suite=suite, name=validation_definition_name
)
validation_definition = context.validation_definitions.add(
validation_definition
)
except Exception as e:
validation_definition = context.validation_definitions.get(
validation_definition_name
)
validation_definitions = [validation_definition]
validation_definitions
try:
checkpoint = gx.Checkpoint(
name=checkpoint_name,
validation_definitions=validation_definitions,
actions=[
# This Action updates the Data Docs static website with the Validation
# Results after the Checkpoint is run.
gx.checkpoint.UpdateDataDocsAction(
name="update_all_data_docs",
),
],
result_format={"result_format": "COMPLETE"},
)
context.checkpoints.add(checkpoint)
except Exception as e:
checkpoint = context.checkpoints.get(checkpoint_name)
batch_parameters = {"dataframe": df}
run_id = gx.RunIdentifier(run_name=f"validation_checkpoint_for_{batch_definition_name}", run_time=datetime.now())
validation_result = checkpoint.run(batch_parameters=batch_parameters, run_id=run_id)
Image attached below to show you when it took 17mins
Expected behavior
It should not take that long to just run one expectation on a dataframe. I ran the custom query on spark by using the table name instead of {batch} and everything ran in less than 5 seconds. I attached an image for your perusal.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: