You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I validated a column using the exception_values_to_be_in_set expectation and I have some doubts about the output result. First, I don't see unexpected_percent_total in the documentation.
The description of unexpected_percent_nonmissing is: "The percent of unexpected values in a column, excluding rows that have no value for that column.", but in the example below unexpected_percent_nonmissing should be 211608/921750 * 100 = 22,95720097640358 and not 65,95724785397692 (I don't get where this value come from).
For my understanding unexpected_percent_total should be (unexpected_count + missing_count) / element_count * 100 = 88,15101708, where unexpected_count are the total count of unexpected values in in a column (EXCLUDING NULL values). Then unexpected_percent should be 22,9572009764035.
def test_expectation():
context = ge.get_context()
data_name = f"{catalog}.{schema}.{table_name}"
# Set up context
data_source = context.data_sources.add_spark(name=data_name)
data_asset = data_source.add_dataframe_asset(name=table_name)
batch_parameters = {"dataframe": spark.read.table(f"{catalog}.{schema}.{table_name}")}
expectation_suite = context.suites.add(ge.core.expectation_suite.ExpectationSuite(name=f"s-{data_name}"))
# Get the expectation class
gx_expectation = my_import(expectation_name, args)
# Add the expectation to the suite
expectation_suite.add_expectation(gx_expectation)
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch_definition")
validation_definition = context.validation_definitions.add(
ge.core.validation_definition.ValidationDefinition(
name=f"vd-{data_name}",
data=batch_definition,
suite=expectation_suite,
)
)
checkpoint = context.checkpoints.add(
ge.Checkpoint(
name=f"cp-{data_name}",
validation_definitions=[validation_definition],
result_format={
"result_format": "COMPLETE",
"unexpected_index_column_names": table_key_list,
"partial_unexpected_count": 0,
},
)
)
result = checkpoint.run(batch_parameters=batch_parameters)
return result
def my_import(name: str, arguments: dict) -> object:
"""Import a specific Expectation Class from GX Expectation module
name -- name of the Expectation Class to be imported
arguments -- dictionary with the arguments to be passed to the Expectation Class
"""
module = importlib.import_module("great_expectations.expectations")
return getattr(module, name)(**arguments)
expectation_name = "ExpectColumnValuesToBeInSet"
arguments = '{"column": "employee_category", "value_set": ["100090", "100240", "100020","100070"]}'
args = loads(arguments)
Environment (please complete the following information):
Operating System: Windows
Great Expectations Version: 1.2
Data Source: PySpark Databricks
Cloud environment: Azure Databricks
The text was updated successfully, but these errors were encountered:
Describe the bug
Hi Team,
I validated a column using the exception_values_to_be_in_set expectation and I have some doubts about the output result. First, I don't see unexpected_percent_total in the documentation.
The description of unexpected_percent_nonmissing is: "The percent of unexpected values in a column, excluding rows that have no value for that column.", but in the example below unexpected_percent_nonmissing should be 211608/921750 * 100 = 22,95720097640358 and not 65,95724785397692 (I don't get where this value come from).
For my understanding unexpected_percent_total should be (unexpected_count + missing_count) / element_count * 100 = 88,15101708, where unexpected_count are the total count of unexpected values in in a column (EXCLUDING NULL values). Then unexpected_percent should be 22,9572009764035.
Please correct me if I'm missing something.
Here I see is the calculations of these metrics.
To Reproduce
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: