Validation Result #10778

victorgrcp · 2024-12-13T15:57:52Z

Describe the bug
Hi Team,

I validated a column using the exception_values_to_be_in_set expectation and I have some doubts about the output result. First, I don't see unexpected_percent_total in the documentation.

The description of unexpected_percent_nonmissing is: "The percent of unexpected values in a column, excluding rows that have no value for that column.", but in the example below unexpected_percent_nonmissing should be 211608/921750 * 100 = 22,95720097640358 and not 65,95724785397692 (I don't get where this value come from).

For my understanding unexpected_percent_total should be (unexpected_count + missing_count) / element_count * 100 = 88,15101708, where unexpected_count are the total count of unexpected values in in a column (EXCLUDING NULL values). Then unexpected_percent should be 22,9572009764035.

Please correct me if I'm missing something.

"result": {
        "element_count": 921750,
        "unexpected_count": 211608,
        "unexpected_percent": 65.95724785397692,
        "partial_unexpected_list": [],
        "unexpected_index_column_names": ["assignment_key","assignment_id"],
        "missing_count": 600924,
        "missing_percent": 65.19381611065907,
        "unexpected_percent_total": 22.95720097640358,
        "unexpected_percent_nonmissing": 65.95724785397692,        
        "unexpected_list": ["100020",...], ...

Here I see is the calculations of these metrics.

To Reproduce

def test_expectation():
    context = ge.get_context()
    data_name = f"{catalog}.{schema}.{table_name}"
    
    # Set up context
    data_source = context.data_sources.add_spark(name=data_name)
    data_asset = data_source.add_dataframe_asset(name=table_name)
    
    batch_parameters = {"dataframe": spark.read.table(f"{catalog}.{schema}.{table_name}")}
    
    expectation_suite = context.suites.add(ge.core.expectation_suite.ExpectationSuite(name=f"s-{data_name}"))
    
    # Get the expectation class
    gx_expectation = my_import(expectation_name, args)
    # Add the expectation to the suite
    expectation_suite.add_expectation(gx_expectation)
    
    batch_definition = data_asset.add_batch_definition_whole_dataframe("batch_definition")
    validation_definition = context.validation_definitions.add(
            ge.core.validation_definition.ValidationDefinition(
                name=f"vd-{data_name}",
                data=batch_definition,
                suite=expectation_suite,
            )
    )
    checkpoint = context.checkpoints.add(
        ge.Checkpoint(
            name=f"cp-{data_name}",
            validation_definitions=[validation_definition],
            result_format={
                "result_format": "COMPLETE",
                "unexpected_index_column_names": table_key_list,
                "partial_unexpected_count": 0,
            },
        )
    )
    result = checkpoint.run(batch_parameters=batch_parameters)
    return result

def my_import(name: str, arguments: dict) -> object:
    """Import a specific Expectation Class from GX Expectation module
    name -- name of the Expectation Class to be imported
    arguments -- dictionary with the arguments to be passed to the Expectation Class
    """
    module = importlib.import_module("great_expectations.expectations")
    return getattr(module, name)(**arguments)

expectation_name = "ExpectColumnValuesToBeInSet"
arguments = '{"column": "employee_category", "value_set": ["100090", "100240", "100020","100070"]}'
args = loads(arguments)

Environment (please complete the following information):

Operating System: Windows
Great Expectations Version: 1.2
Data Source: PySpark Databricks
Cloud environment: Azure Databricks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation Result #10778

Validation Result #10778

victorgrcp commented Dec 13, 2024

Validation Result #10778

Validation Result #10778

Comments

victorgrcp commented Dec 13, 2024