Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Result #10778

Open
victorgrcp opened this issue Dec 13, 2024 · 0 comments
Open

Validation Result #10778

victorgrcp opened this issue Dec 13, 2024 · 0 comments

Comments

@victorgrcp
Copy link

Describe the bug
Hi Team,

I validated a column using the exception_values_to_be_in_set expectation and I have some doubts about the output result. First, I don't see unexpected_percent_total in the documentation.

The description of unexpected_percent_nonmissing is: "The percent of unexpected values in a column, excluding rows that have no value for that column.", but in the example below unexpected_percent_nonmissing should be 211608/921750 * 100 = 22,95720097640358 and not 65,95724785397692 (I don't get where this value come from).

For my understanding unexpected_percent_total should be (unexpected_count + missing_count) / element_count * 100 = 88,15101708, where unexpected_count are the total count of unexpected values in in a column (EXCLUDING NULL values). Then unexpected_percent should be 22,9572009764035.

Please correct me if I'm missing something.

"result": {
        "element_count": 921750,
        "unexpected_count": 211608,
        "unexpected_percent": 65.95724785397692,
        "partial_unexpected_list": [],
        "unexpected_index_column_names": ["assignment_key","assignment_id"],
        "missing_count": 600924,
        "missing_percent": 65.19381611065907,
        "unexpected_percent_total": 22.95720097640358,
        "unexpected_percent_nonmissing": 65.95724785397692,        
        "unexpected_list": ["100020",...], ...

Here I see is the calculations of these metrics.

To Reproduce

def test_expectation():
    context = ge.get_context()
    data_name = f"{catalog}.{schema}.{table_name}"
    
    # Set up context
    data_source = context.data_sources.add_spark(name=data_name)
    data_asset = data_source.add_dataframe_asset(name=table_name)
    
    batch_parameters = {"dataframe": spark.read.table(f"{catalog}.{schema}.{table_name}")}
    
    expectation_suite = context.suites.add(ge.core.expectation_suite.ExpectationSuite(name=f"s-{data_name}"))
    
    # Get the expectation class
    gx_expectation = my_import(expectation_name, args)
    # Add the expectation to the suite
    expectation_suite.add_expectation(gx_expectation)
    
    batch_definition = data_asset.add_batch_definition_whole_dataframe("batch_definition")
    validation_definition = context.validation_definitions.add(
            ge.core.validation_definition.ValidationDefinition(
                name=f"vd-{data_name}",
                data=batch_definition,
                suite=expectation_suite,
            )
    )
    checkpoint = context.checkpoints.add(
        ge.Checkpoint(
            name=f"cp-{data_name}",
            validation_definitions=[validation_definition],
            result_format={
                "result_format": "COMPLETE",
                "unexpected_index_column_names": table_key_list,
                "partial_unexpected_count": 0,
            },
        )
    )
    result = checkpoint.run(batch_parameters=batch_parameters)
    return result

def my_import(name: str, arguments: dict) -> object:
    """Import a specific Expectation Class from GX Expectation module
    name -- name of the Expectation Class to be imported
    arguments -- dictionary with the arguments to be passed to the Expectation Class
    """
    module = importlib.import_module("great_expectations.expectations")
    return getattr(module, name)(**arguments)

expectation_name = "ExpectColumnValuesToBeInSet"
arguments = '{"column": "employee_category", "value_set": ["100090", "100240", "100020","100070"]}'
args = loads(arguments)

Environment (please complete the following information):

  • Operating System: Windows
  • Great Expectations Version: 1.2
  • Data Source: PySpark Databricks
  • Cloud environment: Azure Databricks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant