Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Enforce mandatory docstrings for public API decorated objects #10799

Open
wants to merge 18 commits into
base: develop
Choose a base branch
from

Conversation

cdkini
Copy link
Member

@cdkini cdkini commented Dec 19, 2024

Refer to ADR 0005 - we should always have docstrings when dealing with public objects that are rendered in our docs

  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

For more information about contributing, visit our community resources.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

Copy link

netlify bot commented Dec 19, 2024

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit ce89ded
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/67658d19b2c0210008c5b306
😎 Deploy Preview https://deploy-preview-10799.docs.greatexpectations.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Dec 19, 2024

❌ 4 Tests Failed:

Tests completed Failed Passed Skipped
5764 4 5760 278
View the top 3 failed tests by shortest run time
tests.datasource.fluent.test_schemas::test_vcs_schemas_match[pandas_gcs:PandasGoogleCloudStorageDatasource]
Stack Traces | 0.205s run time
fluent_ds_or_asset_model = <class 'great_expectations.datasource.fluent.pandas_google_cloud_storage_datasource.PandasGoogleCloudStorageDatasource'>
schema_dir = PosixPath('.../datasource/fluent/schemas')

    @pytest.mark.skipif(
        min_supported_python() < PYTHON_VERSION,
        reason=f"_sort_any_of() keys needs to be fixed for py {PYTHON_VERSION}",
    )
    @pytest.mark.timeout(
        2.0  # this is marked as unit so that it will run on different versions of python
    )
    @pytest.mark.unit
    @pytest.mark.parametrize(
        ["fluent_ds_or_asset_model", "schema_dir"],
        [pytest.param(t[0], t[1], id=t[2]) for t in _models_and_schema_dirs()],
    )
    def test_vcs_schemas_match(  # noqa: C901
        fluent_ds_or_asset_model: Type[Datasource | DataAsset], schema_dir: pathlib.Path
    ):
        """
        Test that json schemas for each DataSource and DataAsset match the current schema
        under version control.
    
        This is important because some of these classes are generated dynamically and
        monitoring these schemas files can clue us into problems that could otherwise go
        unnoticed.
    
        If this test is failing run `invoke schema --sync` to update schemas and commit the
        changes.
    
        Note: if the installed version of pandas doesn't match the one used in the standard
        test pipeline, the test will be marked a `xfail` because the schemas will not match.
        """
    
        def _sort_any_of(d: dict) -> str:
            if items := d.get("items"):
                _sort_any_of(items)
            if ref := d.get("$ref"):
                return ref
            if type_ := d.get("type"):
                return type_
            # return any string for sorting
            return "z"
    
        def _sort_lists(schema_as_dict: dict) -> None:
            """Sometimes "required" and "anyOf" lists come unsorted, causing misleading assertion failures; this corrects the issue.
    
            Args:
                schema_as_dict: source dictionary (will be modified "in-situ")
    
            """  # noqa: E501
            key: str
            value: Any
    
            for key, value in schema_as_dict.items():
                if key == "required":
                    schema_as_dict[key] = sorted(value)
                elif key == "anyOf":
                    if isinstance(value, list):
                        for val in value:
                            if isinstance(value, dict):
                                schema_as_dict[key] = sorted(val, key=_sort_any_of)
                    else:
                        schema_as_dict[key] = sorted(value, key=_sort_any_of)
    
                if isinstance(value, dict):
                    _sort_lists(schema_as_dict=value)
    
        if "Pandas" in str(schema_dir) and _PANDAS_SCHEMA_VERSION != PANDAS_VERSION:
            pytest.xfail(reason=f"schema generated with pandas {_PANDAS_SCHEMA_VERSION}")
    
        print(f"python version: {sys.version.split()[0]}")
        print(f"pandas version: {PANDAS_VERSION}")
    
        schema_path = schema_dir.joinpath(f"{fluent_ds_or_asset_model.__name__}.json")
        print(schema_path)
    
        json_str = schema_path.read_text().rstrip()
    
        schema_as_dict = json.loads(json_str)
        _sort_lists(schema_as_dict=schema_as_dict)
        # we have tuples in our schema, which are mutated to lists when dumped to json
        # dump and reload the schema dict to ensure we are comparing
        fluent_ds_or_asset_model_as_dict = json.loads(fluent_ds_or_asset_model.schema_json())
        _sort_lists(schema_as_dict=fluent_ds_or_asset_model_as_dict)
    
        if "Excel" in str(schema_path):
            pytest.xfail(reason="Sorting of nested anyOf key")
    
>       assert (
            schema_as_dict == fluent_ds_or_asset_model_as_dict
        ), "Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version."
E       AssertionError: Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version.
E       assert equals failed
E               'required': ['key'],             'required': ['key'],      
E               'title': 'Sorter',               'title': 'Sorter',        
E               'type': 'object',                'type': 'object',         
E             },                               },                          
E           },                               },                            
E         #x00-  'description': '--Public API-#x01  #x00+  'description': '--Public API-#x01 
E         #x00--',#x01                              #x00+-\nPandasGoogleCloudStorageData#x01 
E                                          #x00+source is a PandasDatasource th#x01 
E                                          #x00+at uses Google Cloud Storage as#x01 
E                                          #x00+ a\ndata store.',#x01               
E           'properties': {                  'properties': {               
E             'assets': {                      'assets': {                 
E               'default': [],                   'default': [],            
E               'items': {                       'items': {                
E                 '$ref': '#/definitions/          '$ref': '#/definitions/ 
E         FileDataAsset',                  FileDataAsset',

.../datasource/fluent/test_schemas.py:135: AssertionError
tests.datasource.fluent.test_schemas::test_vcs_schemas_match[pandas_s3:PandasS3Datasource]
Stack Traces | 0.213s run time
fluent_ds_or_asset_model = <class 'great_expectations.datasource.fluent.pandas_s3_datasource.PandasS3Datasource'>
schema_dir = PosixPath('.../datasource/fluent/schemas')

    @pytest.mark.skipif(
        min_supported_python() < PYTHON_VERSION,
        reason=f"_sort_any_of() keys needs to be fixed for py {PYTHON_VERSION}",
    )
    @pytest.mark.timeout(
        2.0  # this is marked as unit so that it will run on different versions of python
    )
    @pytest.mark.unit
    @pytest.mark.parametrize(
        ["fluent_ds_or_asset_model", "schema_dir"],
        [pytest.param(t[0], t[1], id=t[2]) for t in _models_and_schema_dirs()],
    )
    def test_vcs_schemas_match(  # noqa: C901
        fluent_ds_or_asset_model: Type[Datasource | DataAsset], schema_dir: pathlib.Path
    ):
        """
        Test that json schemas for each DataSource and DataAsset match the current schema
        under version control.
    
        This is important because some of these classes are generated dynamically and
        monitoring these schemas files can clue us into problems that could otherwise go
        unnoticed.
    
        If this test is failing run `invoke schema --sync` to update schemas and commit the
        changes.
    
        Note: if the installed version of pandas doesn't match the one used in the standard
        test pipeline, the test will be marked a `xfail` because the schemas will not match.
        """
    
        def _sort_any_of(d: dict) -> str:
            if items := d.get("items"):
                _sort_any_of(items)
            if ref := d.get("$ref"):
                return ref
            if type_ := d.get("type"):
                return type_
            # return any string for sorting
            return "z"
    
        def _sort_lists(schema_as_dict: dict) -> None:
            """Sometimes "required" and "anyOf" lists come unsorted, causing misleading assertion failures; this corrects the issue.
    
            Args:
                schema_as_dict: source dictionary (will be modified "in-situ")
    
            """  # noqa: E501
            key: str
            value: Any
    
            for key, value in schema_as_dict.items():
                if key == "required":
                    schema_as_dict[key] = sorted(value)
                elif key == "anyOf":
                    if isinstance(value, list):
                        for val in value:
                            if isinstance(value, dict):
                                schema_as_dict[key] = sorted(val, key=_sort_any_of)
                    else:
                        schema_as_dict[key] = sorted(value, key=_sort_any_of)
    
                if isinstance(value, dict):
                    _sort_lists(schema_as_dict=value)
    
        if "Pandas" in str(schema_dir) and _PANDAS_SCHEMA_VERSION != PANDAS_VERSION:
            pytest.xfail(reason=f"schema generated with pandas {_PANDAS_SCHEMA_VERSION}")
    
        print(f"python version: {sys.version.split()[0]}")
        print(f"pandas version: {PANDAS_VERSION}")
    
        schema_path = schema_dir.joinpath(f"{fluent_ds_or_asset_model.__name__}.json")
        print(schema_path)
    
        json_str = schema_path.read_text().rstrip()
    
        schema_as_dict = json.loads(json_str)
        _sort_lists(schema_as_dict=schema_as_dict)
        # we have tuples in our schema, which are mutated to lists when dumped to json
        # dump and reload the schema dict to ensure we are comparing
        fluent_ds_or_asset_model_as_dict = json.loads(fluent_ds_or_asset_model.schema_json())
        _sort_lists(schema_as_dict=fluent_ds_or_asset_model_as_dict)
    
        if "Excel" in str(schema_path):
            pytest.xfail(reason="Sorting of nested anyOf key")
    
>       assert (
            schema_as_dict == fluent_ds_or_asset_model_as_dict
        ), "Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version."
E       AssertionError: Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version.
E       assert equals failed
E               'required': ['key'],             'required': ['key'],      
E               'title': 'Sorter',               'title': 'Sorter',        
E               'type': 'object',                'type': 'object',         
E             },                               },                          
E           },                               },                            
E         #x00-  'description': '--Public API-#x01  #x00+  'description': '--Public API-#x01 
E         #x00--',#x01                              #x00+-\nPandasS3Datasource is a Pand#x01 
E                                          #x00+asDatasource that uses Amazon S#x01 
E                                          #x00+3 as a data store.',#x01            
E           'properties': {                  'properties': {               
E             'assets': {                      'assets': {                 
E               'default': [],                   'default': [],            
E               'items': {                       'items': {                
E                 '$ref': '#/definitions/          '$ref': '#/definitions/ 
E         FileDataAsset',                  FileDataAsset',

.../datasource/fluent/test_schemas.py:135: AssertionError
tests.datasource.fluent.test_schemas::test_vcs_schemas_match[pandas_abs:PandasAzureBlobStorageDatasource]
Stack Traces | 0.244s run time
fluent_ds_or_asset_model = <class 'great_expectations.datasource.fluent.pandas_azure_blob_storage_datasource.PandasAzureBlobStorageDatasource'>
schema_dir = PosixPath('.../datasource/fluent/schemas')

    @pytest.mark.skipif(
        min_supported_python() < PYTHON_VERSION,
        reason=f"_sort_any_of() keys needs to be fixed for py {PYTHON_VERSION}",
    )
    @pytest.mark.timeout(
        2.0  # this is marked as unit so that it will run on different versions of python
    )
    @pytest.mark.unit
    @pytest.mark.parametrize(
        ["fluent_ds_or_asset_model", "schema_dir"],
        [pytest.param(t[0], t[1], id=t[2]) for t in _models_and_schema_dirs()],
    )
    def test_vcs_schemas_match(  # noqa: C901
        fluent_ds_or_asset_model: Type[Datasource | DataAsset], schema_dir: pathlib.Path
    ):
        """
        Test that json schemas for each DataSource and DataAsset match the current schema
        under version control.
    
        This is important because some of these classes are generated dynamically and
        monitoring these schemas files can clue us into problems that could otherwise go
        unnoticed.
    
        If this test is failing run `invoke schema --sync` to update schemas and commit the
        changes.
    
        Note: if the installed version of pandas doesn't match the one used in the standard
        test pipeline, the test will be marked a `xfail` because the schemas will not match.
        """
    
        def _sort_any_of(d: dict) -> str:
            if items := d.get("items"):
                _sort_any_of(items)
            if ref := d.get("$ref"):
                return ref
            if type_ := d.get("type"):
                return type_
            # return any string for sorting
            return "z"
    
        def _sort_lists(schema_as_dict: dict) -> None:
            """Sometimes "required" and "anyOf" lists come unsorted, causing misleading assertion failures; this corrects the issue.
    
            Args:
                schema_as_dict: source dictionary (will be modified "in-situ")
    
            """  # noqa: E501
            key: str
            value: Any
    
            for key, value in schema_as_dict.items():
                if key == "required":
                    schema_as_dict[key] = sorted(value)
                elif key == "anyOf":
                    if isinstance(value, list):
                        for val in value:
                            if isinstance(value, dict):
                                schema_as_dict[key] = sorted(val, key=_sort_any_of)
                    else:
                        schema_as_dict[key] = sorted(value, key=_sort_any_of)
    
                if isinstance(value, dict):
                    _sort_lists(schema_as_dict=value)
    
        if "Pandas" in str(schema_dir) and _PANDAS_SCHEMA_VERSION != PANDAS_VERSION:
            pytest.xfail(reason=f"schema generated with pandas {_PANDAS_SCHEMA_VERSION}")
    
        print(f"python version: {sys.version.split()[0]}")
        print(f"pandas version: {PANDAS_VERSION}")
    
        schema_path = schema_dir.joinpath(f"{fluent_ds_or_asset_model.__name__}.json")
        print(schema_path)
    
        json_str = schema_path.read_text().rstrip()
    
        schema_as_dict = json.loads(json_str)
        _sort_lists(schema_as_dict=schema_as_dict)
        # we have tuples in our schema, which are mutated to lists when dumped to json
        # dump and reload the schema dict to ensure we are comparing
        fluent_ds_or_asset_model_as_dict = json.loads(fluent_ds_or_asset_model.schema_json())
        _sort_lists(schema_as_dict=fluent_ds_or_asset_model_as_dict)
    
        if "Excel" in str(schema_path):
            pytest.xfail(reason="Sorting of nested anyOf key")
    
>       assert (
            schema_as_dict == fluent_ds_or_asset_model_as_dict
        ), "Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version."
E       AssertionError: Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version.
E       assert equals failed
E               'required': ['key'],             'required': ['key'],      
E               'title': 'Sorter',               'title': 'Sorter',        
E               'type': 'object',                'type': 'object',         
E             },                               },                          
E           },                               },                            
E         #x00-  'description': '--Public API-#x01  #x00+  'description': '--Public API-#x01 
E         #x00--',#x01                              #x00+-\nPandasAzureBlobStorageDataso#x01 
E                                          #x00+urce is a PandasDatasource that#x01 
E                                          #x00+ uses Azure Blob Storage as a\n#x01 
E                                          #x00+data store.',#x01                   
E           'properties': {                  'properties': {               
E             'assets': {                      'assets': {                 
E               'default': [],                   'default': [],            
E               'items': {                       'items': {                
E                 '$ref': '#/definitions/          '$ref': '#/definitions/ 
E         FileDataAsset',                  FileDataAsset',

.../datasource/fluent/test_schemas.py:135: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📢 Thoughts on this report? Let us know!

Copy link
Contributor

@klavavej klavavej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not approving since I'm reviewing only the copy (and not the new test)

But, none of my copy feedback is severe enough to be a blocker on its own, so I'm also not requesting changes.

The vast majority of my comments are capitalization changes to abide the style guide though there are some more substantive notes. I was short on time before vacation so I didn't go as deep in this review as I would have liked. But I'll be iterating this content as part of the API reference improvements so I think that's ok.

@@ -295,6 +295,24 @@ def run(
expectation_parameters: SuiteParameterDict | None = None,
run_id: RunIdentifier | None = None,
) -> CheckpointResult:
"""
Runs the Checkpoint's underlying validation definitions and actions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Runs the Checkpoint's underlying validation definitions and actions.
Runs the Checkpoint's underlying Validation Definitions and Actions.

Comment on lines +302 to +303
batch_parameters: Parameters to be used when loading the batch.
expectation_parameters: Parameters to be used when validating the batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
batch_parameters: Parameters to be used when loading the batch.
expectation_parameters: Parameters to be used when validating the batch.
batch_parameters: Parameters to be used when loading the Batch.
expectation_parameters: Parameters to be used when validating the Batch.

Comment on lines +310 to +314
CheckpointRunWithoutValidationDefinitionError: If the Checkpoint is run without any
validation definitions.
CheckpointNotAddedError: If the Checkpoint has not been added to the store.
CheckpointNotFreshError: If the Checkpoint has been modified since it was last added
to the store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CheckpointRunWithoutValidationDefinitionError: If the Checkpoint is run without any
validation definitions.
CheckpointNotAddedError: If the Checkpoint has not been added to the store.
CheckpointNotFreshError: If the Checkpoint has been modified since it was last added
to the store.
CheckpointRunWithoutValidationDefinitionError: If the Checkpoint is run without any
Validation Definitions.
CheckpointNotAddedError: If the Checkpoint has not been added to the Store.
CheckpointNotFreshError: If the Checkpoint has been modified since it was last added
to the Store.

Comment on lines +302 to +303
batch_parameters: Parameters to be used when loading the batch.
expectation_parameters: Parameters to be used when validating the batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be accurate/helpful to reuse the more robust batch_parameters and expectation_parameters definitions from the ValidationDefinition page here?

batch_parameters
The dictionary of parameters necessary for selecting the correct batch to run the validation on. The keys are strings that are determined by the BatchDefinition used to instantiate this ValidationDefinition. For example:

  • whole table -> None
  • yearly -> year
  • monthly -> year, month
  • daily -> year, month, day

expectation_parameters – A dictionary of parameters values for any expectations using parameterized values (the $PARAMETER syntax). The keys are the parameter names and the values are the values to be used for this validation run.

Returns:
A CheckpointResult object containing the results of the run.

Raises:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that we have a few different patterns in "Raises" sections throughout the API reference

Examples:
https://docs.greatexpectations.io/docs/reference/api/core/ExpectationValidationResult_class - starts with "Raised if ..."

InvalidCacheValueError – Raised if the result does not pass validation.

https://docs.greatexpectations.io/docs/reference/api/ExpectationSuite_class - does not have a conditional lead-in

KeyError – Expectation not found in suite.

https://docs.greatexpectations.io/docs/reference/api/core/factory/CheckpointFactory_class - does not separate error name and definition with a dash

DataContextError if Checkpoint doesn't exist –

It would be great if these were consistently phrased but I think that's outside the scope of this PR. I'll start tracking this as something for me to address as part of the content-focused API docs improvements. I'd love to take a sweep through for consistent phrasing and then put together a style guide for engineering to help with consistency going forward. (If an API docs string style guide already exists and I'm just overlooking it, please point me to it!)

name: The name of the batch definition to be added
column: The column name on which to partition the asset by day
sort_ascending: Boolean to indicate whether to sort ascending (default) or descending.
When running a validation, we default to running the last batch definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When running a validation, we default to running the last batch definition
When running a validation, we default to running the last Batch Definition

"""An asset made from a SQL query

Args:
query: The query to be used to contruct the underlying data asset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
query: The query to be used to contruct the underlying data asset
query: The query to be used to construct the underlying Data Asset

@@ -170,6 +170,17 @@ def add_table_asset(
schema_name: Optional[str] = None,
batch_metadata: Optional[BatchMetadata] = None,
) -> SqliteTableAsset:
"""Adds a table asset to this SQLite datasource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected this to show up on https://deploy-preview-10799.docs.greatexpectations.io/docs/reference/api/datasource/fluent/SqliteDatasource_class but am not seeing it there. instead I see

Adds a table asset to this datasource.

Is it possible that this SQLite-specific text is being overridden by a generic description?

Comment on lines 203 to +204
) -> SqliteQueryAsset:
"""Adds a table asset to this SQLite datasource
Copy link
Contributor

@klavavej klavavej Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to another comment, I'm not seeing this where I expect. But, regardless I wonder if there's a mismatch here since line 203 is about query assets but then line 204 is about table assets

@@ -1270,7 +1270,6 @@ def _get_result_format(
result_format = configuration_result_format
return result_format

@public_api
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for removing this? I ask because I see validate_configuration still mentioned elsewhere on the page

Expectation classes ... may provide implementations of:

  1. validate_configuration, which should raise an error if the configuration will not be usable for the Expectation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants