[MAINTENANCE] Enforce mandatory docstrings for public API decorated objects #10799

cdkini · 2024-12-19T16:20:31Z

Refer to ADR 0005 - we should always have docstrings when dealing with public objects that are rendered in our docs

Description of PR changes above includes a link to an existing GitHub issue
PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
Code is linted - run invoke lint (uses ruff format + ruff check)
Appropriate tests and docs have been updated

For more information about contributing, visit our community resources.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

netlify · 2024-12-19T16:20:51Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`ce89ded`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/67658d19b2c0210008c5b306
😎 Deploy Preview	https://deploy-preview-10799.docs.greatexpectations.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2024-12-19T16:53:51Z

❌ 4 Tests Failed:

Tests completed	Failed	Passed	Skipped
5764	4	5760	278

View the top 3 failed tests by shortest run time

tests.datasource.fluent.test_schemas::test_vcs_schemas_match[pandas_gcs:PandasGoogleCloudStorageDatasource]

Stack Traces | 0.205s run time

fluent_ds_or_asset_model = <class 'great_expectations.datasource.fluent.pandas_google_cloud_storage_datasource.PandasGoogleCloudStorageDatasource'>
schema_dir = PosixPath('.../datasource/fluent/schemas')

    @pytest.mark.skipif(
        min_supported_python() < PYTHON_VERSION,
        reason=f"_sort_any_of() keys needs to be fixed for py {PYTHON_VERSION}",
    )
    @pytest.mark.timeout(
        2.0  # this is marked as unit so that it will run on different versions of python
    )
    @pytest.mark.unit
    @pytest.mark.parametrize(
        ["fluent_ds_or_asset_model", "schema_dir"],
        [pytest.param(t[0], t[1], id=t[2]) for t in _models_and_schema_dirs()],
    )
    def test_vcs_schemas_match(  # noqa: C901
        fluent_ds_or_asset_model: Type[Datasource | DataAsset], schema_dir: pathlib.Path
    ):
        """
        Test that json schemas for each DataSource and DataAsset match the current schema
        under version control.
    
        This is important because some of these classes are generated dynamically and
        monitoring these schemas files can clue us into problems that could otherwise go
        unnoticed.
    
        If this test is failing run `invoke schema --sync` to update schemas and commit the
        changes.
    
        Note: if the installed version of pandas doesn't match the one used in the standard
        test pipeline, the test will be marked a `xfail` because the schemas will not match.
        """
    
        def _sort_any_of(d: dict) -> str:
            if items := d.get("items"):
                _sort_any_of(items)
            if ref := d.get("$ref"):
                return ref
            if type_ := d.get("type"):
                return type_
            # return any string for sorting
            return "z"
    
        def _sort_lists(schema_as_dict: dict) -> None:
            """Sometimes "required" and "anyOf" lists come unsorted, causing misleading assertion failures; this corrects the issue.
    
            Args:
                schema_as_dict: source dictionary (will be modified "in-situ")
    
            """  # noqa: E501
            key: str
            value: Any
    
            for key, value in schema_as_dict.items():
                if key == "required":
                    schema_as_dict[key] = sorted(value)
                elif key == "anyOf":
                    if isinstance(value, list):
                        for val in value:
                            if isinstance(value, dict):
                                schema_as_dict[key] = sorted(val, key=_sort_any_of)
                    else:
                        schema_as_dict[key] = sorted(value, key=_sort_any_of)
    
                if isinstance(value, dict):
                    _sort_lists(schema_as_dict=value)
    
        if "Pandas" in str(schema_dir) and _PANDAS_SCHEMA_VERSION != PANDAS_VERSION:
            pytest.xfail(reason=f"schema generated with pandas {_PANDAS_SCHEMA_VERSION}")
    
        print(f"python version: {sys.version.split()[0]}")
        print(f"pandas version: {PANDAS_VERSION}")
    
        schema_path = schema_dir.joinpath(f"{fluent_ds_or_asset_model.__name__}.json")
        print(schema_path)
    
        json_str = schema_path.read_text().rstrip()
    
        schema_as_dict = json.loads(json_str)
        _sort_lists(schema_as_dict=schema_as_dict)
        # we have tuples in our schema, which are mutated to lists when dumped to json
        # dump and reload the schema dict to ensure we are comparing
        fluent_ds_or_asset_model_as_dict = json.loads(fluent_ds_or_asset_model.schema_json())
        _sort_lists(schema_as_dict=fluent_ds_or_asset_model_as_dict)
    
        if "Excel" in str(schema_path):
            pytest.xfail(reason="Sorting of nested anyOf key")
    
>       assert (
            schema_as_dict == fluent_ds_or_asset_model_as_dict
        ), "Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version."
E       AssertionError: Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version.
E       assert equals failed
E               'required': ['key'],             'required': ['key'],      
E               'title': 'Sorter',               'title': 'Sorter',        
E               'type': 'object',                'type': 'object',         
E             },                               },                          
E           },                               },                            
E         #x00-  'description': '--Public API-#x01  #x00+  'description': '--Public API-#x01 
E         #x00--',#x01                              #x00+-\nPandasGoogleCloudStorageData#x01 
E                                          #x00+source is a PandasDatasource th#x01 
E                                          #x00+at uses Google Cloud Storage as#x01 
E                                          #x00+ a\ndata store.',#x01               
E           'properties': {                  'properties': {               
E             'assets': {                      'assets': {                 
E               'default': [],                   'default': [],            
E               'items': {                       'items': {                
E                 '$ref': '#/definitions/          '$ref': '#/definitions/ 
E         FileDataAsset',                  FileDataAsset',

.../datasource/fluent/test_schemas.py:135: AssertionError

tests.datasource.fluent.test_schemas::test_vcs_schemas_match[pandas_s3:PandasS3Datasource]

Stack Traces | 0.213s run time

fluent_ds_or_asset_model = <class 'great_expectations.datasource.fluent.pandas_s3_datasource.PandasS3Datasource'>
schema_dir = PosixPath('.../datasource/fluent/schemas')

    @pytest.mark.skipif(
        min_supported_python() < PYTHON_VERSION,
        reason=f"_sort_any_of() keys needs to be fixed for py {PYTHON_VERSION}",
    )
    @pytest.mark.timeout(
        2.0  # this is marked as unit so that it will run on different versions of python
    )
    @pytest.mark.unit
    @pytest.mark.parametrize(
        ["fluent_ds_or_asset_model", "schema_dir"],
        [pytest.param(t[0], t[1], id=t[2]) for t in _models_and_schema_dirs()],
    )
    def test_vcs_schemas_match(  # noqa: C901
        fluent_ds_or_asset_model: Type[Datasource | DataAsset], schema_dir: pathlib.Path
    ):
        """
        Test that json schemas for each DataSource and DataAsset match the current schema
        under version control.
    
        This is important because some of these classes are generated dynamically and
        monitoring these schemas files can clue us into problems that could otherwise go
        unnoticed.
    
        If this test is failing run `invoke schema --sync` to update schemas and commit the
        changes.
    
        Note: if the installed version of pandas doesn't match the one used in the standard
        test pipeline, the test will be marked a `xfail` because the schemas will not match.
        """
    
        def _sort_any_of(d: dict) -> str:
            if items := d.get("items"):
                _sort_any_of(items)
            if ref := d.get("$ref"):
                return ref
            if type_ := d.get("type"):
                return type_
            # return any string for sorting
            return "z"
    
        def _sort_lists(schema_as_dict: dict) -> None:
            """Sometimes "required" and "anyOf" lists come unsorted, causing misleading assertion failures; this corrects the issue.
    
            Args:
                schema_as_dict: source dictionary (will be modified "in-situ")
    
            """  # noqa: E501
            key: str
            value: Any
    
            for key, value in schema_as_dict.items():
                if key == "required":
                    schema_as_dict[key] = sorted(value)
                elif key == "anyOf":
                    if isinstance(value, list):
                        for val in value:
                            if isinstance(value, dict):
                                schema_as_dict[key] = sorted(val, key=_sort_any_of)
                    else:
                        schema_as_dict[key] = sorted(value, key=_sort_any_of)
    
                if isinstance(value, dict):
                    _sort_lists(schema_as_dict=value)
    
        if "Pandas" in str(schema_dir) and _PANDAS_SCHEMA_VERSION != PANDAS_VERSION:
            pytest.xfail(reason=f"schema generated with pandas {_PANDAS_SCHEMA_VERSION}")
    
        print(f"python version: {sys.version.split()[0]}")
        print(f"pandas version: {PANDAS_VERSION}")
    
        schema_path = schema_dir.joinpath(f"{fluent_ds_or_asset_model.__name__}.json")
        print(schema_path)
    
        json_str = schema_path.read_text().rstrip()
    
        schema_as_dict = json.loads(json_str)
        _sort_lists(schema_as_dict=schema_as_dict)
        # we have tuples in our schema, which are mutated to lists when dumped to json
        # dump and reload the schema dict to ensure we are comparing
        fluent_ds_or_asset_model_as_dict = json.loads(fluent_ds_or_asset_model.schema_json())
        _sort_lists(schema_as_dict=fluent_ds_or_asset_model_as_dict)
    
        if "Excel" in str(schema_path):
            pytest.xfail(reason="Sorting of nested anyOf key")
    
>       assert (
            schema_as_dict == fluent_ds_or_asset_model_as_dict
        ), "Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version."
E       AssertionError: Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version.
E       assert equals failed
E               'required': ['key'],             'required': ['key'],      
E               'title': 'Sorter',               'title': 'Sorter',        
E               'type': 'object',                'type': 'object',         
E             },                               },                          
E           },                               },                            
E         #x00-  'description': '--Public API-#x01  #x00+  'description': '--Public API-#x01 
E         #x00--',#x01                              #x00+-\nPandasS3Datasource is a Pand#x01 
E                                          #x00+asDatasource that uses Amazon S#x01 
E                                          #x00+3 as a data store.',#x01            
E           'properties': {                  'properties': {               
E             'assets': {                      'assets': {                 
E               'default': [],                   'default': [],            
E               'items': {                       'items': {                
E                 '$ref': '#/definitions/          '$ref': '#/definitions/ 
E         FileDataAsset',                  FileDataAsset',

.../datasource/fluent/test_schemas.py:135: AssertionError

tests.datasource.fluent.test_schemas::test_vcs_schemas_match[pandas_abs:PandasAzureBlobStorageDatasource]

Stack Traces | 0.244s run time

fluent_ds_or_asset_model = <class 'great_expectations.datasource.fluent.pandas_azure_blob_storage_datasource.PandasAzureBlobStorageDatasource'>
schema_dir = PosixPath('.../datasource/fluent/schemas')

    @pytest.mark.skipif(
        min_supported_python() < PYTHON_VERSION,
        reason=f"_sort_any_of() keys needs to be fixed for py {PYTHON_VERSION}",
    )
    @pytest.mark.timeout(
        2.0  # this is marked as unit so that it will run on different versions of python
    )
    @pytest.mark.unit
    @pytest.mark.parametrize(
        ["fluent_ds_or_asset_model", "schema_dir"],
        [pytest.param(t[0], t[1], id=t[2]) for t in _models_and_schema_dirs()],
    )
    def test_vcs_schemas_match(  # noqa: C901
        fluent_ds_or_asset_model: Type[Datasource | DataAsset], schema_dir: pathlib.Path
    ):
        """
        Test that json schemas for each DataSource and DataAsset match the current schema
        under version control.
    
        This is important because some of these classes are generated dynamically and
        monitoring these schemas files can clue us into problems that could otherwise go
        unnoticed.
    
        If this test is failing run `invoke schema --sync` to update schemas and commit the
        changes.
    
        Note: if the installed version of pandas doesn't match the one used in the standard
        test pipeline, the test will be marked a `xfail` because the schemas will not match.
        """
    
        def _sort_any_of(d: dict) -> str:
            if items := d.get("items"):
                _sort_any_of(items)
            if ref := d.get("$ref"):
                return ref
            if type_ := d.get("type"):
                return type_
            # return any string for sorting
            return "z"
    
        def _sort_lists(schema_as_dict: dict) -> None:
            """Sometimes "required" and "anyOf" lists come unsorted, causing misleading assertion failures; this corrects the issue.
    
            Args:
                schema_as_dict: source dictionary (will be modified "in-situ")
    
            """  # noqa: E501
            key: str
            value: Any
    
            for key, value in schema_as_dict.items():
                if key == "required":
                    schema_as_dict[key] = sorted(value)
                elif key == "anyOf":
                    if isinstance(value, list):
                        for val in value:
                            if isinstance(value, dict):
                                schema_as_dict[key] = sorted(val, key=_sort_any_of)
                    else:
                        schema_as_dict[key] = sorted(value, key=_sort_any_of)
    
                if isinstance(value, dict):
                    _sort_lists(schema_as_dict=value)
    
        if "Pandas" in str(schema_dir) and _PANDAS_SCHEMA_VERSION != PANDAS_VERSION:
            pytest.xfail(reason=f"schema generated with pandas {_PANDAS_SCHEMA_VERSION}")
    
        print(f"python version: {sys.version.split()[0]}")
        print(f"pandas version: {PANDAS_VERSION}")
    
        schema_path = schema_dir.joinpath(f"{fluent_ds_or_asset_model.__name__}.json")
        print(schema_path)
    
        json_str = schema_path.read_text().rstrip()
    
        schema_as_dict = json.loads(json_str)
        _sort_lists(schema_as_dict=schema_as_dict)
        # we have tuples in our schema, which are mutated to lists when dumped to json
        # dump and reload the schema dict to ensure we are comparing
        fluent_ds_or_asset_model_as_dict = json.loads(fluent_ds_or_asset_model.schema_json())
        _sort_lists(schema_as_dict=fluent_ds_or_asset_model_as_dict)
    
        if "Excel" in str(schema_path):
            pytest.xfail(reason="Sorting of nested anyOf key")
    
>       assert (
            schema_as_dict == fluent_ds_or_asset_model_as_dict
        ), "Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version."
E       AssertionError: Schemas are out of sync. Run `invoke schema --sync`. Also check your pandas version.
E       assert equals failed
E               'required': ['key'],             'required': ['key'],      
E               'title': 'Sorter',               'title': 'Sorter',        
E               'type': 'object',                'type': 'object',         
E             },                               },                          
E           },                               },                            
E         #x00-  'description': '--Public API-#x01  #x00+  'description': '--Public API-#x01 
E         #x00--',#x01                              #x00+-\nPandasAzureBlobStorageDataso#x01 
E                                          #x00+urce is a PandasDatasource that#x01 
E                                          #x00+ uses Azure Blob Storage as a\n#x01 
E                                          #x00+data store.',#x01                   
E           'properties': {                  'properties': {               
E             'assets': {                      'assets': {                 
E               'default': [],                   'default': [],            
E               'items': {                       'items': {                
E                 '$ref': '#/definitions/          '$ref': '#/definitions/ 
E         FileDataAsset',                  FileDataAsset',

.../datasource/fluent/test_schemas.py:135: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📢 Thoughts on this report? Let us know!

…m/great-expectations/great_expectations into m/_/enforce_public_api_docstrings

klavavej

I'm not approving since I'm reviewing only the copy (and not the new test)

But, none of my copy feedback is severe enough to be a blocker on its own, so I'm also not requesting changes.

The vast majority of my comments are capitalization changes to abide the style guide though there are some more substantive notes. I was short on time before vacation so I didn't go as deep in this review as I would have liked. But I'll be iterating this content as part of the API reference improvements so I think that's ok.

klavavej · 2024-12-19T22:49:22Z