Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added custom issue manager class for detecting identifier columns #1120

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

MaxJoas
Copy link

@MaxJoas MaxJoas commented May 9, 2024

Summary

🎯 Purpose: This PR solves #923 by adding a issue manager class that find identifier columns

📜 Example Usage:

import numpy as np 
import pandas as pd 
from cleanlab import Datalab
from cleanlab.datalab.internal.issue_manager_factory import register
from cleanlab.datalab.internal.issue_manager.identifier_column import IdentifierColumnIssueManager
register(IdentifierColumnIssueManager)
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [6, 7, 8, 30, 10],
    'C': [2, 12, 13, 14, 15],
    'D': [4, 17, 18, 19, 20],
    'E': [21, 22, 23, 24, 25]
}
df = pd.DataFrame(data)
lab = Datalab(data=df, label_name="D")

lab.find_issues(issue_types={"identifier_column":{"features": df.values}})
lab.info

Impact

No other areas impacted. It is now possible to find a new issue_type.

Screenshots
See code snippetabove

Testing

Ran existing tests, added test_identifier_column.py with last commit.
Tested mainly types sequential and non-sequential arrays (sometimes with and without duplicates and sequences that do not start with 0)

@pytest.mark.parametrize(
    "arr, expected_output",
    [
        (np.array([1, 2, 3, 4, 5]), True),
        (np.array([1, 1, 2, 2, 3, 3, 5]), False),
        (np.array([1, 1, 3, 4, 5, 8, 10]), False),
        (np.array([0, 0, 0, 0, 0, 0, 0]), False),
        (np.array([4, 5, 5, 6, 7, 8, 9, 10]), True),
        (np.array([1, 3, 4, 4, 5, 6, 7, -1]), False),
        (np.array([2, 1, 3, 5, 6, 4]), True),
        (np.array([-1, -3, -2, -4, 0]), True),
        (np.array([]), False),
        (np.array([0, 0, 0]), False),
    ],

Unaddressed Cases
Providing Non-integer and non-numeric columns. I think there will be just an general error fro the issue manager, like with other issues.

Links to Relevant Issues or Conversations

#923
#947

🔗 What Git or Slack items (Issues, threads, etc) that are specifically related to
this work? Please link them here.
None, I know of

References

None

Reviewer Notes

None

@CLAassistant
Copy link

CLAassistant commented May 9, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

codecov bot commented May 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.06%. Comparing base (81af417) to head (b27be24).

❗ Current head b27be24 differs from pull request most recent head 2377fc4. Consider uploading reports for the commit 2377fc4 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1120      +/-   ##
==========================================
- Coverage   96.09%   96.06%   -0.03%     
==========================================
  Files          76       77       +1     
  Lines        6088     6128      +40     
  Branches     1081     1089       +8     
==========================================
+ Hits         5850     5887      +37     
- Misses        142      144       +2     
- Partials       96       97       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwmueller jwmueller requested a review from elisno May 21, 2024 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants