adding an spurious_correlation as new issue type #872

01PrathamS · 2023-10-13T19:40:23Z

I am submitting my first PR towards cleanlab, I might be a bit inexperienced but I am dedicated to following your instructions.

Add _spurious_correlations() as private instance method in datalab.py file with short and rely on helper function
helper functions are live in new file : cleanlab/datalab/internal/spurious_correlation.py
I'm still learning how to add unit tests, but I've been studying the code and directories to figure out the process.

I'm writing this to request a review of my initial pull request, your constructive feedback and guidance are really valuable for me.

Thank you, @jwmueller

codecov · 2023-10-13T19:50:36Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (dee32ad) 96.78% compared to head (f35b3c1) 96.69%.
Report is 8 commits behind head on master.

❗ Current head f35b3c1 differs from pull request most recent head 1ef0397. Consider uploading reports for the commit 1ef0397 to get more accurate results

Files	Patch %	Lines
cleanlab/datalab/datalab.py	88.88%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #872      +/-   ##
==========================================
- Coverage   96.78%   96.69%   -0.10%     
==========================================
  Files          70       68       -2     
  Lines        5544     5362     -182     
  Branches      945      925      -20     
==========================================
- Hits         5366     5185     -181     
- Misses         89       90       +1     
+ Partials       89       87       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jwmueller · 2023-10-13T20:16:11Z

Thanks for your contribution @01PrathamS!

We will need you to sign the CLA: #872 (comment)
before we can review it, thanks!

01PrathamS · 2023-10-13T20:20:40Z

hey @jwmueller I've signed the CLA for issue,looking forward to review. thanks!

jwmueller · 2023-10-13T22:20:03Z

Addresses: #860

tataganesh · 2023-10-13T22:20:47Z

@01PrathamS If you could structure the description of the PR based on the Pull Request Template, it would be really helpful for the reviewers. You do not have to fill all the sections though, just relevant information would do. You can use this PR as a starting point for the description. Thank you!

jwmueller · 2023-10-13T22:21:19Z

cleanlab/datalab/datalab.py

@@ -303,6 +303,19 @@ def find_issues(
 f"\nAudit complete. {self.data_issues.issue_summary['num_issues'].sum()} issues found in the dataset."
 )

+ def _spurious_correlations(self) -> pd.DataFrame: 


Please include an end-to-end unit test of this function. You should actually create a toy dataset that suffers from a spurious correlation (say have 10 tiny images at varying levels of darkness, and make the label related to how dark they are). And then verify that this code detects this spurious correlation. Likewise your same unit test should verify that the other spurious correlation scores (those unrelated to dark, light) do NOT give low scores for this same dataset.

For now you can just add the new unit test at the bottom of here:
https://github.com/cleanlab/cleanlab/blob/master/tests/datalab/test_datalab.py

Thank you for the suggestion @jwmueller, to include an end-to-end unit test. I'd like to ensure I create a comprehensive test that verifies the detection of spurious correlations effectively. However, I'm not entirely sure how to set up such a test, especially with a toy dataset. Could you please provide an example test or point me to any resources that might be helpful in creating this unit test?

You can generally follow the structure of any of the existing unit tests. I wouldn't worry too much about the precise code structure you use, we can help you refactor the code properly. Instead I would focus on ensuring the test runs quickly (toy dataset is small enough) but still tests the key logic -- namely that this code is actually able to detect an image property that is highly correlated with the labels and that this code does not return false positives for image properties that have no relationship with the labels.

An example you could follow is: test_find_issues_with_pred_probs

and just change the dataset being used and add a final line: lab._spurious_correlations() near the end of the test and then check its results.

Thank you for the guidance, @jwmueller. I appreciate your clear explanation of what the unit test should achieve. I understand the high-level structure and the need to ensure it runs efficiently with a small toy dataset. However, I'm currently facing a bit of a roadblock when it comes to translating this into code. i saw at example code and other test code as well but couldn't figure out how to get it done through code.

Which part is confusing to code specifically?

We can provide you some skeleton code or further pointers for that part, if you can write out your remaining specific questions.

i created dataset 'light_score = [0.11, 0.43, 0.96, 0.28, 0.23, 0.21, 0.63, 0.40, 0.19, 0.93]
dark_score = [0.98, 0.57, 0.28, 0.97, 0.91, 0.95, 0.57, 0.60, 0.87, 0.34]
label = [0, 1, 2, 0, 0, 0, 1, 1, 0, 2]
issues = pd.DataFrame({'dark_score': dark_score,
'light_score': light_score,
'labels': label})

issue_summary = pd.DataFrame({'issue_type': ['dark', 'light'],
'num_issues': [10,0]})' and it gets me result

'' image_property label_prediction_error
0 dark 0.3
1 light 0.3''

but when i tested it on mnist dataset by taking this as refarance 'https://docs.cleanlab.ai/master/tutorials/image.html' it gives output as

''image_property label_prediction_error
0 outlier 0.836867
1 near_duplicate 0.843817
2 low_information 0.743633
3 dark 0.855317''

I made a mistake by accidentally deleting the 'spurious_correlations' branch from my local machine. To rectify this, I have created a new branch named 'spurious_correlations_' and submitted a new pull request. I apologize for any inconvenience as i am doing this first time and will ensure to be more careful in the future

I'd prefer not to work in a new PR, given I have left a lot of feedback on this one.

You should be able to get the branch back on your local machine by doing:

git checkout --track origin/spurious_correlations git pull

(with git here pointed at your own fork). It should be good practice for you to get the branch back on your local machine, and resume work on the original PR if you can

jwmueller · 2023-10-13T22:24:01Z