Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding an spurious_correlation as new issue type #872

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

01PrathamS
Copy link
Contributor

I am submitting my first PR towards cleanlab, I might be a bit inexperienced but I am dedicated to following your instructions.

Add _spurious_correlations() as private instance method in datalab.py file with short and rely on helper function
helper functions are live in new file : cleanlab/datalab/internal/spurious_correlation.py
I'm still learning how to add unit tests, but I've been studying the code and directories to figure out the process.

I'm writing this to request a review of my initial pull request, your constructive feedback and guidance are really valuable for me.

Thank you, @jwmueller

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (dee32ad) 96.78% compared to head (f35b3c1) 96.69%.
Report is 8 commits behind head on master.

❗ Current head f35b3c1 differs from pull request most recent head 1ef0397. Consider uploading reports for the commit 1ef0397 to get more accurate results

Files Patch % Lines
cleanlab/datalab/datalab.py 88.88% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #872      +/-   ##
==========================================
- Coverage   96.78%   96.69%   -0.10%     
==========================================
  Files          70       68       -2     
  Lines        5544     5362     -182     
  Branches      945      925      -20     
==========================================
- Hits         5366     5185     -181     
- Misses         89       90       +1     
+ Partials       89       87       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwmueller
Copy link
Member

Thanks for your contribution @01PrathamS!

We will need you to sign the CLA: #872 (comment)
before we can review it, thanks!

@01PrathamS
Copy link
Contributor Author

hey @jwmueller I've signed the CLA for issue,looking forward to review. thanks!

@jwmueller
Copy link
Member

Addresses: #860

@tataganesh
Copy link
Contributor

tataganesh commented Oct 13, 2023

@01PrathamS If you could structure the description of the PR based on the Pull Request Template, it would be really helpful for the reviewers. You do not have to fill all the sections though, just relevant information would do. You can use this PR as a starting point for the description. Thank you!

@@ -303,6 +303,19 @@ def find_issues(
f"\nAudit complete. {self.data_issues.issue_summary['num_issues'].sum()} issues found in the dataset."
)

def _spurious_correlations(self) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include an end-to-end unit test of this function. You should actually create a toy dataset that suffers from a spurious correlation (say have 10 tiny images at varying levels of darkness, and make the label related to how dark they are). And then verify that this code detects this spurious correlation. Likewise your same unit test should verify that the other spurious correlation scores (those unrelated to dark, light) do NOT give low scores for this same dataset.

For now you can just add the new unit test at the bottom of here:
https://github.com/cleanlab/cleanlab/blob/master/tests/datalab/test_datalab.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @jwmueller, to include an end-to-end unit test. I'd like to ensure I create a comprehensive test that verifies the detection of spurious correlations effectively. However, I'm not entirely sure how to set up such a test, especially with a toy dataset. Could you please provide an example test or point me to any resources that might be helpful in creating this unit test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generally follow the structure of any of the existing unit tests. I wouldn't worry too much about the precise code structure you use, we can help you refactor the code properly. Instead I would focus on ensuring the test runs quickly (toy dataset is small enough) but still tests the key logic -- namely that this code is actually able to detect an image property that is highly correlated with the labels and that this code does not return false positives for image properties that have no relationship with the labels.

An example you could follow is: test_find_issues_with_pred_probs

and just change the dataset being used and add a final line: lab._spurious_correlations() near the end of the test and then check its results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the guidance, @jwmueller. I appreciate your clear explanation of what the unit test should achieve. I understand the high-level structure and the need to ensure it runs efficiently with a small toy dataset. However, I'm currently facing a bit of a roadblock when it comes to translating this into code. i saw at example code and other test code as well but couldn't figure out how to get it done through code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part is confusing to code specifically?

We can provide you some skeleton code or further pointers for that part, if you can write out your remaining specific questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i created dataset 'light_score = [0.11, 0.43, 0.96, 0.28, 0.23, 0.21, 0.63, 0.40, 0.19, 0.93]
dark_score = [0.98, 0.57, 0.28, 0.97, 0.91, 0.95, 0.57, 0.60, 0.87, 0.34]
label = [0, 1, 2, 0, 0, 0, 1, 1, 0, 2]
issues = pd.DataFrame({'dark_score': dark_score,
'light_score': light_score,
'labels': label})

issue_summary = pd.DataFrame({'issue_type': ['dark', 'light'],
'num_issues': [10,0]})' and it gets me result

'' image_property label_prediction_error
0 dark 0.3
1 light 0.3''

but when i tested it on mnist dataset by taking this as refarance 'https://docs.cleanlab.ai/master/tutorials/image.html' it gives output as

''image_property label_prediction_error
0 outlier 0.836867
1 near_duplicate 0.843817
2 low_information 0.743633
3 dark 0.855317''

I made a mistake by accidentally deleting the 'spurious_correlations' branch from my local machine. To rectify this, I have created a new branch named 'spurious_correlations_' and submitted a new pull request. I apologize for any inconvenience as i am doing this first time and will ensure to be more careful in the future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to work in a new PR, given I have left a lot of feedback on this one.

You should be able to get the branch back on your local machine by doing:

git checkout --track origin/spurious_correlations
git pull

(with git here pointed at your own fork). It should be good practice for you to get the branch back on your local machine, and resume work on the original PR if you can


from datalab import DataLab

class SpuriousCorrelations:
Copy link
Member

@jwmueller jwmueller Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define this class in a different (new) file.

My suggestion is:

cleanlab/datalab/internal/spurious_correlation.py

data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score'])
return data_score

def calculate_spurious_correlation(self, property_of_interest, baseline_accuracy):
Copy link
Member

@jwmueller jwmueller Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add mypy typing information for all arguments, as well as the return type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to get the type check that runs in our CI:
https://github.com/cleanlab/cleanlab/actions/runs/6512420947/job/17690017396?pr=872

to pass eventually, by adding the appropriate typing information everywhere

X = self.issues[property_of_interest].values.reshape(-1, 1)
y = self.labels
classifier = GaussianNB()
cv_accuracies = cross_val_score(classifier, X, y, cv=5, scoring='accuracy')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make cv_folds = 5 an optional argument of calculate_spurious_correlation() and set cv = cv_folds here.

for property_of_interest in image_properties:
S = self.calculate_spurious_correlation(property_of_interest, baseline_accuracy)
property_scores[f'{property_of_interest}'] = S
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score'])
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'label_prediction_error'])


def spurious_correlations(self) -> pd.DataFrame:
baseline_accuracy = np.bincount(self.labels).argmax() / len(self.labels)
image_properties = ["near_duplicate_score", "blurry_score", "light_score", "low_information_score", "dark_score", "grayscale_score", "odd_aspect_ratio_score", "odd_size_score"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before you loop over these, you should restrict to only the subset of these that is present in: self.issues

In some cases, not all of these properties will have been previously computed, in which case we should just not compute the spurious correlation for those properties that were not computed already

from sklearn.naive_bayes import GaussianNB
from statistics import mode
import warnings
warnings.filterwarnings('ignore')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this line, we should not be suppressing warnings in our codebase.

If there are warnings being printed, please explain why

@01PrathamS
Copy link
Contributor Author

@01PrathamS If you could structure the description of the PR based on the Pull Request Template, it would be really helpful for the reviewers. You do not have to fill all the sections though, just relevant information would do. You can use this PR as a starting point for the description. Thank you!

"Thank you for your feedback, @tataganesh . I'll make sure to structure the description of the PR based on the Pull Request Template to provide relevant information. I appreciate your guidance, and I'll use this PR as a starting point for the description. Thanks again!"

@01PrathamS
Copy link
Contributor Author

hello @jwmueller , please checkout my submitted PR new issue type spurious_correlation changes added

@jwmueller
Copy link
Member

hello @jwmueller , please checkout my submitted PR new issue type spurious_correlation changes added

So this PR is now up-to-date and I close the other one?

I'll take a look after you confirm that, thanks

@01PrathamS
Copy link
Contributor Author

Done, @jwmueller

Comment on lines 1120 to 1131
dark_scores = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
labels = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
issue_summary = pd.DataFrame({
'issue_type': ['dark'],
'num_issues': 10,
})

data = {
'issues': pd.DataFrame({'dark_score': dark_scores}),
'labels': labels,
'issue_summary': issue_summary,
}
Copy link
Member

@jwmueller jwmueller Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this not how the test should be. The test should be:

  1. construct a toy image dataset (say with 10-20 images)

  2. run datalab.find_issues(..., image_key = ...) on this toy dataset, following the tutorial here:
    https://docs.cleanlab.ai/master/tutorials/image.html

  3. run corrs = datalab._spurious_correlations() after that.

  4. verify the results in corrs are as expected via assert statements.

Make sure in your toy dataset there is some image-property that is correlated with your class labels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can see easy way to construct a toy image here:
https://github.com/cleanlab/cleanvision/blob/main/tests/test_image_property_helpers.py

using:

from PIL import Image
img = Image.new("RGB", (200, 200), (255, 0, 0))

Copy link
Contributor Author

@01PrathamS 01PrathamS Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jwmueller for your previous responses to my question. I've made multiple attemts to convert my custom dataset to Datalab, but i'm still facing difficulties, particularly with non-standard dataset, I've already raised the issue in the help channel on Slack "https://cleanlab-community.slack.com/archives/C031BGERG3Z/p1697512220170169"

how can i convert this to datalab

import datasets
from PIL import Image
images = [
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164)),
    Image.new("RGB", (164, 164), (255, 255, 255)),
    Image.new("RGB", (164, 164), (255, 255, 255))
]
labels = ['dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'light']
data = {
    "image": [img for img in images],
    "label": labels
}
dataset = datasets.Dataset.from_dict(data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello @jwmueller can you please review this and help me to write condition for assert statement

class TestSpuriousCorrelation:
    def create_data(self):
        images = [Image.new("RGB", (32, 32), (25, 25, 25))] * 5 + \
                 [Image.new("RGB", (32, 32), (255, 255, 255))] * 5

        rand_images = (np.random.rand(40, 32, 32, 3) * 255).astype(np.uint8)
        images = images + [Image.fromarray(img) for img in rand_images]

        labels = np.array([0] * 5 + [1] * 5 + [2] * 40)
        data = {
            "image": images,
            "label": labels
        }
        dataset = Dataset.from_dict(data)
        lab = Dataset(data=dataset, image_key="image", label_name="label")
        features = np.array([np.array(img).flatten() for img in images])
        return lab, features

    def test_spurious_correlation(self):
        imagelab, features = self.create_data()
        imagelab.find_issues(features=features)
        corrs = imagelab._spurious_correlation()
        assert corrs

Copy link
Member

@jwmueller jwmueller Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which property of the image here is the one correlated with the labels? Hard for me to tell just from your code.

If it is say the dark score property, then your assert should verify that corrs[dark] > corrs[i] for all other image properties i.

A stronger assert would verify that corrs[dark] is sufficiently high (above some threshold), while all other corrs entries are sufficiently low (below some threshold). Note I'm assuming high corrs corresponds to high correlation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwmueller jwmueller requested a review from elisno October 26, 2023 06:56
Let Datalab worry about passing the correct arguments into the constructor.

Generalize some of the methods. Add docstrings.
The test file includes a simple initializer test, and a property-based test on the scoring method used.
Additionally, added a random feature generator that allows you to configure a correlation coefficient.
@elisno
Copy link
Member

elisno commented Nov 1, 2023

The type-check error in the CI

cleanlab/datalab/internal/issue_manager/noniid.py:217: error: Argument 4 to "_select_features_and_setup_knn" of "NonIIDIssueManager" has incompatible type "str | bool | None"; expected "bool"  [arg-type]

is unrelated, but should be fixed on master. Wait for CI to run there before merging master into this PR branch.

The test still needs to check the case when baseline_accuracy == 1, when the eps value comes into play.
@pytest.fixture
def lab(self):
# Load dataset, sample for speed
dataset = load_dataset("fashion_mnist", split="train").shuffle(seed=0).select(range(30))
Copy link
Contributor

@tataganesh tataganesh Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider using the streaming option of load_dataset to avoid downloading the entire Fashion-MNIST Dataset. The test should then run faster. E.g.

from datasets import Dataset
iterable_dataset = load_dataset("fashion_mnist", split="train", streaming=True).take(30)
dataset = Dataset.from_list([x for x in iterable_dataset])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any unit test like this needs to be marked as slow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should ideally replace this with a toy dataset. Do not need a real image dataset to test this basic code

properties = _issue_summary["issue_type"].values.tolist()

# Ensure only properties present in both datalab and imagelab are considered.
if self._imagelab:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we checking here? _issue_summary was obtained using self._imagelab, so is this condition required?

relative_room_for_improvement
"""
baseline_accuracy = self._get_baseline()
assert (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be part of __post_init__, along with other asserts for properties_of_interest?

self.properties_of_interest is not None
), "properties_of_interest must be set, but is None."
property_scores = {
str(property_of_interest): self.calculate_spurious_correlation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
str(property_of_interest): self.calculate_spurious_correlation(
property_of_interest: self.calculate_spurious_correlation(

str check is already performed in __post_init__.

if mean_accuracy == 1:
assert (
score == 0
), f"score: {score} is not 0, baseline_acc uracy: {baseline_accuracy}, mean_accuracy: {mean_accuracy}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), f"score: {score} is not 0, baseline_acc uracy: {baseline_accuracy}, mean_accuracy: {mean_accuracy}"
), f"score: {score} is not 0, baseline_accuracy: {baseline_accuracy}, mean_accuracy: {mean_accuracy}"


# Check that the scores are replicable
scores = data_scores["label_prediction_error"].tolist()
np.testing.assert_almost_equal(scores, [0.100, 0.420], decimal=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If higher score indicates higher correlation, blurry_correlated should have a higher score than dark_uncorrelated?

@01PrathamS 01PrathamS reopened this Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants