-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding an spurious_correlation as new issue type #872
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #872 +/- ##
==========================================
- Coverage 96.78% 96.69% -0.10%
==========================================
Files 70 68 -2
Lines 5544 5362 -182
Branches 945 925 -20
==========================================
- Hits 5366 5185 -181
- Misses 89 90 +1
+ Partials 89 87 -2 ☔ View full report in Codecov by Sentry. |
Thanks for your contribution @01PrathamS! We will need you to sign the CLA: #872 (comment) |
hey @jwmueller I've signed the CLA for issue,looking forward to review. thanks! |
Addresses: #860 |
@01PrathamS If you could structure the description of the PR based on the Pull Request Template, it would be really helpful for the reviewers. You do not have to fill all the sections though, just relevant information would do. You can use this PR as a starting point for the description. Thank you! |
cleanlab/datalab/datalab.py
Outdated
@@ -303,6 +303,19 @@ def find_issues( | |||
f"\nAudit complete. {self.data_issues.issue_summary['num_issues'].sum()} issues found in the dataset." | |||
) | |||
|
|||
def _spurious_correlations(self) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include an end-to-end unit test of this function. You should actually create a toy dataset that suffers from a spurious correlation (say have 10 tiny images at varying levels of darkness, and make the label related to how dark they are). And then verify that this code detects this spurious correlation. Likewise your same unit test should verify that the other spurious correlation scores (those unrelated to dark, light) do NOT give low scores for this same dataset.
For now you can just add the new unit test at the bottom of here:
https://github.com/cleanlab/cleanlab/blob/master/tests/datalab/test_datalab.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion @jwmueller, to include an end-to-end unit test. I'd like to ensure I create a comprehensive test that verifies the detection of spurious correlations effectively. However, I'm not entirely sure how to set up such a test, especially with a toy dataset. Could you please provide an example test or point me to any resources that might be helpful in creating this unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can generally follow the structure of any of the existing unit tests. I wouldn't worry too much about the precise code structure you use, we can help you refactor the code properly. Instead I would focus on ensuring the test runs quickly (toy dataset is small enough) but still tests the key logic -- namely that this code is actually able to detect an image property that is highly correlated with the labels and that this code does not return false positives for image properties that have no relationship with the labels.
An example you could follow is: test_find_issues_with_pred_probs
and just change the dataset being used and add a final line: lab._spurious_correlations()
near the end of the test and then check its results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the guidance, @jwmueller. I appreciate your clear explanation of what the unit test should achieve. I understand the high-level structure and the need to ensure it runs efficiently with a small toy dataset. However, I'm currently facing a bit of a roadblock when it comes to translating this into code. i saw at example code and other test code as well but couldn't figure out how to get it done through code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part is confusing to code specifically?
We can provide you some skeleton code or further pointers for that part, if you can write out your remaining specific questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i created dataset 'light_score = [0.11, 0.43, 0.96, 0.28, 0.23, 0.21, 0.63, 0.40, 0.19, 0.93]
dark_score = [0.98, 0.57, 0.28, 0.97, 0.91, 0.95, 0.57, 0.60, 0.87, 0.34]
label = [0, 1, 2, 0, 0, 0, 1, 1, 0, 2]
issues = pd.DataFrame({'dark_score': dark_score,
'light_score': light_score,
'labels': label})
issue_summary = pd.DataFrame({'issue_type': ['dark', 'light'],
'num_issues': [10,0]})' and it gets me result
'' image_property label_prediction_error
0 dark 0.3
1 light 0.3''
but when i tested it on mnist dataset by taking this as refarance 'https://docs.cleanlab.ai/master/tutorials/image.html' it gives output as
''image_property label_prediction_error
0 outlier 0.836867
1 near_duplicate 0.843817
2 low_information 0.743633
3 dark 0.855317''
I made a mistake by accidentally deleting the 'spurious_correlations' branch from my local machine. To rectify this, I have created a new branch named 'spurious_correlations_' and submitted a new pull request. I apologize for any inconvenience as i am doing this first time and will ensure to be more careful in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to work in a new PR, given I have left a lot of feedback on this one.
You should be able to get the branch back on your local machine by doing:
git checkout --track origin/spurious_correlations
git pull
(with git here pointed at your own fork). It should be good practice for you to get the branch back on your local machine, and resume work on the original PR if you can
|
||
from datalab import DataLab | ||
|
||
class SpuriousCorrelations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define this class in a different (new) file.
My suggestion is:
cleanlab/datalab/internal/spurious_correlation.py
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score']) | ||
return data_score | ||
|
||
def calculate_spurious_correlation(self, property_of_interest, baseline_accuracy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add mypy typing information for all arguments, as well as the return type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to get the type check that runs in our CI:
https://github.com/cleanlab/cleanlab/actions/runs/6512420947/job/17690017396?pr=872
to pass eventually, by adding the appropriate typing information everywhere
X = self.issues[property_of_interest].values.reshape(-1, 1) | ||
y = self.labels | ||
classifier = GaussianNB() | ||
cv_accuracies = cross_val_score(classifier, X, y, cv=5, scoring='accuracy') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make cv_folds = 5
an optional argument of calculate_spurious_correlation()
and set cv = cv_folds
here.
for property_of_interest in image_properties: | ||
S = self.calculate_spurious_correlation(property_of_interest, baseline_accuracy) | ||
property_scores[f'{property_of_interest}'] = S | ||
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'Overall_score']) | |
data_score = pd.DataFrame(list(property_scores.items()), columns=['image_property', 'label_prediction_error']) |
|
||
def spurious_correlations(self) -> pd.DataFrame: | ||
baseline_accuracy = np.bincount(self.labels).argmax() / len(self.labels) | ||
image_properties = ["near_duplicate_score", "blurry_score", "light_score", "low_information_score", "dark_score", "grayscale_score", "odd_aspect_ratio_score", "odd_size_score"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before you loop over these, you should restrict to only the subset of these that is present in: self.issues
In some cases, not all of these properties will have been previously computed, in which case we should just not compute the spurious correlation for those properties that were not computed already
from sklearn.naive_bayes import GaussianNB | ||
from statistics import mode | ||
import warnings | ||
warnings.filterwarnings('ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this line, we should not be suppressing warnings in our codebase.
If there are warnings being printed, please explain why
"Thank you for your feedback, @tataganesh . I'll make sure to structure the description of the PR based on the Pull Request Template to provide relevant information. I appreciate your guidance, and I'll use this PR as a starting point for the description. Thanks again!" |
hello @jwmueller , please checkout my submitted PR new issue type spurious_correlation changes added |
So this PR is now up-to-date and I close the other one? I'll take a look after you confirm that, thanks |
Done, @jwmueller |
tests/datalab/test_datalab.py
Outdated
dark_scores = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] | ||
labels = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]) | ||
issue_summary = pd.DataFrame({ | ||
'issue_type': ['dark'], | ||
'num_issues': 10, | ||
}) | ||
|
||
data = { | ||
'issues': pd.DataFrame({'dark_score': dark_scores}), | ||
'labels': labels, | ||
'issue_summary': issue_summary, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this not how the test should be. The test should be:
-
construct a toy image dataset (say with 10-20 images)
-
run
datalab.find_issues(..., image_key = ...)
on this toy dataset, following the tutorial here:
https://docs.cleanlab.ai/master/tutorials/image.html -
run
corrs = datalab._spurious_correlations()
after that. -
verify the results in
corrs
are as expected viaassert
statements.
Make sure in your toy dataset there is some image-property that is correlated with your class labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can see easy way to construct a toy image here:
https://github.com/cleanlab/cleanvision/blob/main/tests/test_image_property_helpers.py
using:
from PIL import Image
img = Image.new("RGB", (200, 200), (255, 0, 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jwmueller for your previous responses to my question. I've made multiple attemts to convert my custom dataset to Datalab, but i'm still facing difficulties, particularly with non-standard dataset, I've already raised the issue in the help channel on Slack "https://cleanlab-community.slack.com/archives/C031BGERG3Z/p1697512220170169"
how can i convert this to datalab
import datasets
from PIL import Image
images = [
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164)),
Image.new("RGB", (164, 164), (255, 255, 255)),
Image.new("RGB", (164, 164), (255, 255, 255))
]
labels = ['dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'dark', 'light']
data = {
"image": [img for img in images],
"label": labels
}
dataset = datasets.Dataset.from_dict(data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @jwmueller can you please review this and help me to write condition for assert statement
class TestSpuriousCorrelation:
def create_data(self):
images = [Image.new("RGB", (32, 32), (25, 25, 25))] * 5 + \
[Image.new("RGB", (32, 32), (255, 255, 255))] * 5
rand_images = (np.random.rand(40, 32, 32, 3) * 255).astype(np.uint8)
images = images + [Image.fromarray(img) for img in rand_images]
labels = np.array([0] * 5 + [1] * 5 + [2] * 40)
data = {
"image": images,
"label": labels
}
dataset = Dataset.from_dict(data)
lab = Dataset(data=dataset, image_key="image", label_name="label")
features = np.array([np.array(img).flatten() for img in images])
return lab, features
def test_spurious_correlation(self):
imagelab, features = self.create_data()
imagelab.find_issues(features=features)
corrs = imagelab._spurious_correlation()
assert corrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which property of the image here is the one correlated with the labels? Hard for me to tell just from your code.
If it is say the dark
score property, then your assert should verify that corrs[dark] > corrs[i] for all other image properties i.
A stronger assert would verify that corrs[dark]
is sufficiently high (above some threshold), while all other corrs
entries are sufficiently low (below some threshold). Note I'm assuming high corrs
corresponds to high correlation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wrote complete code to address this https://colab.research.google.com/drive/1Pvu7QES5AgUQjTClwJAiJeHkKlhqW4sW#scrollTo=t9XAO03pRnfH&uniqifier=1
Let Datalab worry about passing the correct arguments into the constructor. Generalize some of the methods. Add docstrings.
The test file includes a simple initializer test, and a property-based test on the scoring method used.
Additionally, added a random feature generator that allows you to configure a correlation coefficient.
…rame to be scored.
The type-check error in the CI
is unrelated, but should be fixed on master. Wait for CI to run there before merging master into this PR branch. |
The test still needs to check the case when baseline_accuracy == 1, when the eps value comes into play.
@pytest.fixture | ||
def lab(self): | ||
# Load dataset, sample for speed | ||
dataset = load_dataset("fashion_mnist", split="train").shuffle(seed=0).select(range(30)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider using the streaming
option of load_dataset
to avoid downloading the entire Fashion-MNIST Dataset. The test should then run faster. E.g.
from datasets import Dataset
iterable_dataset = load_dataset("fashion_mnist", split="train", streaming=True).take(30)
dataset = Dataset.from_list([x for x in iterable_dataset])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any unit test like this needs to be marked as slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should ideally replace this with a toy dataset. Do not need a real image dataset to test this basic code
properties = _issue_summary["issue_type"].values.tolist() | ||
|
||
# Ensure only properties present in both datalab and imagelab are considered. | ||
if self._imagelab: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we checking here? _issue_summary
was obtained using self._imagelab
, so is this condition required?
relative_room_for_improvement | ||
""" | ||
baseline_accuracy = self._get_baseline() | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be part of __post_init__
, along with other asserts for properties_of_interest
?
self.properties_of_interest is not None | ||
), "properties_of_interest must be set, but is None." | ||
property_scores = { | ||
str(property_of_interest): self.calculate_spurious_correlation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str(property_of_interest): self.calculate_spurious_correlation( | |
property_of_interest: self.calculate_spurious_correlation( |
str
check is already performed in __post_init__
.
if mean_accuracy == 1: | ||
assert ( | ||
score == 0 | ||
), f"score: {score} is not 0, baseline_acc uracy: {baseline_accuracy}, mean_accuracy: {mean_accuracy}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
), f"score: {score} is not 0, baseline_acc uracy: {baseline_accuracy}, mean_accuracy: {mean_accuracy}" | |
), f"score: {score} is not 0, baseline_accuracy: {baseline_accuracy}, mean_accuracy: {mean_accuracy}" |
|
||
# Check that the scores are replicable | ||
scores = data_scores["label_prediction_error"].tolist() | ||
np.testing.assert_almost_equal(scores, [0.100, 0.420], decimal=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If higher score indicates higher correlation, blurry_correlated
should have a higher score than dark_uncorrelated
?
3df88c7
to
dee32ad
Compare
I am submitting my first PR towards cleanlab, I might be a bit inexperienced but I am dedicated to following your instructions.
I'm writing this to request a review of my initial pull request, your constructive feedback and guidance are really valuable for me.
Thank you, @jwmueller