-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table of issue type info and relevant column name descriptions #1100
base: master
Are you sure you want to change the base?
Conversation
Allows to refer to sections by their title.
…red arguments Includes small notes for edge cases.
These doctests are not executed by CI at this moment, but the usage example show where the relevant cleanlab columns are found.
@@ -56,6 +59,80 @@ To handle mislabeled examples, you can either filter out the data with label iss | |||
|
|||
Learn more about the method used to detect label issues in our paper: `Confident Learning: Estimating Uncertainty in Dataset Labels <https://arxiv.org/abs/1911.00068>`_ | |||
|
|||
.. testsetup:: * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This testsetup block will be executed for all doctests blocks (..testcode
), they just won't run the doctest until we set it up in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These testsetup cells are not visible in the docs.
@@ -46,6 +46,7 @@ | |||
"sphinx.ext.napoleon", | |||
"nbsphinx", | |||
"sphinx.ext.autodoc", | |||
"sphinx.ext.autosectionlabel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added to allow us to link to local section headings.
A numeric column with scores between 0 and 1. | ||
A smaller value for an example indicates that it is less common or typical in the dataset, suggesting that it is more likely to be an outlier. | ||
|
||
If most of the nearest-neighbors of an example are exact-duplicates, then the outlier score of the example is set to 1.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If most of the nearest-neighbors of an example are exact-duplicates, then the outlier score of the example is set to 1.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ this statement won't always be right, eg pred_probs based outlier
A column of lists of integers, where each list contains the indices of examples that belong to the same set of near-duplicates (not including the example itself). | ||
Each set represents a group of examples that are extremely similar to each other, relative to the rest of the dataset. | ||
The examples in each set may be exactly duplicated or have very similar feature representations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A column of lists of integers, where each list contains the indices of examples that belong to the same set of near-duplicates (not including the example itself). | |
Each set represents a group of examples that are extremely similar to each other, relative to the rest of the dataset. | |
The examples in each set may be exactly duplicated or have very similar feature representations. | |
A column of lists of integers. The i-th list contains the indices of examples that are considered near-duplicates of example i (not including example i). |
A numeric column that represents the distance between each example and its nearest neighbor in the dataset. | ||
The distance is calculated based on the provided `features` or `knn_graph`. | ||
A smaller distance indicates that the example is more similar to its nearest neighbor. | ||
Examples that are (near) duplicates have smaller distances to their nearest neighbors compared to other examples in the dataset. | ||
Exact duplicates ideally have a distance of 0 to their nearest neighbor. However, due to floating point precision, especially when using certain distance metrics like Euclidean distance, this might not always be the case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A numeric column that represents the distance between each example and its nearest neighbor in the dataset. | |
The distance is calculated based on the provided `features` or `knn_graph`. | |
A smaller distance indicates that the example is more similar to its nearest neighbor. | |
Examples that are (near) duplicates have smaller distances to their nearest neighbors compared to other examples in the dataset. | |
Exact duplicates ideally have a distance of 0 to their nearest neighbor. However, due to floating point precision, especially when using certain distance metrics like Euclidean distance, this might not always be the case. | |
A numeric column that represents the distance between each example and its nearest neighbor in the dataset. | |
The distance is calculated based on the provided `features` or `knn_graph`, and is directly related to the `near_duplicate_score`. | |
A smaller distance indicates that the example is more similar to its nearest neighbor in the dataset. |
Addresses #1081
This PR improves the Datalab Issue Types guide.
It adds a table for the different tasks that Datalab supports. It lists out the
issue_types
argument inDatalab.find_issues()
.Datalab.issues
dataframe.Datalab.find_issues()
are required to successfully run the issue check.Here's a screenshot of the table (still WIP):