Add table of issue type info and relevant column name descriptions #1100

elisno · 2024-04-15T23:10:02Z

Addresses #1081

This PR improves the Datalab Issue Types guide.

It adds a table for the different tasks that Datalab supports. It lists out the

Names of the issues it can find.
Whether it is searched by default (otherwise it's accessible via the issue_types argument in Datalab.find_issues().
A collection of relevant columns in the Datalab.issues dataframe.
A column that lists out what types of inputs to Datalab.find_issues() are required to successfully run the issue check.

Here's a screenshot of the table (still WIP):

Allows to refer to sections by their title.

…red arguments Includes small notes for edge cases.

These doctests are not executed by CI at this moment, but the usage example show where the relevant cleanlab columns are found.

docs/source/cleanlab/datalab/guide/issue_type_description.rst

elisno · 2024-04-17T18:56:57Z

docs/source/cleanlab/datalab/guide/issue_type_description.rst

@@ -56,6 +59,80 @@ To handle mislabeled examples, you can either filter out the data with label iss

 Learn more about the method used to detect label issues in our paper: `Confident Learning: Estimating Uncertainty in Dataset Labels <https://arxiv.org/abs/1911.00068>`_

+.. testsetup:: *


This testsetup block will be executed for all doctests blocks (..testcode), they just won't run the doctest until we set it up in CI.

These testsetup cells are not visible in the docs.

elisno · 2024-04-17T19:01:22Z

docs/source/conf.py

@@ -46,6 +46,7 @@
 "sphinx.ext.napoleon",
 "nbsphinx",
 "sphinx.ext.autodoc",
+ "sphinx.ext.autosectionlabel",


This is added to allow us to link to local section headings.

jwmueller · 2024-05-30T17:57:04Z

docs/source/cleanlab/datalab/guide/issue_type_description.rst

+A numeric column with scores between 0 and 1. 
+A smaller value for an example indicates that it is less common or typical in the dataset, suggesting that it is more likely to be an outlier.
+
+If most of the nearest-neighbors of an example are exact-duplicates, then the outlier score of the example is set to 1.0.


Suggested change

If most of the nearest-neighbors of an example are exact-duplicates, then the outlier score of the example is set to 1.0.

^ this statement won't always be right, eg pred_probs based outlier

jwmueller · 2024-05-30T18:00:27Z

docs/source/cleanlab/datalab/guide/issue_type_description.rst

+A column of lists of integers, where each list contains the indices of examples that belong to the same set of near-duplicates (not including the example itself).
+Each set represents a group of examples that are extremely similar to each other, relative to the rest of the dataset.
+The examples in each set may be exactly duplicated or have very similar feature representations.


Suggested change

A column of lists of integers, where each list contains the indices of examples that belong to the same set of near-duplicates (not including the example itself).

Each set represents a group of examples that are extremely similar to each other, relative to the rest of the dataset.

The examples in each set may be exactly duplicated or have very similar feature representations.

A column of lists of integers. The i-th list contains the indices of examples that are considered near-duplicates of example i (not including example i).

jwmueller · 2024-05-30T18:01:42Z

docs/source/cleanlab/datalab/guide/issue_type_description.rst

+A numeric column that represents the distance between each example and its nearest neighbor in the dataset.
+The distance is calculated based on the provided `features` or `knn_graph`.
+A smaller distance indicates that the example is more similar to its nearest neighbor.
+Examples that are (near) duplicates have smaller distances to their nearest neighbors compared to other examples in the dataset.
+Exact duplicates ideally have a distance of 0 to their nearest neighbor. However, due to floating point precision, especially when using certain distance metrics like Euclidean distance, this might not always be the case.


Suggested change

A numeric column that represents the distance between each example and its nearest neighbor in the dataset.

The distance is calculated based on the provided `features` or `knn_graph`.

A smaller distance indicates that the example is more similar to its nearest neighbor.

Examples that are (near) duplicates have smaller distances to their nearest neighbors compared to other examples in the dataset.

Exact duplicates ideally have a distance of 0 to their nearest neighbor. However, due to floating point precision, especially when using certain distance metrics like Euclidean distance, this might not always be the case.

A numeric column that represents the distance between each example and its nearest neighbor in the dataset.

The distance is calculated based on the provided `features` or `knn_graph`, and is directly related to the `near_duplicate_score`.

A smaller distance indicates that the example is more similar to its nearest neighbor in the dataset.

elisno added 3 commits April 15, 2024 20:50

Add sphinx.ext.autosectionlabel extension

f9809c0

Allows to refer to sections by their title.

Add tables listing issue names, flag default issues and specify requi…

d3492d7

…red arguments Includes small notes for edge cases.

Switch to list-table and add descriptions of relevant columns

aac10e1

elisno marked this pull request as draft April 15, 2024 23:28

elisno added 5 commits April 17, 2024 18:38

Add descriptions for cleanlab columns in the Datalab issue type guide

afa5fef

Add doctest cells in issue_type_descriptions.

03d13fa

These doctests are not executed by CI at this moment, but the usage example show where the relevant cleanlab columns are found.

Update references in table

f3efc23

change text

f763b76

Move table to separate rst file

eec8b3b

elisno marked this pull request as ready for review April 17, 2024 18:54

elisno commented Apr 17, 2024

View reviewed changes

jwmueller reviewed May 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add table of issue type info and relevant column name descriptions #1100

Add table of issue type info and relevant column name descriptions #1100

elisno commented Apr 15, 2024

elisno Apr 17, 2024

elisno Apr 17, 2024

elisno Apr 17, 2024

jwmueller May 30, 2024

jwmueller May 30, 2024

jwmueller May 30, 2024

jwmueller May 30, 2024

		@@ -56,6 +59,80 @@ To handle mislabeled examples, you can either filter out the data with label iss

		Learn more about the method used to detect label issues in our paper: `Confident Learning: Estimating Uncertainty in Dataset Labels <https://arxiv.org/abs/1911.00068>`_

		.. testsetup:: *

Add table of issue type info and relevant column name descriptions #1100

Are you sure you want to change the base?

Add table of issue type info and relevant column name descriptions #1100

Conversation

elisno commented Apr 15, 2024

elisno Apr 17, 2024

Choose a reason for hiding this comment

elisno Apr 17, 2024

Choose a reason for hiding this comment

elisno Apr 17, 2024

Choose a reason for hiding this comment

jwmueller May 30, 2024

Choose a reason for hiding this comment

jwmueller May 30, 2024

Choose a reason for hiding this comment

jwmueller May 30, 2024

Choose a reason for hiding this comment

jwmueller May 30, 2024

Choose a reason for hiding this comment