Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table of issue type info and relevant column name descriptions #1100

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

elisno
Copy link
Member

@elisno elisno commented Apr 15, 2024

Addresses #1081

This PR improves the Datalab Issue Types guide.

It adds a table for the different tasks that Datalab supports. It lists out the

  • Names of the issues it can find.
  • Whether it is searched by default (otherwise it's accessible via the issue_types argument in Datalab.find_issues().
  • A collection of relevant columns in the Datalab.issues dataframe.
  • A column that lists out what types of inputs to Datalab.find_issues() are required to successfully run the issue check.

Here's a screenshot of the table (still WIP):
image

@elisno elisno marked this pull request as draft April 15, 2024 23:28
@elisno elisno marked this pull request as ready for review April 17, 2024 18:54
@@ -56,6 +59,80 @@ To handle mislabeled examples, you can either filter out the data with label iss

Learn more about the method used to detect label issues in our paper: `Confident Learning: Estimating Uncertainty in Dataset Labels <https://arxiv.org/abs/1911.00068>`_

.. testsetup:: *
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testsetup block will be executed for all doctests blocks (..testcode), they just won't run the doctest until we set it up in CI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These testsetup cells are not visible in the docs.

@@ -46,6 +46,7 @@
"sphinx.ext.napoleon",
"nbsphinx",
"sphinx.ext.autodoc",
"sphinx.ext.autosectionlabel",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to allow us to link to local section headings.

A numeric column with scores between 0 and 1.
A smaller value for an example indicates that it is less common or typical in the dataset, suggesting that it is more likely to be an outlier.

If most of the nearest-neighbors of an example are exact-duplicates, then the outlier score of the example is set to 1.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If most of the nearest-neighbors of an example are exact-duplicates, then the outlier score of the example is set to 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this statement won't always be right, eg pred_probs based outlier

Comment on lines +256 to +258
A column of lists of integers, where each list contains the indices of examples that belong to the same set of near-duplicates (not including the example itself).
Each set represents a group of examples that are extremely similar to each other, relative to the rest of the dataset.
The examples in each set may be exactly duplicated or have very similar feature representations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A column of lists of integers, where each list contains the indices of examples that belong to the same set of near-duplicates (not including the example itself).
Each set represents a group of examples that are extremely similar to each other, relative to the rest of the dataset.
The examples in each set may be exactly duplicated or have very similar feature representations.
A column of lists of integers. The i-th list contains the indices of examples that are considered near-duplicates of example i (not including example i).

Comment on lines +263 to +267
A numeric column that represents the distance between each example and its nearest neighbor in the dataset.
The distance is calculated based on the provided `features` or `knn_graph`.
A smaller distance indicates that the example is more similar to its nearest neighbor.
Examples that are (near) duplicates have smaller distances to their nearest neighbors compared to other examples in the dataset.
Exact duplicates ideally have a distance of 0 to their nearest neighbor. However, due to floating point precision, especially when using certain distance metrics like Euclidean distance, this might not always be the case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A numeric column that represents the distance between each example and its nearest neighbor in the dataset.
The distance is calculated based on the provided `features` or `knn_graph`.
A smaller distance indicates that the example is more similar to its nearest neighbor.
Examples that are (near) duplicates have smaller distances to their nearest neighbors compared to other examples in the dataset.
Exact duplicates ideally have a distance of 0 to their nearest neighbor. However, due to floating point precision, especially when using certain distance metrics like Euclidean distance, this might not always be the case.
A numeric column that represents the distance between each example and its nearest neighbor in the dataset.
The distance is calculated based on the provided `features` or `knn_graph`, and is directly related to the `near_duplicate_score`.
A smaller distance indicates that the example is more similar to its nearest neighbor in the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants