Skip to content

Commit

Permalink
Minor text corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
JeniaJitsev committed Sep 7, 2024
1 parent e8b9485 commit 1db7a0a
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion blog/relaion-5b.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Open datasets necessary for open science and for reproducible studies of foundat

At LAION, we are dedicated to building safe and legally compliant datasets and tools to advance research and promote widespread accessibility of AI for academia and technology. However, while contributing to important solutions necessary for basic and applied research in machine learning at larger scales, we are aware that we as a non-profit research organization with limited resources cannot single-handedly rectify all publicly available online information. We play a significant role, but not the entirety of it, building alliances with people and organizations that possess strong expertise and skills in handling large-scale dataset composition and pipelines necessary to perform it together.

We take full accountability for the accuracy of our publications, whether datasets, models, or tools. Prior to releasing LAION-400M and LAION-5B to the public, we implemented and refined filters to eliminate various problematic content. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-8 describe the specific measures we took for filtering CSAM related material. However, the findings from David Thiel (Stanford Internet Observatory, 19.12.2023) revealed that some links pointing to illegal content still slipped through our filters into LAION-5B text-links to images dataset, which led us to [promptly withdraw LAION-5B from circulation for the necessary safety revision](https://laion.ai/notes/laion-maintenance/).
We take full accountability for the accuracy of our publications, whether datasets, models, or tools. Prior to releasing LAION-400M and LAION-5B to the public, we implemented and refined filters to eliminate various problematic content. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-9 describe the specific measures we took for filtering CSAM related material. However, the findings from David Thiel (Stanford Internet Observatory, 19.12.2023) revealed that some links pointing to illegal content still slipped through our filters into LAION-5B text-links to images dataset, which led us to [promptly withdraw LAION-5B from circulation for the necessary safety revision](https://laion.ai/notes/laion-maintenance/).

Regarding datasets, we believe an open approach is the most effective and safe one, because in addition to securing reproducibility, it also empowers anyone to inspect and see what’s inside, allowing for validation and for scientific progress executed together by the broad community, continually checking and improving the dataset as important artifact in a transparent manner. We think as with any open-source project, also open datasets should be subject to continuous scrutiny by the broad community, in a common effort to make open datasets better and better. We thus appreciate very much the effort David Thiel from the Stanford Internet Observatory undertook to look closely at LAION 5B and are grateful to all partner organizations for working with us on making it a better, safer dataset for the research community to use.

Expand Down
2 changes: 1 addition & 1 deletion notes/laion-maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ There have been reports in the press about the results of a research project at

LAION is a non-profit organization that provides datasets, tools and models for the advancement of machine learning research. We are committed to open public education and the environmentally safe use of resources through the reuse of existing datasets and models.

LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-8 describe the specific measures we took for filtering CSAM related material.
LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-9 describe the specific measures we took for filtering CSAM related material.

LAION collaborates with universities, researchers and NGOs to improve these filters and are currently working with the [Internet Watch Foundation (IWF)](https://www.iwf.org.uk/) to identify and remove content suspected of violating laws. LAION invites the Stanford researchers to join its Community to improve our datasets and to develop efficient filters for detecting harmful content.

Expand Down

0 comments on commit 1db7a0a

Please sign in to comment.