From 1db7a0a264548f5d1608a968e5b11c088a4b41eb Mon Sep 17 00:00:00 2001 From: Jenia Jitsev Date: Sat, 7 Sep 2024 20:42:17 +0200 Subject: [PATCH] Minor text corrections --- blog/relaion-5b.md | 2 +- notes/laion-maintenance.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/blog/relaion-5b.md b/blog/relaion-5b.md index 6cac6792..f5022ecf 100644 --- a/blog/relaion-5b.md +++ b/blog/relaion-5b.md @@ -25,7 +25,7 @@ Open datasets necessary for open science and for reproducible studies of foundat At LAION, we are dedicated to building safe and legally compliant datasets and tools to advance research and promote widespread accessibility of AI for academia and technology. However, while contributing to important solutions necessary for basic and applied research in machine learning at larger scales, we are aware that we as a non-profit research organization with limited resources cannot single-handedly rectify all publicly available online information. We play a significant role, but not the entirety of it, building alliances with people and organizations that possess strong expertise and skills in handling large-scale dataset composition and pipelines necessary to perform it together. -We take full accountability for the accuracy of our publications, whether datasets, models, or tools. Prior to releasing LAION-400M and LAION-5B to the public, we implemented and refined filters to eliminate various problematic content. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-8 describe the specific measures we took for filtering CSAM related material. However, the findings from David Thiel (Stanford Internet Observatory, 19.12.2023) revealed that some links pointing to illegal content still slipped through our filters into LAION-5B text-links to images dataset, which led us to [promptly withdraw LAION-5B from circulation for the necessary safety revision](https://laion.ai/notes/laion-maintenance/). +We take full accountability for the accuracy of our publications, whether datasets, models, or tools. Prior to releasing LAION-400M and LAION-5B to the public, we implemented and refined filters to eliminate various problematic content. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-9 describe the specific measures we took for filtering CSAM related material. However, the findings from David Thiel (Stanford Internet Observatory, 19.12.2023) revealed that some links pointing to illegal content still slipped through our filters into LAION-5B text-links to images dataset, which led us to [promptly withdraw LAION-5B from circulation for the necessary safety revision](https://laion.ai/notes/laion-maintenance/). Regarding datasets, we believe an open approach is the most effective and safe one, because in addition to securing reproducibility, it also empowers anyone to inspect and see what’s inside, allowing for validation and for scientific progress executed together by the broad community, continually checking and improving the dataset as important artifact in a transparent manner. We think as with any open-source project, also open datasets should be subject to continuous scrutiny by the broad community, in a common effort to make open datasets better and better. We thus appreciate very much the effort David Thiel from the Stanford Internet Observatory undertook to look closely at LAION 5B and are grateful to all partner organizations for working with us on making it a better, safer dataset for the research community to use. diff --git a/notes/laion-maintenance.md b/notes/laion-maintenance.md index f4eee3e2..5930c03f 100644 --- a/notes/laion-maintenance.md +++ b/notes/laion-maintenance.md @@ -9,7 +9,7 @@ There have been reports in the press about the results of a research project at LAION is a non-profit organization that provides datasets, tools and models for the advancement of machine learning research. We are committed to open public education and the environmentally safe use of resources through the reuse of existing datasets and models. -LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-8 describe the specific measures we took for filtering CSAM related material. +LAION datasets (more than 5.85 billion entries) are sourced from the freely available Common Crawl web index and offer only links to content on the public web, with no images. We developed and published our own rigorous filters to detect and remove illegal content from LAION datasets before releasing them. [See our original announcement from 20.08.2021](https://laion.ai/blog/laion-400-open-dataset/#filtering-out-unsuitable-image-text-pairs), where points 6-9 describe the specific measures we took for filtering CSAM related material. LAION collaborates with universities, researchers and NGOs to improve these filters and are currently working with the [Internet Watch Foundation (IWF)](https://www.iwf.org.uk/) to identify and remove content suspected of violating laws. LAION invites the Stanford researchers to join its Community to improve our datasets and to develop efficient filters for detecting harmful content.