Sources of Open AI models must also be open? #148

iperdomo · 2022-11-14T09:26:48Z

Does the sources of an Open AI model also need to be open in order to comply with the standard?

A couple of real examples:

Voice recording from children used for training a speech-to-text (STT) model. In this case, a laborious and complex consent with parents (and/or tutors) must be in place to release the sources used in the training.
Models where several datasets including Google trends (search) and NASA datasets are used and there are no open alternative to those.

christer-io · 2022-11-26T10:01:33Z

We need to be very clear on this. Models must be under an open license.

jwflory · 2023-01-27T10:46:38Z

Just curious. Are we talking about data or are we talking about models? I don't think we can have a conversation about models being under an open license if the data they are trained on is proprietary or limited access. I think AI models fall into a gray area with the current DPG Standard because models are, in many ways, an intersection of software and data. You can't have an AI model with just one; you need both ingredients.

jstclair2019 · 2023-02-16T12:49:50Z

I think from the software side AI/ML still fits within the DPG Standard. While I agree with @jwflory about the "grey area" I envision a submitter presenting a candidate DPG that is an open source software library. Even if it's trained on proprietary data sets, HOW the software libraries work and how they are updated (especially hard dependencies like another third party library).

iperdomo · 2023-02-28T08:51:38Z

We have another candidate that uses datasets behind a login wall and a custom dataset license

DPGAlliance/publicgoods-candidates#1363

We use MIMIC-III. As MIMIC-III requires the CITI training program in order to use it, we refer users to the link

Source: https://github.com/onefact/ClinicalBERT/blob/main/README.md#datasets

The license for MIMIC-III can be found at: https://physionet.org/content/mimiciii/1.4/ - PhysioNet Credentialed Health Data License 1.5.0

jaanli · 2023-04-21T13:38:37Z

Thank you for this discussion! Echoing @iperdomo -- there is a moral and ethical dilemma here.

For example, I have Estonian citizenship, and am a visiting professor at University of Tartu, where I teach students and faculty how to use ClinicalBERT (https://arxiv.org/abs/1904.05342). The course I teach: https://courses.cs.ut.ee/2023/chatGPT/spring

Some students and faculty have re-trained this ClinicalBERT in the Estonian language to help the department of health move to value-based care for the health system there.

I will try to articulate the moral and ethical issue here that is consequential to requiring that protected health information be open source (e.g. using these license types: https://github.com/DPGAlliance/publicgoods-candidates/blob/main/help-center/licenses.md#data):

(1) countries like Estonia have homogeneous genetic ancestry

(2) countries like Estonia also have few people of color -- e.g. this reference states there were 414 people of African descent or Black Europeans in 2011 (https://www.enar-eu.org/wp-content/uploads/estonia_fact_sheet_briefing_final.pdf).

(3) Doctors not having awareness around diseases with genetic etiology such as sickle cell anemia or polycistic ovarian syndrome can cause death.

(4) Doctors CAN use tools like ClinicalBERT (https://arxiv.org/abs/1904.05342) to help make decisions about patients for whom they cannot access data due to legal, moral, and ethical standards (for example, a doctor in Estonia cannot expect a hospital in New York City to share data on Black, Black European, or African-American patients in the electronic health record -- and this could violate several laws).

(5) We have support from the National Institutes of Health, who are conducting the largest longitudinal study in the history of the United States, to solve this problem: https://drive.google.com/open?id=1Si323cuMQp68ilgsymi_KZu1FXqXFT3c&authuser=jaan%40onefact.org&usp=drive_fs -- this letter states that ClinicalBERT can be trained on the entirety of the researchallofus.org electronic health record. This health record contains up to 78,040 people who self-report their race/ethnicity as Black, African American or African. This is over 100 times more Black people than are in Estonia.

Not using this data risks exposing under-represented groups in many countries to having medical decisions made by themselves or their care teams, hospital systems, health systems, and governments not have the most information possible, delivered by open source software and AI (ClinicalBERT is Apache 2.0-licensed).

In my experience working across several large academic medical centers and hospitals in the United States and Europe, scenarios like this are exceedingly common, and open source AI tools like ClinicalBERT are one way to share knowledge for medical professionals and health systems to deliver valuable care to their population.

Does this use case make sense? And is it clear how the requirement that datasets be open source can lead to a situation where people do not have access to potentially life-saving care algorithms, in the case that they happen to be under-represented in the health system and clinical data repository their care team uses to inform decision-making?

Happy to clarify, cite, discuss any of the above. My email is [email protected] if easier.

kjetilk · 2023-04-21T23:22:18Z

While admittedly having not had time to read recent discussions in detail, I just wanted to point out the work that is going on within the Debian Project on this topic: https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst

ricardomiron · 2023-04-23T23:42:52Z

At the DPGA we are currently hosting a community of practice (CoP) on AI systems as DPGs. This CoP will run for the next couple of months and one of its focus areas will be to inform the DPG standard to enable a well-grounded assessment and vetting process of AI-based solutions that are submitted to the DPG Registry.

Some of the things we will be discussing are (but not limited to):

What does an open AI model/ system mean as a stand-alone DPG category and its core components.
Requirement to have the sources and/ or training datasets open and their challenges.
Best practices, open standards, and required documentation for AI.
Other privacy, security, and ethical considerations.

We hope that at the end of this CoP, we can get to a final conclusion around this issue (#148) but also have a stronger understating of the requirements for AI models/ systems as DPGs, so feel free to share any thoughts or resources that can help guide this conversation.

jstclair2019 · 2023-04-24T15:12:08Z

Thanks @ricardomiron I'd like to know if CoP participation is still open. I've been involved in open source efforts for vulnerability management in AI that I think need to be included for consideration as well.

ricardomiron · 2023-05-02T05:01:32Z

Thanks @ricardomiron I'd like to know if CoP participation is still open. I've been involved in open source efforts for vulnerability management in AI that I think need to be included for consideration as well.

@jstclair2019 the CoP has already started but I'll send you a DM with more details.

Also, to be more specific these are the current requirements to consider open AI models for the DPG vetting process:

The model itself (source code for model creation, training, optimization, etc) must be under an approved open-source software license (OSI).
The data sources should be explicitly mentioned, and the training datasets must be publicly available under an approved open data license.
Training of the model is expected to be carried out using only these open datasets, alongside proper documentation for reproducibility.

nathanbaleeta changed the title ~~Sources of Open AI models must be also open?~~ Sources of Open AI models must also be open? Nov 14, 2022

This was referenced Sep 11, 2023

Add DPG: ZIConnect AI Platform (10522) DPGAlliance/publicgoods-candidates#1615

Open

Add DPG: Angaza Personalized Learning Recommendation Engine (10407) DPGAlliance/publicgoods-candidates#1625

Open

jaanli mentioned this issue Jun 2, 2024

support for wasmedge models? endomorphosis/ipfs_transformers#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sources of Open AI models must also be open? #148

Sources of Open AI models must also be open? #148

iperdomo commented Nov 14, 2022 •

edited by nathanbaleeta

christer-io commented Nov 26, 2022

jwflory commented Jan 27, 2023

jstclair2019 commented Feb 16, 2023

iperdomo commented Feb 28, 2023

jaanli commented Apr 21, 2023 •

edited

kjetilk commented Apr 21, 2023

ricardomiron commented Apr 23, 2023

jstclair2019 commented Apr 24, 2023

ricardomiron commented May 2, 2023 •

edited

Sources of Open AI models must also be open? #148

Sources of Open AI models must also be open? #148

Comments

iperdomo commented Nov 14, 2022 • edited by nathanbaleeta

christer-io commented Nov 26, 2022

jwflory commented Jan 27, 2023

jstclair2019 commented Feb 16, 2023

iperdomo commented Feb 28, 2023

jaanli commented Apr 21, 2023 • edited

kjetilk commented Apr 21, 2023

ricardomiron commented Apr 23, 2023

jstclair2019 commented Apr 24, 2023

ricardomiron commented May 2, 2023 • edited

iperdomo commented Nov 14, 2022 •

edited by nathanbaleeta

jaanli commented Apr 21, 2023 •

edited

ricardomiron commented May 2, 2023 •

edited