Stay updated with the latest in Responsible AI. Subscribe to the Responsible AI newsletter for weekly updates on new papers and more.
Welcome to the Responsible AI Paper Summaries repository! Here, you'll find concise summaries of key papers in various areas of responsible AI.
This repository provides brief summaries of AI/ML papers in the following areas:
- Explainability and Interpretability
- Fairness and Biases
- Privacy
- Security
- Safety
- Accountability
- Human Control and Interaction
- Legal and Ethical Guidelines
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - arXiv, 2024. This paper from OpenAI introduces an instruction hierarchy to train LLMs to prioritize privileged instructions(system messages) over lower-level ones (user messages and third-party inputs), enhancing their robustness against adversaries.
-
Taxonomy of Risks Posed by Language Models - FAccT ’22. This paper develops a comprehensive taxonomy of ethical and social risks associated with large-scale language models (LMs). It identifies twenty-one risks and categorizes them into six risk areas to guide responsible innovation and mitigation strategies.
-
Why Should I Trust You? Explaining the Predictions of Any Classifier - KDD 2016. This paper introduces LIME (Local Interpretable Model-agnostic Explanations), a technique to explain the predictions of any classifier in a faithful and interpretable manner by learning an interpretable model locally around the prediction.
Explainability and Interpretability
-
Why Should I Trust You? Explaining the Predictions of Any Classifier - KDD 2016. This paper introduces LIME (Local Interpretable Model-agnostic Explanations), a technique to explain the predictions of any classifier in a faithful and interpretable manner by learning an interpretable model locally around the prediction.
-
A Nutritional Label for Rankings - SIGMOD’18. Provides a web-based application called Ranking Facts that generates a "nutritional label" for rankings to enhance transparency, fairness, and stability.
-
A Unified Approach to Interpreting Model Predictions - NIPS 2017. Introduces SHAP (SHapley Additive exPlanations), a unified framework for interpreting model predictions by assigning each feature an importance value for a particular prediction, integrating six existing methods into a single, cohesive approach.
Fairness and Biases
- Taxonomy of Risks Posed by Language Models - FAccT ’22. This paper develops a comprehensive taxonomy of ethical and social risks associated with large-scale language models (LMs). It identifies twenty-one risks and categorizes them into six risk areas to guide responsible innovation and mitigation strategies.
Privacy
Security
Safety
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - arXiv, 2024. This paper from OpenAI introduces an instruction hierarchy to train LLMs to prioritize privileged instructions(system messages) over lower-level ones (user messages and third-party inputs), enhancing their robustness against adversaries.
-
Taxonomy of Risks Posed by Language Models - FAccT ’22. This paper develops a comprehensive taxonomy of ethical and social risks associated with large-scale language models (LMs). It identifies twenty-one risks and categorizes them into six risk areas to guide responsible innovation and mitigation strategies.
-
To Believe or Not to Believe Your LLM - arXiv, 2024. This paper explores uncertainty quantification in LLMs to detect hallucinations by distinguishing epistemic from aleatoric uncertainties using an information-theoretic metric.
-
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models - arXiv, 2024. CARES evaluates the trustworthiness of medical vision language models (Med-LVLMs) across five dimensions: trustfulness, fairness, safety, privacy, and robustness.
-
Air Gap: Protecting Privacy-Conscious Conversational Agents - arXiv 2024. A paper from Google proposes AirGapAgent to prevent data leakage from LLMs, ensuring privacy in adversarial contexts.
Accountability
Human Control and Interaction
- Towards a Science of Human-AI Decision Making: A Survey of Empirical Studies - FAccT '23. This survey reviews over 100 empirical studies to understand and improve human-AI decision-making, emphasizing the need for unified research frameworks.
Legal and Ethical Guidelines
Each summary is stored in the relevant subfolder within the summaries/
directory. You can browse through the summaries to quickly understand the main points of various papers.
We welcome contributions! Please read our CONTRIBUTING.md file for more details on how to contribute.