This repository contains the dataset, code, and outputs related to the thesis "Interpreting Multi-Word Expressions in Text-to-Image Generation: A Cross-Attention Approach Using Stable Diffusion." The work explores how generative models like Stable Diffusion 2.1 handle Multi-Word Expressions (MWEs) and uses interpretability tools such as DAAM (Diffusion Attention Attribution Maps) to analyze model attention.
- Introduction
- Dataset Description
- Code Description
- Outputs
- Setup and Usage
- Dependencies
- Acknowledgments
- License
Multi-Word Expressions (MWEs) encompass idioms, collocations, and phrases that convey meanings beyond their literal components. They are central to natural language but pose challenges for generative models due to their context-dependent and often abstract nature. This repository:
- Explores the semantic and visual representation of MWEs.
- Leverages Stable Diffusion 2.1 for generating images from text prompts.
- Uses DAAM to analyze and visualize attention distribution across the generated images.
This dataset is an extended version of the MWE-CWI dataset. It includes:
- MWEs: Multi-word expressions such as "spill the beans" and "hard pressed."
- Context: Sentences providing linguistic and situational context for each MWE.
- Generated Prompts: Text prompts crafted for Stable Diffusion to represent each MWE.
- Annotations:
- Supersenses: High-level semantic categories for the MWEs.
- Complexity scores: Binary/probabilistic labels indicating difficulty for native and non-native speakers.
This dataset plays a pivotal role in:
- Generating accurate textual prompts.
- Evaluating semantic alignment between MWEs and their visual representations.
The Jupyter Notebook contains all the code necessary for:
-
Prompt Engineering:
- Integrates MWEs naturally into descriptive and contextually rich prompts.
- Ensures prompts stay within Stable Diffusion’s token limit.
- Example:
"A close-up of a person nervously spilling beans onto the floor, symbolizing the accidental revealing of a secret. The atmosphere is tense, and the background is a modern office."
-
Image Generation:
- Uses Stable Diffusion 2.1 to generate high-quality images based on the crafted prompts.
- Encodes prompts using CLIP and processes them through latent diffusion steps.
-
DAAM Visualizations:
- Generates heatmaps to visualize cross-attention scores.
- Maps heatmaps onto the image to highlight the model’s focus areas.
- Example outputs include heatmaps for "spill the beans" and "hard pressed."
-
Evaluation Framework:
- Metrics:
- Intersection over Union (IoU): Measures alignment between heatmaps and annotated ground-truth masks.
- Coverage Metrics: Quantifies how much attention is focused on relevant regions.
- Metrics:
Visual Comparisons for the word: “President Barack Obama” Prompt Details • Genre: News • Complex Probabilistic Score: 0.45 • Expression: “President Barack Obama” • Context: “Outgoing US President Barack Obama authorised the move in response to Russian intervention in Ukraine in 2014, in which Crimea was annexed.” • Generated Prompt: “Outgoing US President Barack Obama stands in a solemn setting, authorizing a strategic response to Russia's 2014 intervention in Ukraine, where Crimea was annexed. The scene captures Obama's decisive expression amid a backdrop of geopolitical tension.”
This example captures the idiomatic expression "Spill the Beans." The original image shows a person nervously spilling beans, symbolizing revealing a secret. The DAAM overlay highlights the model’s attention to the beans and the context of the action.
- torch: PyTorch for Stable Diffusion.
- transformers: Hugging Face library for model loading.
- cv2: OpenCV for image processing.
- matplotlib: Visualization of heatmaps and outputs.
- pandas: Dataset handling.
- Google Colab (recommended): Provides a GPU-powered environment for running the notebook.
This repository relies on the following resources:
- MWE-CWI Dataset: A publicly available dataset for identifying and evaluating multi-word expressions (MWEs).
- Stable Diffusion: A cutting-edge latent diffusion model by Stability AI and Hugging Face.
- DAAM: A visualization tool for attention attribution in diffusion models.
This repository is licensed under the MIT License. See the LICENSE file for more details.
git clone https://github.com/your-username/MWE-Stable-Diffusion-Interpretation.git
cd MWE-Stable-Diffusion-Interpretation