Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

Link
Abstract

Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative text-to-image generation. However, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. This approach modifies objects or object properties in images by manipulating feature components in attention layers during the generation process. Nevertheless, little is known about the semantic meanings that these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information, which can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention mechanisms in diffusion models. Furthermore, based on our findings, we propose a simplified, yet more stable and efficient, tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets. 11Source code and datasets are available at https://github.com/alibaba/EasyNLP/tree/master/diffusion/FreePromptEditing.

Synth

Problem:: 기존 Tuning-Free Text-guided Image Editing (TIE) 방법들은 Diffusion Model의 Attention Layer를 수정하지만, Cross-Attention과 Self-Attention의 역할 및 Semantic 의미에 대한 이해 부족 / Cross-Attention 수정 기반 방법들의 불안정성 및 편집 실패 가능성 존재

Solution:: Probing Analysis를 통해 Attention Map 역할 규명 (Cross-Attention: 카테고리/Semantic 정보 포함, Self-Attention: 공간/구조 정보 포함) / 분석 기반으로 Cross-Attention 수정을 배제하고 Self-Attention Map만 Source 이미지의 것으로 교체하는 Free-Prompt-Editing (FPE) 제안 / Real Image 편집 시 DDIM Inversion 활용

Novelty:: Stable Diffusion Attention Map 역할에 대한 체계적 Probing Analysis 수행 및 결과 제시 / Cross-Attention Map 수정이 편집 실패의 원인임을 입증 / Self-Attention Map만 수정하는 단순하고 안정적인 TIE 프레임워크(FPE) 제안

Note:: Editing 상황 말고 Text-to-Image 생성의 경우에 Cross-Attention Map을 Null Text에서 뽑은 것으로 교체한 경우, 원하는 이미지를 생성할 수 있을까?

Summary

Motivation

file-20250421152941813.png|500

Analysis on Cross and Self-Attention

Method

file-20250421152639025.png|925

  1. 생성 이미지 편집 알고리즘:
    • 동일한 랜덤 시드로 Gaussian 노이즈 샘플링
    • 소스 및 타겟 이미지 생성 과정에서 셀프 Attention Map 교체
  2. 실제 이미지 편집 알고리즘:
    • DDIM Inversion으로 실제 이미지에서 노이즈 추출
    • 타겟 이미지 생성 과정에서 실제 이미지의 Self-Attention Map 주입

Method 검증