Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

#text-to-image-diffusion #cross-attention #self-attention #diffusion-models #free-prompt-editing #attention-map-analysis #text-guided-image-editing

Link

https://ieeexplore.ieee.org/document/10655224

Abstract

Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative text-to-image generation. However, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. This approach modifies objects or object properties in images by manipulating feature components in attention layers during the generation process. Nevertheless, little is known about the semantic meanings that these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information, which can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention mechanisms in diffusion models. Furthermore, based on our findings, we propose a simplified, yet more stable and efficient, tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets. 11Source code and datasets are available at https://github.com/alibaba/EasyNLP/tree/master/diffusion/FreePromptEditing.

Synth

Problem:: 기존 Tuning-Free Text-guided Image Editing (TIE) 방법들은 Diffusion Model의 Attention Layer를 수정하지만, Cross-Attention과 Self-Attention의 역할 및 Semantic 의미에 대한 이해 부족 / Cross-Attention 수정 기반 방법들의 불안정성 및 편집 실패 가능성 존재

Solution:: Probing Analysis를 통해 Attention Map 역할 규명 (Cross-Attention: 카테고리/Semantic 정보 포함, Self-Attention: 공간/구조 정보 포함) / 분석 기반으로 Cross-Attention 수정을 배제하고 Self-Attention Map만 Source 이미지의 것으로 교체하는 Free-Prompt-Editing (FPE) 제안 / Real Image 편집 시 DDIM Inversion 활용

Novelty:: Stable Diffusion Attention Map 역할에 대한 체계적 Probing Analysis 수행 및 결과 제시 / Cross-Attention Map 수정이 편집 실패의 원인임을 입증 / Self-Attention Map만 수정하는 단순하고 안정적인 TIE 프레임워크(FPE) 제안

Note:: Editing 상황 말고 Text-to-Image 생성의 경우에 Cross-Attention Map을 Null Text에서 뽑은 것으로 교체한 경우, 원하는 이미지를 생성할 수 있을까?

Summary

Motivation

Stable Diffusion과 같은 Text-to-Image Synthesis(TIS) 모델들이 텍스트 기반 이미지 생성에서 큰 인기를 얻음
그러나 도메인 특화 시나리오에서는 미세 조정 없는 Text-Guided Image Editing(TIE)가 더 중요함
- TIE는 생성 과정 중 Attention 레이어의 Feature Components를 조작하여 이미지의 객체나 속성을 수정함
- 하지만 Attention 레이어가 학습한 Semantic Meanings와 어떤 부분이 편집 성공에 기여하는지 거의 알려진 바 없음

Analysis on Cross and Self-Attention

두 가지 주요 분석 접근법:
1. Probing Analysis: Attention Map이 의미론적 정보를 포함하는지 검증하기 위한 실험
  - 두 계층 MLP 분류기를 훈련시켜 Attention Map의 정보 포함 여부 분석
  - 색상 형용사와 동물 명사로 구성된 프롬프트 데이터셋 활용
2. Attention Map 수정 실험: 다양한 레이어에서 Target Prompt를 이용한 편집시 Attention Map을 Source Prompt의 것으로 교체
  - Cross-Attention Map 수정 vs Self-Attention Map 수정
  - 특정 레이어 범위 수정 vs 전체 레이어 수정
주요 발견점:
- Cross-Attention Map은 카테고리 정보를 포함하므로 교체 시 편집 실패 가능성 존재
  - 분류 실험: Cross-Attention Map을 이용한 분류의 경우 성능이 높음 → Semantic Information이 많음
  - Attention Map 수정 실험: Cross-Attention 교체는 실패 가능성이 높음 (위 그림 첫 번째 줄)
- Self-Attention Map은 이미지의 공간 정보를 보존하는 데 중요한 역할
  - 분류 실험: Self-Attention Map을 이용한 분류의 경우 성능이 낮음 (카테고리 분류는 잘 못하고, 색상 분류는 아예 못함) → 구조적 정보를 가짐
  - Attention Map 수정 실험: Self-Attention 교체는 원래 구조를 잃어버림 (위 그림 두 번째 줄)
  - 레이어 4-14의 Self-Attention Map 교체가 구조 보존과 성공적 편집의 최적 균형점
- 편집할 프롬프트와 관계없는 토큰도 유의미한 정보를 포함할 수 있음
  - 'A brown car'가 Target, 'A blue car'가 Source인 경우 Source의 Car와 대응되는 Cross-Attention Map이 Source의 색상 정보를 일부 포함
  - 그러나, 'brown'에 해당하는 정보를 'blue'으로 교체한 상황에서 'car'를 Source에서 가져온다고 색상 편집이 실패하지는 않음

Method

발견을 바탕으로 Free-Prompt-Editing(FPE) 알고리즘 제안:
- Source의 Self-Attention Map만 이용: $z_{t - 1}^{*} \leftarrow D M (z_{t}^{*}, P_{d s t}, t) M_{s e l f}^{*} \leftarrow M_{s e l f}$

생성 이미지 편집 알고리즘:
- 동일한 랜덤 시드로 Gaussian 노이즈 샘플링
- 소스 및 타겟 이미지 생성 과정에서 셀프 Attention Map 교체
실제 이미지 편집 알고리즘:
- DDIM Inversion으로 실제 이미지에서 노이즈 추출
- 타겟 이미지 생성 과정에서 실제 이미지의 Self-Attention Map 주입

Method 검증

다양한 데이터셋에서 기존 방법과 비교 실험
- Car-fake-edit, ImageNet-fake-edit, Car-real-edit, ImageNet-real-edit
- Wild-TI2I, ImageNet-R-TI2I 벤치마크
- Clip Score(CS)와 Clip Directional Similarity(CDS) 평가 지표 활용
FPE vs P2P 비교
- 실제 차량 색상 변경 시 P2P는 원본 색상 복제 경향이 있으나 FPE는 성공적으로 색상 변환
- 카테고리 변환 시 P2P는 불완전한 변환 결과를 보이지만 FPE는 더 완전한 변환 달성
- 모든 데이터셋에서 FPE가 더 높은 CS와 CDS 점수 기록
다른 SOTA 방법들과 비교
- SDEdit, DiffEdit, Pix2pixzero, MasaCtrl, InstructPix2Pix 등과 비교
- 복잡한 이미지 배경에서도 원본 구조를 더 잘 보존하면서 효과적인 편집 수행
- 계산 효율성 측면에서도 우수(FPE: 6.30초/이미지 vs PnP: 335.65초/이미지)
다양한 TIS 모델에 적용 가능
- Realistic-V2, Deliberate, Anything-V4 등 다른 Stable Diffusion 기반 모델에도 효과적으로 적용
- 다양한 편집 작업(나이 변경, 헤어스타일 변경, 배경 변경, 카테고리 전환 등) 수행 가능