Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

Link
Abstract

Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow concepts from unlearned models, such as celebrity identity. Therefore, this paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting. This challenging scenario assumes that the unlearning method is unknown and the unlearned model is inaccessible for optimization, requiring the attack to be capable of transferring across different unlearned models. Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models. This strategy adopts the original Stable Diffusion model as a surrogate model to iteratively erase and search for embeddings, enabling it to find the embedding that can restore the target concept for different unlearning methods. Extensive experiments demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods and its effectiveness for different levels of concepts.

Synth

Problem:: 기존 Unlearning Method는 텍스트-이미지 매핑만 변경하고 시각적 콘텐츠는 보존 / 기존 프로빙 방법은 전이 가능성 부족하고 White-Box 환경 요구 / Prompt-Level 방법은 유명인 신원 같은 좁은 개념 복원에 실패

Solution:: 반복적으로 컨셉과 관련된 영역을 찾고 지워서 기존 방법들이 변경하지 매핑하지 못한 영역의 임베딩을 사용

Novelty:: 블랙박스 환경에서도 동작 가능 / 기존 언러닝 방법들의 한계점이 저밀도 영역에 있음을 발견 (컨셉과 가까운 영역은 고밀도)

Note:: 실험에 사용한 Unlearned Method가 전부 Stable Diffusion 기반이라 잘 동작한 듯, 진짜 Black-Box라면 Stable Diffusion도 몰랐어야 하지만 현실적으로 쉽지 않아보임 / 기존 Unlearning의 한계는 관련 개념을 텍스트 영역에서만 지우는 것

Summary

Motivation

file-20250414235818683.png

즉, Text Space에서 대표적인 컨셉을 이용한 조작은 Unlearning 방법들이 쉽게 막을 수 있음 → 해당 컨셉의 사진을 만들지만 언어와 크게 연결되어 있지 않은 영역을 찾자

Method

file-20250414235958951.png

Method 검증

Embedding 시각화

file-20250415000654177.png|675

객체 개념 복원

예술 스타일 복원

NSFW 콘텐츠 복원

유명인 신원 복원

Ablation Study