Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Link
Abstract

Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at https://github.com/OPTML-Group/AdvUnlearn.

Synth

Problem:: 디퓨전 모델의 기존 언러닝 방법들이 적대적 프롬프트 공격에 취약함 / 적대적 견고성과 이미지 생성 품질 사이의 균형 어려움

Solution:: 적대적 훈련(AT)을 언러닝 과정에 통합한 AdvUnlearn 프레임워크 제안 / 유틸리티 유지 정규화로 이미지 생성 품질 보존

Novelty:: 텍스트 인코더 최적화가 Unet 최적화보다 ASR 및 FID 측면에서 효과적임

Note:: Nudity는 더 많은 Text Encoder Layer를 학습시켜야 더 효과적임 → Nudity가 Style/Object보다 언어적 맥락에서 상위 개념이고, 따라서 더 많이 요구되는 것인가? / 학습된 Text Encoder는 여러 DM에서 효과적임 → Text Encoder의 임베딩 공간을 DM의 입장에서 나누었으므로 통하는걸까? / ESD에서 Text Encoder만 학습시키는건 ASR 성능을 극단적으로 높이지만 FID를 극단적으로 저해함 → Text Condition는 CA에만 관여하는데 전체 그림을 그리는 SA까지 성능 저하가 온걸까? 아니면 LD 자체가 Text Embedding에 많이 의존해서 해당 공간이 망가지는게 영향을 준 것일까

Summary

Motivation

Unlearning이 진행된 ESD도 적대적 공격엔 속수무책

Method

Method 검증