Circumventing Concept Erasure Methods For Text-To-Image Generative Models

Link
Abstract

Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine seven recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.

Synth

Problem:: "개념 제거" 방법론들이 실제로 효과적인지 검증 필요

Solution:: Concept Inversion(CI) 기법 제안 / 모델 가중치 변경 없이 특수 단어 임베딩 학습 / 기존 Textual Inversion 확장하여 각 개념 제거 방법에 맞춤형 접근법 개발

Novelty:: 7가지 개념 제거 방법 모두 우회 가능함을 증명 / 개념이 실제로 제거되지 않고 입력 필터링 수준임을 입증

Note:: SLD와 같은 타임스텝별로 임베딩 스페이스를 동적으로 수정하는 방어 기법은 우회가능 하지만 실용적 장벽을 높임

Summary

Motivation

Method

Method 검증