Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

Link
Abstract

Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.

Synth

Problem:: 거대 데이터셋으로 학습된 텍스트-이미지 생성 모델이 부적절한 콘텐츠 생성 / 기존 안전장치는 쉽게 우회 가능

Solution:: Classifier-free guidance 확장하여 부적절한 콘텐츠 방향 억제 / 부적절한 콘텐츠 생성을 평가할 수 있는 I2P 벤치마크 제안

Novelty:: 모델의 이미 습득한 지식 활용하여 부적절 콘텐츠 억제 / 이미지 품질 저하 없는 안전 메커니즘 구현

Note:: 자연어 관점에서 사전 정의된 '부적절한 단어'가 필요함 → 단어는 전체 문장의 맥락 안에서 동작하므로, 모호한 프롬프트의 경우 생성 성능을 떨어뜨릴 수 있음

Summary

Motivation

Method

file-20250407162446771.png|600

Inappropriate Image Prompts (I2P)

Method 검증

Stable Diffusion의 부적절한 콘텐츠 생성

SLD의 효과성