Multi-layer Learnable Attention Mask for Multimodal Tasks

#transformer-architecture #learnable-attention-mask #multimodal-encoders

May 22, 2025 05:47 AM

May 22, 2025 07:34 AM

Link

Abstract

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.

Synth

Problem:: Transformer의 Self-Attention은 다양한 Token Granularity와 긴 Sequence의 높은 계산 비용으로 인해 Multimodal 환경에서 효율이 저하됨 / 모든 Token이 동일한 중요도를 갖지 않음에도 불구하고 이를 동적으로 조절하는 메커니즘이 Computer Vision 분야에서 부족함

Solution:: 입력 Sequence 전체를 보고 중요 Token을 우선순위화하여 Attention Map을 조절하는 Learnable Attention Mask (LAM) 제안 / Transformer의 각 Layer가 다른 정보 측면을 처리하므로, Layer별 Context에 맞는 Mask를 동적으로 학습하는 Multi-Layer LAM으로 확장

Novelty:: 전체 입력 Sequence를 기반으로 동적으로 학습되는 Attention Mask를 통해 Multimodal Token 간의 복잡한 연관성을 전역적으로 포착 / Ablation Study를 통해 LAM의 효과가 단순한 Parameter 증가가 아닌 선택적 Attention 능력임을 보임

Note:: 마스크를 씌우는 것은 계산할 토큰 수를 줄이는건 아니므로, 연산량이 감소하지는 않음 / 이상적으로 학습된 Self-Attention은 LAM이 없어도 동작하겠지만, LAM을 통한 Inductive Bias로 모델 학습을 쉽게 했다고 볼 수 있음

Summary

Motivation

(a)는 영화 장면에서 시간적으로 정렬된 Video Token과 Audio Token을 보여줌.

Transformer의 Self-Attention 메커니즘은 Token 간의 지역적 연관성을 계산하는 데 효과적이지만, 특히 다양한 Modality의 Token을 처리할 때 몇 가지 단점이 관찰됨 → 다양한 Modality는 정보의 Granularity가 달라 문제를 야기할 수 있음.
- (a)의 "Joanna's shouts"와 같은 특정 Audio Token이 장면의 여러 Video Token과 연관될 수 있음 → Self-Attention이 주로 Token 대 Token의 지역적 연관성을 포착하는 방식((b)의 Self-Attention Map 참고)으로는 효과적으로 처리하기 어려움
- 이러한 연관성은 단순한 시간적 인접성을 넘어, 하나의 Modality에 있는 각 Token이 다른 Modality의 여러 Token과, 심지어는 Token Sub-sequence 간에도 형성될 수 있어 복잡성을 더함.
또한, Token Sequence가 길어질수록 풍부한 정보를 제공하지만, Attention 메커니즘의 계산 요구량은 입력 Token 길이에 따라 증가하여 많은 수의 Token을 효과적으로 처리하는 데 제약이 됨.
복잡한 입력 Sequence의 모든 Token이 동일한 중요도를 갖지 않음 → 전체 Sequence에 대한 전역적 시각을 통해 중요한 Token을 우선순위화하는 접근 방식을 제안
- 동적으로 Update되는 Masking 메커니즘의 효과는 이전 연구들에서 입증되었으나, Computer Vision 분야에서는 상대적으로 연구가 부족하여, 다양한 Vision Task에 걸쳐 동적 Token Masking의 영향을 분석하고자 함.

Method

Learnable Attention Mask (LAM)

LAM Module은 전체 Token Sequence $T$ (단일 또는 Multimodal)를 입력받아 Mask $M$ 을 출력하며, 이는 주로 Linear Layer 기반의 Feedforward Network (FFN)으로 구성됨.
Self-Attention의 경우, Mask $M$ 의 크기는 $L_{t} \times L_{t}$ ( $L_{t}$ : 입력 Sequence 길이)이며, 다음과 같이 표현됨:
- $X (T) \to M$
Cross-Attention의 경우, Mask $M$ 의 크기는 $L_{q} \times L_{k}$ ( $L_{q}$ : Query 길이, $L_{k}$ : Key 길이)이며, Query $Q$ 와 Key $K$ 의 Dot Product를 입력으로 사용함:
- $X (Q K^{T}) \to M$
LAM Module의 Forward Pass는 다음과 같이 정의됨 (L: LAM 내 총 Layer 수):
- $h_{1} = ReLU (W_{1} x + b_{1})$
- $h_{i} = ReLU (W_{i} h_{i - 1} + b_{i})$ , for $i \in {2, 3, . . ., L - 1}$
- $M = W_{L} h_{L - 1} + b_{L}$
생성된 Mask는 Transformer Layer Stack 전체에 적용되거나 각 Layer별로 개별적으로 Scaling 될 수 있음.