Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Link
Abstract

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.

Synth

Problem:: LLM의 Self-Reflection에서 발생하는 Degeneration-of-Thought (DoT) 문제 / LLM이 초기 답변에 확신을 가지면 틀린 답이어도 새로운 사고를 생성하지 못함 / 편향된 인식, 변화 저항성, 제한된 외부 피드백으로 인한 자기 교정 실패

Solution:: Multi-Agent Debate (MAD) 프레임워크 제안 / 여러 에이전트가 "tit for tat" 상태로 토론하며 발산적 사고 유도 / Judge가 Adaptive Break로 최적 시점에 토론 종료 및 답변 추출 / Meta Prompts로 건설적 대립 수준 조절

Novelty:: DoT 문제를 최초로 정의하고 실증적으로 입증 / 인간의 토론 메커니즘을 LLM에 적용한 창의적 접근 / GPT-3.5-Turbo + MAD가 GPT-4 성능을 능가함을 보임 / 동일 LLM 사용 시 최적 성능, Judge의 편향 문제 발견

Note:: Multi Agent 사용에 실용적인 부분이 꽤 많이 담겨 있음 (모델 종류 및 성능에 따른 역할 분배, 토론의 방향성에 따른 차이, 토론 길이에 따른 차이 등)

Summary

Motivation

현재 LLM의 한계와 Self-Reflection의 문제점

Degeneration-of-Thought (DoT) 문제 - 본 논문에서 처음으로 정의

file-20250523205305594.png|550

DoT 발생의 3가지 주요 원인

Method

Multi-Agent Debate (MAD) Framework의 핵심 아이디어

file-20250523205343355.png|775

문제: Circle A (반지름 r)가 Circle B (반지름 3r) 주위를 한 바퀴 돌 때 총 회전 수는?

MAD의 3가지 구성 요소

Method 검증

Common MT (Commonsense Machine Translation) - 상식적 기계번역의 도전과제

Counter-Intuitive AR - 반직관적 산술 추론의 함정

추가 분석