Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models

Hallucination detection based on the overthinking behavior of Vision Language Models

CVPR Findings 2026
1 Australian Institute for Machine Learning, The University of Adelaide
2 Hanoi University of Science and Technology
Equal Contribution   *Lead Author
Teaser: Overthinking causes hallucination in VLMs

First paper to analyze the layer-wise thought process of VLMs and identify "overthinking" as a key factor in hallucination, leading to a novel hallucination detection method with significant performance improvements.

Abstract

Background: Vision-Language Models (VLMs) produces responses that are irrelevant to the visual content and are called hallucinations.

Objective: This study analyses the layer-wise thought processes of VLMs to understand and detect hallucinations.

Analysis Results: By observing the layer-wise tokens via Logit-Lens, we find that models repeatedly revise object hypotheses across layers before committing to an incorrect answer, a behavior we term "overthinking".

Method: We introduce the Overthinking Score, a metric measuring how many competing hypotheses the model entertains and how unstable these hypotheses are across layers. Light-weight binary classifiers are trained using this score to detect hallucinations.

Results: The Overthinking Score significantly improves hallucination detection: 78.9% F1 on MSCOCO and 71.58% on AMBER.

Exploring the VLM's Thought Process

Inconsistent thought process or overthinking: VLM's explore multiple object hypotheses often resulting in irrelevant object tokens and we term this phenomenon as "overthinking". In the first example given below, the VLM consistently thinks about the correct object "cat" in the intermediate and late layers, resulting in a correct answer. In the second example, the VLM entertains multiple hypotheses (apple, sink, lemon and soap) across layers, leading to an incorrect final token "dish".

Presence of confounders trigger overthinking: Semantically co-occurring objects (confounders) in the VLM's layer-wise thought chain cause it to entertain multiple hypotheses, leading to overthinking and hallucination. For example, in the second example below, the presence of confounders (sink and soap) causes the VLM to entertain multiple hypotheses across layers, leading to semantically related but incorrect answer "dish".

An example illustrating the overthinking behavior and confounding propagation in VLMs.

How We Detect Hallucinations

An Overview of Hallucination Detection Method
  1. Prefix Prompting: We use prefix prompting to obtain the next predicted token. Initially, the model is prompted with the query "Describe this image.".
  2. Logit Lens: We apply the logit lens technique to observe the layer-wise thought process of the VLM, tracking how object hypotheses evolve across layers.
  3. Feature Extraction: We extract the following features from the layer-wise thought process:
    • Overthinking Score: A metric defined as the number of unique tokens emitted across the decoder layers.
    • Layer-wise Entropy: A measure of the uncertainty or diversity of the VLM's object hypotheses across layers.
    • Next token to Image Attention: For each layer, we measure the average attention from the next predicted token to the image features.
    • Next token to Text Attention: For each layer, we measure the average attention from the next predicted token to the text features.
  4. Detection Model: We train the following light-weight binary classifiers using the extracted features to detect hallucinations in VLM outputs.
    • Logistic Regression
    • Gradient Boosting (GB)
    • Multi-layer Perceptron (MLP)

Performance Evaluation

We evaluate hallucination detection performance on the MSCOCO benchmark across three VLMs (LLaVA-1.5, Gemma-3, Qwen3-VL), and assess out-of-distribution (OOD) generalisation on the AMBER benchmark using LLaVA-1.5. Our method consistently outperforms prior approaches across all metrics, with the Gradient Boosting variant achieving the best overall results — 78.9% F1 on MSCOCO and 71.58% F1 on AMBER. Bold denotes the best result; underline denotes the second best.

Object hallucination detection performance (in %) on MSCOCO dataset measured using AUC, AP and F1 across three different VLMs.
Method LLaVA-1.5 Gemma-3 Qwen3-VL Avg.
AUCAPF1 AUCAPF1 AUCAPF1 AUCAPF1
SVAR 85.1250.6869.35 74.1132.7547.84 75.5635.6950.20 78.2639.7155.80
HalLoc 80.3853.6173.68 79.2749.9667.11 83.8559.5074.75 81.1754.3671.85
MetaToken (LR) 85.4152.5972.88 73.9128.7848.27 79.4927.6047.27 79.6036.3256.14
MetaToken (GB) 88.9561.0375.95 77.2334.7567.15 84.2138.0674.43 83.4644.6172.51
MetaToken (MLP) 86.8155.7273.89 78.4033.6360.03 86.2947.0068.03 83.8345.4567.32
Ours (LR) 86.8559.1672.88 76.6831.2067.53 77.8537.3155.90 80.4642.5665.44
Ours (GB) 89.6663.0478.95 85.5959.8974.54 86.6561.6974.43 87.3061.5475.97
Ours (MLP) 89.7363.8175.37 85.3856.8472.07 86.8953.7271.15 87.3358.1272.86

Qualitative comparison of hallucination detection performance with current methods such as SVAR and MetaToken.

Qualitative comparison of hallucination detection performance with current methods such as SVAR and MetaToken.