Logo MMPerspective

Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

1University of Rochester, 2Carnegie Mellon University

Introduction

Perspective has long served as a cornerstone for representing three-dimensional reality on two-dimensional surfaces, enabling humans to infer spatial structure and depth from flat images. This capability is central to artistic creation, scientific visualization, and machine perception. While some research has attempted to enable models to detect vanishing points and key lines, these approaches often rely on precise mathematical models or specialized datasets, struggling to capture perspective-related semantics or generalize to broader tasks. Recent multimodal large language models like GPT-4 and Gemini have shown powerful visual perception capabilities, but their perspective understanding remains untested. While these models excel at high-level vision-language tasks, existing benchmarks rarely evaluate geometric reasoning abilities. It remains unclear whether MLLMs can identify vanishing points, understand line convergence, reason about spatial relationships, or maintain consistent interpretations across viewpoints. Our key contributions are:
  1. We introduce MMPerspective, the first dedicated benchmark for evaluating perspective understanding in MLLMs, spanning 10 tasks across three dimensions, consisting of 2,711 instances and 5,083 QA pairs.
  2. We conduct a comprehensive evaluation of 43 representative MLLMs and reveal key limitations in perspective perception, reasoning, and robustness.
  3. We offer new insights into current model bottlenecks and provide guidance toward building geometry-aware, spatially grounded multimodal systems.

🏆 Leaderboard

By default, this leaderboard is sorted by overall results. To view other sorted results, please click on the corresponding cell.

Model Perspective Perception Perspective Reasoning P'Percep & P'Reason Robustness
Model VPP CLP VAP LDP PTR LRR OVR PTS VPC P Acc R Acc Overall P'Robust
MLLMs: < 7B
InternVL2.5-2B 47.4 22.8 13.0 65.3 62.2 31.8 16.6 30.0 50.0 37.1 38.1 37.7 46.5
Qwen2.5-VL-3B 27.6 22.8 56.8 55.1 32.3 32.5 15.9 39.4 44.7 40.6 33.0 36.3 6.4
InternVL2.5-4B 32.1 26.0 59.3 64.2 28.2 30.5 10.7 37.1 36.8 45.4 28.7 36.1 20.6
InternVL3-2B 22.4 28.5 50.0 44.6 43.1 31.1 34.4 25.4 43.0 36.4 35.4 35.8 23.9
InternVL2-4B 26.9 12.2 54.3 60.4 18.0 40.4 18.8 24.4 45.6 38.4 29.4 33.4 7.9
Qwen2-VL-2B 12.2 19.5 49.4 35.8 23.3 24.5 28.9 32.9 47.4 29.2 31.4 30.4 4.7
InternVL3-1B 19.9 13.0 53.7 20.7 16.3 8.6 23.7 21.6 47.4 26.8 23.5 25.0 13.8
InternVL2-1B 20.5 20.3 15.4 24.2 24.1 11.3 24.0 22.1 44.7 20.1 25.2 23.0 6.7
LLaVA-OV-1B 13.5 14.6 35.8 24.2 15.2 19.2 19.5 22.1 40.4 22.0 23.3 22.7 7.8
InternVL2-2B 26.9 26.0 3.1 36.8 18.8 12.6 23.1 21.1 34.2 23.2 22.0 22.5 12.3
InternVL2.5-1B 14.7 23.6 0.6 33.0 20.1 11.3 13.3 34.7 45.6 18.0 25.0 21.9 18.2
MLLMs: 7B - 9B
InternVL2.5-8B 38.5 17.9 53.1 75.4 40.8 48.3 34.7 24.9 67.5 46.2 43.3 44.6 22.3
Qwen2.5-VL-7B 35.3 29.3 70.4 73.7 42.4 44.4 32.1 28.6 44.7 52.1 38.5 44.5 15.3
Qwen2-VL-7B 34.6 25.2 63.0 64.2 57.1 49.0 27.3 31.0 46.5 46.7 42.2 44.2 25.5
InternVL3-9B 37.2 33.3 63.0 77.5 30.7 53.0 27.9 23.9 43.9 52.8 35.9 43.4 7.3
InternVL3-8B 42.3 27.6 67.9 81.8 38.1 46.4 20.8 23.9 32.5 54.9 32.3 42.4 15.9
LLaVA-OV-7B 34.0 33.3 51.2 57.9 44.9 53.0 19.8 35.2 49.1 44.1 40.4 42.0 15.9
Eagle-X4-8B 39.1 17.1 46.9 47.7 65.3 37.1 18.2 32.9 68.4 37.7 44.4 41.4 55.3
InternVL2-8B 33.3 19.5 59.3 73.3 27.1 36.4 42.5 22.1 48.2 46.4 35.3 40.2 7.9
LLaVA-Next-m-7B 35.9 21.1 35.2 50.5 17.7 37.7 15.6 27.2 46.5 35.7 28.9 31.9 16.4
Eagle-X5-7B 25.0 26.0 24.7 34.7 22.1 46.4 15.6 20.7 42.1 27.6 29.4 28.6 15.9
LLaVA-Next-v-7B 16.7 20.3 40.7 39.6 16.3 44.4 19.8 16.4 7.0 29.3 20.8 24.6 16.4
MLLMs: 10B - 30B
InternVL2.5-26B 41.7 35.0 55.6 81.8 65.5 46.4 43.5 34.3 46.5 53.5 47.2 50.0 33.7
InternVL3-14B 39.1 26.0 73.5 73.3 36.5 34.4 54.5 28.2 54.4 53.0 41.6 46.7 13.5
InternVL2-26B 28.2 35.0 61.1 74.0 50.7 41.7 28.9 28.6 43.0 49.6 38.6 43.5 26.5
Eagle-X4-13B 42.3 26.8 41.4 44.6 65.8 20.5 28.2 31.0 57.9 38.8 40.7 39.8 53.8
LLaVA-Next-13B 7.7 17.1 54.3 34.7 66.7 24.5 13.0 26.8 43.9 28.5 35.0 32.1 51.1
MLLMs: 30B - 70B
InternVL2.5-38B 46.8 36.6 67.9 89.5 58.4 51.7 38.3 44.1 44.7 60.2 47.5 53.1 19.1
InternVL3-38B 45.5 35.0 71.0 90.9 37.3 43.0 56.8 37.6 43.0 60.6 43.5 51.1 9.1
Qwen2.5-VL-32B 35.9 22.8 68.5 73.7 62.0 37.7 33.8 35.2 45.6 50.2 42.9 46.1 25.5
Eagle-X5-34B 36.5 28.5 60.5 79.6 19.5 51.0 24.0 39.0 63.2 51.3 39.3 44.6 16.0
InternVL2-40B 26.3 22.0 66.0 76.1 43.2 55.0 27.3 25.8 47.4 47.6 39.7 43.2 12.6
MLLMs: > 70B
InternVL3-78B 43.6 39.8 69.8 89.1 55.9 57.6 40.3 38.0 42.1 60.6 46.8 52.9 25.5
InternVL2.5-72B 47.4 30.1 67.3 89.5 65.2 53.6 41.9 32.4 37.7 58.6 46.2 51.7 29.7
Qwen2.5-VL-72B 41.7 31.7 67.9 82.1 65.3 38.4 39.9 39.0 38.6 55.8 44.3 49.4 24.3
Qwen2-VL-72B 34.6 18.7 70.4 82.5 68.8 52.3 38.6 35.2 42.1 51.5 47.4 49.2 25.0
LLaVA-OV-72B 25.6 26.0 75.9 81.1 81.4 55.6 22.4 28.2 31.6 52.2 43.8 47.5 53.1
LLaVA-Next-72B 21.8 21.1 66.0 32.3 65.7 49.7 22.4 27.2 30.7 35.3 39.1 37.4 33.2
InternVL2-72B 26.9 18.7 57.4 56.8 56.1 47.0 24.7 24.4 7.9 40.0 32.0 35.6 22.9
MLLMs: Proprietary
Gemini-2-flash (CoT) 69.2 49.6 72.8 87.4 78.7 32.5 40.9 39.9 43.9 69.8 47.2 57.2 45.9
GPT-4o (CoT) 45.5 46.3 70.4 88.8 81.4 47.0 34.4 37.6 34.2 62.7 46.9 54.0 49.9
Gemini-2-flash 64.7 35.0 73.5 87.0 71.3 34.4 29.9 40.8 41.2 65.0 43.5 53.1 30.7
GPT-4o 42.9 35.0 66.0 86.0 82.0 41.7 29.9 33.8 32.5 57.5 44.0 50.0 49.9
Gemini-1.5-flash (CoT) 30.1 28.5 66.7 79.3 51.0 39.7 20.1 31.5 35.1 51.1 35.5 42.4 15.3
GPT-4o-mini 35.3 24.4 43.2 71.6 43.1 29.8 14.6 31.0 45.6 43.6 32.8 37.6 10.8
Gemini-1.5-flash 26.9 25.2 59.3 70.5 26.4 27.8 18.2 26.8 22.8 45.5 24.4 33.8 10.6

Benchmark

📊 Statistics & Analysis

Top MLLMs' performance on VidComposition, across 15 tasks of 5 aspects of video composition understanding: Cinematography Analysis, Character Understanding, Narrative Understanding, Scene Perception, and Making Analysis.

Question Categorie Hierarchy: Question Types in VidComposition Benchmark for Evaluating MLLMs.

The difficulty distribution across these five categories. If a question is answered correctly by more than 60% of MLLMs, it will be labeled as "Easy."; if a question is answered correctly by fewer than 10% of MLLMs, it will be labeled as "Super Hard."

🧪 Experiments





🔭 Visualization Results





Citation


      @article{tang2025mmperspective,
        title = {MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness},
        author = {Tang, Yunlong and Liu, Pinxin and Feng, Mingqian and Tan, Zhangyun and Mao, Rui and Huang, Chao and Bi, Jing and Xiao, Yunzhong and Liang, Susan and Hua, Hang and Vosoughi, Ali and Song, Luchuan and Zhang, Zeliang and Xu, Chenliang},
        journal = {},
        year = {2025}
      }