Logo VidComposition

Can MLLMs Analyze Compositions in Compiled Videos?

1University of Rochester, 2Arizona State University

Introduction

The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. Our key contributions are:
  1. We introduce VidComposition, a novel, human-annotated, high-quality benchmark for evaluating fine-grained video composition understanding in MLLMs.
  2. We comprehensively evaluate 33 MLLMs for video understanding with VidComposition. The results show the challenging nature of VidComposition and the substantial gap between MLLMs' and humans' capabilities in video composition understanding.
  3. We analyze the critical factors that influence the performance of MLLMs systematically, providing potential directions for model improvement and future advancements.

🏆 Leaderboard

By default, this leaderboard is sorted by overall results. To view other sorted results, please click on the corresponding cell. Colored rows indicate closed-source models/APIs.

# Model LLM
Params
Date Overall (%) CA (%) CU (%) NU (%) SP (%) MA (%)
🥇 LLaVA-OneVision-72B

Bytedance & NTU S-Lab

72B 2024/8/06 63.31 61.18 76.67 78.32 33.77 69.42
🥈 InternVL2-40B

Shanghai AI Lab

34B 2024/07/04 60.73 54.95 69.44 65.47 56.07 70.25
🥉 InternVL2-76B

Shanghai AI Lab

70B 2024/07/04 58.73 51.92 72.78 64.84 52.79 63.64
4 Qwen2-VL-72B

Alibaba

72B 2024/09/19 58.68 50.48 60.00 71.16 60.00 46.28
5 Video-LLaMA2-72B

DAMO Academy & Alibaba Group

72B 2024/08/14 58.62 54.15 71.67 65.68 48.52 59.50
6 InternVL2-8B

Shanghai AI Lab

7B 2024/07/14 54.63 57.03 62.78 53.68 45.57 56.20
7 GPT-4o

OpenAI

- 2024/05/13 52.93 45.37 57.22 65.68 39.67 68.60
8 Gemini-1.5-Flash

Google

- 2024/09/24 52.40 42.65 65.00 62.74 42.95 66.94
9 VILA-1.5-40B

NVIDIA & MIT

34B 2024/05 51.23 47.28 62.78 55.79 39.02 66.94
10 GPT-4o mini

OpenAI

- 2024/07/18 50.23 44.41 53.89 60.21 38.69 64.46
11 Gemini-1.5-Pro

Google

- 2024/09/24 49.36 45.37 67.22 41.89 48.52 74.38
12 Qwen2-VL-7B

Alibaba

7B 2024/08/30 49.30 34.35 56.67 61.05 57.38 48.76
13 Oryx-7B

Tsinghua University & Tencent & S-Lab, NTU

7B 2024/09/20 48.77 48.40 67.22 41.47 50.49 47.11
14 Gemini-1.5-Flash-8B

Google

8B 2024/10/03 48.59 48.56 48.33 52.63 39.02 57.02
15 Video-LLaMA2.1

Shanghai AI Lab & CUHK & SenseTime

7B 2024/10/15 47.77 39.94 60.56 52.84 46.23 52.89
16 VideoChat2

Shanghai AI Lab

- 2024/05/22 47.13 38.98 60.00 45.47 56.07 53.72
17 InternVL2-26B

Shanghai AI Lab

20B 2024/07/04 46.42 40.10 63.89 51.16 38.36 54.55
18 LongVA

NTU S-Lab

7B 2024/06/24 43.73 37.54 51.67 41.89 49.51 56.20
19 MiniCPM-V2.6

OpenBmB

8B 2024/08/06 42.50 38.18 60.00 40.84 38.36 55.37
20 InternVL2-4B

Shanghai AI Lab

3.8B 2024/07/04 41.68 32.11 51.67 44.00 48.85 48.76
21 Video-LLaMA2.1-AV

DAMO Academy & Alibaba

7B 2024/10/22 41.50 37.06 58.89 38.11 39.67 56.20
22 VILA-1.5-8B

Nvidia

8B 2024/05 40.21 36.26 51.67 33.68 45.25 56.20
23 GPT-4-turbo

OpenAI

- 2023/11/06 39.86 31.79 46.67 45.47 37.70 54.55
24 LongLLaVA

Amazon

7B 2024/09/09 38.45 30.83 47.78 40.21 43.61 43.80
25 Chat-UniVi-v1.5

PKU

7B 2024/04 28.02 28.91 31.11 25.26 23.93 39.67
26 Kangaroo

Meituan & UCAS

8B 2024/07/17 37.10 31.79 51.67 29.05 53.44 33.06
27 InternVL2-2B

Shanghai AI Lab

1.8B 2024/07/04 36.75 24.28 54.44 35.16 47.54 53.72
28 LongVILA

Nvidia

8B 2024/08 36.46 33.55 49.44 28.84 37.05 60.33
29 AuroraCap

University of Washington

7B 2024/10/10 36.28 38.98 45.00 28.63 32.13 49.59
30 Qwen2-VL-2B

Alibaba

2B 2024/08/30 36.17 25.08 47.22 37.05 47.54 44.63
31 Video-LLaMA2-7B

DAMO Academy & Alibaba

7B 2024/06/03 34.35 25.88 47.78 28.84 45.90 50.41
32 VILA-1.5-3B

Nvidia

3B 2024/05 31.95 30.03 42.78 18.32 42.95 51.24
33 Video-LLaVA

PKU

7B 2024/04/09 31.07 30.03 34.44 24.84 36.39 42.15
34 Chat-UniVi

PKU

7B 2023/09/28 28.02 28.91 31.11 25.26 23.93 39.67
35 InternVL2-1B

Shanghai AI Lab

0.5B 2024/07/04 26.61 24.12 28.33 26.53 29.84 28.93
36 RANDOM - 2024/10/28 25.28 25.70 24.28 25.12 24.62 26.61

Benchmark

📊 Statistics & Analysis

Top MLLMs' performance on VidComposition, across 15 tasks of 5 aspects of video composition understanding: Cinematography Analysis, Character Understanding, Narrative Understanding, Scene Perception, and Making Analysis.

Question Categorie Hierarchy: Question Types in VidComposition Benchmark for Evaluating MLLMs.

The difficulty distribution across these five categories. If a question is answered correctly by more than 60% of MLLMs, it will be labeled as "Easy."; if a question is answered correctly by fewer than 10% of MLLMs, it will be labeled as "Super Hard."

🧪 Experiments





🔭 Visualization Results





Citation


      @article{tang2024vidcompostion,
        title = {VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?},
        author = {Tang, Yunlong and Guo, Junjia and Hua, Hang and Liang, Susan and Feng, Mingqian and Li, Xinyang and Mao, Rui and Huang, Chao and Bi, Jing and Zhang, Zeliang and Fazli, Pooyan and Xu, Chenliang},
        journal = {arXiv preprint arXiv:2411.10979},
        year = {2024}
      }