Video-R4

Introduction

Understanding small, transient textual cues in videos is crucial for text-rich video reasoning tasks. Current video QA models often rely on single-pass perception, which can lead to hallucinations when dealing with complex visual-textual content. We introduce Video-R4, a video reasoning Large Multimodal Model (LMM) that performs visual rumination—an iterative process that mimics how humans process visual information by selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating the reasoning state, forming a closed-loop read-retrieve-refocus-reinforce cycle for grounded video reasoning. We develop two datasets (Video-R4-CoT-17k and Video-R4-RL-30k) and a multi-stage rumination learning framework that progressively finetunes a 7B LMM using Supervised Fine-Tuning (SFT) and GRPO-based Reinforcement Learning (RL). Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and demonstrates strong generalization to multi-page document QA, slides QA, and generic video QA, proving the effectiveness of iterative rumination for pixel-grounded multimodal reasoning.

Video-R4-7B achieves state-of-the-art performance on the text-rich video understanding, and is also compatible with the LMMs with the same size on the general video QA benchmarks.

📊 Data Curation Pipeline

The data curation pipeline creates two datasets: Video-R4-CoT-17k (for supervised practice) and Video-R4-RL-30k (for reinforcement learning). The pipeline processes training data with OCR and object detection results, uses rule-based matching, trajectory synthesis, and quality control to generate high-quality training examples.

🎯 Reward Design

Our reward function combines four components: the original reward \(R\) (e.g., answer correctness and format), Diversity Reward \(R_{\text{div}}\), Representativeness Reward \(R_{\text{rep}}\), and Curiosity Reward \(R_{\text{cur}}\). The overall reward is:

\[R' = R + \lambda_{\text{div}} R_{\text{div}} + \lambda_{\text{rep}} R_{\text{rep}} + \lambda_{\text{cur}} R_{\text{cur}}\]

Diversity Reward

Following prior unsupervised summarization objectives, we encourage selected regions to be mutually dissimilar in feature space. Let \(V\) denote the set of features of the input frames and \(\hat{V} = \hat{V}^{f} \cup \hat{V}^{r}\) denote the set of features of the selected frames (\(\hat{V}^{f}\)) and regions (\(\hat{V}^{r}\)). We define diversity reward to encourage the policy to avoid redundant region selections:

\[R_{\mathrm{div}}(\hat{V}^{r}) = \frac{1}{|\hat{V}^{r}| (|\hat{V}^{r}| - 1)} \sum_{i=1}^{|\hat{V}^{r}|} \sum_{j \ne i}^{|\hat{V}^{r}|-1} d(v_i, v_{j})\]

where \(d(\cdot,\cdot)\) denotes cosine similarity: \[d(v_i, v_{j}) = 1 - \frac{v_i^{\top} v_{j}}{\|v_i\|_2 \|v_{j}\|_2}.\] This objective computes the average pairwise distance between all selected region features, making the reward depend on the orientation of features rather than their magnitude.

Representativeness Reward

To ensure the selected frames \(\hat{V}^{f}\) remain informative, we encourage them to represent the global video frame set \(V\):

\[R_{\mathrm{rep}}(V,\hat{V}^f) = \exp\!\left(-\frac{1}{|V|} \sum_{i=1}^{|V|} \min_{v_j \in \hat{V}^{f[-1]}} \| v_i - v_{j} \|_2\right)\]

where \(\|\cdot\|_2\) denotes the Euclidean distance in the feature space and \(\hat{V}^{f[-1]}\) is the set of frame features selected in the last clipping operation. This reward measures how well the selected frames cover the entire video in feature space.

Curiosity Reward

To balance exploration and prevent overuse of visual operations, we incorporate a curiosity reward:

\[R_{\text{cur}}(\hat{V}_i) = \alpha \left( H - \frac{1}{K}\sum_{j=1}^{K} \mathbb{I}\big[|\hat{V}_j| > 0\big] \right)_+ \cdot \mathbb{I}\big[|\hat{V}_i| > 0\big] - \beta \left(\,|\hat{V}_i| - N\,\right)_+\]

where \(K\) is the number of rollouts, \((\cdot)_+\) is the ReLU function, \(\mathbb{I}[\cdot]\) is the indicator function, \(\alpha\) and \(\beta\) are coefficients, and \(H\) is the threshold. The first term encourages exploration when visual operations are under-utilized, while the second term penalizes excessive calls to prevent over-reliance on visual operations.

✈️ Training Framework

Multi-stage rumination training framework: (1) Deliberate Rumination Practice (DRP) SFT, (2) RL after DRP-SFT, (3) Compositional Rumination Practice (CRP) SFT, (4) RL after CRP-SFT. The framework uses GRPO-based reinforcement learning with rewards for accuracy, diversity, repetition, and curiosity.

🏆 Performance on M4-ViteVQA

Performance comparison on the testset of the M4-ViteVQA dataset. Best non-human scores in bold. The parameters of the LMM-based models are 7B or 8B. All the RL-based models compared are based on Qwen2.5-VL. The human performances are from the original dataset paper.

Models	LMM-Based	Visual-Grounded	RL-Based	Task 1 - Split 1		Task 1 - Split 2		Task 2
Models	LMM-Based	Visual-Grounded	RL-Based	Acc. (%)	ANLS (%)	Acc. (%)	ANLS (%)	Acc. (%)	ANLS (%)
JustAsk	✗	✗	✗	10.05	14.10	5.47	8.60	3.60	6.70
All-in-one-B	✗	✗	✗	10.87	14.80	5.66	7.80	3.28	4.60
Video-LLaVA-7B	✓	✗	✗	15.43	17.15	11.19	12.02	9.38	11.80
T5-ViteVQA	✗	✗	✗	22.17	29.10	16.68	23.80	9.29	13.60
VideoLLaMA2-7B	✓	✗	✗	20.76	23.55	18.33	20.45	16.54	21.08
Qwen2-VL-7B	✓	✗	✗	35.22	45.84	27.25	38.45	21.23	28.79
TEA-L	✗	✓	✗	34.78	43.71	28.43	38.13	18.83	28.90
NVILA-8B	✓	✗	✗	37.73	47.23	30.10	41.52	22.89	30.34
GAT-L	✗	✓	✗	38.30	48.23	30.90	41.81	22.13	30.75
Video-R1-7B	✓	✗	✓	37.10	48.25	33.67	44.94	43.16	53.37
Qwen2.5-VL	✓	✗	✗	26.53	44.91	24.34	39.60	32.81	50.82
Pixel-Reasoner	✓	✓	✓	52.91	61.44	48.88	58.23	58.97	65.32
Video-R4-7B (ours)	✓	✓	✓	56.17	65.22	52.69	61.89	64.21	69.99
Human	--	--	--	85.27	89.30	78.41	82.80	82.26	85.10

🕹️ Generalization to Other Tasks

Fine-tuning on Video-R4-CoT-17k and Video-R4-RL-30k, Video-R4 demonstrates strong generalization capabilities, effectively handling not only general video QA but also multi-page document QA and slides QA, without the need for further dataset-specific training.

Results on general video QA benchmarks

Models	MVBench	Video-MME	Video-MMMU
Video-LLaVA-7B	42.9	39.9	--
VideoLLaMA2-7B	54.6	46.6	--
Qwen2.5-VL-7B	57.4	53.1	47.8
Video-R1-7B	62.7	57.4	49.8
Pixel-Reasoner	65.4	54.6	47.7
Video-R4-7B (ours)	64.5	54.5	52.2

Results on the validation set of the MP-DocVQA dataset

Models	Zero-Shot	Acc. (%)	ANLS (%)
LayoutLMv3	✗	38.47	45.38
Big-Bird	✗	41.06	49.29
Hi-VT5 (w/o train)	✓	42.10	58.64
Longformer	✗	43.91	52.87
Hi-VT5	✗	48.28	62.01
Video-R4-7B (ours)	✓	53.21	62.22

Results on the test set of SlidesVQA

Models	Zero-Shot	Dev		Test
Models	Zero-Shot	EM	F1	EM	F1
Q-only	✓	9.4	11.4	10.7	13.5
UniVL	✓	8.8	12.1	10.6	14.1
PreasM	✗	36.3	41.9	30.7	38.2
T5	✗	35.2	41.3	29.3	37.9
T5 + z^lay	✗	36.9	43.2	31.0	39.7
LayoutT5	✗	38.9	44.8	31.7	39.9
LayoutLMv2	✗	26.5	33.4	21.4	29.3
FiD	✗	37.6	42.9	30.4	38.9
FiD + z^lay	✗	38.1	43.3	30.6	38.9
M3D	✗	41.3	47.1	33.5	41.7
Video-R4-7B (ours)	✓	49.5	56.0	43.0	52.2
Human	--	--	--	89.8	93.0

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for their insightful discussion. We also thank the authors of the following projects for their contributions: M4-ViteVQA, SlideVQA, MP-DocVQA, Open-R1, OpenRLHF, Ray, Qwen2.5-VL, Video-R1, Pixel-Reasoner, DeepSeek-R1, MVBench, Video-MME, Video-MMMU


      @article{tang2025videor4,
        title={Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination},
        author={Tang, Yolo Yunlong and Shimada, Daiki and Hua, Hang and Huang, Chao and Bi, Jing and Feris, Rogerio and Xu, Chenliang},
        journal={arXiv preprint arXiv:2511.17490},
        year={2025}
      }

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination