Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

1University of Rochester, 2Sony Group Corporation, 3MIT-IBM Watson AI Lab

Introduction

Understanding small, transient textual cues in videos is crucial for text-rich video reasoning tasks. Current video QA models often rely on single-pass perception, which can lead to hallucinations when dealing with complex visual-textual content. We introduce Video-R4, a video reasoning Large Multimodal Model (LMM) that performs visual ruminationβ€”an iterative process that mimics how humans process visual information by selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating the reasoning state, forming a closed-loop read-retrieve-refocus-reinforce cycle for grounded video reasoning. We develop two datasets (Video-R4-CoT-17k and Video-R4-RL-30k) and a multi-stage rumination learning framework that progressively finetunes a 7B LMM using Supervised Fine-Tuning (SFT) and GRPO-based Reinforcement Learning (RL). Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and demonstrates strong generalization to multi-page document QA, slides QA, and generic video QA, proving the effectiveness of iterative rumination for pixel-grounded multimodal reasoning.
performance

Video-R4-7B achieves state-of-the-art performance on the text-rich video understanding, and is also compatible with the LMMs with the same size on the general video QA benchmarks.

πŸ”¬ Method

πŸ“Š Data Curation Pipeline

data-composition

The data curation pipeline creates two datasets: Video-R4-CoT-17k (for supervised practice) and Video-R4-RL-30k (for reinforcement learning). The pipeline processes training data with OCR and object detection results, uses rule-based matching, trajectory synthesis, and quality control to generate high-quality training examples.

🎯 Reward Design

Our reward function combines four components: the original reward \(R\) (e.g., answer correctness and format), Diversity Reward \(R_{\text{div}}\), Representativeness Reward \(R_{\text{rep}}\), and Curiosity Reward \(R_{\text{cur}}\). The overall reward is:

\[R' = R + \lambda_{\text{div}} R_{\text{div}} + \lambda_{\text{rep}} R_{\text{rep}} + \lambda_{\text{cur}} R_{\text{cur}}\]

Diversity Reward

Following prior unsupervised summarization objectives, we encourage selected regions to be mutually dissimilar in feature space. Let \(V\) denote the set of features of the input frames and \(\hat{V} = \hat{V}^{f} \cup \hat{V}^{r}\) denote the set of features of the selected frames (\(\hat{V}^{f}\)) and regions (\(\hat{V}^{r}\)). We define diversity reward to encourage the policy to avoid redundant region selections:

\[R_{\mathrm{div}}(\hat{V}^{r}) = \frac{1}{|\hat{V}^{r}| (|\hat{V}^{r}| - 1)} \sum_{i=1}^{|\hat{V}^{r}|} \sum_{j \ne i}^{|\hat{V}^{r}|-1} d(v_i, v_{j})\]

where \(d(\cdot,\cdot)\) denotes cosine similarity: \[d(v_i, v_{j}) = 1 - \frac{v_i^{\top} v_{j}}{\|v_i\|_2 \|v_{j}\|_2}.\] This objective computes the average pairwise distance between all selected region features, making the reward depend on the orientation of features rather than their magnitude.

Representativeness Reward

To ensure the selected frames \(\hat{V}^{f}\) remain informative, we encourage them to represent the global video frame set \(V\):

\[R_{\mathrm{rep}}(V,\hat{V}^f) = \exp\!\left(-\frac{1}{|V|} \sum_{i=1}^{|V|} \min_{v_j \in \hat{V}^{f[-1]}} \| v_i - v_{j} \|_2\right)\]

where \(\|\cdot\|_2\) denotes the Euclidean distance in the feature space and \(\hat{V}^{f[-1]}\) is the set of frame features selected in the last clipping operation. This reward measures how well the selected frames cover the entire video in feature space.

Curiosity Reward

To balance exploration and prevent overuse of visual operations, we incorporate a curiosity reward:

\[R_{\text{cur}}(\hat{V}_i) = \alpha \left( H - \frac{1}{K}\sum_{j=1}^{K} \mathbb{I}\big[|\hat{V}_j| > 0\big] \right)_+ \cdot \mathbb{I}\big[|\hat{V}_i| > 0\big] - \beta \left(\,|\hat{V}_i| - N\,\right)_+\]

where \(K\) is the number of rollouts, \((\cdot)_+\) is the ReLU function, \(\mathbb{I}[\cdot]\) is the indicator function, \(\alpha\) and \(\beta\) are coefficients, and \(H\) is the threshold. The first term encourages exploration when visual operations are under-utilized, while the second term penalizes excessive calls to prevent over-reliance on visual operations.

✈️ Training Framework

training-framework

Multi-stage rumination training framework: (1) Deliberate Rumination Practice (DRP) SFT, (2) RL after DRP-SFT, (3) Compositional Rumination Practice (CRP) SFT, (4) RL after CRP-SFT. The framework uses GRPO-based reinforcement learning with rewards for accuracy, diversity, repetition, and curiosity.

πŸ“Š Experiment Results

πŸ† Performance on M4-ViteVQA

Performance comparison on the testset of the M4-ViteVQA dataset. Best non-human scores in bold. The parameters of the LMM-based models are 7B or 8B. All the RL-based models compared are based on Qwen2.5-VL. The human performances are from the original dataset paper.

Models LMM-Based Visual-Grounded RL-Based Task 1 - Split 1 Task 1 - Split 2 Task 2
Acc. (%) ANLS (%) Acc. (%) ANLS (%) Acc. (%) ANLS (%)
JustAsk βœ— βœ— βœ— 10.05 14.10 5.47 8.60 3.60 6.70
All-in-one-B βœ— βœ— βœ— 10.87 14.80 5.66 7.80 3.28 4.60
Video-LLaVA-7B βœ“ βœ— βœ— 15.43 17.15 11.19 12.02 9.38 11.80
T5-ViteVQA βœ— βœ— βœ— 22.17 29.10 16.68 23.80 9.29 13.60
VideoLLaMA2-7B βœ“ βœ— βœ— 20.76 23.55 18.33 20.45 16.54 21.08
Qwen2-VL-7B βœ“ βœ— βœ— 35.22 45.84 27.25 38.45 21.23 28.79
TEA-L βœ— βœ“ βœ— 34.78 43.71 28.43 38.13 18.83 28.90
NVILA-8B βœ“ βœ— βœ— 37.73 47.23 30.10 41.52 22.89 30.34
GAT-L βœ— βœ“ βœ— 38.30 48.23 30.90 41.81 22.13 30.75
Video-R1-7B βœ“ βœ— βœ“ 37.10 48.25 33.67 44.94 43.16 53.37
Qwen2.5-VL βœ“ βœ— βœ— 26.53 44.91 24.34 39.60 32.81 50.82
Pixel-Reasoner βœ“ βœ“ βœ“ 52.91 61.44 48.88 58.23 58.97 65.32
Video-R4-7B (ours) βœ“ βœ“ βœ“ 56.17 65.22 52.69 61.89 64.21 69.99
Human -- -- -- 85.27 89.30 78.41 82.80 82.26 85.10

πŸ•ΉοΈ Generalization to Other Tasks

Fine-tuning on Video-R4-CoT-17k and Video-R4-RL-30k, Video-R4 demonstrates strong generalization capabilities, effectively handling not only general video QA but also multi-page document QA and slides QA, without the need for further dataset-specific training.

Results on general video QA benchmarks

Models MVBench Video-MME Video-MMMU
Video-LLaVA-7B 42.9 39.9 --
VideoLLaMA2-7B 54.6 46.6 --
Qwen2.5-VL-7B 57.4 53.1 47.8
Video-R1-7B 62.7 57.4 49.8
Pixel-Reasoner 65.4 54.6 47.7
Video-R4-7B (ours) 64.5 54.5 52.2

Results on the validation set of the MP-DocVQA dataset

Models Zero-Shot Acc. (%) ANLS (%)
LayoutLMv3 βœ— 38.47 45.38
Big-Bird βœ— 41.06 49.29
Hi-VT5 (w/o train) βœ“ 42.10 58.64
Longformer βœ— 43.91 52.87
Hi-VT5 βœ— 48.28 62.01
Video-R4-7B (ours) βœ“ 53.21 62.22

Results on the test set of SlidesVQA

Models Zero-Shot Dev Test
EM F1 EM F1
Q-only βœ“ 9.4 11.4 10.7 13.5
UniVL βœ“ 8.8 12.1 10.6 14.1
PreasM βœ— 36.3 41.9 30.7 38.2
T5 βœ— 35.2 41.3 29.3 37.9
T5 + zlay βœ— 36.9 43.2 31.0 39.7
LayoutT5 βœ— 38.9 44.8 31.7 39.9
LayoutLMv2 βœ— 26.5 33.4 21.4 29.3
FiD βœ— 37.6 42.9 30.4 38.9
FiD + zlay βœ— 38.1 43.3 30.6 38.9
M3D βœ— 41.3 47.1 33.5 41.7
Video-R4-7B (ours) βœ“ 49.5 56.0 43.0 52.2
Human -- -- -- 89.8 93.0

πŸ“Ί Visualizations

visualization-1

Visualization of the iterative visual rumination process.



visualization-2

Visualization of the iterative visual rumination process.

πŸ™ Acknowledgments

This work was supported by Sony Group Corporation. We would like to thank Sayaka Nakamura and Jerry Jun Yokono for their insightful discussion. We also thank the authors of the following projects for their contributions: M4-ViteVQA, SlideVQA, MP-DocVQA, Open-R1, OpenRLHF, Ray, Qwen2.5-VL, Video-R1, Pixel-Reasoner, DeepSeek-R1, MVBench, Video-MME, Video-MMMU

πŸ“ƒ Citation


      @article{tang2025videor4,
        title={Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination},
        author={Tang, Yolo Yunlong and Shimada, Daiki and Hua, Hang and Huang, Chao and Bi, Jing and Feris, Rogerio and Xu, Chenliang},
        journal={arXiv preprint arXiv:2511.17490},
        year={2025}
      }