Yolo Y. Tang

Hi there / 你好 / こんにちは / 안녕 / Ciallo ～(∠・ω< )⌒★

Yolo is a final-year Ph.D. candidate at the University of Rochester (UR), advised by Prof. Chenliang Xu, working on LMMs/Agents × Video Understanding. She earned her M.S. from UR in 2025 en route to her Ph.D. and received her B.Eng. from SUSTech in 2023. She has interned at Amazon, ByteDance, and Tencent.

Click a station or topic to show related papers.

Swipe left or right to explore the full map.

* Equal Contribution | † Corresponding Author

TCSVT
Video Understanding with Large Language Models: A Survey

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, and 10 more authors

IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025

Abs PDF Cite ( 355 )

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/awesome-llms-for-video-understanding.
@article{vidllmsurvey, title = {Video Understanding with Large Language Models: A Survey}, author = {Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Liu, Pinxin and Feng, Mingqian and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang}, journal = {IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)}, year = {2025}, publisher = {IEEE}, }
AAAI
V2Xum-LLM: Cross-modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua^*, Yunlong Tang^*, Chenliang Xu, and Jiebo Luo

In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

Website Poster Cite ( 81 )
@inproceedings{hua2024v2xum, title = {V2Xum-LLM: Cross-modal Video Summarization with Temporal Prompt Instruction Tuning}, author = {Hua, Hang and Tang, Yunlong and Xu, Chenliang and Luo, Jiebo}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)}, volume = {39}, number = {4}, pages = {3599-3607}, year = {2025}, doi = {10.1609/aaai.v39i4.32374}, }
AAAI
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, and Chenliang Xu

In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

Abs Poster Cite ( 65 )

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
@inproceedings{tang2024avicuna, title = {Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding}, author = {Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)}, volume = {39}, number = {7}, pages = {7293-7301}, year = {2025}, doi = {10.1609/aaai.v39i7.32784}, }
CVPR
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Yunlong Tang^*, Junjia Guo^*, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, and 2 more authors

In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025

Website Poster Cite ( 21 )
@inproceedings{tang2024vidcompostion, title = {VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?}, author = {Tang, Yunlong and Guo, Junjia and Hua, Hang and Liang, Susan and Feng, Mingqian and Li, Xinyang and Mao, Rui and Huang, Chao and Bi, Jing and Zhang, Zeliang and Fazli, Pooyan and Xu, Chenliang}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, year = {2025}, pages = {8490-8500}, }
AAAI
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, and Chenliang Xu

In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

Abs PDF Poster Cite ( 17 )

Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.
@inproceedings{tang2024cardiff, title = {CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion}, author = {Tang, Yunlong and Zhan, Gen and Yang, Li and Liao, Yiting and Xu, Chenliang}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)}, year = {2025}, doi = {10.1609/aaai.v39i7.32785}, volume = {39}, number = {7}, pages = {7302-7310}, }
AAAI Demo
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Yunlong Tang, Jing Bi, Chao Huang, Susan Liang, Daiki Shimada, Hang Hua, Yunzhong Xiao, Yizhi Song, Pinxin Liu, Mingqian Feng, and 9 more authors

Best Demo Award Runner-up at AAAI Demonstration Program, 2026

Poster Video Cite ( 7 )
@article{tang2025catv, title = {Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting}, author = {Tang, Yunlong and Bi, Jing and Huang, Chao and Liang, Susan and Shimada, Daiki and Hua, Hang and Xiao, Yunzhong and Song, Yizhi and Liu, Pinxin and Feng, Mingqian and Guo, Junjia and Liu, Zhuo and Song, Luchuan and Vosoughi, Ali and He, Jinxi and He, Liu and Zhang, Zeliang and Luo, Jiebo and Xu, Chenliang}, journal = {Best Demo Award Runner-up at AAAI Demonstration Program}, year = {2026}, }
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, and 17 more authors

arXiv preprint arXiv:2510.05034, 2025

Abs PDF Cite ( 7 )

Video understanding represents the most challenging frontier in computer vision. It requires models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://arxiv.org/abs/2510.05034
@article{tang2025videolmm_posttraining, title = {Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models}, author = {Tang, Yolo Yunlong and Bi, Jing and Liu, Pinxin and Pan, Zhenyu and Tan, Zhangyun and Shen, Qianxiang and Liu, Jiani and Hua, Hang and Guo, Junjia and Xiao, Yunzhong and Huang, Chao and Wang, Zhiyuan and Liang, Susan and Liu, Xinyi and Song, Yizhi and Huang, Junhua and Zhong, Jia-Xing and Li, Bozheng and Qi, Daiqing and Zeng, Ziyun and Vosoughi, Ali and Song, Luchuan and Zhang, Zeliang and Shimada, Daiki and Liu, Han and Luo, Jiebo and Xu, Chenliang}, journal = {arXiv preprint arXiv:2510.05034}, year = {2025}, }
NeurIPS D&B
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, and 4 more authors

The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

Website Poster Cite ( 5 )
@article{tang2025mmperspective, title = {MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness}, author = {Tang, Yunlong and Liu, Pinxin and Tan, Zhangyun and Feng, Mingqian and Mao, Rui and Huang, Chao and Bi, Jing and Xiao, Yunzhong and Liang, Susan and Hua, Hang and Vosoughi, Ali and Song, Luchuan and Zhang, Zeliang and Xu, Chenliang}, journal = {The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track}, year = {2025}, }
CVPR Findings
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, and Chenliang Xu

Computer Vision and Pattern Recognition Conference (CVPR) Findings, 2026

Website Poster Cite ( 2 )
@article{tang2025videor4, title = {Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination}, author = {Tang, Yolo Yunlong and Shimada, Daiki and Hua, Hang and Huang, Chao and Bi, Jing and Feris, Rogerio and Xu, Chenliang}, journal = {Computer Vision and Pattern Recognition Conference (CVPR) Findings}, year = {2026}, }
CVPR
Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning

Yolo Yunlong Tang, Chao Huang, Susan Liang, Jing Bi, Yicheng Wang, Daiki Shimada, and Chenliang Xu

Computer Vision and Pattern Recognition Conference (CVPR), 2026

Abs Cite

Streaming dense video captioning requires real-time processing of continuous visual input while determining precisely when and what to caption. Current approaches primarily focus on designing complex external memory mechanisms, failing to leverage Large Multimodal Models’ (LMMs) inherent long-context capabilities. Moreover, existing methods employing threshold-based caption triggering face a severe Threshold-Gated Discrepancy (TGD) problem, a training-inference mismatch arising from data imbalance, where models predominantly predict silence tokens, requiring thresholds that vary drastically across videos with extremely narrow effective ranges. We introduce Takusen, an asynchronous temporal modeling two-agent framework comprising a Small Multimodal Model (SMM) as an Oracle agent and an LMM as a Listener agent. The Oracle agent processes sparse video inputs at an accelerated rate to detect event boundaries, while the Listener agent processes dense inputs to generate accurate captions when prompted by the Oracle’s signals. This architecture eliminates threshold dependencies by fundamentally changing how silence/generation decisions are made, resolving the TGD problem. To enhance robustness against boundary prediction instabilities, we integrate uniformly distributed fixed decoding points with Oracle-predicted boundaries. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that Takusen achieves state-of-the-art performance with a simpler and more efficient design that balances temporal sensitivity with descriptive accuracy.
@article{tang2026takusen, title = {Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning}, author = {Tang, Yolo Yunlong and Huang, Chao and Liang, Susan and Bi, Jing and Wang, Yicheng and Shimada, Daiki and Xu, Chenliang}, journal = {Computer Vision and Pattern Recognition Conference (CVPR)}, year = {2026}, }

a post with tabs

this is what included tabs in a post could look like

2 min read · May 01, 2024
a post with typograms

this is what included typograms code could look like

1 min read · April 29, 2024
a post that can be cited

this is what a post that can be cited looks like

1 min read · April 28, 2024
a post with pseudo code

this is what included pseudo code could look like

2 min read · April 15, 2024
a post with code diff

this is how you can display code diffs

10 min read · January 27, 2024
a post with advanced image components

this is what advanced image components could look like

3 min read · January 27, 2024
a post with vega lite

this is what included vega lite code could look like

2 min read · January 27, 2024
a post with geojson

this is what included geojson code could look like

1 min read · January 26, 2024
a post with echarts

this is what included echarts code could look like

1 min read · January 26, 2024
a post with chart.js

this is what included chart.js code could look like

2 min read · January 26, 2024
a post with TikZJax

this is what included TikZ code could look like

1 min read · December 12, 2023
a post with bibliography

an example of a blog post with bibliography

1 min read · July 12, 2023
a post with jupyter notebook

an example of a blog post with jupyter notebook

2 min read · July 04, 2023
a post with custom blockquotes

an example of a blog post with custom blockquotes

3 min read · May 12, 2023
a post with table of contents on a sidebar

an example of a blog post with table of contents on a sidebar

3 min read · April 25, 2023
a post with audios

this is what included audios could look like

1 min read · April 25, 2023
a post with videos

this is what included videos could look like

1 min read · April 24, 2023
displaying beautiful tables with Bootstrap Tables

an example of how to use Bootstrap Tables

3 min read · March 20, 2023
a post with table of contents

an example of a blog post with table of contents

3 min read · March 20, 2023
a post with giscus comments

an example of a blog post with giscus comments

1 min read · December 10, 2022
a post with redirect

you can also redirect to assets like pdf

1 min read · February 01, 2022
a post with diagrams

an example of a blog post with diagrams

1 min read · July 04, 2021
a distill-style blog post

an example of a distill-style blog post and main elements

8 min read · May 22, 2021
a post with twitter

an example of a blog post with twitter

1 min read · September 28, 2020
a post with disqus comments

an example of a blog post with disqus comments

1 min read · October 20, 2015
a post with math

an example of a blog post with some math

1 min read · October 20, 2015
a post with code

an example of a blog post with some code

4 min read · July 15, 2015
a post with images

this is what included images could look like

3 min read · May 15, 2015
a post with formatting and links

march & april, looking forward to summer

2 min read · March 15, 2015