Yunlong (Yolo) Tang
[ Yolo Y. Tang • /ˈjoʊloʊ tɛn / • 芸珑 • she/her/hers ]
I’m a Ph.D. student at the University of Rochester, advised by Prof. Chenliang Xu, working on LMMs/Agents for Video Understanding.
I obtained my B.Eng. from SUSTech in 2023, supervised by Prof. Feng Zheng. I’ve interned at Amazon, ByteDance, and Tencent.
Please feel free to contact me if you’re interested in collaboration!
Selected Research
* Equal Contribution | † Corresponding Author
- TCSVT
Yunlong Tang*, Jing Bi*, Siting Xu*, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, and 10 more authorsIEEE Transactions on Circuits and Systems for Video Technology, 2025With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/awesome-llms-for-video-understanding.
@article{vidllmsurvey, title = {Video Understanding with Large Language Models: A Survey 🔥🔥🔥}, author = {Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Liu, Pinxin and Feng, Mingqian and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang}, journal = {IEEE Transactions on Circuits and Systems for Video Technology}, year = {2025}, publisher = {IEEE}, } - NeurIPS
Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, and 4 more authorsThe 39th Annual Conference on Neural Information Processing Systems, 2025Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
@article{tang2025mmperspective, title = {MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness}, author = {Tang, Yunlong and Liu, Pinxin and Feng, Mingqian and Tan, Zhangyun and Mao, Rui and Huang, Chao and Bi, Jing and Xiao, Yunzhong and Liang, Susan and Hua, Hang and Vosoughi, Ali and Song, Luchuan and Zhang, Zeliang and Xu, Chenliang}, journal = {The 39th Annual Conference on Neural Information Processing Systems}, year = {2025}, } - CVPR
Yunlong Tang*, Junjia Guo*, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, and 2 more authorsIn Proceedings of the Computer Vision and Pattern Recognition Conference, 2025@inproceedings{tang2024vidcompostion, title = {VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?}, author = {Tang, Yunlong and Guo, Junjia and Hua, Hang and Liang, Susan and Feng, Mingqian and Li, Xinyang and Mao, Rui and Huang, Chao and Bi, Jing and Zhang, Zeliang and Fazli, Pooyan and Xu, Chenliang}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference}, year = {2025}, pages = {8490-8500}, } - AAAI
Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, and Chenliang Xu†In Proceedings of the AAAI Conference on Artificial Intelligence, 2025Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
@inproceedings{tang2024avicuna, title = {Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding}, author = {Tang, Yunlong and Shimada, Daiki and Bi, Jing and Feng, Mingqian and Hua, Hang and Xu, Chenliang}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {39}, number = {7}, pages = {7293-7301}, year = {2025}, doi = {10.1609/aaai.v39i7.32784}, } - AAAI
Hang Hua*, Yunlong Tang*, Chenliang Xu, and Jiebo Luo†In Proceedings of the AAAI Conference on Artificial Intelligence, 2025@inproceedings{hua2024v2xum, title = {V2Xum-LLM: Cross-modal Video Summarization with Temporal Prompt Instruction Tuning}, author = {Hua, Hang and Tang, Yunlong and Xu, Chenliang and Luo, Jiebo}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {39}, number = {4}, pages = {3599-3607}, year = {2025}, doi = {10.1609/aaai.v39i4.32374}, } - AAAI
Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, and Chenliang Xu†In Proceedings of the AAAI Conference on Artificial Intelligence, 2025Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.
@inproceedings{tang2024cardiff, title = {CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion}, author = {Tang, Yunlong and Zhan, Gen and Yang, Li and Liao, Yiting and Xu, Chenliang}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, year = {2025}, doi = {10.1609/aaai.v39i7.32785}, volume = {39}, number = {7}, pages = {7302-7310}, }
- NeurIPS
Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, and 4 more authorsThe 39th Annual Conference on Neural Information Processing Systems, 2025Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/
@article{tang2025mmperspective, title = {MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness}, author = {Tang, Yunlong and Liu, Pinxin and Feng, Mingqian and Tan, Zhangyun and Mao, Rui and Huang, Chao and Bi, Jing and Xiao, Yunzhong and Liang, Susan and Hua, Hang and Vosoughi, Ali and Song, Luchuan and Zhang, Zeliang and Xu, Chenliang}, journal = {The 39th Annual Conference on Neural Information Processing Systems}, year = {2025}, } - ICCVW
Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, and 7 more authorsICCV AISTORY Workshop; arXiv preprint arXiv:2501.06250, 2025Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, issues such as maintaining visual consistency, ensuring stylistic coherence, and addressing ethical considerations continue to pose challenges. Furthermore, this paper discusses future directions and explores potential advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
@article{tang2025ai4anime, title = {Generative AI for Cel-Animation: A Survey 🔥🔥🔥}, author = {Tang, Yunlong and Guo, Junjia and Liu, Pinxin and Wang, Zhiyuan and Hua, Hang and Zhong, Jia-Xing and Xiao, Yunzhong and Huang, Chao and Song, Luchuan and Liang, Susan and Song, Yizhi and He, Liu and Bi, Jing and Feng, Mingqian and Li, Xinyang and Zhang, Zeliang and Xu, Chenliang}, journal = {ICCV AISTORY Workshop; arXiv preprint arXiv:2501.06250}, year = {2025}, } - ICMC
Siting Xu*, Yunlong Tang*, and Feng Zheng†In Proceedings of the International Computer Music Conference, 2023Launchpad is a popular music instrument for live performance and music production. In this paper, we propose LaunchpadGPT, a language model that can generate music visualization designs for Launchpad. We train the model on a large-scale dataset of music visualization designs and demonstrate its effectiveness in generating diverse and creative designs. We also develop a web-based platform that allows users to interact with the model and generate music visualization designs in real-time. Our code is available at https://github.com/yunlong10/LaunchpadGPT.
@inproceedings{xu2023launchpadgpt, title = {LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad}, author = {Xu, Siting and Tang, Yunlong and Zheng, Feng}, booktitle = {Proceedings of the International Computer Music Conference}, year = {2023}, pages = {213-217}, }
Misc.
Fun Facts
- 🩵 Preferred names: Yolo (en) or 芸珑 (zh).
- 🍭 I’m into ACG, J-pop, singing, and cosplay.
- 😝 Maybe I’m just cosplaying as a researcher—who knows?