Projects | Yunlong (Yolo) Tang

work

Vid-LLM Survey

[TCSVT] 🔥 Video Understanding with Large Language Models: A Survey

VidComposition

[CVPR 2025] 🏆 See how Top MLLMs understand video compositions.

Scaling Concept

We use pretrained text-guided diffusion models to scale up/down concepts in image/audio.

MMComposition

Benchmarking the compositionality capabilities of VLMs 🤯

CAT＼(=^‥^)✏️

Caption-Anything is a versatile image processing tool that combines the capabilities of Segment Anything, Visual Captioning, and ChatGPT. Our solution generates descriptive captions for any object within an image, offering a range of language styles to accommodate diverse user preferences.

CaptionAnything in Video (CAT-V)

CAT-V is a training-free framework that enables fine-grained object-centric video captioning through spatiotemporal visual prompting and chain-of-thought reasoning.

fun

GenAI for Cel-Animation

[ICCVW 2025] 🎨 A Comprehensive Survey on GenAI for Cel-Animation.

MMPerspective

[NeurIPS 2025] Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness is provided in this project.

LaunchpadGPT

[ICMC 2023] 🎵 LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad