work Vid-LLM Survey [TCSVT] 🔥 Video Understanding with Large Language Models: A Survey VidComposition [CVPR 2025] 🏆 See how Top MLLMs understand video compositions. Scaling Concept We use pretrained text-guided diffusion models to scale up/down concepts in image/audio. MMComposition Benchmarking the compositionality capabilities of VLMs 🤯 CAT\(=^‥^)✏️ Caption-Anything (CAT) is a versatile image processing tool that combines the capabilities of Segment Anything, Visual Captioning, and ChatGPT. CaptionAnything in Video (CAT-V) CAT-V is a training-free framework that enables fine-grained object-centric video captioning through spatiotemporal visual prompting and chain-of-thought reasoning. fun GenAI for Cel-Animation [ICCVW 2025] 🎨 A Comprehensive Survey on GenAI for Cel-Animation. MMPerspective Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness is provided in this project. LaunchpadGPT [ICMC 2023] 🎵 LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad