work Vid-LLM Survey 🔥 Video Understanding with Large Language Models: A Survey VidComposition [CVPR 2025] 🏆 See how Top MLLMs understand video compositions. Scaling Concept We use pretrained text-guided diffusion models to scale up/down concepts in image/audio. MMComposition Benchmarking the compositionality capabilities of VLMs 🤯 CAT\(=^‥^)✏️ Caption-Anything (CAT) is a versatile image processing tool that combines the capabilities of Segment Anything, Visual Captioning, and ChatGPT. CAT-V CAT-V is a training-free framework that enables fine-grained object-centric video captioning through spatiotemporal visual prompting and chain-of-thought reasoning. fun GenAI for Cel-Animation A Comprehensive Survey on GenAI for Cel-Animation.