work VidComposition 🏆 See how Top MLLMs understand video compositions. Vid-LLM Survey 🔥 Video Understanding with Large Language Models: A Survey Scaling Concept We use pretrained text-guided diffusion models to scale up/down concepts in image/audio. MMComposition Benchmarking the compositionality capabilities of VLMs 🤯 CAT\(=^‥^)✏️ Caption-Anything (CAT) is a versatile image processing tool that combines the capabilities of Segment Anything, Visual Captioning, and ChatGPT. fun project 4 another without an image