CaptionAnything in Video (CAT-V)
CAT-V is a training-free framework that enables fine-grained object-centric video captioning through spatiotemporal visual prompting and chain-of-thought reasoning.
CAT-V is a training-free framework that enables fine-grained object-centric video captioning through spatiotemporal visual prompting and chain-of-thought reasoning.