ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
Paper
•
2412.09754
•
Published
This is the official baseline implementation for the ViCas dataset, presented in the paper ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation.
For details about setting up the model, refer to the Video-LLaVA-Seg GitHub repo
For details about downloading and evaluating the dataset benchmark, refer to the ViCaS GitHub repo