Publications/Reports

FigCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Addressed the lack of fine-grained and syntactic information in CLIP’s representations by adapting CLIP on holistic, multidimensional, and densely annotated video-text data using lightweight adaptation strategy with LoRA adapters.

Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

Paper (under review)

Unsupervised Audio Visual Lecture Segmentation

Proposed video lecture segmentation that splits lectures into bite-sized topics. Approached this problem by first learning the lecture-clip representations by leveraging visual, textual, and OCR cues using a pretext self-supervised task of matching lecture narrations with temporally aligned visual content. Used these learned representations to temporally segment the lectures using an algorithm called TW-FINCH. Introduced a new dataset, AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects from MIT-OpenCourseWare, which we used for pre-training, fine-tuning, and evaluating the segmentation performance.

Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi

Winter Conference on Applications of Computer Vision (WACV), 2023

Paper / Code (GitHub)