Darshan Singh S

Hi! I am a predoctoral researcher at Google DeepMind, India in the Foundational Research unit. I work on Multimodal-Multicultural video understanding in Gemini. I primarily work with Partha Talukdar, Shachi Dave, Arsha Nagrani, Tobias Weyand, Anelia Angelova and Cordelia Schmid. Feels like a dream to be working with so many amazing people!

Prior to this I spent five months at FastCode AI as a Machine Learning Engineer where I worked on Time-Series Foundation Models and Diffusion Models with Arjun Jain.

Previously, I was a MS student in the CVIT group at IIIT Hyderabad, advised by Prof. C. V. Jawahar and Prof. Makarand Tapaswi. I also closely worked with Prof. Vineet Gandhi. I am working in multimodal learning (jointly learning from vision and language modalities). Before this I was an Engineer at Mercedes Benz Research & Development India.

I am broadly interested in the problems related to computer vision, natural language processing and multimodal representation learning (especially using self-supervision).

CV / Google Scholar / Github / LinkedIn /

News

Oct, 2025 : Our paper on fine-grained image captioning using self-retrieval was recognized as top 10% by TMLR! and has been invited to present at ICLR 2026! Yay! See you in Rio de Janeiro I guess :)

Oct, 2025 : Excited to share our new paper - Rethinking Cross-lingual Gaps from a Statistical Viewpoint. My first work (of many?) from DeepMind!

Feb, 2025 : Super excited to announce that our paper VELOCITI was accpeted by CVPR 2025! See you all in Nashville!

Dec, 2024 : Served as a reviewer (and Emergency Reviewer!) for CVPR 2025.

Nov, 2024 : Excited to join Google DeepMind as a Pre-Doctoral Researcher in the Languages team with Partha Talukdar

See all news

darshan.singh@research.iiit.ac.in

Publications

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

We propose an alternative view of the cross-lingual gap in LLMs, hypothesizing that the variance of responses in the target language (not just differences in latent representations) to be the primary cause behind the drop in performance. We are the first to formalize this gap using bias-variance decomposition, providing extensive experimental evidence to support our hypothesis. Our key finding is that this variance-driven gap can be significantly reduced; we demonstrate that simple inference-time interventions, including a specific prompt instruction to control variance, can improve target language accuracy by 20-25%.

Vihari Piratla, Purvam Jain, Darshan Singh S, Trevor Cohn, Partha Talukdar

Paper (under review)

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

We propose VELOCITI, a new benchmark to evaluate how well Video-LLMs understand compositional reasoning. We introduce StrictVLE (Strict Video-Language Entailment), a new evaluation method that offers a stricter test by requiring models to correctly classify both positive and negative captions. Our key finding is that even state-of-the-art models like Gemini 1.5 Pro (49.3%) perform far below human accuracy (93.0%), showing they significantly struggle to associate agents with their actions, which is one of the most fundamental reasoning skills.

Darshana S^*, Varun Gupta^*, Darshan Singh S^*, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

In Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Project Page / Paper

No Detail Left Behind - Revisiting Self-Retrieval for Fine-Grained Image Captioning

A findings rich paper that systematically improves captioning systems across all fronts- Data, Training, Evaluation. We design- (1) a post-training recipe for self-retrieval finetuning with REINFORCE, and (2) a synthetic framework for visually boosting captioning datasets. Jointly they enable captioners to generate fine-grained, succinct descriptions while reducing hallucinations. Using our training recipe, ClipCap, a 200M param simplication of modern MLLMs, outperforms sota open-source MLLMs on fine-grained visual discrimination.

Manu Gaur, Darshan Singh S, Makarand Tapaswi

Transactions of Machine Learning Research (TMLR), 2024 (top 10%)

Project Page / Paper

Detect, Describe, Discriminate - Moving Beyond VQA for MLLM Evaluation

TL;DR It is easier for MLLMs to select an answer from multiple choices during VQA than to generate it independently. We evaluate MLLMs visual capabilities through self-retrieval within highly similar image pairs, revealing that current models struggle to identify fine-grained visual differences, with open-source models failing to outperform random guess.

Manu Gaur, Darshan Singh S, Makarand Tapaswi

ECCV EVAL-FoMo Workshop, 2024

Project Page / Paper

FigCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Addressed the lack of fine-grained and syntactic information in CLIP’s representations by adapting CLIP on holistic, multidimensional, and densely annotated video-text data using lightweight adaptation strategy with LoRA adapters.

Darshan Singh, Zeeshan Khan, Makarand Tapaswi

Paper (under review)

Unsupervised Audio Visual Lecture Segmentation

Proposed video lecture segmentation that splits lectures into bite-sized topics. Approached this problem by first learning the lecture-clip representations by leveraging visual, textual, and OCR cues using a pretext self-supervised task of matching lecture narrations with temporally aligned visual content. Used these learned representations to temporally segment the lectures using an algorithm called TW-FINCH. Introduced a new dataset, AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects from MIT-OpenCourseWare, which we used for pre-training, fine-tuning, and evaluating the segmentation performance.

Darshan Singh, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi

Winter Conference on Applications of Computer Vision (WACV), 2023

Paper / Code (GitHub)

News

See all news

Publications

See all publications