Teach Computers to Connect Videos and Text without Labeled Data - VideoClip

A groundbreaking way to do self-supervision on videos and text. It’s like the BERT moment for this video-text understanding. #videoclip #contrastivelearning #videotransformer 0:00 - Intro 3:31 - Retrieval augmented training 5:07 - Video and text encoding8:48 - Contrastive loss 12:09 - Zero-shot transfer to end tasks 14:05 - Experiment results 18:09 - What did we learn VideoCLIP: Contrastive Pre-training forZero-shot Video-Text Understanding Connect Twitter Linkedin email edwindeeplearning@ Abstract We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks,

7 views