This video explains a new approach to Visually supervise Language models that achieves performance gains on Language-Only tasks like the GLUE benchmark and SQuAD question answering. This is done by constructing a token-image matching (vokens) and classifying corresponding tokens with a a weakly supervised loss function.
Thanks for watching! Please Subscribe!
Paper Links:
Vokenization:
ImageBERT:
VilBERT: