#222 Multimodal Models Part1 (as part of IIT Delhi course on Large Language Models (LLMs))

00:46:17
https://www.youtube.com/watch?v=R9YHeF_Uli0

Résumé

TLDRLa session animée par Manish Gupta, scientifique appliqué chez Microsoft, explore les modèles multimodaux axés sur les tâches de vision et de langage telles que la réponse visuelle aux questions, le raisonnement de bon sens visuel, et la récupération d'images basée sur des légendes. Elle aborde la compréhension multimodale par l'intégration de Bird et des modèles transformeurs visuels pour traiter les images et les textes. Des outils comme les réseaux de neurones convolutifs ou les Transformeurs Visuels (ViT) sont analysés pour encoder les images. Des modèles tels que Visual Bird, Wilbert, et Clip sont examinés pour expliquer comment ils traitent les tâches multimodales grâce à des méthodes d'encodage et de pré-formation contrastive. Gupta mentionne aussi les méthodes pour la compréhension de documents visuellement riches avec Layout LM et l'extension des tâches au niveau vidéo avec des outils comme Video Clip. La recherche récente sur les modèles liant plusieurs modalités, par exemple Image Bind, est aussi discutée.

A retenir

  • 📌 Accent sur la compréhension de la vision et du langage.
  • 🖼️ Usage des Transformeurs Visuels pour coder des images.
  • 💡 Visual Bird et Wilbert pour multimodalité.
  • 🔄 Utilisation de la perte contrastive dans Clip.
  • 🗂️ Layout LM pour la compréhension de documents.
  • 🎥 Adaptation des techniques à la vidéo et texte.
  • 🔍 Exploration de l'Image Bind avec plusieurs modalités.

Chronologie

  • 00:00:00 - 00:05:00

    Bonjour à cette session sur les modèles multimodaux, partie 1, axée sur les tâches de vision et de langage, en mettant l'accent sur la compréhension plutôt que la génération multimodale. Les tâches populaires incluent la réponse aux questions visuelles et le raisonnement en sens commun visuel, où des objets sont détectés dans des images à associer avec des textes.

  • 00:05:00 - 00:10:00

    Introduction aux transformateurs de vision, utilisés pour encoder des images en les divisant en patchs fixes et en ajoutant des embeddings positionnels avant de les passer dans un encodeur de transformeur. Ces modèles de vision peuvent être utilisés pour diverses tâches de classification.

  • 00:10:00 - 00:15:00

    Présentation de Visual BERT, un modèle pré-entraîné multimodal qui intègre BERT pour le texte et utilise des modèles de vision pour l'image. Il est pré-entraîné en utilisant des images avec sous-titres et utilise des fonctions d'objectif telles que le masquage de texte et la prédiction d'alignement image-texte.

  • 00:15:00 - 00:20:00

    Introduction de VilBERT, une architecture à deux tours, séparant le traitement du texte et de l'image avant de les fusionner par des couches de co-transformateur pour une modélisation alignée. Il exploite les données de sous-titres conceptuels pour l'entraînement préliminaire.

  • 00:20:00 - 00:25:00

    Présentation du modèle CLIP qui utilise une perte contrastive pour l'apprentissage préentraîné sur de vastes ensembles de données d'image-texte provenant du web. CLIP démontre d'excellentes performances dans des tâches de vision par ordinateur, même sans apprentissage supervisé direct.

  • 00:25:00 - 00:30:00

    Les modèles multimodaux sont également utilisés pour la compréhension des documents visuellement riches avec LayoutLM, qui traite les documents scannés pour extraire des paires clé-valeur ou répondre à des questions basées sur le document.

  • 00:30:00 - 00:35:00

    Discussion sur l'extension aux tâches vidéo, où la vidéo est traitée comme une séquence de cadres d'image, et sur l'utilisation de modèles pour encoder ces vidéos, facilitant des tâches comme la récupération de texte vidéo ou la réponse aux questions vidéo.

  • 00:35:00 - 00:40:00

    Introduction à ImageBind, une tentative de fusion de six modalités différentes, y compris images, texte, audio et plus, permettant une compréhension et une génération multimodales plus riches sans nécessiter de données alignées globales.

  • 00:40:00 - 00:46:17

    Résumé de la session en soulignant l'importance de la modélisation multimodale dans divers contextes, en introduisant des modèles récents et en incitant à explorer davantage la recherche dans ce domaine passionnant.

Afficher plus

Carte mentale

Mind Map

Questions fréquemment posées

  • Quel est le principal focus de cette session ?

    La session met l'accent sur la compréhension multimodale des contenus visuels et textuels.

  • Quelles tâches multimodales sont discutées dans la présentation ?

    Les tâches incluent la réponse à des questions visuelles, le raisonnement de bon sens visuel, la récupération d'images à base de légendes, et la détection de fausses nouvelles multimodales.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !
Sous-titres
en
Défilement automatique:
  • 00:00:02
    hi welcome to this session on multimodel
  • 00:00:05
    models part one which is a part of e 881
  • 00:00:09
    and a 821 course on large language
  • 00:00:12
    models introduction recent advances at
  • 00:00:15
    it Delhi um my name is Manish kupta and
  • 00:00:18
    I'm a principal applied scientist at
  • 00:00:20
    Microsoft uh so let's get started with
  • 00:00:23
    the session today okay um this is a
  • 00:00:26
    session on multimodel models right um
  • 00:00:30
    multimodel models can actually mean
  • 00:00:31
    several modalities this session is going
  • 00:00:34
    to be primarily focused on vision and
  • 00:00:36
    language tasks also note that this
  • 00:00:39
    session is more on multimodal
  • 00:00:41
    understanding and less on generation to
  • 00:00:43
    begin with of course in the part two
  • 00:00:45
    I'll start talking about generation uh
  • 00:00:48
    text generation specifically where the
  • 00:00:49
    input could be multimodel but in this
  • 00:00:51
    session I'm going to focus more on um
  • 00:00:55
    you know multimodal content
  • 00:00:57
    understanding right so what you see on
  • 00:00:59
    the slide here are various vision and
  • 00:01:02
    language tasks there are several
  • 00:01:04
    scenarios where uh information uh could
  • 00:01:08
    be multimodal in nature um specifically
  • 00:01:11
    text and images right so uh the most
  • 00:01:14
    popular multimodal task I would say is
  • 00:01:16
    visual question answering where the
  • 00:01:18
    input is an image and also a question uh
  • 00:01:22
    along with that image so is there
  • 00:01:24
    something to cut the vegetables with and
  • 00:01:26
    this task is actually called as visual
  • 00:01:27
    question answering where given the image
  • 00:01:29
    and the question the idea is to select
  • 00:01:32
    from a large set of answers right an
  • 00:01:35
    understanding task is about selecting
  • 00:01:36
    from a large set of answers right uh but
  • 00:01:39
    a generation task would be to actually
  • 00:01:41
    generate an answer text another related
  • 00:01:44
    task is VCR or visual Common Sense
  • 00:01:46
    reasoning where uh you know given this
  • 00:01:49
    image of a social situation right and a
  • 00:01:52
    question why is person for pointing at
  • 00:01:55
    person one the idea is to be able to
  • 00:01:57
    figure out from these four options which
  • 00:02:00
    is the most correct option most uh you
  • 00:02:02
    know accurate answer right and then uh
  • 00:02:05
    as an extension to the task people have
  • 00:02:07
    also proposed um you know uh this this
  • 00:02:09
    second order task where uh given an
  • 00:02:12
    image and given the question and an
  • 00:02:14
    answer that was chosen what is a
  • 00:02:16
    rational why is it correct right why is
  • 00:02:18
    the answer correct right so both of them
  • 00:02:21
    involve choosing one of the four
  • 00:02:23
    possible options right here is yet
  • 00:02:26
    another task where then where given an
  • 00:02:29
    image and uh uh you see uh given a
  • 00:02:32
    rectangle you basically want to choose a
  • 00:02:35
    piece of text saying hey well this is
  • 00:02:36
    guy in yellow dribbling ball right
  • 00:02:38
    referring Expressions task yet another
  • 00:02:41
    task is about uh caption based image
  • 00:02:43
    retrieval where you know you have a
  • 00:02:45
    particular caption let's say uh this
  • 00:02:47
    text caption and then you have a large
  • 00:02:49
    bunch of images and you want to retrieve
  • 00:02:51
    the most relevant images okay so the
  • 00:02:53
    idea is that here of course you want to
  • 00:02:54
    basically encode the images and the text
  • 00:02:57
    uh in a in a similar in a you know uh in
  • 00:03:00
    a in a uh single space so that you could
  • 00:03:03
    actually compute similarity between them
  • 00:03:05
    seamlessly and then do good image
  • 00:03:06
    retrial right so all of them basically
  • 00:03:09
    involve encoding the image and encoding
  • 00:03:11
    the text uh in some form um sometimes
  • 00:03:14
    jointly sometimes uh individually but
  • 00:03:16
    then at some level try to compute
  • 00:03:18
    similarity okay so now what I would do
  • 00:03:21
    is to basically slowly start talking
  • 00:03:22
    about the kind of models people have
  • 00:03:24
    been using to solve these kinds of
  • 00:03:26
    vision and language tasks of course
  • 00:03:27
    there are many other tasks by the way
  • 00:03:29
    right so for example there's also
  • 00:03:30
    multimodal fake uh tweet detection so
  • 00:03:33
    where the Tweet could be multimodal in
  • 00:03:35
    nature there's an image there's a text
  • 00:03:36
    and you want to figure out if this
  • 00:03:38
    combined multimodal tweet is fake or not
  • 00:03:40
    uh similarly there are other tasks like
  • 00:03:43
    uh like like hate speech detection so
  • 00:03:44
    there could be a multimodel document and
  • 00:03:47
    you want to figure out let's say a blog
  • 00:03:49
    or again a tweet right and you want to
  • 00:03:50
    figure out if combined overall including
  • 00:03:53
    the image in the text is it hateful or
  • 00:03:55
    not right so so given all of this I'll
  • 00:03:58
    start talking about models that people
  • 00:04:00
    have built around trying to solve uh
  • 00:04:02
    these these multimodal tasks so
  • 00:04:04
    basically how do you do multimodal
  • 00:04:05
    understanding right now uh to be able to
  • 00:04:08
    do multimodal understanding you need to
  • 00:04:09
    encode both the images both the image as
  • 00:04:12
    well as the text now text understanding
  • 00:04:14
    of course you have done over multiple
  • 00:04:16
    lectures in this in this course uh let
  • 00:04:18
    me actually talk about how do you how do
  • 00:04:20
    you encode images nicely right so far uh
  • 00:04:24
    you know or or um until 2020 let's say
  • 00:04:27
    people used to use uh convolution neural
  • 00:04:29
    network to encode images people still
  • 00:04:31
    use them but what has become popular
  • 00:04:33
    also is to basically use these Vision
  • 00:04:34
    Transformers to encode images okay so
  • 00:04:37
    the idea is that given a particular
  • 00:04:39
    image of this kind you're going to first
  • 00:04:41
    split it into fixed size patches for
  • 00:04:43
    example this guy is basically split into
  • 00:04:45
    fixed size n patches and uh these
  • 00:04:48
    patches are linearly embedded so
  • 00:04:50
    essentially uh you have some sort of a
  • 00:04:52
    projection which sort of embeds them
  • 00:04:53
    linearly so you have a embedding
  • 00:04:55
    computed per patch and then you also add
  • 00:04:58
    positional embeddings uh to this to each
  • 00:05:00
    of those patch embeddings and you feed
  • 00:05:02
    them the resultant vectors along with
  • 00:05:04
    the CLS token to a Transformer encoder
  • 00:05:07
    right so that's basically uh the way
  • 00:05:09
    these Vision Transformers work so viit
  • 00:05:11
    or the vision Transformer model
  • 00:05:13
    basically takes an image plac into fix
  • 00:05:15
    size patches linearly embeds them adds a
  • 00:05:17
    position embedding and passes it as
  • 00:05:19
    input to the uh Transformer encoder
  • 00:05:21
    along with the CLS token on the CLS
  • 00:05:23
    token you could of course attach a MLP
  • 00:05:25
    head and use it for various
  • 00:05:26
    classification purposes for example in
  • 00:05:28
    this particular case maybe you want to
  • 00:05:29
    do thousand class classification so you
  • 00:05:32
    basically taken the image and you passed
  • 00:05:34
    it to through through this Transformer
  • 00:05:36
    encoder and you're going to attach a
  • 00:05:38
    thousand sized output attention out
  • 00:05:41
    output layer right th neurons in the
  • 00:05:43
    output okay um now uh viit model uh and
  • 00:05:48
    the transform encoder is a standard
  • 00:05:49
    encoder you don't make no changes
  • 00:05:50
    essentially you just have the multi-ad
  • 00:05:52
    self attention you have the MLP feed
  • 00:05:54
    forward layer and so on right um the vi
  • 00:05:57
    model comes in three different sizes 12
  • 00:05:59
    layer 24 layer 32 layer base large and
  • 00:06:01
    huge sizes with different hidden
  • 00:06:03
    dimensions and number of attention heads
  • 00:06:05
    and so on right the largest one is
  • 00:06:08
    basically 632 million parameters this
  • 00:06:10
    model was pre-trained using obvious data
  • 00:06:12
    sets imaginet 1K imaginet 21k and jft
  • 00:06:17
    right um see these models basically uh
  • 00:06:19
    differ in terms of not just the base
  • 00:06:21
    large and huge the sizes but they also
  • 00:06:23
    differ in terms of the patch IM patch
  • 00:06:26
    sizes that they take as input so for
  • 00:06:28
    example v l 16 basically takes 16 Cross
  • 00:06:31
    16 input patch size so of course smaller
  • 00:06:34
    patch size would mean larger sequence
  • 00:06:36
    length but there is basically a
  • 00:06:38
    trade-off between uh and then larger
  • 00:06:40
    sequence length would mean latency
  • 00:06:41
    higher latency and so on uh so but it
  • 00:06:44
    could also mean higher accuracy in that
  • 00:06:45
    senses so that's that and what what
  • 00:06:47
    people have shown is that Vision
  • 00:06:48
    Transformers match or exceed residual
  • 00:06:51
    networks on many many image
  • 00:06:52
    classification data sets and therefore
  • 00:06:55
    people have started using transform
  • 00:06:57
    models Vision transform models for
  • 00:06:58
    encoding images as well okay now you
  • 00:07:01
    would be thinking hey where is
  • 00:07:02
    multimodality here but this was really
  • 00:07:04
    important to build there's no
  • 00:07:05
    multimodality here so far it was
  • 00:07:07
    important to build so as to bring this
  • 00:07:09
    notion that yes you could use transform
  • 00:07:11
    models even to encode images right and
  • 00:07:13
    now here is our first multimodel model
  • 00:07:15
    it's called visual Bird right so the
  • 00:07:17
    model is actually called visual bird and
  • 00:07:18
    as you can imagine it sort of is going
  • 00:07:20
    to integrate bird for text and you know
  • 00:07:24
    Vision also um as using using transform
  • 00:07:28
    model itself okay so the idea is uh uh
  • 00:07:31
    that um uh you have to somehow uh
  • 00:07:34
    understand how to pre-train this model
  • 00:07:35
    so visual bird is a pre-trained
  • 00:07:37
    multimodel model now if you remember
  • 00:07:39
    bird how was it pre-trained the
  • 00:07:41
    pre-training goal was to be able to
  • 00:07:42
    learn a task agnostic language model
  • 00:07:45
    such that it understands English like a
  • 00:07:47
    second grade third grade kid okay and
  • 00:07:50
    the way it was done pre- trending data
  • 00:07:52
    is is always obtained in a
  • 00:07:53
    self-supervised manner in the sense that
  • 00:07:56
    there should be large amounts of this
  • 00:07:57
    data and no extra human labeling
  • 00:07:59
    required right now when you think about
  • 00:08:01
    a multimodal pre-training what you want
  • 00:08:03
    is a model which can basically
  • 00:08:05
    understand how to link U you know uh
  • 00:08:08
    portions of those images or patches in
  • 00:08:10
    those images with text words right so
  • 00:08:12
    essentially patches in images with words
  • 00:08:15
    right so it should basically not just so
  • 00:08:17
    so uh unlike bird which basically just
  • 00:08:19
    needs to understand relationship between
  • 00:08:21
    various words uh visual part needs to
  • 00:08:24
    understand the relationship between
  • 00:08:25
    these text words and visual patches of
  • 00:08:28
    images okay how how do you enable that
  • 00:08:30
    what kind of pretin data would you use
  • 00:08:32
    so what folks realized is that there's a
  • 00:08:34
    whole bunch of image captioning data
  • 00:08:36
    which is available and therefore they
  • 00:08:37
    said that hey we'll use image captioning
  • 00:08:39
    data image text pair so as to be able to
  • 00:08:41
    pre-train this model so you have images
  • 00:08:43
    and you have a corresponding caption
  • 00:08:45
    associated with it so visual bird
  • 00:08:47
    basically leveraged the MS Coco data
  • 00:08:49
    which is basically 120,000 images along
  • 00:08:52
    with each with five different captions
  • 00:08:54
    leading to a data set of 6 600,000 image
  • 00:08:57
    text pairs right and they basically use
  • 00:09:00
    the standard so they use the standard
  • 00:09:02
    transform encoder model uh to be able uh
  • 00:09:04
    to train um this this pre-training to do
  • 00:09:07
    this pre-training okay so uh the way
  • 00:09:10
    they did this did this is basically to
  • 00:09:13
    uh add um you see there's a there's a
  • 00:09:16
    CLS token and then there is the text
  • 00:09:17
    caption which is going in as input and
  • 00:09:19
    there is of course the image uh pieces
  • 00:09:21
    that are going in as input okay now U
  • 00:09:24
    let's let's understand this step by step
  • 00:09:26
    so essentially um uh just like one would
  • 00:09:29
    do Mass language modeling in B here they
  • 00:09:32
    also mask text so some text token so
  • 00:09:35
    that's obvious and then text and images
  • 00:09:37
    are separated by a separated token which
  • 00:09:38
    is also obvious now what are these image
  • 00:09:40
    pieces well these image pieces basically
  • 00:09:43
    um are not tiles or patches as you
  • 00:09:45
    observed on the previous slide but well
  • 00:09:47
    in their case they actually used an
  • 00:09:49
    object detection model called as faster
  • 00:09:50
    rcnn so as to basically take this image
  • 00:09:53
    and divide it into different patches uh
  • 00:09:55
    such that each patch actually captures a
  • 00:09:58
    relevant object like a cap or a ball or
  • 00:10:00
    a tennis racket or a shirt and so on
  • 00:10:02
    okay so they basically take objects uh
  • 00:10:05
    which are returned by this faster RC and
  • 00:10:07
    model uh and pass them as inputs in you
  • 00:10:10
    know as as input image tokens so what
  • 00:10:12
    are your image words so to say well or
  • 00:10:15
    image tokens they're basically objects
  • 00:10:16
    as detected by the FAS rcnn model faster
  • 00:10:19
    rcnn object detector okay um uh now to
  • 00:10:22
    encode these things they have like three
  • 00:10:24
    different kinds of U um you know things
  • 00:10:26
    that go as input at every position there
  • 00:10:28
    is of course a position embedding that
  • 00:10:30
    is obvious in any Transformer model
  • 00:10:31
    there's also segment embedding which
  • 00:10:33
    basically tells you whether this is a
  • 00:10:34
    text segment or an image segment and
  • 00:10:36
    there's also a token or image embedding
  • 00:10:37
    so of course you know there need to be
  • 00:10:39
    some text features or image features
  • 00:10:41
    that need to be passed for tokens
  • 00:10:42
    basically it's the token embedding and
  • 00:10:44
    for images it basically compris the
  • 00:10:46
    features from faster rcnn and so on okay
  • 00:10:49
    so all of this is combined and then fed
  • 00:10:51
    to the standard Transformer model uh
  • 00:10:54
    Transformer encoder model and um uh then
  • 00:10:57
    you basically train pre-train this
  • 00:10:59
    Transformer model using two different
  • 00:11:01
    objective functions we'll talk about
  • 00:11:02
    these two objective functions next uh
  • 00:11:05
    the first one is very simple it's just
  • 00:11:06
    the mass language modeling so M language
  • 00:11:08
    modeling remember no M image modeling so
  • 00:11:10
    on the image side you do not really hide
  • 00:11:12
    any of those images but on the text side
  • 00:11:14
    you mask these text pieces and the idea
  • 00:11:16
    is that just like in bird uh you know
  • 00:11:19
    Mass language modeling aims to be able
  • 00:11:21
    to uh predict the must word at the
  • 00:11:23
    output at the same position by
  • 00:11:26
    leveraging knowledge uh or borrowing
  • 00:11:28
    knowledge uh from other tokens unmasked
  • 00:11:30
    tokens right in the same way in visual
  • 00:11:33
    bir the idea is that at the same
  • 00:11:35
    position you know the model should be
  • 00:11:37
    able to guess the hidden word or the
  • 00:11:39
    mased word um um you know by leveraging
  • 00:11:43
    uh Knowledge from the unmasked uh text
  • 00:11:46
    words and also of course all the image
  • 00:11:48
    tokens because n image tokens are mased
  • 00:11:51
    so you know the accuracy here for Mass
  • 00:11:53
    language modeling should be ideally
  • 00:11:54
    higher U you know because now the m word
  • 00:11:57
    let's say the tennis recet uh tennis can
  • 00:12:00
    can possibly be guessed not just based
  • 00:12:02
    on Racket and ball and so on but also
  • 00:12:05
    based on what the model sees in the
  • 00:12:07
    image in that senses right so this
  • 00:12:09
    knowledge should be able to help um you
  • 00:12:11
    know U improve the mass language
  • 00:12:12
    modeling accuracy on the way making the
  • 00:12:14
    model learn how to um relate uh this
  • 00:12:17
    particular image with the word called
  • 00:12:20
    tennis okay so that's that the second
  • 00:12:23
    task is and and by the way these these
  • 00:12:25
    must you know this objective one is
  • 00:12:27
    going to be computed must language
  • 00:12:29
    modeling objective is going to be
  • 00:12:30
    computed only on those tokens uh where
  • 00:12:33
    the text words or text tokens were mased
  • 00:12:35
    okay now on the other hand let's talk
  • 00:12:37
    about this objective two it's basically
  • 00:12:40
    the sentence image prediction task the
  • 00:12:42
    idea is that um you see in a particular
  • 00:12:44
    batch of samples you would give image
  • 00:12:47
    text pairs but half of those pairs are
  • 00:12:49
    going to be positive pairs half of them
  • 00:12:50
    are going to be negative pairs now what
  • 00:12:52
    is a positive pair positive pair
  • 00:12:53
    basically means that the caption is
  • 00:12:55
    linked with the image itself and you
  • 00:12:58
    know which is which means that is
  • 00:12:59
    similar it is relevant for the image
  • 00:13:02
    right and then the negative pair would
  • 00:13:03
    be this caption is linked with the
  • 00:13:05
    negative image you know irrelevant image
  • 00:13:07
    to the caption okay for the sample
  • 00:13:09
    positive and negative pairs create a
  • 00:13:10
    batch and then you know the objective
  • 00:13:12
    two is basically all about figuring out
  • 00:13:14
    whether uh the attent whether the you
  • 00:13:17
    know uh the the MLP head out there um is
  • 00:13:21
    is able to predict correctly whether
  • 00:13:22
    this is a positive pair or a negative
  • 00:13:23
    pair right so that's basically that um
  • 00:13:26
    now the interesting part about visual
  • 00:13:28
    bird is that uh you could basically now
  • 00:13:31
    look at the attention weights of some
  • 00:13:34
    selected heads uh at the at the output
  • 00:13:36
    layer at the last layer and uh then by
  • 00:13:39
    looking at those attention weights you
  • 00:13:41
    can visualize them and try to see uh
  • 00:13:44
    remember in self attention you have like
  • 00:13:46
    now words text words paying attention to
  • 00:13:48
    uh image uh pieces and now you can try
  • 00:13:51
    to see you know how are they correlated
  • 00:13:53
    okay so let me look at layer 11 and uh
  • 00:13:57
    you see uh this this heat map is
  • 00:13:59
    basically drawn by showing um uh
  • 00:14:02
    attention between text words you see
  • 00:14:05
    there and image tokens you see there
  • 00:14:07
    right so five image five image tokens
  • 00:14:09
    you know uh corresponding to man which
  • 00:14:11
    are also highlighted in this image by
  • 00:14:13
    the way so the red red one is man
  • 00:14:15
    the you know bluish one is shirt and you
  • 00:14:18
    know the bluish kind of stuff is
  • 00:14:20
    sidewalk and so on right so what do You
  • 00:14:22
    observe is that uh so if I look at the
  • 00:14:25
    word man right fortunately it has very
  • 00:14:28
    high you know um attention weight for
  • 00:14:32
    for the for the for the for the box man
  • 00:14:34
    in that sense for the image piece man
  • 00:14:36
    okay and that's very useful and nice
  • 00:14:38
    because it sort of nicely tells us that
  • 00:14:40
    the model is actually learning to
  • 00:14:42
    correlate pieces in the image with the
  • 00:14:44
    tokens in the text okay uh so and so on
  • 00:14:46
    you can actually pause the video here
  • 00:14:47
    and essentially observe that this holds
  • 00:14:49
    also for other kinds of things like
  • 00:14:51
    sidewalk pedestrians shirt and so on
  • 00:14:53
    okay now after this people tried to
  • 00:14:56
    improve the architecture they came up
  • 00:14:57
    with this architecture new model called
  • 00:14:59
    as Wilber okay by the way you can also
  • 00:15:01
    look at these papers most of my slides
  • 00:15:02
    actually have citations for these papers
  • 00:15:04
    at the at the bottom okay um so Wilbert
  • 00:15:08
    basically believes in a two Tower model
  • 00:15:10
    unlike visual bird visual bird was a
  • 00:15:12
    single Tower model the concatination of
  • 00:15:14
    text and image modali sort of happened
  • 00:15:16
    right in the right in the first layer
  • 00:15:17
    right in the zeroth layer in that senses
  • 00:15:19
    okay but Wilbert basically believes in
  • 00:15:21
    processing text separately using a few
  • 00:15:24
    few layers Transformer layers as you see
  • 00:15:26
    them here and processing the image
  • 00:15:28
    separately using a few layers now these
  • 00:15:29
    layers could be you know um uh
  • 00:15:32
    Transformer layers and so on so um so uh
  • 00:15:35
    notice that the text stream in Wilber
  • 00:15:38
    actually has much more processing before
  • 00:15:39
    interacting with the visual features but
  • 00:15:41
    as I was saying well it's a two Tower
  • 00:15:42
    model where the text and image are
  • 00:15:45
    processed separately in their own
  • 00:15:46
    pipelines but then there is also a
  • 00:15:48
    fusion which happens where core
  • 00:15:50
    Transformer layers or Co attention based
  • 00:15:52
    Transformer layers basically try to fuse
  • 00:15:54
    the information across both the
  • 00:15:55
    pipelines okay uh and then there are
  • 00:15:57
    other Transformer layers further which
  • 00:15:59
    basically do individual processing
  • 00:16:00
    separately okay now the interesting part
  • 00:16:03
    is that this linguistic stream is
  • 00:16:05
    basically bird based and for for you
  • 00:16:09
    know um uh bird base and then you know
  • 00:16:12
    uh again the for for the visual stream
  • 00:16:14
    essentially you use uh faster rcnn to
  • 00:16:17
    essentially figure out those image
  • 00:16:19
    patches and so on okay this faster rcnn
  • 00:16:21
    both in visual bird and willber is
  • 00:16:23
    pre-train on visual genome data set okay
  • 00:16:25
    okay so now how is this Co attention how
  • 00:16:27
    do these Co Transformer layers work work
  • 00:16:29
    right so a standard Transformer layer it
  • 00:16:31
    has the typical standard self attention
  • 00:16:32
    and feed forward right with the qkv the
  • 00:16:35
    query keys and values coming from a
  • 00:16:37
    single stream right but in Wilber you
  • 00:16:39
    have two different streams the visual
  • 00:16:41
    stream and then the um the the
  • 00:16:42
    linguistic stream right uh and what you
  • 00:16:45
    do uh for transferring information
  • 00:16:48
    across the two is to basically use uh
  • 00:16:50
    the query of the same modality but the
  • 00:16:53
    you know the keys and the values coming
  • 00:16:55
    from the other modality from the
  • 00:16:56
    linguistic stream in this particular
  • 00:16:57
    example right in this particular case
  • 00:16:59
    and in the other case in the linguistic
  • 00:17:00
    stream again you're going to use the
  • 00:17:01
    query from the linguistic stream but
  • 00:17:03
    you're going to use the keys and the
  • 00:17:04
    values from the visual stream so as to
  • 00:17:05
    essentially do this cross-pollination of
  • 00:17:08
    information um in the in the attenion
  • 00:17:10
    layer right and then of course you do
  • 00:17:11
    this typical standard feed forward with
  • 00:17:13
    all those residual connections add and
  • 00:17:14
    normalization and so on okay so that's
  • 00:17:17
    how Wilbert works now another
  • 00:17:19
    interesting part about Wilbert is that
  • 00:17:21
    rather than depending on manually
  • 00:17:23
    labeled Ms Coco data they actually
  • 00:17:25
    depended on conceptual caps data this
  • 00:17:27
    data set of concept ual captions is
  • 00:17:29
    basically obtained in an automated
  • 00:17:31
    manner by scraping things from the web
  • 00:17:33
    so the idea is that on the web there are
  • 00:17:35
    Wikipedia Pages news articles and many
  • 00:17:37
    other many other web pages where you
  • 00:17:39
    have an image and underneath that
  • 00:17:40
    there's a caption okay there are also
  • 00:17:43
    images with very nice alt tags
  • 00:17:45
    associated with them these serve as
  • 00:17:47
    really good sources for caption
  • 00:17:49
    information along with images and that
  • 00:17:51
    is what willber guys basically Lage um
  • 00:17:54
    so as to uh essentially U you know um
  • 00:17:58
    create the visual uh grounding data
  • 00:18:00
    right uh visual image text pairs
  • 00:18:02
    essentially and then then they pre-train
  • 00:18:03
    the visual wibert model based on that
  • 00:18:06
    now in Wilbert model again there are two
  • 00:18:08
    pre-training loss functions so just like
  • 00:18:10
    in visual bird there is a multimodel
  • 00:18:11
    alignment prediction function to
  • 00:18:13
    basically just predict whether this
  • 00:18:14
    image and text are aligned with each
  • 00:18:16
    other or not and that's basically the
  • 00:18:17
    same as visual bird so there's no no
  • 00:18:18
    difference in that sense except the
  • 00:18:19
    change in the name right but then on the
  • 00:18:22
    other hand the mass language modeling is
  • 00:18:23
    actually now extended to to to become
  • 00:18:25
    Mast multimodal learning okay the
  • 00:18:28
    interesting part part is that you know
  • 00:18:30
    rather than just masking out text you
  • 00:18:31
    can actually also mask out image
  • 00:18:33
    pieces uh in 2019 you know there was no
  • 00:18:37
    good technology to basically generate
  • 00:18:39
    back image pieces themselves in the same
  • 00:18:43
    position but what you could do or rather
  • 00:18:45
    what Wilber does is to basically
  • 00:18:46
    generate a distribution over a set of
  • 00:18:48
    class labels so of course you know
  • 00:18:50
    because you're using fcnn you know what
  • 00:18:52
    particular object this particular
  • 00:18:54
    position indicates the image piece at
  • 00:18:56
    this position indicates and then the
  • 00:18:58
    idea is that
  • 00:18:59
    Wilbert uh the goal or the objective is
  • 00:19:01
    such that the Wilbert model is motivated
  • 00:19:04
    to learn the right pred right
  • 00:19:06
    distribution over those objects uh at
  • 00:19:09
    the at the output at the same position
  • 00:19:11
    okay if it learns great else there's a
  • 00:19:12
    back propagation cross and tropy loss
  • 00:19:14
    back propagated right so that's
  • 00:19:15
    basically M multimodal learning a great
  • 00:19:17
    extension from just that M Mass language
  • 00:19:19
    modeling as done in visual
  • 00:19:21
    part okay so that's great now the idea
  • 00:19:25
    behind wibert and visual bird so far is
  • 00:19:27
    that you have imaged is that you have
  • 00:19:29
    image text Pairs and you could do a nice
  • 00:19:32
    uh uh you know modeling of both the
  • 00:19:35
    modalities together and come up with
  • 00:19:37
    these interesting embeddings in that
  • 00:19:39
    senses and uh hopefully it gives you
  • 00:19:41
    good accuracies however uh over time
  • 00:19:45
    people have moved to using contrastive
  • 00:19:46
    training contrastive loss Based training
  • 00:19:49
    and that is what the clip model is also
  • 00:19:50
    famous for as it says it's contrastive
  • 00:19:53
    language image pre-training okay so the
  • 00:19:56
    way clip model works is that it's also
  • 00:19:58
    two Tower model in that senses and uh
  • 00:20:01
    then it has a contrastive loss right at
  • 00:20:02
    the very end in that senses okay so uh
  • 00:20:06
    you use uh uh you know I mean of course
  • 00:20:09
    you have a text caption and you have an
  • 00:20:10
    image as well so use the text caption is
  • 00:20:12
    to basically pre-train a text encoder or
  • 00:20:15
    you know you use a pre-train texture
  • 00:20:17
    encoder and you you use it to encode the
  • 00:20:19
    text in that senses you use a pre Trin
  • 00:20:21
    image encoder to encode the images and
  • 00:20:23
    then so in their particular case in fact
  • 00:20:25
    they they essentially used 12
  • 00:20:28
    Transformers for the text encoder and
  • 00:20:30
    for the image encoder well they actually
  • 00:20:31
    experimented with quite a few so there
  • 00:20:34
    are five res Nets they still believed in
  • 00:20:35
    convolution Ural networks and you know
  • 00:20:38
    uh three different uh vit models so vit
  • 00:20:40
    base and vit large with different patch
  • 00:20:42
    sizes 3 to 16 and 14 as you see okay uh
  • 00:20:46
    and then what do they do they basically
  • 00:20:48
    pre-train this with a contrastive uh law
  • 00:20:50
    that I'm going to explain very soon
  • 00:20:52
    using 400 million web image text image
  • 00:20:55
    text pairs okay so the data set is
  • 00:20:57
    called Web image text and basically 400
  • 00:20:59
    million image Comm text pairs okay uh so
  • 00:21:02
    you see I mean visual bird was on 600k
  • 00:21:04
    image text pairs willbert on 3 million
  • 00:21:06
    uh you know clip is basically on 400
  • 00:21:08
    million and the interesting part is it
  • 00:21:09
    also uses contrastive loss Tes to do
  • 00:21:11
    pre-training okay now the way this
  • 00:21:13
    pre-training works is that uh you have
  • 00:21:16
    image you have text tokens and you have
  • 00:21:18
    embeddings for each of those text tokens
  • 00:21:20
    at different positions from the text
  • 00:21:21
    encoder just 12 Transformer you also
  • 00:21:23
    have image tokens and you have a
  • 00:21:24
    representation for each of those uh for
  • 00:21:26
    various image pieces right so what you
  • 00:21:29
    do is basically you try to compare uh
  • 00:21:31
    these and and by by the way by the way
  • 00:21:34
    you know from a from a from a batch of n
  • 00:21:37
    instances let's say if I have a batch of
  • 00:21:38
    n samples so consider nend real pairs
  • 00:21:42
    real image text pairs n samples in a
  • 00:21:44
    batch okay what you're going to do is to
  • 00:21:47
    basically take a pulled text embedding a
  • 00:21:49
    pulled image embedding and you're going
  • 00:21:50
    to compute cosine similari between them
  • 00:21:53
    so i1 dot1 i1 dot2 so given a batch
  • 00:21:56
    let's say batch of 20 you're basically
  • 00:21:57
    going to end up with 400 similarities
  • 00:21:59
    because you have a 20 um 20 image um
  • 00:22:03
    embeddings and 20 text embeddings you
  • 00:22:04
    get like 400 different similarities of
  • 00:22:06
    course what you want what you what you
  • 00:22:08
    know is that there are only 20 pairs so
  • 00:22:10
    therefore there are only 20 positive
  • 00:22:12
    pairs right and if you really did all of
  • 00:22:14
    this you know 20 cross 20 kind of
  • 00:22:16
    computation you have like 400 minus 20
  • 00:22:19
    380 negative pairs what you want to do
  • 00:22:21
    is to maximize the coine similarity of
  • 00:22:23
    the image and text embeddings of in real
  • 00:22:24
    pairs versus minimizing cosine
  • 00:22:27
    similarity of the embeddings of the
  • 00:22:28
    correct pairings right so 380 incorrect
  • 00:22:31
    pairings so you want to maximize those
  • 00:22:33
    20 the diagonal right and minimize the
  • 00:22:35
    similarity um for for those remaining
  • 00:22:37
    380 which are negative right so this is
  • 00:22:41
    what gives basically awesome uh accuracy
  • 00:22:43
    values in fact clip was tested on 30
  • 00:22:45
    plus computer vision tasks like OCR
  • 00:22:47
    action recognition videos and so on so
  • 00:22:48
    forth and they basically found clip to
  • 00:22:50
    be really doing very well even in a zero
  • 00:22:52
    zero shot manner okay uh uh it was
  • 00:22:57
    better than uh
  • 00:22:58
    it was it was it was found to be better
  • 00:23:00
    than even fully supervised baselines
  • 00:23:02
    okay so here is more details about clip
  • 00:23:04
    so essentially as you notice here uh we
  • 00:23:07
    have uh you know uh we have uh so
  • 00:23:11
    essentially you can actually use clip
  • 00:23:13
    even for zero short classes and for
  • 00:23:15
    classification problems which basically
  • 00:23:17
    involve new classes at test time okay so
  • 00:23:20
    for example what you could do uh is that
  • 00:23:23
    you can take an image and uh let's say
  • 00:23:25
    you have some new class labels right at
  • 00:23:27
    test time and you to figure out if this
  • 00:23:29
    new test time class label holds good for
  • 00:23:31
    this image or not okay all you need to
  • 00:23:33
    do is to basically take that class label
  • 00:23:34
    pass it through a text encoder and the
  • 00:23:36
    text encoder learns a text embedding and
  • 00:23:38
    then and this is at inference time by
  • 00:23:39
    the way right so basically you take the
  • 00:23:41
    image pass through the image encoder get
  • 00:23:42
    an image embedding and just try to
  • 00:23:44
    compute the similarities whichever has
  • 00:23:45
    the highest similarity is the one that
  • 00:23:47
    you actually predict as the as the right
  • 00:23:49
    caption or the right text uh class label
  • 00:23:51
    for this particular image okay that's
  • 00:23:54
    that now the interesting part so what
  • 00:23:56
    they did was to compare a zero short
  • 00:23:57
    clip with the rest net 50 supervised
  • 00:24:00
    model across several the several of
  • 00:24:02
    these data sets so notice clip is zero
  • 00:24:05
    shot it's not fine tuned on any of these
  • 00:24:07
    data sets but resnet is not zero short I
  • 00:24:10
    mean it's actually you take the
  • 00:24:11
    pre-trained rest net and fine tune it on
  • 00:24:13
    the training set of these data sets and
  • 00:24:15
    what they observed is that among these
  • 00:24:17
    data sets you know several of these data
  • 00:24:19
    sets clip actually gives you a positive
  • 00:24:21
    Improvement significantly positive
  • 00:24:22
    improvements compared to uh compared to
  • 00:24:25
    resnet okay uh here are a few examples
  • 00:24:28
    based on how clip performs so here's an
  • 00:24:30
    example from food 101 data set um you
  • 00:24:33
    know uh nicely predicts that this
  • 00:24:36
    particular food is not you know any of
  • 00:24:38
    those but guacamole right and then you
  • 00:24:41
    can also use it for other kinds of
  • 00:24:42
    classification problems like classifying
  • 00:24:44
    uh you know what setting is this is it a
  • 00:24:45
    television Studio Podium indoor
  • 00:24:47
    conference room lecture room control
  • 00:24:49
    room and so on you could also basically
  • 00:24:50
    try to figure out what particular object
  • 00:24:52
    is highlighted in the image or you could
  • 00:24:54
    basically use it for classifying uh the
  • 00:24:56
    land Ed type so whether it is a
  • 00:24:58
    Perman crop land pasture land you know
  • 00:25:01
    highway or road or Ocean or you know
  • 00:25:03
    shrand and so on so forth okay so that's
  • 00:25:07
    clip okay um well now similar kind of
  • 00:25:10
    models have also been trained and used
  • 00:25:12
    for uh doing document understanding a
  • 00:25:15
    visually Rich document understanding all
  • 00:25:16
    right so these are scans of various
  • 00:25:19
    kinds of documents so for example what
  • 00:25:21
    you see here is essentially um uh some
  • 00:25:24
    sort of key value pairs highlighted in
  • 00:25:26
    this interesting clear clearance sheet
  • 00:25:29
    as such but you could also have scans
  • 00:25:31
    for invoices and so on okay uh the
  • 00:25:33
    interesting part is that uh using uh a
  • 00:25:36
    model popularly called as layout LM uh
  • 00:25:38
    of course I'll also talk a little bit
  • 00:25:39
    about on the next few slides right uh
  • 00:25:42
    what one could do is to nicely extract
  • 00:25:43
    the key value pairs from this document
  • 00:25:45
    okay one can also basically do question
  • 00:25:47
    answering on these documents so uh here
  • 00:25:49
    is a postcard scan and you could
  • 00:25:52
    basically then ask questions like
  • 00:25:54
    mention the ZIP code written and then it
  • 00:25:56
    can nicely figure out that the ZIP code
  • 00:25:57
    is that you can also ask it for the date
  • 00:25:59
    on the seal at the top and nicely
  • 00:26:01
    figures out the seal and so on so forth
  • 00:26:04
    the the date on the seal okay you could
  • 00:26:06
    also use it for legal contract
  • 00:26:07
    Management in that senses that given
  • 00:26:09
    document scans you could basically ask
  • 00:26:11
    it to highlight what are the important
  • 00:26:12
    legal uh phrases that I must be uh
  • 00:26:15
    paying attention to or extract just key
  • 00:26:17
    value pairs so basically which parties
  • 00:26:19
    signed the document or when was it
  • 00:26:20
    signed and so on okay uh of course if
  • 00:26:23
    you have U lots of documents on users
  • 00:26:27
    one drive or Google Drive accounts you
  • 00:26:28
    could try to build an app which can
  • 00:26:30
    classify those documents or categorize
  • 00:26:32
    them into popular categories like uh
  • 00:26:35
    like you know personal identification
  • 00:26:36
    documents like passport and pan cards
  • 00:26:38
    and so on uh while uh while another
  • 00:26:41
    category could just be all kinds of
  • 00:26:42
    invoices utility bills and so on right
  • 00:26:45
    you could of course also use these kinds
  • 00:26:46
    of models to do U recognition over
  • 00:26:48
    Walmart receipts or any other
  • 00:26:49
    Supermarket receipts in that sensus and
  • 00:26:52
    the main model for doing this kind of
  • 00:26:54
    visually reach document processing a
  • 00:26:56
    very popular model is layout LM and you
  • 00:26:58
    know there are of course various
  • 00:27:00
    versions layout LM V1 V2 there's also a
  • 00:27:02
    layout llm in that senses you know
  • 00:27:04
    motivate you folks to go ahead and look
  • 00:27:06
    at it later but U what is interesting is
  • 00:27:10
    uh that it basically uses transform
  • 00:27:12
    models okay in the particular case they
  • 00:27:13
    used transformer called as uni LM V2 to
  • 00:27:16
    to initialize and of course uh then they
  • 00:27:19
    basically took domain specific data
  • 00:27:20
    layout visually You Know Rich layout
  • 00:27:24
    data and they basically tried to train
  • 00:27:26
    this model using um using document
  • 00:27:29
    specific loss functions as well okay um
  • 00:27:33
    so very broadly what they do is to take
  • 00:27:34
    the document and uh they mask out
  • 00:27:37
    certain lines on this document so they
  • 00:27:39
    hide out certain lines I'll call them
  • 00:27:41
    hidden out lines in that senses okay um
  • 00:27:44
    uh then they basically also um you know
  • 00:27:47
    uh so this this hidden out uh you know
  • 00:27:50
    image is divided into different parts
  • 00:27:52
    and then encode it using a visual
  • 00:27:54
    encoder right on the other hand you take
  • 00:27:57
    uh so these
  • 00:27:58
    you basically take uh uh the document
  • 00:28:01
    and then uh you take the lines from the
  • 00:28:04
    document and uh essentially for the
  • 00:28:06
    lines which are not covered you
  • 00:28:09
    basically have each for each line you
  • 00:28:11
    basically have this notion whether it is
  • 00:28:13
    covered or not you Bas well when they
  • 00:28:15
    when they hide out they don't hide out
  • 00:28:17
    partial lines they hide out an entire
  • 00:28:18
    line and so on okay so you have a
  • 00:28:20
    covered line and a non-covered line okay
  • 00:28:23
    and uh uh what you do is to basically
  • 00:28:25
    take the document and um you know you
  • 00:28:27
    have a OC PDF parser which gives you
  • 00:28:29
    text so that's how you get these text
  • 00:28:31
    tokens so you basically have the text
  • 00:28:32
    tokens and you can of course do must
  • 00:28:34
    language modeling so therefore some
  • 00:28:35
    tokens are must as well remember the hi
  • 00:28:37
    hiding part is different from masking
  • 00:28:38
    part okay of course you can mask out
  • 00:28:41
    text Tok so so the OCR is actually done
  • 00:28:43
    on the on the non-hidden version of the
  • 00:28:45
    document so that the OCR quality is good
  • 00:28:47
    in that senses right but then you have
  • 00:28:49
    this information which line is hidden or
  • 00:28:50
    not okay now as you see the Transformer
  • 00:28:53
    is being passed four different things so
  • 00:28:54
    of course the first one is basically
  • 00:28:55
    segment embeddings whether it is
  • 00:28:57
    basically process ing image tokens
  • 00:28:59
    versus is it processing um you know text
  • 00:29:02
    tokens and then on the text tokens also
  • 00:29:04
    you could basically say whether it is a
  • 00:29:05
    mass token or a non-mass token yellow or
  • 00:29:08
    blue all uh of course you pass a 1D
  • 00:29:11
    position embedding so essentially you
  • 00:29:12
    must pass some position some notion of
  • 00:29:14
    position right you also pass two
  • 00:29:16
    dimensional position embeddings for the
  • 00:29:17
    box so for example for the text tokens
  • 00:29:19
    essentially sorry for the for the image
  • 00:29:21
    tokens you have a box so essentially uh
  • 00:29:24
    you know uh X ywh and uh you can
  • 00:29:27
    actually also um uh encode width and
  • 00:29:29
    height of the box as part of the um uh
  • 00:29:32
    2D position embeddings for the um uh and
  • 00:29:35
    and then for the CLS token you actually
  • 00:29:37
    um you know pad with a box with all the
  • 00:29:40
    six things xmin x max y Min y Max and
  • 00:29:43
    height and width initialized to all
  • 00:29:44
    zeros set to all zeros right okay uh and
  • 00:29:48
    U uh so that's that so that's how you
  • 00:29:49
    basically have 2D position emings then
  • 00:29:51
    you also have the text and uh visual
  • 00:29:53
    embedding so um essentially um you use
  • 00:29:57
    um you know Mass rcnn embeddings uh
  • 00:30:00
    because you're using that as the visual
  • 00:30:02
    encoder here right U and uh for the text
  • 00:30:05
    you basically just use the uh use the
  • 00:30:07
    standard text embeddings okay uh now the
  • 00:30:10
    Transformer yes they experiment with the
  • 00:30:11
    two different models base size and large
  • 00:30:13
    size 12 lers and 24 layers and uh
  • 00:30:16
    basically which means 200 million or 426
  • 00:30:18
    million parameters okay uh so now you
  • 00:30:21
    know preing objectives so there are
  • 00:30:22
    three different pring objectives Mas
  • 00:30:24
    visual language modeling right which is
  • 00:30:26
    the typical uh you know you can mask the
  • 00:30:28
    images part or the text part and you can
  • 00:30:29
    try to um uh I think in their particular
  • 00:30:32
    case I think they just hid the uh the
  • 00:30:34
    text part so therefore visual language
  • 00:30:36
    modeling so if you Mass these text
  • 00:30:38
    tokens you will basically try to predict
  • 00:30:39
    them what is the text right text image
  • 00:30:42
    alignment so um essentially uh they are
  • 00:30:44
    just predicting whether a particular
  • 00:30:47
    token belongs to the covered class or
  • 00:30:49
    the not covered class so remember you
  • 00:30:51
    basically covered some lines so this
  • 00:30:53
    text token belongs to the covered class
  • 00:30:55
    covered class versus these ones belong
  • 00:30:56
    to not covered class okay and then
  • 00:30:59
    lastly you have text image matching so
  • 00:31:01
    essentially you know whether this
  • 00:31:02
    particular image and the text match with
  • 00:31:04
    each other or not like just like the
  • 00:31:05
    visual bir and the B models okay those
  • 00:31:08
    are three classes now of course the the
  • 00:31:09
    preing data they obtain like 11 million
  • 00:31:11
    scan documents and they use the text OCR
  • 00:31:14
    Microsoft read API for the OCR part okay
  • 00:31:17
    and that's how layout LM V2 basically
  • 00:31:19
    does an awesome job U and and
  • 00:31:21
    essentially uh is used to pre a model
  • 00:31:24
    which uh does very awesome uh visually
  • 00:31:26
    reach document processing
  • 00:31:28
    okay so so far I've talked about text
  • 00:31:30
    and images now let me quickly talk about
  • 00:31:32
    video tasks um uh multimodality could
  • 00:31:35
    also mean you know doing things about
  • 00:31:37
    video and text so for example text video
  • 00:31:40
    retrieval given a text and a collection
  • 00:31:42
    of videos find the relevant ones now
  • 00:31:44
    this requires text embedding and video
  • 00:31:45
    video embedding both of them together in
  • 00:31:47
    same space okay multiple choice video
  • 00:31:50
    question answering so again given a
  • 00:31:51
    video and a question and multiple
  • 00:31:52
    candidate answers you want to choose
  • 00:31:54
    which is the best one uh you know it's
  • 00:31:55
    analogous to image question uh visual
  • 00:31:58
    question answering which typically just
  • 00:31:59
    relates with an image in that senses
  • 00:32:01
    okay and then you could also have other
  • 00:32:03
    kinds of tasks like action segmentation
  • 00:32:04
    action step localization and so on uh
  • 00:32:07
    where you basically have an action which
  • 00:32:09
    is described in text and then you have a
  • 00:32:10
    video and you want to figure out where
  • 00:32:12
    the action is in that sensus also called
  • 00:32:14
    as moment detection in that senses okay
  • 00:32:17
    okay so um how do you do this now the
  • 00:32:21
    ideas are pretty similar uh you you see
  • 00:32:23
    I mean if you really think about it what
  • 00:32:25
    is a video video is a sequence of image
  • 00:32:27
    frames okay so in some ways if you
  • 00:32:29
    basically uh are thinking about an image
  • 00:32:32
    as a 3D bit map where an image has a
  • 00:32:35
    height and a width and basically the
  • 00:32:37
    depth is basically just three three
  • 00:32:39
    because you have to incorporate three
  • 00:32:40
    channels RGB red green and blue right a
  • 00:32:43
    video with 100 frames can be Tau again
  • 00:32:46
    as a 3D uh bit map or a 3D uh Cube where
  • 00:32:50
    you have of course the height and the
  • 00:32:51
    width but you also have a depth which is
  • 00:32:53
    basically 3 * 100 if there are 100
  • 00:32:55
    frames in the video okay so you should
  • 00:32:57
    really you could really think about a
  • 00:32:59
    video as um a three-dimensional cube in
  • 00:33:02
    that senses right and in that senses you
  • 00:33:04
    could basically then use your 3D
  • 00:33:07
    convolution neural networks to encode
  • 00:33:09
    this video or you could actually also
  • 00:33:11
    use latest advances you know uh in
  • 00:33:14
    transform models is to be able to encode
  • 00:33:15
    this video of course you know um I mean
  • 00:33:19
    um so as I mentioned video is basically
  • 00:33:21
    a whole bunch of image frames but there
  • 00:33:23
    is also sequence to them and 3D CNN help
  • 00:33:25
    you sort of uh Ensure that that sequence
  • 00:33:28
    is also respected when you're trying to
  • 00:33:30
    encode the video right but again this
  • 00:33:31
    session is not on video encoding so I'm
  • 00:33:33
    not really going to go into deep details
  • 00:33:34
    about how do you encode videos there's
  • 00:33:36
    so many interesting models you know all
  • 00:33:38
    the way starting from um you know very
  • 00:33:40
    old models like i3d uh inflated uh 3D
  • 00:33:43
    models and 3D Comins and so on to the
  • 00:33:46
    more recent ones but the idea is let's
  • 00:33:48
    say that you have a video encoder right
  • 00:33:50
    and you have a text encoder so and you
  • 00:33:52
    could basically do the same contrastive
  • 00:33:54
    loss kind of training the same noise you
  • 00:33:56
    know um uh um the typical popular uh
  • 00:34:01
    noise contrastive estimation loss
  • 00:34:02
    basically can be used also for doing
  • 00:34:05
    something called as video clip okay just
  • 00:34:07
    like you have the clip for image and
  • 00:34:09
    text pairs you could also have a video
  • 00:34:10
    clip which basically tries to do contast
  • 00:34:11
    of learning uh with the uh with with
  • 00:34:14
    video and text pairs okay now the idea
  • 00:34:17
    is that where do you get these video and
  • 00:34:18
    text pairs right so you could of course
  • 00:34:20
    basically make use of transcripts so you
  • 00:34:21
    have video and the visual information
  • 00:34:23
    and you have transcript and you could
  • 00:34:24
    make use of the two to align them but
  • 00:34:27
    one has to be a little little cautious
  • 00:34:28
    about this because you know typically if
  • 00:34:30
    uh uh let's say even in this lecture
  • 00:34:32
    video I started off saying that hey in
  • 00:34:34
    this video I would talk about U you know
  • 00:34:36
    multimodel models but when I talked
  • 00:34:39
    about that visually on the slide you
  • 00:34:41
    couldn't see any multimodel for that
  • 00:34:43
    matter right similarly if I if if I'm
  • 00:34:45
    making a recipe video I'm going to say
  • 00:34:47
    that hey I'm going to basically teach
  • 00:34:48
    you how to cook Cho B right and at that
  • 00:34:51
    time on the slide there's no CH B at all
  • 00:34:54
    right I mean CH B come much later okay
  • 00:34:57
    or Essen you know um the idea is that
  • 00:35:00
    the speech and and what you see on the
  • 00:35:03
    video may not be completely aligned
  • 00:35:04
    always and therefore you have to be
  • 00:35:06
    little cautious about how do you align
  • 00:35:08
    and how do you get those positive pairs
  • 00:35:10
    versus the negative pairs but otherwise
  • 00:35:12
    more or less the contrastive estimation
  • 00:35:13
    contrastive lws and so on work the same
  • 00:35:15
    way uh and in fact in their particular
  • 00:35:18
    case in video clip they basically use
  • 00:35:20
    the same the six layer uh I mean they
  • 00:35:22
    use the bird base in uncased for both
  • 00:35:24
    the video and text they just use the
  • 00:35:25
    Transformer model for encoding the video
  • 00:35:27
    as well okay um so uh uh I mean and the
  • 00:35:31
    way they did that was to basically use a
  • 00:35:33
    frozen pre-rain CNN so it's to
  • 00:35:35
    essentially encode the image frames and
  • 00:35:36
    then they projected those video tokens
  • 00:35:37
    to the to the to the to the to the size
  • 00:35:40
    that size and the dimensionality and the
  • 00:35:42
    space that bird base desires by doing an
  • 00:35:44
    MLP projection layer by training MLP
  • 00:35:46
    projection layer okay uh so that's that
  • 00:35:49
    they pretend on how to 100 million data
  • 00:35:51
    set and that's basically um how the
  • 00:35:54
    pretend video
  • 00:35:55
    clip next uh uh or you know almost um
  • 00:35:58
    sort of towards the uh towards uh sort
  • 00:36:02
    of trying to uh you know uh moving
  • 00:36:05
    towards U more and more modality let me
  • 00:36:07
    talk about this image find model okay uh
  • 00:36:11
    so far we have talked about multiple
  • 00:36:13
    models uh I started off with simple
  • 00:36:15
    Vision Transformers right and I
  • 00:36:17
    basically said hey well they can be used
  • 00:36:19
    for encoding images then I talked about
  • 00:36:21
    image then I talked about willber and
  • 00:36:23
    visual bird and I basically said well
  • 00:36:25
    they can be used for um for encoding uh
  • 00:36:28
    uh you know multimodal task involving
  • 00:36:30
    images and text and then we of course
  • 00:36:31
    also talked about clip in the same same
  • 00:36:34
    theme okay then I talked about video
  • 00:36:36
    clip and basically I said well you could
  • 00:36:37
    extend this to two modalities not image
  • 00:36:39
    and text this time but video and text
  • 00:36:41
    okay now the obvious question is that he
  • 00:36:44
    can I include more modalities and there
  • 00:36:46
    are so many tasks with multiple
  • 00:36:47
    modalities okay so for example here are
  • 00:36:50
    various modalities and image bind here
  • 00:36:53
    is a model called image bind which
  • 00:36:54
    basically tries to extend this kind of a
  • 00:36:56
    thing to 6 different
  • 00:37:00
    modalities uh these ones are images uh
  • 00:37:03
    text audio depth um so this is the depth
  • 00:37:07
    image right uh uh thermal and inertial
  • 00:37:09
    measurement unit okay images text audio
  • 00:37:12
    are obvious what is the depth image
  • 00:37:14
    depth image basically tells you how far
  • 00:37:16
    away from the uh camera each pixel is
  • 00:37:19
    okay so essentially white tells you that
  • 00:37:20
    it is very close to the camera but black
  • 00:37:22
    tells you that it is very far away from
  • 00:37:24
    the camera okay that's a depth image you
  • 00:37:26
    could also try to bring in in a modality
  • 00:37:28
    called thermal now thermal modality you
  • 00:37:30
    know you might have heard about Flur
  • 00:37:31
    images Flur images so they basically
  • 00:37:33
    used a lot in you know um infrared
  • 00:37:36
    Imaging for for for electrical circuits
  • 00:37:38
    you know you want to figure out is there
  • 00:37:39
    a fault or not right so FL images
  • 00:37:42
    basically make use of these infrared
  • 00:37:44
    cameras so as to um also record
  • 00:37:46
    temperature at every pixel in some ways
  • 00:37:48
    so that's why thermal images in that
  • 00:37:50
    sense right you could also have IMU
  • 00:37:52
    inertial measurement unit kind of data I
  • 00:37:54
    mean this is more like time series data
  • 00:37:56
    sensor data in that sense is which you
  • 00:37:57
    could get let's say if you're trying to
  • 00:37:59
    build a driverless car application you
  • 00:38:01
    might not just want to use the uh you
  • 00:38:03
    know um uh the the input from the camera
  • 00:38:07
    but you might also want to use several
  • 00:38:09
    sensors data coming from several sensors
  • 00:38:12
    inside the car right so to be able to
  • 00:38:14
    make a decision for example whether to
  • 00:38:15
    press a break or not right so that's
  • 00:38:18
    basically multiple modalities of data uh
  • 00:38:20
    and in several applications may not you
  • 00:38:23
    you may not require processing all the
  • 00:38:25
    modalities but some of those modalities
  • 00:38:27
    are become important okay um and what
  • 00:38:31
    image B so so therefore the the idea is
  • 00:38:33
    that it is a great idea to basically
  • 00:38:35
    learn a model which can process all the
  • 00:38:37
    modalities uh right uh and you know here
  • 00:38:39
    is an inspiring statement why this could
  • 00:38:41
    be useful um and uh how to do this right
  • 00:38:45
    so of course many many applications
  • 00:38:47
    require a combination of these
  • 00:38:48
    modalities the challenge though is that
  • 00:38:50
    there is no data set across all of these
  • 00:38:52
    modalities right so although I might
  • 00:38:54
    want to basically uh compare or or built
  • 00:38:57
    a application which be basically you
  • 00:38:59
    know uses thermal images along with the
  • 00:39:02
    sensor data unfortunately I might not
  • 00:39:04
    have aligned data there okay but what is
  • 00:39:06
    really interesting is that image binds
  • 00:39:08
    it all okay an image of a beach can
  • 00:39:10
    actually remind us of the sound of the
  • 00:39:12
    Waves audio right the texture of the
  • 00:39:14
    sand right uh a breeze so or even
  • 00:39:18
    inspire a poem you know text and so on
  • 00:39:20
    so you see different modalities can be
  • 00:39:21
    all linked to images in some ways okay
  • 00:39:24
    uh so that is also shown in this in this
  • 00:39:27
    here so so if you basically just look at
  • 00:39:30
    the image if you basically can just get
  • 00:39:32
    the uh image text image depth image heat
  • 00:39:35
    map image audio image IMU pairs you know
  • 00:39:38
    maybe what you can do is to solve this
  • 00:39:40
    problem of not requiring you know
  • 00:39:43
    pairwise data across all possible
  • 00:39:44
    modalities right so if there are five
  • 00:39:46
    different six different modalities as
  • 00:39:48
    you see you would require 62 different
  • 00:39:50
    image different types of data you know
  • 00:39:52
    image Text data image audio data text
  • 00:39:54
    audio data text IM data and so on but
  • 00:39:56
    maybe if you basically just go via
  • 00:39:58
    images uh you could basically solve this
  • 00:40:00
    problem and that's what image bind banks
  • 00:40:01
    on so they basically make use of a whole
  • 00:40:03
    bunch of image data combined uh with
  • 00:40:06
    another modality data so is to be able
  • 00:40:08
    to train a really really awesome
  • 00:40:09
    multimodal model and that multimodel
  • 00:40:12
    model now helps them to do not just
  • 00:40:15
    multimodel understanding but also helps
  • 00:40:17
    them to do multimodel generation yeah
  • 00:40:19
    now multimodel generation is of course a
  • 00:40:20
    topic for another lecture and we'll talk
  • 00:40:22
    about that later uh but uh uh you know
  • 00:40:25
    here are here are some examples so cross
  • 00:40:27
    model retrieval so essentially if you
  • 00:40:30
    basically just pass on this audio uh
  • 00:40:32
    which is essentially of a crackle of a
  • 00:40:33
    fire you can basically try to retrieve
  • 00:40:36
    these kinds of images or videos which
  • 00:40:38
    actually show the crackle of a fire or
  • 00:40:40
    also retrieve these uh uh depth images
  • 00:40:43
    which basically relate with you know
  • 00:40:45
    fireplace and so on as you can uh as you
  • 00:40:47
    can U understand from the image right
  • 00:40:50
    you can also retrieve you know these
  • 00:40:51
    text pieces which are all talking about
  • 00:40:53
    fire fire crackles while pan and
  • 00:40:55
    remember this is not basically because
  • 00:40:57
    the text called crackle of a fire was
  • 00:40:59
    used to search but because the audio
  • 00:41:01
    relating um you know audio which
  • 00:41:04
    basically just uh is the sound of the
  • 00:41:06
    crackle of a fire basically helps you
  • 00:41:08
    retrieve the text okay this is not
  • 00:41:10
    speech to text by the way so in the
  • 00:41:12
    audio crackle of a fire those words were
  • 00:41:13
    not spoken uh all that was there in the
  • 00:41:16
    audio is the sound of the fire in that
  • 00:41:18
    senses okay so of course it can also be
  • 00:41:20
    used for doing embedding space
  • 00:41:22
    arithmetic so you could basically have
  • 00:41:24
    this image of of a crane uh or or a a
  • 00:41:27
    bird right and you know you can have
  • 00:41:29
    sound of waves right and then can
  • 00:41:31
    actually generate images which can you
  • 00:41:32
    know basically the same bird in in in
  • 00:41:35
    the sea or on on the shore and so on
  • 00:41:38
    okay you could also use this kind of a
  • 00:41:39
    model for audio to image generation so
  • 00:41:41
    given a audio you know can you generate
  • 00:41:43
    an image barking audio generate a dog or
  • 00:41:46
    you know train audio generate a train
  • 00:41:49
    image and so on so forth yeah so how is
  • 00:41:52
    the image model image bind model trained
  • 00:41:54
    well as I mentioned the basically the
  • 00:41:56
    model is basically trained by using uh
  • 00:41:59
    uh several kinds of data sets also
  • 00:42:01
    mentioned here uh which basically relate
  • 00:42:04
    visual modality with everything else so
  • 00:42:07
    with other modalities for example video
  • 00:42:09
    and audio from a data set called audio
  • 00:42:11
    Set uh image depth kind of relationship
  • 00:42:14
    from some other data set image thermal
  • 00:42:16
    from another data set video IMU data
  • 00:42:18
    image Text data and so on so forth okay
  • 00:42:20
    uh the model is basically uh of course
  • 00:42:23
    you know it uses large deep learning
  • 00:42:25
    deep neural networks to encode uh the
  • 00:42:28
    image and also any other modality so
  • 00:42:31
    encode the image and any other modality
  • 00:42:32
    M uh and it uses the same influency um
  • 00:42:36
    contrastive loss so the I mean
  • 00:42:38
    essentially just like video clip or clip
  • 00:42:40
    they basically use a symmetric version
  • 00:42:42
    of the influency loss where noise
  • 00:42:43
    contrastive estimation right where you
  • 00:42:46
    know uh you you take both the U you know
  • 00:42:49
    um image versus the other modality and
  • 00:42:52
    modality versus the other versus the
  • 00:42:53
    image and so on right so the loss is
  • 00:42:56
    particularly given as follows I mean
  • 00:42:57
    that's basically uh ensuring that the
  • 00:43:00
    positive pairs essentially have a higher
  • 00:43:01
    similarity compared to negative pairs
  • 00:43:03
    which you which you see I mean of course
  • 00:43:04
    in the denominator you have both right
  • 00:43:07
    so that's that now um yeah so so
  • 00:43:10
    essentially for image point for the text
  • 00:43:12
    encoder essentially for the image
  • 00:43:14
    encoder they used the vi the huge model
  • 00:43:16
    630 million parameters and for the text
  • 00:43:18
    encoder they basically used a 302 302
  • 00:43:20
    million parameters from open
  • 00:43:22
    CLP and um I think for training the uh
  • 00:43:27
    as far as I remember they froze the text
  • 00:43:29
    encoder part but then they trained uh
  • 00:43:32
    the other modalities in that senses um
  • 00:43:35
    that's that so they use the same encoder
  • 00:43:36
    for images plus videos basically where
  • 00:43:38
    videos are just treated as multi frame
  • 00:43:41
    images right um that's that okay so more
  • 00:43:46
    or less this is what I had for you uh
  • 00:43:48
    for this uh for this session uh quickly
  • 00:43:52
    summarizing in this session I talked
  • 00:43:54
    about a whole bunch of
  • 00:43:57
    models I first motivated essentially why
  • 00:44:00
    multimodel modeling is important by um
  • 00:44:03
    you know talking about various vision
  • 00:44:04
    and language tasks like visual question
  • 00:44:05
    answering visual Common Sense reasoning
  • 00:44:07
    referring Expressions caption based
  • 00:44:09
    image retrieval um you know multimodel
  • 00:44:11
    hate speech detection multimodel fake
  • 00:44:13
    news detection and so on then we talked
  • 00:44:15
    about Vision Transformers which is
  • 00:44:16
    basically um model to use Transformers
  • 00:44:19
    to encode images then we talked about
  • 00:44:21
    three different models for encoding
  • 00:44:23
    images U namely visual BT willbert um
  • 00:44:27
    you know and and and clip right uh where
  • 00:44:31
    uh and and in that order they have been
  • 00:44:33
    pre-trained using larger and larger
  • 00:44:35
    image text pairs right U lastly or other
  • 00:44:39
    next I talked about um about visually
  • 00:44:42
    Rich document understanding using uh
  • 00:44:44
    this particular model called as layout
  • 00:44:45
    LM um layout LM V2 particularly uh right
  • 00:44:50
    and then uh next we talked about uh
  • 00:44:52
    video tasks uh extending multimodality
  • 00:44:55
    to video and text combinations right and
  • 00:44:58
    then therefore we talked about video
  • 00:45:00
    clip an obvious extension of the
  • 00:45:02
    standard clip model um uh lastly I
  • 00:45:06
    talked about image bind he by the way
  • 00:45:08
    you know image bind of course deals with
  • 00:45:10
    six different modalities but there are
  • 00:45:11
    other models uh which can actually deal
  • 00:45:14
    with more modalities for example you
  • 00:45:16
    know I think if I remember correctly
  • 00:45:18
    there's a model of course there's a
  • 00:45:19
    model called text bind by the way right
  • 00:45:21
    there's also a model called as metat
  • 00:45:23
    Transformers uh you know uh there is
  • 00:45:25
    also yet another model called as
  • 00:45:27
    composable diffusion right so if you try
  • 00:45:29
    to search for these models you'll
  • 00:45:31
    basically see that people have tried to
  • 00:45:32
    extend this to many more modalities not
  • 00:45:34
    just six people have tried I think 10
  • 00:45:36
    plus modalities in those in those papers
  • 00:45:38
    okay so hopefully you know this session
  • 00:45:40
    motivates you um to uh to to read up
  • 00:45:44
    more of those papers in multimodal
  • 00:45:46
    modeling and um uh you know and
  • 00:45:49
    potentially also do research in this
  • 00:45:51
    area right uh I'm of course always
  • 00:45:53
    excited to uh work in this area um uh um
  • 00:45:57
    so you know if you want to do more
  • 00:45:59
    research in this area feel free to reach
  • 00:46:00
    out to me uh on these
  • 00:46:02
    coordinates uh you would also find by
  • 00:46:05
    the way on my YouTube channel a whole
  • 00:46:06
    bunch of
  • 00:46:07
    videos um around
  • 00:46:09
    multimodality um so feel free to check
  • 00:46:12
    them out as well okay thanks so much and
  • 00:46:15
    uh happy to take questions
Tags
  • modèles multimodaux
  • vision et langage
  • Visual Bird
  • Wilbert
  • Clip
  • Transformers Visuels