#222 Multimodal Models Part1 (as part of IIT Delhi course on Large Language Models (LLMs))
摘要
TLDRLa session animée par Manish Gupta, scientifique appliqué chez Microsoft, explore les modèles multimodaux axés sur les tâches de vision et de langage telles que la réponse visuelle aux questions, le raisonnement de bon sens visuel, et la récupération d'images basée sur des légendes. Elle aborde la compréhension multimodale par l'intégration de Bird et des modèles transformeurs visuels pour traiter les images et les textes. Des outils comme les réseaux de neurones convolutifs ou les Transformeurs Visuels (ViT) sont analysés pour encoder les images. Des modèles tels que Visual Bird, Wilbert, et Clip sont examinés pour expliquer comment ils traitent les tâches multimodales grâce à des méthodes d'encodage et de pré-formation contrastive. Gupta mentionne aussi les méthodes pour la compréhension de documents visuellement riches avec Layout LM et l'extension des tâches au niveau vidéo avec des outils comme Video Clip. La recherche récente sur les modèles liant plusieurs modalités, par exemple Image Bind, est aussi discutée.
心得
- 📌 Accent sur la compréhension de la vision et du langage.
- 🖼️ Usage des Transformeurs Visuels pour coder des images.
- 💡 Visual Bird et Wilbert pour multimodalité.
- 🔄 Utilisation de la perte contrastive dans Clip.
- 🗂️ Layout LM pour la compréhension de documents.
- 🎥 Adaptation des techniques à la vidéo et texte.
- 🔍 Exploration de l'Image Bind avec plusieurs modalités.
时间轴
- 00:00:00 - 00:05:00
Bonjour à cette session sur les modèles multimodaux, partie 1, axée sur les tâches de vision et de langage, en mettant l'accent sur la compréhension plutôt que la génération multimodale. Les tâches populaires incluent la réponse aux questions visuelles et le raisonnement en sens commun visuel, où des objets sont détectés dans des images à associer avec des textes.
- 00:05:00 - 00:10:00
Introduction aux transformateurs de vision, utilisés pour encoder des images en les divisant en patchs fixes et en ajoutant des embeddings positionnels avant de les passer dans un encodeur de transformeur. Ces modèles de vision peuvent être utilisés pour diverses tâches de classification.
- 00:10:00 - 00:15:00
Présentation de Visual BERT, un modèle pré-entraîné multimodal qui intègre BERT pour le texte et utilise des modèles de vision pour l'image. Il est pré-entraîné en utilisant des images avec sous-titres et utilise des fonctions d'objectif telles que le masquage de texte et la prédiction d'alignement image-texte.
- 00:15:00 - 00:20:00
Introduction de VilBERT, une architecture à deux tours, séparant le traitement du texte et de l'image avant de les fusionner par des couches de co-transformateur pour une modélisation alignée. Il exploite les données de sous-titres conceptuels pour l'entraînement préliminaire.
- 00:20:00 - 00:25:00
Présentation du modèle CLIP qui utilise une perte contrastive pour l'apprentissage préentraîné sur de vastes ensembles de données d'image-texte provenant du web. CLIP démontre d'excellentes performances dans des tâches de vision par ordinateur, même sans apprentissage supervisé direct.
- 00:25:00 - 00:30:00
Les modèles multimodaux sont également utilisés pour la compréhension des documents visuellement riches avec LayoutLM, qui traite les documents scannés pour extraire des paires clé-valeur ou répondre à des questions basées sur le document.
- 00:30:00 - 00:35:00
Discussion sur l'extension aux tâches vidéo, où la vidéo est traitée comme une séquence de cadres d'image, et sur l'utilisation de modèles pour encoder ces vidéos, facilitant des tâches comme la récupération de texte vidéo ou la réponse aux questions vidéo.
- 00:35:00 - 00:40:00
Introduction à ImageBind, une tentative de fusion de six modalités différentes, y compris images, texte, audio et plus, permettant une compréhension et une génération multimodales plus riches sans nécessiter de données alignées globales.
- 00:40:00 - 00:46:17
Résumé de la session en soulignant l'importance de la modélisation multimodale dans divers contextes, en introduisant des modèles récents et en incitant à explorer davantage la recherche dans ce domaine passionnant.
思维导图
视频问答
Quel est le principal focus de cette session ?
La session met l'accent sur la compréhension multimodale des contenus visuels et textuels.
Quelles tâches multimodales sont discutées dans la présentation ?
Les tâches incluent la réponse à des questions visuelles, le raisonnement de bon sens visuel, la récupération d'images à base de légendes, et la détection de fausses nouvelles multimodales.
查看更多视频摘要
- 00:00:02hi welcome to this session on multimodel
- 00:00:05models part one which is a part of e 881
- 00:00:09and a 821 course on large language
- 00:00:12models introduction recent advances at
- 00:00:15it Delhi um my name is Manish kupta and
- 00:00:18I'm a principal applied scientist at
- 00:00:20Microsoft uh so let's get started with
- 00:00:23the session today okay um this is a
- 00:00:26session on multimodel models right um
- 00:00:30multimodel models can actually mean
- 00:00:31several modalities this session is going
- 00:00:34to be primarily focused on vision and
- 00:00:36language tasks also note that this
- 00:00:39session is more on multimodal
- 00:00:41understanding and less on generation to
- 00:00:43begin with of course in the part two
- 00:00:45I'll start talking about generation uh
- 00:00:48text generation specifically where the
- 00:00:49input could be multimodel but in this
- 00:00:51session I'm going to focus more on um
- 00:00:55you know multimodal content
- 00:00:57understanding right so what you see on
- 00:00:59the slide here are various vision and
- 00:01:02language tasks there are several
- 00:01:04scenarios where uh information uh could
- 00:01:08be multimodal in nature um specifically
- 00:01:11text and images right so uh the most
- 00:01:14popular multimodal task I would say is
- 00:01:16visual question answering where the
- 00:01:18input is an image and also a question uh
- 00:01:22along with that image so is there
- 00:01:24something to cut the vegetables with and
- 00:01:26this task is actually called as visual
- 00:01:27question answering where given the image
- 00:01:29and the question the idea is to select
- 00:01:32from a large set of answers right an
- 00:01:35understanding task is about selecting
- 00:01:36from a large set of answers right uh but
- 00:01:39a generation task would be to actually
- 00:01:41generate an answer text another related
- 00:01:44task is VCR or visual Common Sense
- 00:01:46reasoning where uh you know given this
- 00:01:49image of a social situation right and a
- 00:01:52question why is person for pointing at
- 00:01:55person one the idea is to be able to
- 00:01:57figure out from these four options which
- 00:02:00is the most correct option most uh you
- 00:02:02know accurate answer right and then uh
- 00:02:05as an extension to the task people have
- 00:02:07also proposed um you know uh this this
- 00:02:09second order task where uh given an
- 00:02:12image and given the question and an
- 00:02:14answer that was chosen what is a
- 00:02:16rational why is it correct right why is
- 00:02:18the answer correct right so both of them
- 00:02:21involve choosing one of the four
- 00:02:23possible options right here is yet
- 00:02:26another task where then where given an
- 00:02:29image and uh uh you see uh given a
- 00:02:32rectangle you basically want to choose a
- 00:02:35piece of text saying hey well this is
- 00:02:36guy in yellow dribbling ball right
- 00:02:38referring Expressions task yet another
- 00:02:41task is about uh caption based image
- 00:02:43retrieval where you know you have a
- 00:02:45particular caption let's say uh this
- 00:02:47text caption and then you have a large
- 00:02:49bunch of images and you want to retrieve
- 00:02:51the most relevant images okay so the
- 00:02:53idea is that here of course you want to
- 00:02:54basically encode the images and the text
- 00:02:57uh in a in a similar in a you know uh in
- 00:03:00a in a uh single space so that you could
- 00:03:03actually compute similarity between them
- 00:03:05seamlessly and then do good image
- 00:03:06retrial right so all of them basically
- 00:03:09involve encoding the image and encoding
- 00:03:11the text uh in some form um sometimes
- 00:03:14jointly sometimes uh individually but
- 00:03:16then at some level try to compute
- 00:03:18similarity okay so now what I would do
- 00:03:21is to basically slowly start talking
- 00:03:22about the kind of models people have
- 00:03:24been using to solve these kinds of
- 00:03:26vision and language tasks of course
- 00:03:27there are many other tasks by the way
- 00:03:29right so for example there's also
- 00:03:30multimodal fake uh tweet detection so
- 00:03:33where the Tweet could be multimodal in
- 00:03:35nature there's an image there's a text
- 00:03:36and you want to figure out if this
- 00:03:38combined multimodal tweet is fake or not
- 00:03:40uh similarly there are other tasks like
- 00:03:43uh like like hate speech detection so
- 00:03:44there could be a multimodel document and
- 00:03:47you want to figure out let's say a blog
- 00:03:49or again a tweet right and you want to
- 00:03:50figure out if combined overall including
- 00:03:53the image in the text is it hateful or
- 00:03:55not right so so given all of this I'll
- 00:03:58start talking about models that people
- 00:04:00have built around trying to solve uh
- 00:04:02these these multimodal tasks so
- 00:04:04basically how do you do multimodal
- 00:04:05understanding right now uh to be able to
- 00:04:08do multimodal understanding you need to
- 00:04:09encode both the images both the image as
- 00:04:12well as the text now text understanding
- 00:04:14of course you have done over multiple
- 00:04:16lectures in this in this course uh let
- 00:04:18me actually talk about how do you how do
- 00:04:20you encode images nicely right so far uh
- 00:04:24you know or or um until 2020 let's say
- 00:04:27people used to use uh convolution neural
- 00:04:29network to encode images people still
- 00:04:31use them but what has become popular
- 00:04:33also is to basically use these Vision
- 00:04:34Transformers to encode images okay so
- 00:04:37the idea is that given a particular
- 00:04:39image of this kind you're going to first
- 00:04:41split it into fixed size patches for
- 00:04:43example this guy is basically split into
- 00:04:45fixed size n patches and uh these
- 00:04:48patches are linearly embedded so
- 00:04:50essentially uh you have some sort of a
- 00:04:52projection which sort of embeds them
- 00:04:53linearly so you have a embedding
- 00:04:55computed per patch and then you also add
- 00:04:58positional embeddings uh to this to each
- 00:05:00of those patch embeddings and you feed
- 00:05:02them the resultant vectors along with
- 00:05:04the CLS token to a Transformer encoder
- 00:05:07right so that's basically uh the way
- 00:05:09these Vision Transformers work so viit
- 00:05:11or the vision Transformer model
- 00:05:13basically takes an image plac into fix
- 00:05:15size patches linearly embeds them adds a
- 00:05:17position embedding and passes it as
- 00:05:19input to the uh Transformer encoder
- 00:05:21along with the CLS token on the CLS
- 00:05:23token you could of course attach a MLP
- 00:05:25head and use it for various
- 00:05:26classification purposes for example in
- 00:05:28this particular case maybe you want to
- 00:05:29do thousand class classification so you
- 00:05:32basically taken the image and you passed
- 00:05:34it to through through this Transformer
- 00:05:36encoder and you're going to attach a
- 00:05:38thousand sized output attention out
- 00:05:41output layer right th neurons in the
- 00:05:43output okay um now uh viit model uh and
- 00:05:48the transform encoder is a standard
- 00:05:49encoder you don't make no changes
- 00:05:50essentially you just have the multi-ad
- 00:05:52self attention you have the MLP feed
- 00:05:54forward layer and so on right um the vi
- 00:05:57model comes in three different sizes 12
- 00:05:59layer 24 layer 32 layer base large and
- 00:06:01huge sizes with different hidden
- 00:06:03dimensions and number of attention heads
- 00:06:05and so on right the largest one is
- 00:06:08basically 632 million parameters this
- 00:06:10model was pre-trained using obvious data
- 00:06:12sets imaginet 1K imaginet 21k and jft
- 00:06:17right um see these models basically uh
- 00:06:19differ in terms of not just the base
- 00:06:21large and huge the sizes but they also
- 00:06:23differ in terms of the patch IM patch
- 00:06:26sizes that they take as input so for
- 00:06:28example v l 16 basically takes 16 Cross
- 00:06:3116 input patch size so of course smaller
- 00:06:34patch size would mean larger sequence
- 00:06:36length but there is basically a
- 00:06:38trade-off between uh and then larger
- 00:06:40sequence length would mean latency
- 00:06:41higher latency and so on uh so but it
- 00:06:44could also mean higher accuracy in that
- 00:06:45senses so that's that and what what
- 00:06:47people have shown is that Vision
- 00:06:48Transformers match or exceed residual
- 00:06:51networks on many many image
- 00:06:52classification data sets and therefore
- 00:06:55people have started using transform
- 00:06:57models Vision transform models for
- 00:06:58encoding images as well okay now you
- 00:07:01would be thinking hey where is
- 00:07:02multimodality here but this was really
- 00:07:04important to build there's no
- 00:07:05multimodality here so far it was
- 00:07:07important to build so as to bring this
- 00:07:09notion that yes you could use transform
- 00:07:11models even to encode images right and
- 00:07:13now here is our first multimodel model
- 00:07:15it's called visual Bird right so the
- 00:07:17model is actually called visual bird and
- 00:07:18as you can imagine it sort of is going
- 00:07:20to integrate bird for text and you know
- 00:07:24Vision also um as using using transform
- 00:07:28model itself okay so the idea is uh uh
- 00:07:31that um uh you have to somehow uh
- 00:07:34understand how to pre-train this model
- 00:07:35so visual bird is a pre-trained
- 00:07:37multimodel model now if you remember
- 00:07:39bird how was it pre-trained the
- 00:07:41pre-training goal was to be able to
- 00:07:42learn a task agnostic language model
- 00:07:45such that it understands English like a
- 00:07:47second grade third grade kid okay and
- 00:07:50the way it was done pre- trending data
- 00:07:52is is always obtained in a
- 00:07:53self-supervised manner in the sense that
- 00:07:56there should be large amounts of this
- 00:07:57data and no extra human labeling
- 00:07:59required right now when you think about
- 00:08:01a multimodal pre-training what you want
- 00:08:03is a model which can basically
- 00:08:05understand how to link U you know uh
- 00:08:08portions of those images or patches in
- 00:08:10those images with text words right so
- 00:08:12essentially patches in images with words
- 00:08:15right so it should basically not just so
- 00:08:17so uh unlike bird which basically just
- 00:08:19needs to understand relationship between
- 00:08:21various words uh visual part needs to
- 00:08:24understand the relationship between
- 00:08:25these text words and visual patches of
- 00:08:28images okay how how do you enable that
- 00:08:30what kind of pretin data would you use
- 00:08:32so what folks realized is that there's a
- 00:08:34whole bunch of image captioning data
- 00:08:36which is available and therefore they
- 00:08:37said that hey we'll use image captioning
- 00:08:39data image text pair so as to be able to
- 00:08:41pre-train this model so you have images
- 00:08:43and you have a corresponding caption
- 00:08:45associated with it so visual bird
- 00:08:47basically leveraged the MS Coco data
- 00:08:49which is basically 120,000 images along
- 00:08:52with each with five different captions
- 00:08:54leading to a data set of 6 600,000 image
- 00:08:57text pairs right and they basically use
- 00:09:00the standard so they use the standard
- 00:09:02transform encoder model uh to be able uh
- 00:09:04to train um this this pre-training to do
- 00:09:07this pre-training okay so uh the way
- 00:09:10they did this did this is basically to
- 00:09:13uh add um you see there's a there's a
- 00:09:16CLS token and then there is the text
- 00:09:17caption which is going in as input and
- 00:09:19there is of course the image uh pieces
- 00:09:21that are going in as input okay now U
- 00:09:24let's let's understand this step by step
- 00:09:26so essentially um uh just like one would
- 00:09:29do Mass language modeling in B here they
- 00:09:32also mask text so some text token so
- 00:09:35that's obvious and then text and images
- 00:09:37are separated by a separated token which
- 00:09:38is also obvious now what are these image
- 00:09:40pieces well these image pieces basically
- 00:09:43um are not tiles or patches as you
- 00:09:45observed on the previous slide but well
- 00:09:47in their case they actually used an
- 00:09:49object detection model called as faster
- 00:09:50rcnn so as to basically take this image
- 00:09:53and divide it into different patches uh
- 00:09:55such that each patch actually captures a
- 00:09:58relevant object like a cap or a ball or
- 00:10:00a tennis racket or a shirt and so on
- 00:10:02okay so they basically take objects uh
- 00:10:05which are returned by this faster RC and
- 00:10:07model uh and pass them as inputs in you
- 00:10:10know as as input image tokens so what
- 00:10:12are your image words so to say well or
- 00:10:15image tokens they're basically objects
- 00:10:16as detected by the FAS rcnn model faster
- 00:10:19rcnn object detector okay um uh now to
- 00:10:22encode these things they have like three
- 00:10:24different kinds of U um you know things
- 00:10:26that go as input at every position there
- 00:10:28is of course a position embedding that
- 00:10:30is obvious in any Transformer model
- 00:10:31there's also segment embedding which
- 00:10:33basically tells you whether this is a
- 00:10:34text segment or an image segment and
- 00:10:36there's also a token or image embedding
- 00:10:37so of course you know there need to be
- 00:10:39some text features or image features
- 00:10:41that need to be passed for tokens
- 00:10:42basically it's the token embedding and
- 00:10:44for images it basically compris the
- 00:10:46features from faster rcnn and so on okay
- 00:10:49so all of this is combined and then fed
- 00:10:51to the standard Transformer model uh
- 00:10:54Transformer encoder model and um uh then
- 00:10:57you basically train pre-train this
- 00:10:59Transformer model using two different
- 00:11:01objective functions we'll talk about
- 00:11:02these two objective functions next uh
- 00:11:05the first one is very simple it's just
- 00:11:06the mass language modeling so M language
- 00:11:08modeling remember no M image modeling so
- 00:11:10on the image side you do not really hide
- 00:11:12any of those images but on the text side
- 00:11:14you mask these text pieces and the idea
- 00:11:16is that just like in bird uh you know
- 00:11:19Mass language modeling aims to be able
- 00:11:21to uh predict the must word at the
- 00:11:23output at the same position by
- 00:11:26leveraging knowledge uh or borrowing
- 00:11:28knowledge uh from other tokens unmasked
- 00:11:30tokens right in the same way in visual
- 00:11:33bir the idea is that at the same
- 00:11:35position you know the model should be
- 00:11:37able to guess the hidden word or the
- 00:11:39mased word um um you know by leveraging
- 00:11:43uh Knowledge from the unmasked uh text
- 00:11:46words and also of course all the image
- 00:11:48tokens because n image tokens are mased
- 00:11:51so you know the accuracy here for Mass
- 00:11:53language modeling should be ideally
- 00:11:54higher U you know because now the m word
- 00:11:57let's say the tennis recet uh tennis can
- 00:12:00can possibly be guessed not just based
- 00:12:02on Racket and ball and so on but also
- 00:12:05based on what the model sees in the
- 00:12:07image in that senses right so this
- 00:12:09knowledge should be able to help um you
- 00:12:11know U improve the mass language
- 00:12:12modeling accuracy on the way making the
- 00:12:14model learn how to um relate uh this
- 00:12:17particular image with the word called
- 00:12:20tennis okay so that's that the second
- 00:12:23task is and and by the way these these
- 00:12:25must you know this objective one is
- 00:12:27going to be computed must language
- 00:12:29modeling objective is going to be
- 00:12:30computed only on those tokens uh where
- 00:12:33the text words or text tokens were mased
- 00:12:35okay now on the other hand let's talk
- 00:12:37about this objective two it's basically
- 00:12:40the sentence image prediction task the
- 00:12:42idea is that um you see in a particular
- 00:12:44batch of samples you would give image
- 00:12:47text pairs but half of those pairs are
- 00:12:49going to be positive pairs half of them
- 00:12:50are going to be negative pairs now what
- 00:12:52is a positive pair positive pair
- 00:12:53basically means that the caption is
- 00:12:55linked with the image itself and you
- 00:12:58know which is which means that is
- 00:12:59similar it is relevant for the image
- 00:13:02right and then the negative pair would
- 00:13:03be this caption is linked with the
- 00:13:05negative image you know irrelevant image
- 00:13:07to the caption okay for the sample
- 00:13:09positive and negative pairs create a
- 00:13:10batch and then you know the objective
- 00:13:12two is basically all about figuring out
- 00:13:14whether uh the attent whether the you
- 00:13:17know uh the the MLP head out there um is
- 00:13:21is able to predict correctly whether
- 00:13:22this is a positive pair or a negative
- 00:13:23pair right so that's basically that um
- 00:13:26now the interesting part about visual
- 00:13:28bird is that uh you could basically now
- 00:13:31look at the attention weights of some
- 00:13:34selected heads uh at the at the output
- 00:13:36layer at the last layer and uh then by
- 00:13:39looking at those attention weights you
- 00:13:41can visualize them and try to see uh
- 00:13:44remember in self attention you have like
- 00:13:46now words text words paying attention to
- 00:13:48uh image uh pieces and now you can try
- 00:13:51to see you know how are they correlated
- 00:13:53okay so let me look at layer 11 and uh
- 00:13:57you see uh this this heat map is
- 00:13:59basically drawn by showing um uh
- 00:14:02attention between text words you see
- 00:14:05there and image tokens you see there
- 00:14:07right so five image five image tokens
- 00:14:09you know uh corresponding to man which
- 00:14:11are also highlighted in this image by
- 00:14:13the way so the red red one is man
- 00:14:15the you know bluish one is shirt and you
- 00:14:18know the bluish kind of stuff is
- 00:14:20sidewalk and so on right so what do You
- 00:14:22observe is that uh so if I look at the
- 00:14:25word man right fortunately it has very
- 00:14:28high you know um attention weight for
- 00:14:32for the for the for the for the box man
- 00:14:34in that sense for the image piece man
- 00:14:36okay and that's very useful and nice
- 00:14:38because it sort of nicely tells us that
- 00:14:40the model is actually learning to
- 00:14:42correlate pieces in the image with the
- 00:14:44tokens in the text okay uh so and so on
- 00:14:46you can actually pause the video here
- 00:14:47and essentially observe that this holds
- 00:14:49also for other kinds of things like
- 00:14:51sidewalk pedestrians shirt and so on
- 00:14:53okay now after this people tried to
- 00:14:56improve the architecture they came up
- 00:14:57with this architecture new model called
- 00:14:59as Wilber okay by the way you can also
- 00:15:01look at these papers most of my slides
- 00:15:02actually have citations for these papers
- 00:15:04at the at the bottom okay um so Wilbert
- 00:15:08basically believes in a two Tower model
- 00:15:10unlike visual bird visual bird was a
- 00:15:12single Tower model the concatination of
- 00:15:14text and image modali sort of happened
- 00:15:16right in the right in the first layer
- 00:15:17right in the zeroth layer in that senses
- 00:15:19okay but Wilbert basically believes in
- 00:15:21processing text separately using a few
- 00:15:24few layers Transformer layers as you see
- 00:15:26them here and processing the image
- 00:15:28separately using a few layers now these
- 00:15:29layers could be you know um uh
- 00:15:32Transformer layers and so on so um so uh
- 00:15:35notice that the text stream in Wilber
- 00:15:38actually has much more processing before
- 00:15:39interacting with the visual features but
- 00:15:41as I was saying well it's a two Tower
- 00:15:42model where the text and image are
- 00:15:45processed separately in their own
- 00:15:46pipelines but then there is also a
- 00:15:48fusion which happens where core
- 00:15:50Transformer layers or Co attention based
- 00:15:52Transformer layers basically try to fuse
- 00:15:54the information across both the
- 00:15:55pipelines okay uh and then there are
- 00:15:57other Transformer layers further which
- 00:15:59basically do individual processing
- 00:16:00separately okay now the interesting part
- 00:16:03is that this linguistic stream is
- 00:16:05basically bird based and for for you
- 00:16:09know um uh bird base and then you know
- 00:16:12uh again the for for the visual stream
- 00:16:14essentially you use uh faster rcnn to
- 00:16:17essentially figure out those image
- 00:16:19patches and so on okay this faster rcnn
- 00:16:21both in visual bird and willber is
- 00:16:23pre-train on visual genome data set okay
- 00:16:25okay so now how is this Co attention how
- 00:16:27do these Co Transformer layers work work
- 00:16:29right so a standard Transformer layer it
- 00:16:31has the typical standard self attention
- 00:16:32and feed forward right with the qkv the
- 00:16:35query keys and values coming from a
- 00:16:37single stream right but in Wilber you
- 00:16:39have two different streams the visual
- 00:16:41stream and then the um the the
- 00:16:42linguistic stream right uh and what you
- 00:16:45do uh for transferring information
- 00:16:48across the two is to basically use uh
- 00:16:50the query of the same modality but the
- 00:16:53you know the keys and the values coming
- 00:16:55from the other modality from the
- 00:16:56linguistic stream in this particular
- 00:16:57example right in this particular case
- 00:16:59and in the other case in the linguistic
- 00:17:00stream again you're going to use the
- 00:17:01query from the linguistic stream but
- 00:17:03you're going to use the keys and the
- 00:17:04values from the visual stream so as to
- 00:17:05essentially do this cross-pollination of
- 00:17:08information um in the in the attenion
- 00:17:10layer right and then of course you do
- 00:17:11this typical standard feed forward with
- 00:17:13all those residual connections add and
- 00:17:14normalization and so on okay so that's
- 00:17:17how Wilbert works now another
- 00:17:19interesting part about Wilbert is that
- 00:17:21rather than depending on manually
- 00:17:23labeled Ms Coco data they actually
- 00:17:25depended on conceptual caps data this
- 00:17:27data set of concept ual captions is
- 00:17:29basically obtained in an automated
- 00:17:31manner by scraping things from the web
- 00:17:33so the idea is that on the web there are
- 00:17:35Wikipedia Pages news articles and many
- 00:17:37other many other web pages where you
- 00:17:39have an image and underneath that
- 00:17:40there's a caption okay there are also
- 00:17:43images with very nice alt tags
- 00:17:45associated with them these serve as
- 00:17:47really good sources for caption
- 00:17:49information along with images and that
- 00:17:51is what willber guys basically Lage um
- 00:17:54so as to uh essentially U you know um
- 00:17:58create the visual uh grounding data
- 00:18:00right uh visual image text pairs
- 00:18:02essentially and then then they pre-train
- 00:18:03the visual wibert model based on that
- 00:18:06now in Wilbert model again there are two
- 00:18:08pre-training loss functions so just like
- 00:18:10in visual bird there is a multimodel
- 00:18:11alignment prediction function to
- 00:18:13basically just predict whether this
- 00:18:14image and text are aligned with each
- 00:18:16other or not and that's basically the
- 00:18:17same as visual bird so there's no no
- 00:18:18difference in that sense except the
- 00:18:19change in the name right but then on the
- 00:18:22other hand the mass language modeling is
- 00:18:23actually now extended to to to become
- 00:18:25Mast multimodal learning okay the
- 00:18:28interesting part part is that you know
- 00:18:30rather than just masking out text you
- 00:18:31can actually also mask out image
- 00:18:33pieces uh in 2019 you know there was no
- 00:18:37good technology to basically generate
- 00:18:39back image pieces themselves in the same
- 00:18:43position but what you could do or rather
- 00:18:45what Wilber does is to basically
- 00:18:46generate a distribution over a set of
- 00:18:48class labels so of course you know
- 00:18:50because you're using fcnn you know what
- 00:18:52particular object this particular
- 00:18:54position indicates the image piece at
- 00:18:56this position indicates and then the
- 00:18:58idea is that
- 00:18:59Wilbert uh the goal or the objective is
- 00:19:01such that the Wilbert model is motivated
- 00:19:04to learn the right pred right
- 00:19:06distribution over those objects uh at
- 00:19:09the at the output at the same position
- 00:19:11okay if it learns great else there's a
- 00:19:12back propagation cross and tropy loss
- 00:19:14back propagated right so that's
- 00:19:15basically M multimodal learning a great
- 00:19:17extension from just that M Mass language
- 00:19:19modeling as done in visual
- 00:19:21part okay so that's great now the idea
- 00:19:25behind wibert and visual bird so far is
- 00:19:27that you have imaged is that you have
- 00:19:29image text Pairs and you could do a nice
- 00:19:32uh uh you know modeling of both the
- 00:19:35modalities together and come up with
- 00:19:37these interesting embeddings in that
- 00:19:39senses and uh hopefully it gives you
- 00:19:41good accuracies however uh over time
- 00:19:45people have moved to using contrastive
- 00:19:46training contrastive loss Based training
- 00:19:49and that is what the clip model is also
- 00:19:50famous for as it says it's contrastive
- 00:19:53language image pre-training okay so the
- 00:19:56way clip model works is that it's also
- 00:19:58two Tower model in that senses and uh
- 00:20:01then it has a contrastive loss right at
- 00:20:02the very end in that senses okay so uh
- 00:20:06you use uh uh you know I mean of course
- 00:20:09you have a text caption and you have an
- 00:20:10image as well so use the text caption is
- 00:20:12to basically pre-train a text encoder or
- 00:20:15you know you use a pre-train texture
- 00:20:17encoder and you you use it to encode the
- 00:20:19text in that senses you use a pre Trin
- 00:20:21image encoder to encode the images and
- 00:20:23then so in their particular case in fact
- 00:20:25they they essentially used 12
- 00:20:28Transformers for the text encoder and
- 00:20:30for the image encoder well they actually
- 00:20:31experimented with quite a few so there
- 00:20:34are five res Nets they still believed in
- 00:20:35convolution Ural networks and you know
- 00:20:38uh three different uh vit models so vit
- 00:20:40base and vit large with different patch
- 00:20:42sizes 3 to 16 and 14 as you see okay uh
- 00:20:46and then what do they do they basically
- 00:20:48pre-train this with a contrastive uh law
- 00:20:50that I'm going to explain very soon
- 00:20:52using 400 million web image text image
- 00:20:55text pairs okay so the data set is
- 00:20:57called Web image text and basically 400
- 00:20:59million image Comm text pairs okay uh so
- 00:21:02you see I mean visual bird was on 600k
- 00:21:04image text pairs willbert on 3 million
- 00:21:06uh you know clip is basically on 400
- 00:21:08million and the interesting part is it
- 00:21:09also uses contrastive loss Tes to do
- 00:21:11pre-training okay now the way this
- 00:21:13pre-training works is that uh you have
- 00:21:16image you have text tokens and you have
- 00:21:18embeddings for each of those text tokens
- 00:21:20at different positions from the text
- 00:21:21encoder just 12 Transformer you also
- 00:21:23have image tokens and you have a
- 00:21:24representation for each of those uh for
- 00:21:26various image pieces right so what you
- 00:21:29do is basically you try to compare uh
- 00:21:31these and and by by the way by the way
- 00:21:34you know from a from a from a batch of n
- 00:21:37instances let's say if I have a batch of
- 00:21:38n samples so consider nend real pairs
- 00:21:42real image text pairs n samples in a
- 00:21:44batch okay what you're going to do is to
- 00:21:47basically take a pulled text embedding a
- 00:21:49pulled image embedding and you're going
- 00:21:50to compute cosine similari between them
- 00:21:53so i1 dot1 i1 dot2 so given a batch
- 00:21:56let's say batch of 20 you're basically
- 00:21:57going to end up with 400 similarities
- 00:21:59because you have a 20 um 20 image um
- 00:22:03embeddings and 20 text embeddings you
- 00:22:04get like 400 different similarities of
- 00:22:06course what you want what you what you
- 00:22:08know is that there are only 20 pairs so
- 00:22:10therefore there are only 20 positive
- 00:22:12pairs right and if you really did all of
- 00:22:14this you know 20 cross 20 kind of
- 00:22:16computation you have like 400 minus 20
- 00:22:19380 negative pairs what you want to do
- 00:22:21is to maximize the coine similarity of
- 00:22:23the image and text embeddings of in real
- 00:22:24pairs versus minimizing cosine
- 00:22:27similarity of the embeddings of the
- 00:22:28correct pairings right so 380 incorrect
- 00:22:31pairings so you want to maximize those
- 00:22:3320 the diagonal right and minimize the
- 00:22:35similarity um for for those remaining
- 00:22:37380 which are negative right so this is
- 00:22:41what gives basically awesome uh accuracy
- 00:22:43values in fact clip was tested on 30
- 00:22:45plus computer vision tasks like OCR
- 00:22:47action recognition videos and so on so
- 00:22:48forth and they basically found clip to
- 00:22:50be really doing very well even in a zero
- 00:22:52zero shot manner okay uh uh it was
- 00:22:57better than uh
- 00:22:58it was it was it was found to be better
- 00:23:00than even fully supervised baselines
- 00:23:02okay so here is more details about clip
- 00:23:04so essentially as you notice here uh we
- 00:23:07have uh you know uh we have uh so
- 00:23:11essentially you can actually use clip
- 00:23:13even for zero short classes and for
- 00:23:15classification problems which basically
- 00:23:17involve new classes at test time okay so
- 00:23:20for example what you could do uh is that
- 00:23:23you can take an image and uh let's say
- 00:23:25you have some new class labels right at
- 00:23:27test time and you to figure out if this
- 00:23:29new test time class label holds good for
- 00:23:31this image or not okay all you need to
- 00:23:33do is to basically take that class label
- 00:23:34pass it through a text encoder and the
- 00:23:36text encoder learns a text embedding and
- 00:23:38then and this is at inference time by
- 00:23:39the way right so basically you take the
- 00:23:41image pass through the image encoder get
- 00:23:42an image embedding and just try to
- 00:23:44compute the similarities whichever has
- 00:23:45the highest similarity is the one that
- 00:23:47you actually predict as the as the right
- 00:23:49caption or the right text uh class label
- 00:23:51for this particular image okay that's
- 00:23:54that now the interesting part so what
- 00:23:56they did was to compare a zero short
- 00:23:57clip with the rest net 50 supervised
- 00:24:00model across several the several of
- 00:24:02these data sets so notice clip is zero
- 00:24:05shot it's not fine tuned on any of these
- 00:24:07data sets but resnet is not zero short I
- 00:24:10mean it's actually you take the
- 00:24:11pre-trained rest net and fine tune it on
- 00:24:13the training set of these data sets and
- 00:24:15what they observed is that among these
- 00:24:17data sets you know several of these data
- 00:24:19sets clip actually gives you a positive
- 00:24:21Improvement significantly positive
- 00:24:22improvements compared to uh compared to
- 00:24:25resnet okay uh here are a few examples
- 00:24:28based on how clip performs so here's an
- 00:24:30example from food 101 data set um you
- 00:24:33know uh nicely predicts that this
- 00:24:36particular food is not you know any of
- 00:24:38those but guacamole right and then you
- 00:24:41can also use it for other kinds of
- 00:24:42classification problems like classifying
- 00:24:44uh you know what setting is this is it a
- 00:24:45television Studio Podium indoor
- 00:24:47conference room lecture room control
- 00:24:49room and so on you could also basically
- 00:24:50try to figure out what particular object
- 00:24:52is highlighted in the image or you could
- 00:24:54basically use it for classifying uh the
- 00:24:56land Ed type so whether it is a
- 00:24:58Perman crop land pasture land you know
- 00:25:01highway or road or Ocean or you know
- 00:25:03shrand and so on so forth okay so that's
- 00:25:07clip okay um well now similar kind of
- 00:25:10models have also been trained and used
- 00:25:12for uh doing document understanding a
- 00:25:15visually Rich document understanding all
- 00:25:16right so these are scans of various
- 00:25:19kinds of documents so for example what
- 00:25:21you see here is essentially um uh some
- 00:25:24sort of key value pairs highlighted in
- 00:25:26this interesting clear clearance sheet
- 00:25:29as such but you could also have scans
- 00:25:31for invoices and so on okay uh the
- 00:25:33interesting part is that uh using uh a
- 00:25:36model popularly called as layout LM uh
- 00:25:38of course I'll also talk a little bit
- 00:25:39about on the next few slides right uh
- 00:25:42what one could do is to nicely extract
- 00:25:43the key value pairs from this document
- 00:25:45okay one can also basically do question
- 00:25:47answering on these documents so uh here
- 00:25:49is a postcard scan and you could
- 00:25:52basically then ask questions like
- 00:25:54mention the ZIP code written and then it
- 00:25:56can nicely figure out that the ZIP code
- 00:25:57is that you can also ask it for the date
- 00:25:59on the seal at the top and nicely
- 00:26:01figures out the seal and so on so forth
- 00:26:04the the date on the seal okay you could
- 00:26:06also use it for legal contract
- 00:26:07Management in that senses that given
- 00:26:09document scans you could basically ask
- 00:26:11it to highlight what are the important
- 00:26:12legal uh phrases that I must be uh
- 00:26:15paying attention to or extract just key
- 00:26:17value pairs so basically which parties
- 00:26:19signed the document or when was it
- 00:26:20signed and so on okay uh of course if
- 00:26:23you have U lots of documents on users
- 00:26:27one drive or Google Drive accounts you
- 00:26:28could try to build an app which can
- 00:26:30classify those documents or categorize
- 00:26:32them into popular categories like uh
- 00:26:35like you know personal identification
- 00:26:36documents like passport and pan cards
- 00:26:38and so on uh while uh while another
- 00:26:41category could just be all kinds of
- 00:26:42invoices utility bills and so on right
- 00:26:45you could of course also use these kinds
- 00:26:46of models to do U recognition over
- 00:26:48Walmart receipts or any other
- 00:26:49Supermarket receipts in that sensus and
- 00:26:52the main model for doing this kind of
- 00:26:54visually reach document processing a
- 00:26:56very popular model is layout LM and you
- 00:26:58know there are of course various
- 00:27:00versions layout LM V1 V2 there's also a
- 00:27:02layout llm in that senses you know
- 00:27:04motivate you folks to go ahead and look
- 00:27:06at it later but U what is interesting is
- 00:27:10uh that it basically uses transform
- 00:27:12models okay in the particular case they
- 00:27:13used transformer called as uni LM V2 to
- 00:27:16to initialize and of course uh then they
- 00:27:19basically took domain specific data
- 00:27:20layout visually You Know Rich layout
- 00:27:24data and they basically tried to train
- 00:27:26this model using um using document
- 00:27:29specific loss functions as well okay um
- 00:27:33so very broadly what they do is to take
- 00:27:34the document and uh they mask out
- 00:27:37certain lines on this document so they
- 00:27:39hide out certain lines I'll call them
- 00:27:41hidden out lines in that senses okay um
- 00:27:44uh then they basically also um you know
- 00:27:47uh so this this hidden out uh you know
- 00:27:50image is divided into different parts
- 00:27:52and then encode it using a visual
- 00:27:54encoder right on the other hand you take
- 00:27:57uh so these
- 00:27:58you basically take uh uh the document
- 00:28:01and then uh you take the lines from the
- 00:28:04document and uh essentially for the
- 00:28:06lines which are not covered you
- 00:28:09basically have each for each line you
- 00:28:11basically have this notion whether it is
- 00:28:13covered or not you Bas well when they
- 00:28:15when they hide out they don't hide out
- 00:28:17partial lines they hide out an entire
- 00:28:18line and so on okay so you have a
- 00:28:20covered line and a non-covered line okay
- 00:28:23and uh uh what you do is to basically
- 00:28:25take the document and um you know you
- 00:28:27have a OC PDF parser which gives you
- 00:28:29text so that's how you get these text
- 00:28:31tokens so you basically have the text
- 00:28:32tokens and you can of course do must
- 00:28:34language modeling so therefore some
- 00:28:35tokens are must as well remember the hi
- 00:28:37hiding part is different from masking
- 00:28:38part okay of course you can mask out
- 00:28:41text Tok so so the OCR is actually done
- 00:28:43on the on the non-hidden version of the
- 00:28:45document so that the OCR quality is good
- 00:28:47in that senses right but then you have
- 00:28:49this information which line is hidden or
- 00:28:50not okay now as you see the Transformer
- 00:28:53is being passed four different things so
- 00:28:54of course the first one is basically
- 00:28:55segment embeddings whether it is
- 00:28:57basically process ing image tokens
- 00:28:59versus is it processing um you know text
- 00:29:02tokens and then on the text tokens also
- 00:29:04you could basically say whether it is a
- 00:29:05mass token or a non-mass token yellow or
- 00:29:08blue all uh of course you pass a 1D
- 00:29:11position embedding so essentially you
- 00:29:12must pass some position some notion of
- 00:29:14position right you also pass two
- 00:29:16dimensional position embeddings for the
- 00:29:17box so for example for the text tokens
- 00:29:19essentially sorry for the for the image
- 00:29:21tokens you have a box so essentially uh
- 00:29:24you know uh X ywh and uh you can
- 00:29:27actually also um uh encode width and
- 00:29:29height of the box as part of the um uh
- 00:29:322D position embeddings for the um uh and
- 00:29:35and then for the CLS token you actually
- 00:29:37um you know pad with a box with all the
- 00:29:40six things xmin x max y Min y Max and
- 00:29:43height and width initialized to all
- 00:29:44zeros set to all zeros right okay uh and
- 00:29:48U uh so that's that so that's how you
- 00:29:49basically have 2D position emings then
- 00:29:51you also have the text and uh visual
- 00:29:53embedding so um essentially um you use
- 00:29:57um you know Mass rcnn embeddings uh
- 00:30:00because you're using that as the visual
- 00:30:02encoder here right U and uh for the text
- 00:30:05you basically just use the uh use the
- 00:30:07standard text embeddings okay uh now the
- 00:30:10Transformer yes they experiment with the
- 00:30:11two different models base size and large
- 00:30:13size 12 lers and 24 layers and uh
- 00:30:16basically which means 200 million or 426
- 00:30:18million parameters okay uh so now you
- 00:30:21know preing objectives so there are
- 00:30:22three different pring objectives Mas
- 00:30:24visual language modeling right which is
- 00:30:26the typical uh you know you can mask the
- 00:30:28images part or the text part and you can
- 00:30:29try to um uh I think in their particular
- 00:30:32case I think they just hid the uh the
- 00:30:34text part so therefore visual language
- 00:30:36modeling so if you Mass these text
- 00:30:38tokens you will basically try to predict
- 00:30:39them what is the text right text image
- 00:30:42alignment so um essentially uh they are
- 00:30:44just predicting whether a particular
- 00:30:47token belongs to the covered class or
- 00:30:49the not covered class so remember you
- 00:30:51basically covered some lines so this
- 00:30:53text token belongs to the covered class
- 00:30:55covered class versus these ones belong
- 00:30:56to not covered class okay and then
- 00:30:59lastly you have text image matching so
- 00:31:01essentially you know whether this
- 00:31:02particular image and the text match with
- 00:31:04each other or not like just like the
- 00:31:05visual bir and the B models okay those
- 00:31:08are three classes now of course the the
- 00:31:09preing data they obtain like 11 million
- 00:31:11scan documents and they use the text OCR
- 00:31:14Microsoft read API for the OCR part okay
- 00:31:17and that's how layout LM V2 basically
- 00:31:19does an awesome job U and and
- 00:31:21essentially uh is used to pre a model
- 00:31:24which uh does very awesome uh visually
- 00:31:26reach document processing
- 00:31:28okay so so far I've talked about text
- 00:31:30and images now let me quickly talk about
- 00:31:32video tasks um uh multimodality could
- 00:31:35also mean you know doing things about
- 00:31:37video and text so for example text video
- 00:31:40retrieval given a text and a collection
- 00:31:42of videos find the relevant ones now
- 00:31:44this requires text embedding and video
- 00:31:45video embedding both of them together in
- 00:31:47same space okay multiple choice video
- 00:31:50question answering so again given a
- 00:31:51video and a question and multiple
- 00:31:52candidate answers you want to choose
- 00:31:54which is the best one uh you know it's
- 00:31:55analogous to image question uh visual
- 00:31:58question answering which typically just
- 00:31:59relates with an image in that senses
- 00:32:01okay and then you could also have other
- 00:32:03kinds of tasks like action segmentation
- 00:32:04action step localization and so on uh
- 00:32:07where you basically have an action which
- 00:32:09is described in text and then you have a
- 00:32:10video and you want to figure out where
- 00:32:12the action is in that sensus also called
- 00:32:14as moment detection in that senses okay
- 00:32:17okay so um how do you do this now the
- 00:32:21ideas are pretty similar uh you you see
- 00:32:23I mean if you really think about it what
- 00:32:25is a video video is a sequence of image
- 00:32:27frames okay so in some ways if you
- 00:32:29basically uh are thinking about an image
- 00:32:32as a 3D bit map where an image has a
- 00:32:35height and a width and basically the
- 00:32:37depth is basically just three three
- 00:32:39because you have to incorporate three
- 00:32:40channels RGB red green and blue right a
- 00:32:43video with 100 frames can be Tau again
- 00:32:46as a 3D uh bit map or a 3D uh Cube where
- 00:32:50you have of course the height and the
- 00:32:51width but you also have a depth which is
- 00:32:53basically 3 * 100 if there are 100
- 00:32:55frames in the video okay so you should
- 00:32:57really you could really think about a
- 00:32:59video as um a three-dimensional cube in
- 00:33:02that senses right and in that senses you
- 00:33:04could basically then use your 3D
- 00:33:07convolution neural networks to encode
- 00:33:09this video or you could actually also
- 00:33:11use latest advances you know uh in
- 00:33:14transform models is to be able to encode
- 00:33:15this video of course you know um I mean
- 00:33:19um so as I mentioned video is basically
- 00:33:21a whole bunch of image frames but there
- 00:33:23is also sequence to them and 3D CNN help
- 00:33:25you sort of uh Ensure that that sequence
- 00:33:28is also respected when you're trying to
- 00:33:30encode the video right but again this
- 00:33:31session is not on video encoding so I'm
- 00:33:33not really going to go into deep details
- 00:33:34about how do you encode videos there's
- 00:33:36so many interesting models you know all
- 00:33:38the way starting from um you know very
- 00:33:40old models like i3d uh inflated uh 3D
- 00:33:43models and 3D Comins and so on to the
- 00:33:46more recent ones but the idea is let's
- 00:33:48say that you have a video encoder right
- 00:33:50and you have a text encoder so and you
- 00:33:52could basically do the same contrastive
- 00:33:54loss kind of training the same noise you
- 00:33:56know um uh um the typical popular uh
- 00:34:01noise contrastive estimation loss
- 00:34:02basically can be used also for doing
- 00:34:05something called as video clip okay just
- 00:34:07like you have the clip for image and
- 00:34:09text pairs you could also have a video
- 00:34:10clip which basically tries to do contast
- 00:34:11of learning uh with the uh with with
- 00:34:14video and text pairs okay now the idea
- 00:34:17is that where do you get these video and
- 00:34:18text pairs right so you could of course
- 00:34:20basically make use of transcripts so you
- 00:34:21have video and the visual information
- 00:34:23and you have transcript and you could
- 00:34:24make use of the two to align them but
- 00:34:27one has to be a little little cautious
- 00:34:28about this because you know typically if
- 00:34:30uh uh let's say even in this lecture
- 00:34:32video I started off saying that hey in
- 00:34:34this video I would talk about U you know
- 00:34:36multimodel models but when I talked
- 00:34:39about that visually on the slide you
- 00:34:41couldn't see any multimodel for that
- 00:34:43matter right similarly if I if if I'm
- 00:34:45making a recipe video I'm going to say
- 00:34:47that hey I'm going to basically teach
- 00:34:48you how to cook Cho B right and at that
- 00:34:51time on the slide there's no CH B at all
- 00:34:54right I mean CH B come much later okay
- 00:34:57or Essen you know um the idea is that
- 00:35:00the speech and and what you see on the
- 00:35:03video may not be completely aligned
- 00:35:04always and therefore you have to be
- 00:35:06little cautious about how do you align
- 00:35:08and how do you get those positive pairs
- 00:35:10versus the negative pairs but otherwise
- 00:35:12more or less the contrastive estimation
- 00:35:13contrastive lws and so on work the same
- 00:35:15way uh and in fact in their particular
- 00:35:18case in video clip they basically use
- 00:35:20the same the six layer uh I mean they
- 00:35:22use the bird base in uncased for both
- 00:35:24the video and text they just use the
- 00:35:25Transformer model for encoding the video
- 00:35:27as well okay um so uh uh I mean and the
- 00:35:31way they did that was to basically use a
- 00:35:33frozen pre-rain CNN so it's to
- 00:35:35essentially encode the image frames and
- 00:35:36then they projected those video tokens
- 00:35:37to the to the to the to the to the size
- 00:35:40that size and the dimensionality and the
- 00:35:42space that bird base desires by doing an
- 00:35:44MLP projection layer by training MLP
- 00:35:46projection layer okay uh so that's that
- 00:35:49they pretend on how to 100 million data
- 00:35:51set and that's basically um how the
- 00:35:54pretend video
- 00:35:55clip next uh uh or you know almost um
- 00:35:58sort of towards the uh towards uh sort
- 00:36:02of trying to uh you know uh moving
- 00:36:05towards U more and more modality let me
- 00:36:07talk about this image find model okay uh
- 00:36:11so far we have talked about multiple
- 00:36:13models uh I started off with simple
- 00:36:15Vision Transformers right and I
- 00:36:17basically said hey well they can be used
- 00:36:19for encoding images then I talked about
- 00:36:21image then I talked about willber and
- 00:36:23visual bird and I basically said well
- 00:36:25they can be used for um for encoding uh
- 00:36:28uh you know multimodal task involving
- 00:36:30images and text and then we of course
- 00:36:31also talked about clip in the same same
- 00:36:34theme okay then I talked about video
- 00:36:36clip and basically I said well you could
- 00:36:37extend this to two modalities not image
- 00:36:39and text this time but video and text
- 00:36:41okay now the obvious question is that he
- 00:36:44can I include more modalities and there
- 00:36:46are so many tasks with multiple
- 00:36:47modalities okay so for example here are
- 00:36:50various modalities and image bind here
- 00:36:53is a model called image bind which
- 00:36:54basically tries to extend this kind of a
- 00:36:56thing to 6 different
- 00:37:00modalities uh these ones are images uh
- 00:37:03text audio depth um so this is the depth
- 00:37:07image right uh uh thermal and inertial
- 00:37:09measurement unit okay images text audio
- 00:37:12are obvious what is the depth image
- 00:37:14depth image basically tells you how far
- 00:37:16away from the uh camera each pixel is
- 00:37:19okay so essentially white tells you that
- 00:37:20it is very close to the camera but black
- 00:37:22tells you that it is very far away from
- 00:37:24the camera okay that's a depth image you
- 00:37:26could also try to bring in in a modality
- 00:37:28called thermal now thermal modality you
- 00:37:30know you might have heard about Flur
- 00:37:31images Flur images so they basically
- 00:37:33used a lot in you know um infrared
- 00:37:36Imaging for for for electrical circuits
- 00:37:38you know you want to figure out is there
- 00:37:39a fault or not right so FL images
- 00:37:42basically make use of these infrared
- 00:37:44cameras so as to um also record
- 00:37:46temperature at every pixel in some ways
- 00:37:48so that's why thermal images in that
- 00:37:50sense right you could also have IMU
- 00:37:52inertial measurement unit kind of data I
- 00:37:54mean this is more like time series data
- 00:37:56sensor data in that sense is which you
- 00:37:57could get let's say if you're trying to
- 00:37:59build a driverless car application you
- 00:38:01might not just want to use the uh you
- 00:38:03know um uh the the input from the camera
- 00:38:07but you might also want to use several
- 00:38:09sensors data coming from several sensors
- 00:38:12inside the car right so to be able to
- 00:38:14make a decision for example whether to
- 00:38:15press a break or not right so that's
- 00:38:18basically multiple modalities of data uh
- 00:38:20and in several applications may not you
- 00:38:23you may not require processing all the
- 00:38:25modalities but some of those modalities
- 00:38:27are become important okay um and what
- 00:38:31image B so so therefore the the idea is
- 00:38:33that it is a great idea to basically
- 00:38:35learn a model which can process all the
- 00:38:37modalities uh right uh and you know here
- 00:38:39is an inspiring statement why this could
- 00:38:41be useful um and uh how to do this right
- 00:38:45so of course many many applications
- 00:38:47require a combination of these
- 00:38:48modalities the challenge though is that
- 00:38:50there is no data set across all of these
- 00:38:52modalities right so although I might
- 00:38:54want to basically uh compare or or built
- 00:38:57a application which be basically you
- 00:38:59know uses thermal images along with the
- 00:39:02sensor data unfortunately I might not
- 00:39:04have aligned data there okay but what is
- 00:39:06really interesting is that image binds
- 00:39:08it all okay an image of a beach can
- 00:39:10actually remind us of the sound of the
- 00:39:12Waves audio right the texture of the
- 00:39:14sand right uh a breeze so or even
- 00:39:18inspire a poem you know text and so on
- 00:39:20so you see different modalities can be
- 00:39:21all linked to images in some ways okay
- 00:39:24uh so that is also shown in this in this
- 00:39:27here so so if you basically just look at
- 00:39:30the image if you basically can just get
- 00:39:32the uh image text image depth image heat
- 00:39:35map image audio image IMU pairs you know
- 00:39:38maybe what you can do is to solve this
- 00:39:40problem of not requiring you know
- 00:39:43pairwise data across all possible
- 00:39:44modalities right so if there are five
- 00:39:46different six different modalities as
- 00:39:48you see you would require 62 different
- 00:39:50image different types of data you know
- 00:39:52image Text data image audio data text
- 00:39:54audio data text IM data and so on but
- 00:39:56maybe if you basically just go via
- 00:39:58images uh you could basically solve this
- 00:40:00problem and that's what image bind banks
- 00:40:01on so they basically make use of a whole
- 00:40:03bunch of image data combined uh with
- 00:40:06another modality data so is to be able
- 00:40:08to train a really really awesome
- 00:40:09multimodal model and that multimodel
- 00:40:12model now helps them to do not just
- 00:40:15multimodel understanding but also helps
- 00:40:17them to do multimodel generation yeah
- 00:40:19now multimodel generation is of course a
- 00:40:20topic for another lecture and we'll talk
- 00:40:22about that later uh but uh uh you know
- 00:40:25here are here are some examples so cross
- 00:40:27model retrieval so essentially if you
- 00:40:30basically just pass on this audio uh
- 00:40:32which is essentially of a crackle of a
- 00:40:33fire you can basically try to retrieve
- 00:40:36these kinds of images or videos which
- 00:40:38actually show the crackle of a fire or
- 00:40:40also retrieve these uh uh depth images
- 00:40:43which basically relate with you know
- 00:40:45fireplace and so on as you can uh as you
- 00:40:47can U understand from the image right
- 00:40:50you can also retrieve you know these
- 00:40:51text pieces which are all talking about
- 00:40:53fire fire crackles while pan and
- 00:40:55remember this is not basically because
- 00:40:57the text called crackle of a fire was
- 00:40:59used to search but because the audio
- 00:41:01relating um you know audio which
- 00:41:04basically just uh is the sound of the
- 00:41:06crackle of a fire basically helps you
- 00:41:08retrieve the text okay this is not
- 00:41:10speech to text by the way so in the
- 00:41:12audio crackle of a fire those words were
- 00:41:13not spoken uh all that was there in the
- 00:41:16audio is the sound of the fire in that
- 00:41:18senses okay so of course it can also be
- 00:41:20used for doing embedding space
- 00:41:22arithmetic so you could basically have
- 00:41:24this image of of a crane uh or or a a
- 00:41:27bird right and you know you can have
- 00:41:29sound of waves right and then can
- 00:41:31actually generate images which can you
- 00:41:32know basically the same bird in in in
- 00:41:35the sea or on on the shore and so on
- 00:41:38okay you could also use this kind of a
- 00:41:39model for audio to image generation so
- 00:41:41given a audio you know can you generate
- 00:41:43an image barking audio generate a dog or
- 00:41:46you know train audio generate a train
- 00:41:49image and so on so forth yeah so how is
- 00:41:52the image model image bind model trained
- 00:41:54well as I mentioned the basically the
- 00:41:56model is basically trained by using uh
- 00:41:59uh several kinds of data sets also
- 00:42:01mentioned here uh which basically relate
- 00:42:04visual modality with everything else so
- 00:42:07with other modalities for example video
- 00:42:09and audio from a data set called audio
- 00:42:11Set uh image depth kind of relationship
- 00:42:14from some other data set image thermal
- 00:42:16from another data set video IMU data
- 00:42:18image Text data and so on so forth okay
- 00:42:20uh the model is basically uh of course
- 00:42:23you know it uses large deep learning
- 00:42:25deep neural networks to encode uh the
- 00:42:28image and also any other modality so
- 00:42:31encode the image and any other modality
- 00:42:32M uh and it uses the same influency um
- 00:42:36contrastive loss so the I mean
- 00:42:38essentially just like video clip or clip
- 00:42:40they basically use a symmetric version
- 00:42:42of the influency loss where noise
- 00:42:43contrastive estimation right where you
- 00:42:46know uh you you take both the U you know
- 00:42:49um image versus the other modality and
- 00:42:52modality versus the other versus the
- 00:42:53image and so on right so the loss is
- 00:42:56particularly given as follows I mean
- 00:42:57that's basically uh ensuring that the
- 00:43:00positive pairs essentially have a higher
- 00:43:01similarity compared to negative pairs
- 00:43:03which you which you see I mean of course
- 00:43:04in the denominator you have both right
- 00:43:07so that's that now um yeah so so
- 00:43:10essentially for image point for the text
- 00:43:12encoder essentially for the image
- 00:43:14encoder they used the vi the huge model
- 00:43:16630 million parameters and for the text
- 00:43:18encoder they basically used a 302 302
- 00:43:20million parameters from open
- 00:43:22CLP and um I think for training the uh
- 00:43:27as far as I remember they froze the text
- 00:43:29encoder part but then they trained uh
- 00:43:32the other modalities in that senses um
- 00:43:35that's that so they use the same encoder
- 00:43:36for images plus videos basically where
- 00:43:38videos are just treated as multi frame
- 00:43:41images right um that's that okay so more
- 00:43:46or less this is what I had for you uh
- 00:43:48for this uh for this session uh quickly
- 00:43:52summarizing in this session I talked
- 00:43:54about a whole bunch of
- 00:43:57models I first motivated essentially why
- 00:44:00multimodel modeling is important by um
- 00:44:03you know talking about various vision
- 00:44:04and language tasks like visual question
- 00:44:05answering visual Common Sense reasoning
- 00:44:07referring Expressions caption based
- 00:44:09image retrieval um you know multimodel
- 00:44:11hate speech detection multimodel fake
- 00:44:13news detection and so on then we talked
- 00:44:15about Vision Transformers which is
- 00:44:16basically um model to use Transformers
- 00:44:19to encode images then we talked about
- 00:44:21three different models for encoding
- 00:44:23images U namely visual BT willbert um
- 00:44:27you know and and and clip right uh where
- 00:44:31uh and and in that order they have been
- 00:44:33pre-trained using larger and larger
- 00:44:35image text pairs right U lastly or other
- 00:44:39next I talked about um about visually
- 00:44:42Rich document understanding using uh
- 00:44:44this particular model called as layout
- 00:44:45LM um layout LM V2 particularly uh right
- 00:44:50and then uh next we talked about uh
- 00:44:52video tasks uh extending multimodality
- 00:44:55to video and text combinations right and
- 00:44:58then therefore we talked about video
- 00:45:00clip an obvious extension of the
- 00:45:02standard clip model um uh lastly I
- 00:45:06talked about image bind he by the way
- 00:45:08you know image bind of course deals with
- 00:45:10six different modalities but there are
- 00:45:11other models uh which can actually deal
- 00:45:14with more modalities for example you
- 00:45:16know I think if I remember correctly
- 00:45:18there's a model of course there's a
- 00:45:19model called text bind by the way right
- 00:45:21there's also a model called as metat
- 00:45:23Transformers uh you know uh there is
- 00:45:25also yet another model called as
- 00:45:27composable diffusion right so if you try
- 00:45:29to search for these models you'll
- 00:45:31basically see that people have tried to
- 00:45:32extend this to many more modalities not
- 00:45:34just six people have tried I think 10
- 00:45:36plus modalities in those in those papers
- 00:45:38okay so hopefully you know this session
- 00:45:40motivates you um to uh to to read up
- 00:45:44more of those papers in multimodal
- 00:45:46modeling and um uh you know and
- 00:45:49potentially also do research in this
- 00:45:51area right uh I'm of course always
- 00:45:53excited to uh work in this area um uh um
- 00:45:57so you know if you want to do more
- 00:45:59research in this area feel free to reach
- 00:46:00out to me uh on these
- 00:46:02coordinates uh you would also find by
- 00:46:05the way on my YouTube channel a whole
- 00:46:06bunch of
- 00:46:07videos um around
- 00:46:09multimodality um so feel free to check
- 00:46:12them out as well okay thanks so much and
- 00:46:15uh happy to take questions
- modèles multimodaux
- vision et langage
- Visual Bird
- Wilbert
- Clip
- Transformers Visuels