#222 Multimodal Models Part1 (as part of IIT Delhi course on Large Language Models (LLMs))
- 00:00:02hi welcome to this session on multimodel
- 00:00:05models part one which is a part of e 881
- 00:00:09and a 821 course on large language
- 00:00:12models introduction recent advances at
- 00:00:15it Delhi um my name is Manish kupta and
- 00:00:18I'm a principal applied scientist at
- 00:00:20Microsoft uh so let's get started with
- 00:00:23the session today okay um this is a
- 00:00:26session on multimodel models right um
- 00:00:30multimodel models can actually mean
- 00:00:31several modalities this session is going
- 00:00:34to be primarily focused on vision and
- 00:00:36language tasks also note that this
- 00:00:39session is more on multimodal
- 00:00:41understanding and less on generation to
- 00:00:43begin with of course in the part two
- 00:00:45I'll start talking about generation uh
- 00:00:48text generation specifically where the
- 00:00:49input could be multimodel but in this
- 00:00:51session I'm going to focus more on um
- 00:00:55you know multimodal content
- 00:00:57understanding right so what you see on
- 00:00:59the slide here are various vision and
- 00:01:02language tasks there are several
- 00:01:04scenarios where uh information uh could
- 00:01:08be multimodal in nature um specifically
- 00:01:11text and images right so uh the most
- 00:01:14popular multimodal task I would say is
- 00:01:16visual question answering where the
- 00:01:18input is an image and also a question uh
- 00:01:22along with that image so is there
- 00:01:24something to cut the vegetables with and
- 00:01:26this task is actually called as visual
- 00:01:27question answering where given the image
- 00:01:29and the question the idea is to select
- 00:01:32from a large set of answers right an
- 00:01:35understanding task is about selecting
- 00:01:36from a large set of answers right uh but
- 00:01:39a generation task would be to actually
- 00:01:41generate an answer text another related
- 00:01:44task is VCR or visual Common Sense
- 00:01:46reasoning where uh you know given this
- 00:01:49image of a social situation right and a
- 00:01:52question why is person for pointing at
- 00:01:55person one the idea is to be able to
- 00:01:57figure out from these four options which
- 00:02:00is the most correct option most uh you
- 00:02:02know accurate answer right and then uh
- 00:02:05as an extension to the task people have
- 00:02:07also proposed um you know uh this this
- 00:02:09second order task where uh given an
- 00:02:12image and given the question and an
- 00:02:14answer that was chosen what is a
- 00:02:16rational why is it correct right why is
- 00:02:18the answer correct right so both of them
- 00:02:21involve choosing one of the four
- 00:02:23possible options right here is yet
- 00:02:26another task where then where given an
- 00:02:29image and uh uh you see uh given a
- 00:02:32rectangle you basically want to choose a
- 00:02:35piece of text saying hey well this is
- 00:02:36guy in yellow dribbling ball right
- 00:02:38referring Expressions task yet another
- 00:02:41task is about uh caption based image
- 00:02:43retrieval where you know you have a
- 00:02:45particular caption let's say uh this
- 00:02:47text caption and then you have a large
- 00:02:49bunch of images and you want to retrieve
- 00:02:51the most relevant images okay so the
- 00:02:53idea is that here of course you want to
- 00:02:54basically encode the images and the text
- 00:02:57uh in a in a similar in a you know uh in
- 00:03:00a in a uh single space so that you could
- 00:03:03actually compute similarity between them
- 00:03:05seamlessly and then do good image
- 00:03:06retrial right so all of them basically
- 00:03:09involve encoding the image and encoding
- 00:03:11the text uh in some form um sometimes
- 00:03:14jointly sometimes uh individually but
- 00:03:16then at some level try to compute
- 00:03:18similarity okay so now what I would do
- 00:03:21is to basically slowly start talking
- 00:03:22about the kind of models people have
- 00:03:24been using to solve these kinds of
- 00:03:26vision and language tasks of course
- 00:03:27there are many other tasks by the way
- 00:03:29right so for example there's also
- 00:03:30multimodal fake uh tweet detection so
- 00:03:33where the Tweet could be multimodal in
- 00:03:35nature there's an image there's a text
- 00:03:36and you want to figure out if this
- 00:03:38combined multimodal tweet is fake or not
- 00:03:40uh similarly there are other tasks like
- 00:03:43uh like like hate speech detection so
- 00:03:44there could be a multimodel document and
- 00:03:47you want to figure out let's say a blog
- 00:03:49or again a tweet right and you want to
- 00:03:50figure out if combined overall including
- 00:03:53the image in the text is it hateful or
- 00:03:55not right so so given all of this I'll
- 00:03:58start talking about models that people
- 00:04:00have built around trying to solve uh
- 00:04:02these these multimodal tasks so
- 00:04:04basically how do you do multimodal
- 00:04:05understanding right now uh to be able to
- 00:04:08do multimodal understanding you need to
- 00:04:09encode both the images both the image as
- 00:04:12well as the text now text understanding
- 00:04:14of course you have done over multiple
- 00:04:16lectures in this in this course uh let
- 00:04:18me actually talk about how do you how do
- 00:04:20you encode images nicely right so far uh
- 00:04:24you know or or um until 2020 let's say
- 00:04:27people used to use uh convolution neural
- 00:04:29network to encode images people still
- 00:04:31use them but what has become popular
- 00:04:33also is to basically use these Vision
- 00:04:34Transformers to encode images okay so
- 00:04:37the idea is that given a particular
- 00:04:39image of this kind you're going to first
- 00:04:41split it into fixed size patches for
- 00:04:43example this guy is basically split into
- 00:04:45fixed size n patches and uh these
- 00:04:48patches are linearly embedded so
- 00:04:50essentially uh you have some sort of a
- 00:04:52projection which sort of embeds them
- 00:04:53linearly so you have a embedding
- 00:04:55computed per patch and then you also add
- 00:04:58positional embeddings uh to this to each
- 00:05:00of those patch embeddings and you feed
- 00:05:02them the resultant vectors along with
- 00:05:04the CLS token to a Transformer encoder
- 00:05:07right so that's basically uh the way
- 00:05:09these Vision Transformers work so viit
- 00:05:11or the vision Transformer model
- 00:05:13basically takes an image plac into fix
- 00:05:15size patches linearly embeds them adds a
- 00:05:17position embedding and passes it as
- 00:05:19input to the uh Transformer encoder
- 00:05:21along with the CLS token on the CLS
- 00:05:23token you could of course attach a MLP
- 00:05:25head and use it for various
- 00:05:26classification purposes for example in
- 00:05:28this particular case maybe you want to
- 00:05:29do thousand class classification so you
- 00:05:32basically taken the image and you passed
- 00:05:34it to through through this Transformer
- 00:05:36encoder and you're going to attach a
- 00:05:38thousand sized output attention out
- 00:05:41output layer right th neurons in the
- 00:05:43output okay um now uh viit model uh and
- 00:05:48the transform encoder is a standard
- 00:05:49encoder you don't make no changes
- 00:05:50essentially you just have the multi-ad
- 00:05:52self attention you have the MLP feed
- 00:05:54forward layer and so on right um the vi
- 00:05:57model comes in three different sizes 12
- 00:05:59layer 24 layer 32 layer base large and
- 00:06:01huge sizes with different hidden
- 00:06:03dimensions and number of attention heads
- 00:06:05and so on right the largest one is
- 00:06:08basically 632 million parameters this
- 00:06:10model was pre-trained using obvious data
- 00:06:12sets imaginet 1K imaginet 21k and jft
- 00:06:17right um see these models basically uh
- 00:06:19differ in terms of not just the base
- 00:06:21large and huge the sizes but they also
- 00:06:23differ in terms of the patch IM patch
- 00:06:26sizes that they take as input so for
- 00:06:28example v l 16 basically takes 16 Cross
- 00:06:3116 input patch size so of course smaller
- 00:06:34patch size would mean larger sequence
- 00:06:36length but there is basically a
- 00:06:38trade-off between uh and then larger
- 00:06:40sequence length would mean latency
- 00:06:41higher latency and so on uh so but it
- 00:06:44could also mean higher accuracy in that
- 00:06:45senses so that's that and what what
- 00:06:47people have shown is that Vision
- 00:06:48Transformers match or exceed residual
- 00:06:51networks on many many image
- 00:06:52classification data sets and therefore
- 00:06:55people have started using transform
- 00:06:57models Vision transform models for
- 00:06:58encoding images as well okay now you
- 00:07:01would be thinking hey where is
- 00:07:02multimodality here but this was really
- 00:07:04important to build there's no
- 00:07:05multimodality here so far it was
- 00:07:07important to build so as to bring this
- 00:07:09notion that yes you could use transform
- 00:07:11models even to encode images right and
- 00:07:13now here is our first multimodel model
- 00:07:15it's called visual Bird right so the
- 00:07:17model is actually called visual bird and
- 00:07:18as you can imagine it sort of is going
- 00:07:20to integrate bird for text and you know
- 00:07:24Vision also um as using using transform
- 00:07:28model itself okay so the idea is uh uh
- 00:07:31that um uh you have to somehow uh
- 00:07:34understand how to pre-train this model
- 00:07:35so visual bird is a pre-trained
- 00:07:37multimodel model now if you remember
- 00:07:39bird how was it pre-trained the
- 00:07:41pre-training goal was to be able to
- 00:07:42learn a task agnostic language model
- 00:07:45such that it understands English like a
- 00:07:47second grade third grade kid okay and
- 00:07:50the way it was done pre- trending data
- 00:07:52is is always obtained in a
- 00:07:53self-supervised manner in the sense that
- 00:07:56there should be large amounts of this
- 00:07:57data and no extra human labeling
- 00:07:59required right now when you think about
- 00:08:01a multimodal pre-training what you want
- 00:08:03is a model which can basically
- 00:08:05understand how to link U you know uh
- 00:08:08portions of those images or patches in
- 00:08:10those images with text words right so
- 00:08:12essentially patches in images with words
- 00:08:15right so it should basically not just so
- 00:08:17so uh unlike bird which basically just
- 00:08:19needs to understand relationship between
- 00:08:21various words uh visual part needs to
- 00:08:24understand the relationship between
- 00:08:25these text words and visual patches of
- 00:08:28images okay how how do you enable that
- 00:08:30what kind of pretin data would you use
- 00:08:32so what folks realized is that there's a
- 00:08:34whole bunch of image captioning data
- 00:08:36which is available and therefore they
- 00:08:37said that hey we'll use image captioning
- 00:08:39data image text pair so as to be able to
- 00:08:41pre-train this model so you have images
- 00:08:43and you have a corresponding caption
- 00:08:45associated with it so visual bird
- 00:08:47basically leveraged the MS Coco data
- 00:08:49which is basically 120,000 images along
- 00:08:52with each with five different captions
- 00:08:54leading to a data set of 6 600,000 image
- 00:08:57text pairs right and they basically use
- 00:09:00the standard so they use the standard
- 00:09:02transform encoder model uh to be able uh
- 00:09:04to train um this this pre-training to do
- 00:09:07this pre-training okay so uh the way
- 00:09:10they did this did this is basically to
- 00:09:13uh add um you see there's a there's a
- 00:09:16CLS token and then there is the text
- 00:09:17caption which is going in as input and
- 00:09:19there is of course the image uh pieces
- 00:09:21that are going in as input okay now U
- 00:09:24let's let's understand this step by step
- 00:09:26so essentially um uh just like one would
- 00:09:29do Mass language modeling in B here they
- 00:09:32also mask text so some text token so
- 00:09:35that's obvious and then text and images
- 00:09:37are separated by a separated token which
- 00:09:38is also obvious now what are these image
- 00:09:40pieces well these image pieces basically
- 00:09:43um are not tiles or patches as you
- 00:09:45observed on the previous slide but well
- 00:09:47in their case they actually used an
- 00:09:49object detection model called as faster
- 00:09:50rcnn so as to basically take this image
- 00:09:53and divide it into different patches uh
- 00:09:55such that each patch actually captures a
- 00:09:58relevant object like a cap or a ball or
- 00:10:00a tennis racket or a shirt and so on
- 00:10:02okay so they basically take objects uh
- 00:10:05which are returned by this faster RC and
- 00:10:07model uh and pass them as inputs in you
- 00:10:10know as as input image tokens so what
- 00:10:12are your image words so to say well or
- 00:10:15image tokens they're basically objects
- 00:10:16as detected by the FAS rcnn model faster
- 00:10:19rcnn object detector okay um uh now to
- 00:10:22encode these things they have like three
- 00:10:24different kinds of U um you know things
- 00:10:26that go as input at every position there
- 00:10:28is of course a position embedding that
- 00:10:30is obvious in any Transformer model
- 00:10:31there's also segment embedding which
- 00:10:33basically tells you whether this is a
- 00:10:34text segment or an image segment and
- 00:10:36there's also a token or image embedding
- 00:10:37so of course you know there need to be
- 00:10:39some text features or image features
- 00:10:41that need to be passed for tokens
- 00:10:42basically it's the token embedding and
- 00:10:44for images it basically compris the
- 00:10:46features from faster rcnn and so on okay
- 00:10:49so all of this is combined and then fed
- 00:10:51to the standard Transformer model uh
- 00:10:54Transformer encoder model and um uh then
- 00:10:57you basically train pre-train this
- 00:10:59Transformer model using two different
- 00:11:01objective functions we'll talk about
- 00:11:02these two objective functions next uh
- 00:11:05the first one is very simple it's just
- 00:11:06the mass language modeling so M language
- 00:11:08modeling remember no M image modeling so
- 00:11:10on the image side you do not really hide
- 00:11:12any of those images but on the text side
- 00:11:14you mask these text pieces and the idea
- 00:11:16is that just like in bird uh you know
- 00:11:19Mass language modeling aims to be able
- 00:11:21to uh predict the must word at the
- 00:11:23output at the same position by
- 00:11:26leveraging knowledge uh or borrowing
- 00:11:28knowledge uh from other tokens unmasked
- 00:11:30tokens right in the same way in visual
- 00:11:33bir the idea is that at the same
- 00:11:35position you know the model should be
- 00:11:37able to guess the hidden word or the
- 00:11:39mased word um um you know by leveraging
- 00:11:43uh Knowledge from the unmasked uh text
- 00:11:46words and also of course all the image
- 00:11:48tokens because n image tokens are mased
- 00:11:51so you know the accuracy here for Mass
- 00:11:53language modeling should be ideally
- 00:11:54higher U you know because now the m word
- 00:11:57let's say the tennis recet uh tennis can
- 00:12:00can possibly be guessed not just based
- 00:12:02on Racket and ball and so on but also
- 00:12:05based on what the model sees in the
- 00:12:07image in that senses right so this
- 00:12:09knowledge should be able to help um you
- 00:12:11know U improve the mass language
- 00:12:12modeling accuracy on the way making the
- 00:12:14model learn how to um relate uh this
- 00:12:17particular image with the word called
- 00:12:20tennis okay so that's that the second
- 00:12:23task is and and by the way these these
- 00:12:25must you know this objective one is
- 00:12:27going to be computed must language
- 00:12:29modeling objective is going to be
- 00:12:30computed only on those tokens uh where
- 00:12:33the text words or text tokens were mased
- 00:12:35okay now on the other hand let's talk
- 00:12:37about this objective two it's basically
- 00:12:40the sentence image prediction task the
- 00:12:42idea is that um you see in a particular
- 00:12:44batch of samples you would give image
- 00:12:47text pairs but half of those pairs are
- 00:12:49going to be positive pairs half of them
- 00:12:50are going to be negative pairs now what
- 00:12:52is a positive pair positive pair
- 00:12:53basically means that the caption is
- 00:12:55linked with the image itself and you
- 00:12:58know which is which means that is
- 00:12:59similar it is relevant for the image
- 00:13:02right and then the negative pair would
- 00:13:03be this caption is linked with the
- 00:13:05negative image you know irrelevant image
- 00:13:07to the caption okay for the sample
- 00:13:09positive and negative pairs create a
- 00:13:10batch and then you know the objective
- 00:13:12two is basically all about figuring out
- 00:13:14whether uh the attent whether the you
- 00:13:17know uh the the MLP head out there um is
- 00:13:21is able to predict correctly whether
- 00:13:22this is a positive pair or a negative
- 00:13:23pair right so that's basically that um
- 00:13:26now the interesting part about visual
- 00:13:28bird is that uh you could basically now
- 00:13:31look at the attention weights of some
- 00:13:34selected heads uh at the at the output
- 00:13:36layer at the last layer and uh then by
- 00:13:39looking at those attention weights you
- 00:13:41can visualize them and try to see uh
- 00:13:44remember in self attention you have like
- 00:13:46now words text words paying attention to
- 00:13:48uh image uh pieces and now you can try
- 00:13:51to see you know how are they correlated
- 00:13:53okay so let me look at layer 11 and uh
- 00:13:57you see uh this this heat map is
- 00:13:59basically drawn by showing um uh
- 00:14:02attention between text words you see
- 00:14:05there and image tokens you see there
- 00:14:07right so five image five image tokens
- 00:14:09you know uh corresponding to man which
- 00:14:11are also highlighted in this image by
- 00:14:13the way so the red red one is man
- 00:14:15the you know bluish one is shirt and you
- 00:14:18know the bluish kind of stuff is
- 00:14:20sidewalk and so on right so what do You
- 00:14:22observe is that uh so if I look at the
- 00:14:25word man right fortunately it has very
- 00:14:28high you know um attention weight for
- 00:14:32for the for the for the for the box man
- 00:14:34in that sense for the image piece man
- 00:14:36okay and that's very useful and nice
- 00:14:38because it sort of nicely tells us that
- 00:14:40the model is actually learning to
- 00:14:42correlate pieces in the image with the
- 00:14:44tokens in the text okay uh so and so on
- 00:14:46you can actually pause the video here
- 00:14:47and essentially observe that this holds
- 00:14:49also for other kinds of things like
- 00:14:51sidewalk pedestrians shirt and so on
- 00:14:53okay now after this people tried to
- 00:14:56improve the architecture they came up
- 00:14:57with this architecture new model called
- 00:14:59as Wilber okay by the way you can also
- 00:15:01look at these papers most of my slides
- 00:15:02actually have citations for these papers
- 00:15:04at the at the bottom okay um so Wilbert
- 00:15:08basically believes in a two Tower model
- 00:15:10unlike visual bird visual bird was a
- 00:15:12single Tower model the concatination of
- 00:15:14text and image modali sort of happened
- 00:15:16right in the right in the first layer
- 00:15:17right in the zeroth layer in that senses
- 00:15:19okay but Wilbert basically believes in
- 00:15:21processing text separately using a few
- 00:15:24few layers Transformer layers as you see
- 00:15:26them here and processing the image
- 00:15:28separately using a few layers now these
- 00:15:29layers could be you know um uh
- 00:15:32Transformer layers and so on so um so uh
- 00:15:35notice that the text stream in Wilber
- 00:15:38actually has much more processing before
- 00:15:39interacting with the visual features but
- 00:15:41as I was saying well it's a two Tower
- 00:15:42model where the text and image are
- 00:15:45processed separately in their own
- 00:15:46pipelines but then there is also a
- 00:15:48fusion which happens where core
- 00:15:50Transformer layers or Co attention based
- 00:15:52Transformer layers basically try to fuse
- 00:15:54the information across both the
- 00:15:55pipelines okay uh and then there are
- 00:15:57other Transformer layers further which
- 00:15:59basically do individual processing
- 00:16:00separately okay now the interesting part
- 00:16:03is that this linguistic stream is
- 00:16:05basically bird based and for for you
- 00:16:09know um uh bird base and then you know
- 00:16:12uh again the for for the visual stream
- 00:16:14essentially you use uh faster rcnn to
- 00:16:17essentially figure out those image
- 00:16:19patches and so on okay this faster rcnn
- 00:16:21both in visual bird and willber is
- 00:16:23pre-train on visual genome data set okay
- 00:16:25okay so now how is this Co attention how
- 00:16:27do these Co Transformer layers work work
- 00:16:29right so a standard Transformer layer it
- 00:16:31has the typical standard self attention
- 00:16:32and feed forward right with the qkv the
- 00:16:35query keys and values coming from a
- 00:16:37single stream right but in Wilber you
- 00:16:39have two different streams the visual
- 00:16:41stream and then the um the the
- 00:16:42linguistic stream right uh and what you
- 00:16:45do uh for transferring information
- 00:16:48across the two is to basically use uh
- 00:16:50the query of the same modality but the
- 00:16:53you know the keys and the values coming
- 00:16:55from the other modality from the
- 00:16:56linguistic stream in this particular
- 00:16:57example right in this particular case
- 00:16:59and in the other case in the linguistic
- 00:17:00stream again you're going to use the
- 00:17:01query from the linguistic stream but
- 00:17:03you're going to use the keys and the
- 00:17:04values from the visual stream so as to
- 00:17:05essentially do this cross-pollination of
- 00:17:08information um in the in the attenion
- 00:17:10layer right and then of course you do
- 00:17:11this typical standard feed forward with
- 00:17:13all those residual connections add and
- 00:17:14normalization and so on okay so that's
- 00:17:17how Wilbert works now another
- 00:17:19interesting part about Wilbert is that
- 00:17:21rather than depending on manually
- 00:17:23labeled Ms Coco data they actually
- 00:17:25depended on conceptual caps data this
- 00:17:27data set of concept ual captions is
- 00:17:29basically obtained in an automated
- 00:17:31manner by scraping things from the web
- 00:17:33so the idea is that on the web there are
- 00:17:35Wikipedia Pages news articles and many
- 00:17:37other many other web pages where you
- 00:17:39have an image and underneath that
- 00:17:40there's a caption okay there are also
- 00:17:43images with very nice alt tags
- 00:17:45associated with them these serve as
- 00:17:47really good sources for caption
- 00:17:49information along with images and that
- 00:17:51is what willber guys basically Lage um
- 00:17:54so as to uh essentially U you know um
- 00:17:58create the visual uh grounding data
- 00:18:00right uh visual image text pairs
- 00:18:02essentially and then then they pre-train
- 00:18:03the visual wibert model based on that
- 00:18:06now in Wilbert model again there are two
- 00:18:08pre-training loss functions so just like
- 00:18:10in visual bird there is a multimodel
- 00:18:11alignment prediction function to
- 00:18:13basically just predict whether this
- 00:18:14image and text are aligned with each
- 00:18:16other or not and that's basically the
- 00:18:17same as visual bird so there's no no
- 00:18:18difference in that sense except the
- 00:18:19change in the name right but then on the
- 00:18:22other hand the mass language modeling is
- 00:18:23actually now extended to to to become
- 00:18:25Mast multimodal learning okay the
- 00:18:28interesting part part is that you know
- 00:18:30rather than just masking out text you
- 00:18:31can actually also mask out image
- 00:18:33pieces uh in 2019 you know there was no
- 00:18:37good technology to basically generate
- 00:18:39back image pieces themselves in the same
- 00:18:43position but what you could do or rather
- 00:18:45what Wilber does is to basically
- 00:18:46generate a distribution over a set of
- 00:18:48class labels so of course you know
- 00:18:50because you're using fcnn you know what
- 00:18:52particular object this particular
- 00:18:54position indicates the image piece at
- 00:18:56this position indicates and then the
- 00:18:58idea is that
- 00:18:59Wilbert uh the goal or the objective is
- 00:19:01such that the Wilbert model is motivated
- 00:19:04to learn the right pred right
- 00:19:06distribution over those objects uh at
- 00:19:09the at the output at the same position
- 00:19:11okay if it learns great else there's a
- 00:19:12back propagation cross and tropy loss
- 00:19:14back propagated right so that's
- 00:19:15basically M multimodal learning a great
- 00:19:17extension from just that M Mass language
- 00:19:19modeling as done in visual
- 00:19:21part okay so that's great now the idea
- 00:19:25behind wibert and visual bird so far is
- 00:19:27that you have imaged is that you have
- 00:19:29image text Pairs and you could do a nice
- 00:19:32uh uh you know modeling of both the
- 00:19:35modalities together and come up with
- 00:19:37these interesting embeddings in that
- 00:19:39senses and uh hopefully it gives you
- 00:19:41good accuracies however uh over time
- 00:19:45people have moved to using contrastive
- 00:19:46training contrastive loss Based training
- 00:19:49and that is what the clip model is also
- 00:19:50famous for as it says it's contrastive
- 00:19:53language image pre-training okay so the
- 00:19:56way clip model works is that it's also
- 00:19:58two Tower model in that senses and uh
- 00:20:01then it has a contrastive loss right at
- 00:20:02the very end in that senses okay so uh
- 00:20:06you use uh uh you know I mean of course
- 00:20:09you have a text caption and you have an
- 00:20:10image as well so use the text caption is
- 00:20:12to basically pre-train a text encoder or
- 00:20:15you know you use a pre-train texture
- 00:20:17encoder and you you use it to encode the
- 00:20:19text in that senses you use a pre Trin
- 00:20:21image encoder to encode the images and
- 00:20:23then so in their particular case in fact
- 00:20:25they they essentially used 12
- 00:20:28Transformers for the text encoder and
- 00:20:30for the image encoder well they actually
- 00:20:31experimented with quite a few so there
- 00:20:34are five res Nets they still believed in
- 00:20:35convolution Ural networks and you know
- 00:20:38uh three different uh vit models so vit
- 00:20:40base and vit large with different patch
- 00:20:42sizes 3 to 16 and 14 as you see okay uh
- 00:20:46and then what do they do they basically
- 00:20:48pre-train this with a contrastive uh law
- 00:20:50that I'm going to explain very soon
- 00:20:52using 400 million web image text image
- 00:20:55text pairs okay so the data set is
- 00:20:57called Web image text and basically 400
- 00:20:59million image Comm text pairs okay uh so
- 00:21:02you see I mean visual bird was on 600k
- 00:21:04image text pairs willbert on 3 million
- 00:21:06uh you know clip is basically on 400
- 00:21:08million and the interesting part is it
- 00:21:09also uses contrastive loss Tes to do
- 00:21:11pre-training okay now the way this
- 00:21:13pre-training works is that uh you have
- 00:21:16image you have text tokens and you have
- 00:21:18embeddings for each of those text tokens
- 00:21:20at different positions from the text
- 00:21:21encoder just 12 Transformer you also
- 00:21:23have image tokens and you have a
- 00:21:24representation for each of those uh for
- 00:21:26various image pieces right so what you
- 00:21:29do is basically you try to compare uh
- 00:21:31these and and by by the way by the way
- 00:21:34you know from a from a from a batch of n
- 00:21:37instances let's say if I have a batch of
- 00:21:38n samples so consider nend real pairs
- 00:21:42real image text pairs n samples in a
- 00:21:44batch okay what you're going to do is to
- 00:21:47basically take a pulled text embedding a
- 00:21:49pulled image embedding and you're going
- 00:21:50to compute cosine similari between them
- 00:21:53so i1 dot1 i1 dot2 so given a batch
- 00:21:56let's say batch of 20 you're basically
- 00:21:57going to end up with 400 similarities
- 00:21:59because you have a 20 um 20 image um
- 00:22:03embeddings and 20 text embeddings you
- 00:22:04get like 400 different similarities of
- 00:22:06course what you want what you what you
- 00:22:08know is that there are only 20 pairs so
- 00:22:10therefore there are only 20 positive
- 00:22:12pairs right and if you really did all of
- 00:22:14this you know 20 cross 20 kind of
- 00:22:16computation you have like 400 minus 20
- 00:22:19380 negative pairs what you want to do
- 00:22:21is to maximize the coine similarity of
- 00:22:23the image and text embeddings of in real
- 00:22:24pairs versus minimizing cosine
- 00:22:27similarity of the embeddings of the
- 00:22:28correct pairings right so 380 incorrect
- 00:22:31pairings so you want to maximize those
- 00:22:3320 the diagonal right and minimize the
- 00:22:35similarity um for for those remaining
- 00:22:37380 which are negative right so this is
- 00:22:41what gives basically awesome uh accuracy
- 00:22:43values in fact clip was tested on 30
- 00:22:45plus computer vision tasks like OCR
- 00:22:47action recognition videos and so on so
- 00:22:48forth and they basically found clip to
- 00:22:50be really doing very well even in a zero
- 00:22:52zero shot manner okay uh uh it was
- 00:22:57better than uh
- 00:22:58it was it was it was found to be better
- 00:23:00than even fully supervised baselines
- 00:23:02okay so here is more details about clip
- 00:23:04so essentially as you notice here uh we
- 00:23:07have uh you know uh we have uh so
- 00:23:11essentially you can actually use clip
- 00:23:13even for zero short classes and for
- 00:23:15classification problems which basically
- 00:23:17involve new classes at test time okay so
- 00:23:20for example what you could do uh is that
- 00:23:23you can take an image and uh let's say
- 00:23:25you have some new class labels right at
- 00:23:27test time and you to figure out if this
- 00:23:29new test time class label holds good for
- 00:23:31this image or not okay all you need to
- 00:23:33do is to basically take that class label
- 00:23:34pass it through a text encoder and the
- 00:23:36text encoder learns a text embedding and
- 00:23:38then and this is at inference time by
- 00:23:39the way right so basically you take the
- 00:23:41image pass through the image encoder get
- 00:23:42an image embedding and just try to
- 00:23:44compute the similarities whichever has
- 00:23:45the highest similarity is the one that
- 00:23:47you actually predict as the as the right
- 00:23:49caption or the right text uh class label
- 00:23:51for this particular image okay that's
- 00:23:54that now the interesting part so what
- 00:23:56they did was to compare a zero short
- 00:23:57clip with the rest net 50 supervised
- 00:24:00model across several the several of
- 00:24:02these data sets so notice clip is zero
- 00:24:05shot it's not fine tuned on any of these
- 00:24:07data sets but resnet is not zero short I
- 00:24:10mean it's actually you take the
- 00:24:11pre-trained rest net and fine tune it on
- 00:24:13the training set of these data sets and
- 00:24:15what they observed is that among these
- 00:24:17data sets you know several of these data
- 00:24:19sets clip actually gives you a positive
- 00:24:21Improvement significantly positive
- 00:24:22improvements compared to uh compared to
- 00:24:25resnet okay uh here are a few examples
- 00:24:28based on how clip performs so here's an
- 00:24:30example from food 101 data set um you
- 00:24:33know uh nicely predicts that this
- 00:24:36particular food is not you know any of
- 00:24:38those but guacamole right and then you
- 00:24:41can also use it for other kinds of
- 00:24:42classification problems like classifying
- 00:24:44uh you know what setting is this is it a
- 00:24:45television Studio Podium indoor
- 00:24:47conference room lecture room control
- 00:24:49room and so on you could also basically
- 00:24:50try to figure out what particular object
- 00:24:52is highlighted in the image or you could
- 00:24:54basically use it for classifying uh the
- 00:24:56land Ed type so whether it is a
- 00:24:58Perman crop land pasture land you know
- 00:25:01highway or road or Ocean or you know
- 00:25:03shrand and so on so forth okay so that's
- 00:25:07clip okay um well now similar kind of
- 00:25:10models have also been trained and used
- 00:25:12for uh doing document understanding a
- 00:25:15visually Rich document understanding all
- 00:25:16right so these are scans of various
- 00:25:19kinds of documents so for example what
- 00:25:21you see here is essentially um uh some
- 00:25:24sort of key value pairs highlighted in
- 00:25:26this interesting clear clearance sheet
- 00:25:29as such but you could also have scans
- 00:25:31for invoices and so on okay uh the
- 00:25:33interesting part is that uh using uh a
- 00:25:36model popularly called as layout LM uh
- 00:25:38of course I'll also talk a little bit
- 00:25:39about on the next few slides right uh
- 00:25:42what one could do is to nicely extract
- 00:25:43the key value pairs from this document
- 00:25:45okay one can also basically do question
- 00:25:47answering on these documents so uh here
- 00:25:49is a postcard scan and you could
- 00:25:52basically then ask questions like
- 00:25:54mention the ZIP code written and then it
- 00:25:56can nicely figure out that the ZIP code
- 00:25:57is that you can also ask it for the date
- 00:25:59on the seal at the top and nicely
- 00:26:01figures out the seal and so on so forth
- 00:26:04the the date on the seal okay you could
- 00:26:06also use it for legal contract
- 00:26:07Management in that senses that given
- 00:26:09document scans you could basically ask
- 00:26:11it to highlight what are the important
- 00:26:12legal uh phrases that I must be uh
- 00:26:15paying attention to or extract just key
- 00:26:17value pairs so basically which parties
- 00:26:19signed the document or when was it
- 00:26:20signed and so on okay uh of course if
- 00:26:23you have U lots of documents on users
- 00:26:27one drive or Google Drive accounts you
- 00:26:28could try to build an app which can
- 00:26:30classify those documents or categorize
- 00:26:32them into popular categories like uh
- 00:26:35like you know personal identification
- 00:26:36documents like passport and pan cards
- 00:26:38and so on uh while uh while another
- 00:26:41category could just be all kinds of
- 00:26:42invoices utility bills and so on right
- 00:26:45you could of course also use these kinds
- 00:26:46of models to do U recognition over
- 00:26:48Walmart receipts or any other
- 00:26:49Supermarket receipts in that sensus and
- 00:26:52the main model for doing this kind of
- 00:26:54visually reach document processing a
- 00:26:56very popular model is layout LM and you
- 00:26:58know there are of course various
- 00:27:00versions layout LM V1 V2 there's also a
- 00:27:02layout llm in that senses you know
- 00:27:04motivate you folks to go ahead and look
- 00:27:06at it later but U what is interesting is
- 00:27:10uh that it basically uses transform
- 00:27:12models okay in the particular case they
- 00:27:13used transformer called as uni LM V2 to
- 00:27:16to initialize and of course uh then they
- 00:27:19basically took domain specific data
- 00:27:20layout visually You Know Rich layout
- 00:27:24data and they basically tried to train
- 00:27:26this model using um using document
- 00:27:29specific loss functions as well okay um
- 00:27:33so very broadly what they do is to take
- 00:27:34the document and uh they mask out
- 00:27:37certain lines on this document so they
- 00:27:39hide out certain lines I'll call them
- 00:27:41hidden out lines in that senses okay um
- 00:27:44uh then they basically also um you know
- 00:27:47uh so this this hidden out uh you know
- 00:27:50image is divided into different parts
- 00:27:52and then encode it using a visual
- 00:27:54encoder right on the other hand you take
- 00:27:57uh so these
- 00:27:58you basically take uh uh the document
- 00:28:01and then uh you take the lines from the
- 00:28:04document and uh essentially for the
- 00:28:06lines which are not covered you
- 00:28:09basically have each for each line you
- 00:28:11basically have this notion whether it is
- 00:28:13covered or not you Bas well when they
- 00:28:15when they hide out they don't hide out
- 00:28:17partial lines they hide out an entire
- 00:28:18line and so on okay so you have a
- 00:28:20covered line and a non-covered line okay
- 00:28:23and uh uh what you do is to basically
- 00:28:25take the document and um you know you
- 00:28:27have a OC PDF parser which gives you
- 00:28:29text so that's how you get these text
- 00:28:31tokens so you basically have the text
- 00:28:32tokens and you can of course do must
- 00:28:34language modeling so therefore some
- 00:28:35tokens are must as well remember the hi
- 00:28:37hiding part is different from masking
- 00:28:38part okay of course you can mask out
- 00:28:41text Tok so so the OCR is actually done
- 00:28:43on the on the non-hidden version of the
- 00:28:45document so that the OCR quality is good
- 00:28:47in that senses right but then you have
- 00:28:49this information which line is hidden or
- 00:28:50not okay now as you see the Transformer
- 00:28:53is being passed four different things so
- 00:28:54of course the first one is basically
- 00:28:55segment embeddings whether it is
- 00:28:57basically process ing image tokens
- 00:28:59versus is it processing um you know text
- 00:29:02tokens and then on the text tokens also
- 00:29:04you could basically say whether it is a
- 00:29:05mass token or a non-mass token yellow or
- 00:29:08blue all uh of course you pass a 1D
- 00:29:11position embedding so essentially you
- 00:29:12must pass some position some notion of
- 00:29:14position right you also pass two
- 00:29:16dimensional position embeddings for the
- 00:29:17box so for example for the text tokens
- 00:29:19essentially sorry for the for the image
- 00:29:21tokens you have a box so essentially uh
- 00:29:24you know uh X ywh and uh you can
- 00:29:27actually also um uh encode width and
- 00:29:29height of the box as part of the um uh
- 00:29:322D position embeddings for the um uh and
- 00:29:35and then for the CLS token you actually
- 00:29:37um you know pad with a box with all the
- 00:29:40six things xmin x max y Min y Max and
- 00:29:43height and width initialized to all
- 00:29:44zeros set to all zeros right okay uh and
- 00:29:48U uh so that's that so that's how you
- 00:29:49basically have 2D position emings then
- 00:29:51you also have the text and uh visual
- 00:29:53embedding so um essentially um you use
- 00:29:57um you know Mass rcnn embeddings uh
- 00:30:00because you're using that as the visual
- 00:30:02encoder here right U and uh for the text
- 00:30:05you basically just use the uh use the
- 00:30:07standard text embeddings okay uh now the
- 00:30:10Transformer yes they experiment with the
- 00:30:11two different models base size and large
- 00:30:13size 12 lers and 24 layers and uh
- 00:30:16basically which means 200 million or 426
- 00:30:18million parameters okay uh so now you
- 00:30:21know preing objectives so there are
- 00:30:22three different pring objectives Mas
- 00:30:24visual language modeling right which is
- 00:30:26the typical uh you know you can mask the
- 00:30:28images part or the text part and you can
- 00:30:29try to um uh I think in their particular
- 00:30:32case I think they just hid the uh the
- 00:30:34text part so therefore visual language
- 00:30:36modeling so if you Mass these text
- 00:30:38tokens you will basically try to predict
- 00:30:39them what is the text right text image
- 00:30:42alignment so um essentially uh they are
- 00:30:44just predicting whether a particular
- 00:30:47token belongs to the covered class or
- 00:30:49the not covered class so remember you
- 00:30:51basically covered some lines so this
- 00:30:53text token belongs to the covered class
- 00:30:55covered class versus these ones belong
- 00:30:56to not covered class okay and then
- 00:30:59lastly you have text image matching so
- 00:31:01essentially you know whether this
- 00:31:02particular image and the text match with
- 00:31:04each other or not like just like the
- 00:31:05visual bir and the B models okay those
- 00:31:08are three classes now of course the the
- 00:31:09preing data they obtain like 11 million
- 00:31:11scan documents and they use the text OCR
- 00:31:14Microsoft read API for the OCR part okay
- 00:31:17and that's how layout LM V2 basically
- 00:31:19does an awesome job U and and
- 00:31:21essentially uh is used to pre a model
- 00:31:24which uh does very awesome uh visually
- 00:31:26reach document processing
- 00:31:28okay so so far I've talked about text
- 00:31:30and images now let me quickly talk about
- 00:31:32video tasks um uh multimodality could
- 00:31:35also mean you know doing things about
- 00:31:37video and text so for example text video
- 00:31:40retrieval given a text and a collection
- 00:31:42of videos find the relevant ones now
- 00:31:44this requires text embedding and video
- 00:31:45video embedding both of them together in
- 00:31:47same space okay multiple choice video
- 00:31:50question answering so again given a
- 00:31:51video and a question and multiple
- 00:31:52candidate answers you want to choose
- 00:31:54which is the best one uh you know it's
- 00:31:55analogous to image question uh visual
- 00:31:58question answering which typically just
- 00:31:59relates with an image in that senses
- 00:32:01okay and then you could also have other
- 00:32:03kinds of tasks like action segmentation
- 00:32:04action step localization and so on uh
- 00:32:07where you basically have an action which
- 00:32:09is described in text and then you have a
- 00:32:10video and you want to figure out where
- 00:32:12the action is in that sensus also called
- 00:32:14as moment detection in that senses okay
- 00:32:17okay so um how do you do this now the
- 00:32:21ideas are pretty similar uh you you see
- 00:32:23I mean if you really think about it what
- 00:32:25is a video video is a sequence of image
- 00:32:27frames okay so in some ways if you
- 00:32:29basically uh are thinking about an image
- 00:32:32as a 3D bit map where an image has a
- 00:32:35height and a width and basically the
- 00:32:37depth is basically just three three
- 00:32:39because you have to incorporate three
- 00:32:40channels RGB red green and blue right a
- 00:32:43video with 100 frames can be Tau again
- 00:32:46as a 3D uh bit map or a 3D uh Cube where
- 00:32:50you have of course the height and the
- 00:32:51width but you also have a depth which is
- 00:32:53basically 3 * 100 if there are 100
- 00:32:55frames in the video okay so you should
- 00:32:57really you could really think about a
- 00:32:59video as um a three-dimensional cube in
- 00:33:02that senses right and in that senses you
- 00:33:04could basically then use your 3D
- 00:33:07convolution neural networks to encode
- 00:33:09this video or you could actually also
- 00:33:11use latest advances you know uh in
- 00:33:14transform models is to be able to encode
- 00:33:15this video of course you know um I mean
- 00:33:19um so as I mentioned video is basically
- 00:33:21a whole bunch of image frames but there
- 00:33:23is also sequence to them and 3D CNN help
- 00:33:25you sort of uh Ensure that that sequence
- 00:33:28is also respected when you're trying to
- 00:33:30encode the video right but again this
- 00:33:31session is not on video encoding so I'm
- 00:33:33not really going to go into deep details
- 00:33:34about how do you encode videos there's
- 00:33:36so many interesting models you know all
- 00:33:38the way starting from um you know very
- 00:33:40old models like i3d uh inflated uh 3D
- 00:33:43models and 3D Comins and so on to the
- 00:33:46more recent ones but the idea is let's
- 00:33:48say that you have a video encoder right
- 00:33:50and you have a text encoder so and you
- 00:33:52could basically do the same contrastive
- 00:33:54loss kind of training the same noise you
- 00:33:56know um uh um the typical popular uh
- 00:34:01noise contrastive estimation loss
- 00:34:02basically can be used also for doing
- 00:34:05something called as video clip okay just
- 00:34:07like you have the clip for image and
- 00:34:09text pairs you could also have a video
- 00:34:10clip which basically tries to do contast
- 00:34:11of learning uh with the uh with with
- 00:34:14video and text pairs okay now the idea
- 00:34:17is that where do you get these video and
- 00:34:18text pairs right so you could of course
- 00:34:20basically make use of transcripts so you
- 00:34:21have video and the visual information
- 00:34:23and you have transcript and you could
- 00:34:24make use of the two to align them but
- 00:34:27one has to be a little little cautious
- 00:34:28about this because you know typically if
- 00:34:30uh uh let's say even in this lecture
- 00:34:32video I started off saying that hey in
- 00:34:34this video I would talk about U you know
- 00:34:36multimodel models but when I talked
- 00:34:39about that visually on the slide you
- 00:34:41couldn't see any multimodel for that
- 00:34:43matter right similarly if I if if I'm
- 00:34:45making a recipe video I'm going to say
- 00:34:47that hey I'm going to basically teach
- 00:34:48you how to cook Cho B right and at that
- 00:34:51time on the slide there's no CH B at all
- 00:34:54right I mean CH B come much later okay
- 00:34:57or Essen you know um the idea is that
- 00:35:00the speech and and what you see on the
- 00:35:03video may not be completely aligned
- 00:35:04always and therefore you have to be
- 00:35:06little cautious about how do you align
- 00:35:08and how do you get those positive pairs
- 00:35:10versus the negative pairs but otherwise
- 00:35:12more or less the contrastive estimation
- 00:35:13contrastive lws and so on work the same
- 00:35:15way uh and in fact in their particular
- 00:35:18case in video clip they basically use
- 00:35:20the same the six layer uh I mean they
- 00:35:22use the bird base in uncased for both
- 00:35:24the video and text they just use the
- 00:35:25Transformer model for encoding the video
- 00:35:27as well okay um so uh uh I mean and the
- 00:35:31way they did that was to basically use a
- 00:35:33frozen pre-rain CNN so it's to
- 00:35:35essentially encode the image frames and
- 00:35:36then they projected those video tokens
- 00:35:37to the to the to the to the to the size
- 00:35:40that size and the dimensionality and the
- 00:35:42space that bird base desires by doing an
- 00:35:44MLP projection layer by training MLP
- 00:35:46projection layer okay uh so that's that
- 00:35:49they pretend on how to 100 million data
- 00:35:51set and that's basically um how the
- 00:35:54pretend video
- 00:35:55clip next uh uh or you know almost um
- 00:35:58sort of towards the uh towards uh sort
- 00:36:02of trying to uh you know uh moving
- 00:36:05towards U more and more modality let me
- 00:36:07talk about this image find model okay uh
- 00:36:11so far we have talked about multiple
- 00:36:13models uh I started off with simple
- 00:36:15Vision Transformers right and I
- 00:36:17basically said hey well they can be used
- 00:36:19for encoding images then I talked about
- 00:36:21image then I talked about willber and
- 00:36:23visual bird and I basically said well
- 00:36:25they can be used for um for encoding uh
- 00:36:28uh you know multimodal task involving
- 00:36:30images and text and then we of course
- 00:36:31also talked about clip in the same same
- 00:36:34theme okay then I talked about video
- 00:36:36clip and basically I said well you could
- 00:36:37extend this to two modalities not image
- 00:36:39and text this time but video and text
- 00:36:41okay now the obvious question is that he
- 00:36:44can I include more modalities and there
- 00:36:46are so many tasks with multiple
- 00:36:47modalities okay so for example here are
- 00:36:50various modalities and image bind here
- 00:36:53is a model called image bind which
- 00:36:54basically tries to extend this kind of a
- 00:36:56thing to 6 different
- 00:37:00modalities uh these ones are images uh
- 00:37:03text audio depth um so this is the depth
- 00:37:07image right uh uh thermal and inertial
- 00:37:09measurement unit okay images text audio
- 00:37:12are obvious what is the depth image
- 00:37:14depth image basically tells you how far
- 00:37:16away from the uh camera each pixel is
- 00:37:19okay so essentially white tells you that
- 00:37:20it is very close to the camera but black
- 00:37:22tells you that it is very far away from
- 00:37:24the camera okay that's a depth image you
- 00:37:26could also try to bring in in a modality
- 00:37:28called thermal now thermal modality you
- 00:37:30know you might have heard about Flur
- 00:37:31images Flur images so they basically
- 00:37:33used a lot in you know um infrared
- 00:37:36Imaging for for for electrical circuits
- 00:37:38you know you want to figure out is there
- 00:37:39a fault or not right so FL images
- 00:37:42basically make use of these infrared
- 00:37:44cameras so as to um also record
- 00:37:46temperature at every pixel in some ways
- 00:37:48so that's why thermal images in that
- 00:37:50sense right you could also have IMU
- 00:37:52inertial measurement unit kind of data I
- 00:37:54mean this is more like time series data
- 00:37:56sensor data in that sense is which you
- 00:37:57could get let's say if you're trying to
- 00:37:59build a driverless car application you
- 00:38:01might not just want to use the uh you
- 00:38:03know um uh the the input from the camera
- 00:38:07but you might also want to use several
- 00:38:09sensors data coming from several sensors
- 00:38:12inside the car right so to be able to
- 00:38:14make a decision for example whether to
- 00:38:15press a break or not right so that's
- 00:38:18basically multiple modalities of data uh
- 00:38:20and in several applications may not you
- 00:38:23you may not require processing all the
- 00:38:25modalities but some of those modalities
- 00:38:27are become important okay um and what
- 00:38:31image B so so therefore the the idea is
- 00:38:33that it is a great idea to basically
- 00:38:35learn a model which can process all the
- 00:38:37modalities uh right uh and you know here
- 00:38:39is an inspiring statement why this could
- 00:38:41be useful um and uh how to do this right
- 00:38:45so of course many many applications
- 00:38:47require a combination of these
- 00:38:48modalities the challenge though is that
- 00:38:50there is no data set across all of these
- 00:38:52modalities right so although I might
- 00:38:54want to basically uh compare or or built
- 00:38:57a application which be basically you
- 00:38:59know uses thermal images along with the
- 00:39:02sensor data unfortunately I might not
- 00:39:04have aligned data there okay but what is
- 00:39:06really interesting is that image binds
- 00:39:08it all okay an image of a beach can
- 00:39:10actually remind us of the sound of the
- 00:39:12Waves audio right the texture of the
- 00:39:14sand right uh a breeze so or even
- 00:39:18inspire a poem you know text and so on
- 00:39:20so you see different modalities can be
- 00:39:21all linked to images in some ways okay
- 00:39:24uh so that is also shown in this in this
- 00:39:27here so so if you basically just look at
- 00:39:30the image if you basically can just get
- 00:39:32the uh image text image depth image heat
- 00:39:35map image audio image IMU pairs you know
- 00:39:38maybe what you can do is to solve this
- 00:39:40problem of not requiring you know
- 00:39:43pairwise data across all possible
- 00:39:44modalities right so if there are five
- 00:39:46different six different modalities as
- 00:39:48you see you would require 62 different
- 00:39:50image different types of data you know
- 00:39:52image Text data image audio data text
- 00:39:54audio data text IM data and so on but
- 00:39:56maybe if you basically just go via
- 00:39:58images uh you could basically solve this
- 00:40:00problem and that's what image bind banks
- 00:40:01on so they basically make use of a whole
- 00:40:03bunch of image data combined uh with
- 00:40:06another modality data so is to be able
- 00:40:08to train a really really awesome
- 00:40:09multimodal model and that multimodel
- 00:40:12model now helps them to do not just
- 00:40:15multimodel understanding but also helps
- 00:40:17them to do multimodel generation yeah
- 00:40:19now multimodel generation is of course a
- 00:40:20topic for another lecture and we'll talk
- 00:40:22about that later uh but uh uh you know
- 00:40:25here are here are some examples so cross
- 00:40:27model retrieval so essentially if you
- 00:40:30basically just pass on this audio uh
- 00:40:32which is essentially of a crackle of a
- 00:40:33fire you can basically try to retrieve
- 00:40:36these kinds of images or videos which
- 00:40:38actually show the crackle of a fire or
- 00:40:40also retrieve these uh uh depth images
- 00:40:43which basically relate with you know
- 00:40:45fireplace and so on as you can uh as you
- 00:40:47can U understand from the image right
- 00:40:50you can also retrieve you know these
- 00:40:51text pieces which are all talking about
- 00:40:53fire fire crackles while pan and
- 00:40:55remember this is not basically because
- 00:40:57the text called crackle of a fire was
- 00:40:59used to search but because the audio
- 00:41:01relating um you know audio which
- 00:41:04basically just uh is the sound of the
- 00:41:06crackle of a fire basically helps you
- 00:41:08retrieve the text okay this is not
- 00:41:10speech to text by the way so in the
- 00:41:12audio crackle of a fire those words were
- 00:41:13not spoken uh all that was there in the
- 00:41:16audio is the sound of the fire in that
- 00:41:18senses okay so of course it can also be
- 00:41:20used for doing embedding space
- 00:41:22arithmetic so you could basically have
- 00:41:24this image of of a crane uh or or a a
- 00:41:27bird right and you know you can have
- 00:41:29sound of waves right and then can
- 00:41:31actually generate images which can you
- 00:41:32know basically the same bird in in in
- 00:41:35the sea or on on the shore and so on
- 00:41:38okay you could also use this kind of a
- 00:41:39model for audio to image generation so
- 00:41:41given a audio you know can you generate
- 00:41:43an image barking audio generate a dog or
- 00:41:46you know train audio generate a train
- 00:41:49image and so on so forth yeah so how is
- 00:41:52the image model image bind model trained
- 00:41:54well as I mentioned the basically the
- 00:41:56model is basically trained by using uh
- 00:41:59uh several kinds of data sets also
- 00:42:01mentioned here uh which basically relate
- 00:42:04visual modality with everything else so
- 00:42:07with other modalities for example video
- 00:42:09and audio from a data set called audio
- 00:42:11Set uh image depth kind of relationship
- 00:42:14from some other data set image thermal
- 00:42:16from another data set video IMU data
- 00:42:18image Text data and so on so forth okay
- 00:42:20uh the model is basically uh of course
- 00:42:23you know it uses large deep learning
- 00:42:25deep neural networks to encode uh the
- 00:42:28image and also any other modality so
- 00:42:31encode the image and any other modality
- 00:42:32M uh and it uses the same influency um
- 00:42:36contrastive loss so the I mean
- 00:42:38essentially just like video clip or clip
- 00:42:40they basically use a symmetric version
- 00:42:42of the influency loss where noise
- 00:42:43contrastive estimation right where you
- 00:42:46know uh you you take both the U you know
- 00:42:49um image versus the other modality and
- 00:42:52modality versus the other versus the
- 00:42:53image and so on right so the loss is
- 00:42:56particularly given as follows I mean
- 00:42:57that's basically uh ensuring that the
- 00:43:00positive pairs essentially have a higher
- 00:43:01similarity compared to negative pairs
- 00:43:03which you which you see I mean of course
- 00:43:04in the denominator you have both right
- 00:43:07so that's that now um yeah so so
- 00:43:10essentially for image point for the text
- 00:43:12encoder essentially for the image
- 00:43:14encoder they used the vi the huge model
- 00:43:16630 million parameters and for the text
- 00:43:18encoder they basically used a 302 302
- 00:43:20million parameters from open
- 00:43:22CLP and um I think for training the uh
- 00:43:27as far as I remember they froze the text
- 00:43:29encoder part but then they trained uh
- 00:43:32the other modalities in that senses um
- 00:43:35that's that so they use the same encoder
- 00:43:36for images plus videos basically where
- 00:43:38videos are just treated as multi frame
- 00:43:41images right um that's that okay so more
- 00:43:46or less this is what I had for you uh
- 00:43:48for this uh for this session uh quickly
- 00:43:52summarizing in this session I talked
- 00:43:54about a whole bunch of
- 00:43:57models I first motivated essentially why
- 00:44:00multimodel modeling is important by um
- 00:44:03you know talking about various vision
- 00:44:04and language tasks like visual question
- 00:44:05answering visual Common Sense reasoning
- 00:44:07referring Expressions caption based
- 00:44:09image retrieval um you know multimodel
- 00:44:11hate speech detection multimodel fake
- 00:44:13news detection and so on then we talked
- 00:44:15about Vision Transformers which is
- 00:44:16basically um model to use Transformers
- 00:44:19to encode images then we talked about
- 00:44:21three different models for encoding
- 00:44:23images U namely visual BT willbert um
- 00:44:27you know and and and clip right uh where
- 00:44:31uh and and in that order they have been
- 00:44:33pre-trained using larger and larger
- 00:44:35image text pairs right U lastly or other
- 00:44:39next I talked about um about visually
- 00:44:42Rich document understanding using uh
- 00:44:44this particular model called as layout
- 00:44:45LM um layout LM V2 particularly uh right
- 00:44:50and then uh next we talked about uh
- 00:44:52video tasks uh extending multimodality
- 00:44:55to video and text combinations right and
- 00:44:58then therefore we talked about video
- 00:45:00clip an obvious extension of the
- 00:45:02standard clip model um uh lastly I
- 00:45:06talked about image bind he by the way
- 00:45:08you know image bind of course deals with
- 00:45:10six different modalities but there are
- 00:45:11other models uh which can actually deal
- 00:45:14with more modalities for example you
- 00:45:16know I think if I remember correctly
- 00:45:18there's a model of course there's a
- 00:45:19model called text bind by the way right
- 00:45:21there's also a model called as metat
- 00:45:23Transformers uh you know uh there is
- 00:45:25also yet another model called as
- 00:45:27composable diffusion right so if you try
- 00:45:29to search for these models you'll
- 00:45:31basically see that people have tried to
- 00:45:32extend this to many more modalities not
- 00:45:34just six people have tried I think 10
- 00:45:36plus modalities in those in those papers
- 00:45:38okay so hopefully you know this session
- 00:45:40motivates you um to uh to to read up
- 00:45:44more of those papers in multimodal
- 00:45:46modeling and um uh you know and
- 00:45:49potentially also do research in this
- 00:45:51area right uh I'm of course always
- 00:45:53excited to uh work in this area um uh um
- 00:45:57so you know if you want to do more
- 00:45:59research in this area feel free to reach
- 00:46:00out to me uh on these
- 00:46:02coordinates uh you would also find by
- 00:46:05the way on my YouTube channel a whole
- 00:46:06bunch of
- 00:46:07videos um around
- 00:46:09multimodality um so feel free to check
- 00:46:12them out as well okay thanks so much and
- 00:46:15uh happy to take questions
