What is the Llama 3.2 model?

Llama 3.2 is a new multimodal model released by Meta, supporting both text and image inputs for certain parameter sizes.

What platforms can the Llama models be used on?

The Llama models were tested on IBM Watson X, a secure and fast platform for managing AI models.

What sizes do the Llama 3.2 models come in?

The Llama 3.2 models have versions with 1 billion, 11 billion, and 90 billion parameters.

How much video RAM is needed to run the 90 billion parameter model?

Running the 90 billion parameter model requires around 64 gigabytes of video RAM.

Which Llama model can I run on a personal computer?

The 11 billion parameter Llama model is manageable to run on a typical personal computer.

What type of data can these models process?

They can process text inputs and interpret images for data extraction and descriptions.

Why is it advantageous to run a model locally?

Running a model locally is more sustainable and ensures data privacy, especially for sensitive data.

Why did the 90 billion parameter model refuse to interpret an image?

The refusal could be due to safeguards or limitations set within the model's interpretation rules.

How does image recognition benefit from these models?

These models can handle image recognition tasks without needing third-party services, reducing costs and privacy concerns.

Why would privacy-sensitive industries, like healthcare, benefit from these models?

These models allow for secure, local processing of sensitive images, keeping data private and reducing risk.

First Look: Llama 3.2 Vision Instruct 11B and 90B Vision Language Models (VLM) in IBM WatsonX.ai

00:08:04

https://www.youtube.com/watch?v=4pBGuitgnXY

Summary

TLDRThe video explores the newly released Llama 3.2 model by Meta, which comes in multimodal versions supporting both text and image inputs. The 11 billion and 90 billion parameter models were tested using IBM Watson X, an efficient platform offering safe data management due to its compliance with IBM's data privacy policies. Larger models like the 90 billion parameter model require significant hardware resources, such as 64 gigabytes of video RAM, while smaller models like the 11 billion version can run on personal devices. The tests included analyzing images, such as performance benchmark tables and creative interpretations like a dog in a hard hat driving a forklift. These models showcased capabilities in data extraction from images without OCR, potentially reducing costs for application development and ensuring privacy by running locally. This local operation is particularly beneficial for privacy-sensitive industries like healthcare, where data confidentiality is critical. The video emphasizes the growing accessibility and practical benefits of AI models for personal and professional use.

Takeaways

🦙 Meta released the Llama 3.2 model with both text-only and multimodal functionalities.
🖼️ Multimodal models can process both text and images, improving data interaction.
🖥️ The large 90 billion parameter model requires significant computational resources.
🔒 Using IBM Watson X ensures secure and private data handling.
⚙️ 11 billion parameter models are manageable and can run on personal computers.
🧾 Extraction from images without OCR was successful, reducing operational costs.
🔍 Image recognition models can work locally, aiding industries like healthcare.
💡 Running models locally saves on operational costs and improves data privacy.
🤔 The 90 billion model's refusal hints at built-in content safeguards.
📈 The video highlights the growing ease and utility of using AI models.

Timeline

00:00:00 - 00:08:04
The new Llama 3.2 model by Meta has been released, featuring a multimodal capability for its 11 billion and 90 billion parameter models, capable of interpreting but not generating images. The smaller 1.3 billion parameter versions remain text-only. The demonstration utilizes IBM Watson X for its safety and speed, with the test focusing on the Llama 3.2 90 billion parameter Vision instructor model, which requires substantial video RAM. The model effectively described a performance benchmark image but showed limitations or refusal in processing unconventional images like a dog driving a forklift. Switching to the 11 billion parameter model showed better performance on the odd image with less technical requirements, indicating room for development in the larger model's content handling.

Mind Map

Video Q&A

What is the Llama 3.2 model?
Llama 3.2 is a new multimodal model released by Meta, supporting both text and image inputs for certain parameter sizes.
What platforms can the Llama models be used on?
The Llama models were tested on IBM Watson X, a secure and fast platform for managing AI models.
What sizes do the Llama 3.2 models come in?
The Llama 3.2 models have versions with 1 billion, 11 billion, and 90 billion parameters.
How much video RAM is needed to run the 90 billion parameter model?
Running the 90 billion parameter model requires around 64 gigabytes of video RAM.
Which Llama model can I run on a personal computer?
The 11 billion parameter Llama model is manageable to run on a typical personal computer.
What type of data can these models process?
They can process text inputs and interpret images for data extraction and descriptions.
Why is it advantageous to run a model locally?
Running a model locally is more sustainable and ensures data privacy, especially for sensitive data.
Why did the 90 billion parameter model refuse to interpret an image?
The refusal could be due to safeguards or limitations set within the model's interpretation rules.
How does image recognition benefit from these models?
These models can handle image recognition tasks without needing third-party services, reducing costs and privacy concerns.
Why would privacy-sensitive industries, like healthcare, benefit from these models?
These models allow for secure, local processing of sensitive images, keeping data private and reducing risk.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
okay I am on the road today but the new
00:00:03
llama 3.2 model just dropped this is
00:00:06
from meta this is a multimodal model so
00:00:10
the 1 three billion parameter versions
00:00:12
are text only like most generative AI
00:00:14
most language models the 11 billion and
00:00:17
90 billion parameters are multimodal
00:00:19
which means you can put text or images
00:00:21
in and have it return out um text it
00:00:25
can't create images it can only
00:00:27
interpret them so let's go ahead and
00:00:29
give these two models a test I'm going
00:00:31
to go ahead and switch to there we go uh
00:00:35
switch to IBM Watson X we're using IBM
00:00:37
Watson X because it's a hosted platform
00:00:39
it's safe it's secure uh your data is
00:00:41
governed by IBM's uh data privacy
00:00:45
policies and uh frankly it's the fastest
00:00:48
way to get and running so let's go to
00:00:49
our foundation models page and we're
00:00:51
going to select the model we want to use
00:00:53
we want to start with llama 3.2 90
00:00:56
billion parameter Vision instructor now
00:00:58
this is a big model if you were to run
00:01:00
this yourself let's go ahead and and hit
00:01:01
okay if you were to run this yourself
00:01:03
you would need uh a computer with like
00:01:05
64 gig of video RAM which is a lot uh
00:01:09
let's go ahead and take a look at the
00:01:10
system prompt uh
00:01:12
pretty pretty stock here um I'm GNA just
00:01:17
I'll leave this in in place now one of
00:01:19
the things I like about IBM Watson X is
00:01:21
you can change the system prompt let's
00:01:23
go ahead and upload a document let's add
00:01:27
an image and let's find an image all
00:01:30
right let's go ahead and let's start
00:01:31
with this image so this image is a
00:01:34
benchmark table this is ironically of
00:01:35
the Llama uh 2 performance benchmarks
00:01:39
we're going to go ahead and load this in
00:01:41
and here we go the image loaded we're
00:01:43
going to hit add and now we're going to
00:01:45
ask it the model to say what do you see
00:01:48
in this
00:01:50
screenshot describe it this image
00:01:52
represents a comprehensive table bench
00:01:54
titled Benchmark higher is better
00:01:56
evaluating different performances good
00:01:58
so it can clearly
00:02:00
uh see what's in the image let's give it
00:02:02
another image let's clear the chat here
00:02:07
and now let's put in a ridiculous
00:02:11
image this is a pitbull driving a
00:02:14
forklift what do you see in this image
00:02:18
describe it I'm not comfortable
00:02:19
responding to this conversation subject
00:02:22
okay let's try that again let's clear
00:02:25
the chat and now let's remove this and
00:02:29
we're still
00:02:30
90b let's use our images let's put our
00:02:35
black
00:02:37
pitbull with a hard hat driving a yellow
00:02:39
forklift what do you see in this image
00:02:44
describe it interesting so the model is
00:02:46
just the model is inherently refusing to
00:02:49
answer this question okay fair enough
00:02:51
that will be fixed once the open models
00:02:53
Community removes things like that now
00:02:56
let's switch
00:02:57
models that what we were using using was
00:03:00
the 90 billion parameter model which is
00:03:02
a big model 60 gig of video RAM beefy
00:03:06
computers need to you need a beefy
00:03:08
computer to run that I'm going to run
00:03:10
the Llama 3.2 11 billion parameter model
00:03:13
this requires about 7 gigabytes of of
00:03:15
video RAM which is much more manageable
00:03:19
um I expect the quality to be worse so
00:03:22
let's go ahead and start with our images
00:03:25
here let's browse actually I should be
00:03:27
able to find my previous images yep
00:03:30
there's my previous image that is the
00:03:32
MML table what do you see in this
00:03:37
screenshot describe it good that's
00:03:42
better than I expected it to
00:03:44
do let's try pitpull again in the 11
00:03:49
billion parameter model what do you see
00:03:51
in this interesting so the 90 billion
00:03:53
parameter model saw something it didn't
00:03:55
like and iused
00:03:57
the 11 billion parameter model
00:04:00
recognized it okay it says a dog wearing
00:04:02
yellow hat hard hat weing a fork and
00:04:04
wear yeah I mean that's that's accurate
00:04:06
um let's try one more thing gonna let's
00:04:10
go start your chat we're going to select
00:04:13
we're going to go back to our table
00:04:15
let's see how good this model's Vision
00:04:17
actually is from this image of a table
00:04:23
extract all the data and return it in a
00:04:27
pipe delimited
00:04:30
mark down table
00:04:33
format let's see if we can do
00:04:35
this you reach your quote up okay that's
00:04:38
really interesting that the model was
00:04:40
able to the 11 billion parameter model
00:04:43
was able to accomplish the task and the
00:04:45
and the 90 billion parameter model did
00:04:46
not and that it can extract data from IM
00:04:49
this is a big deal
00:04:51
because image
00:04:53
extraction getting data out of images
00:04:56
from a model that can recognize imagery
00:04:57
and not have to do OCR means that you
00:04:59
can extract data from images in a local
00:05:03
open
00:05:05
model when you use something like Google
00:05:07
Gemini or chat GPT or whatever uh
00:05:10
through their apis if you were to build
00:05:11
this into an app this would cost you a
00:05:13
decent amount of money every time you
00:05:15
you did it and if you're building an app
00:05:17
you're talking you know hundreds of
00:05:18
thousands of dollars not millions of
00:05:20
dollars possibly per day of users trying
00:05:24
to to to use your app these models these
00:05:28
Vision language models
00:05:30
allow you
00:05:32
to essentially do the same thing for the
00:05:34
cost of electricity so in this case I'm
00:05:36
using IBM's data center the 11 billion
00:05:38
parameter model I can run that on my Mac
00:05:41
and so I don't have to to use a third
00:05:45
party data center I don't have to to do
00:05:47
anything crazy like that I can do it
00:05:48
straight from my machine that's a big
00:05:51
deal that is a big deal for a couple
00:05:54
reasons obviously one is the
00:05:56
sustainability angle you can run an 11
00:05:58
billion par model on your laptop and get
00:06:00
good results out of it as as we clearly
00:06:02
were um you don't have to use a big data
00:06:05
center which means you don't have to
00:06:05
burn nearly as much electricity on the
00:06:08
task and two with an image recognition
00:06:10
model there's so many
00:06:12
images you might not be comfortable
00:06:15
handing off to a third party you know uh
00:06:18
a really good example would be healthc
00:06:19
Care Imaging maybe you know images of
00:06:21
X-rays and scans and stuff that is very
00:06:24
clearly very very personal
00:06:26
data no matter what the the service
00:06:30
guarantee or the terms of service are on
00:06:32
a on a on a website or a service like
00:06:35
chat GPT or
00:06:37
whatever I wouldn't feel comfortable
00:06:39
putting those kinds of images in there
00:06:41
that is very personal very sensitive
00:06:43
data on a Model that I can run on my
00:06:45
laptop and know for sure my data is not
00:06:48
going anywhere that's a great
00:06:50
application for me my next steps are
00:06:52
going to be trying to get the 11 billion
00:06:54
parameter model working locally on my
00:06:55
machine and if I can I'm going to write
00:06:58
a piece of software and the piece of I'm
00:06:59
going to write is it's going to take my
00:07:03
folder of poorly labeled
00:07:05
screenshots and look at them and then
00:07:08
rename the files like what's in the
00:07:10
screenshot like a black dog and a yellow
00:07:12
hard hat wearing a uh wearing a yellow
00:07:15
hard hat and a driving a fork Clift
00:07:16
right that would be a useful thing or
00:07:18
that other one you know an MML Benchmark
00:07:20
table being able to rename files that's
00:07:22
a very pedestrian use of this technology
00:07:24
but again if I can get the 11 billion
00:07:25
parameter model working on my machine
00:07:27
then it's just the cost of electricity
00:07:29
and that my machine consumes way less
00:07:31
electricity than a big data
00:07:32
center so that's the new llama 3.2
00:07:36
models um very impressed I'm a bit
00:07:40
confused as to why the 90 billion
00:07:42
parameter model had a refusal and the 11
00:07:44
billion didn't for the same image I
00:07:47
think I think there's there's there's a
00:07:49
there there there's some interesting
00:07:50
stuff going on um but this is a a
00:07:53
technology that a lot of people are
00:07:55
going to want to use and the fact that
00:07:57
you can run it for the cost of
00:07:58
electricity on your machine is is
00:08:00
amazing so that's all for the for this
00:08:02
time talk to you soon