First Look: Llama 3.2 Vision Instruct 11B and 90B Vision Language Models (VLM) in IBM WatsonX.ai

00:08:04
https://www.youtube.com/watch?v=4pBGuitgnXY

Summary

TLDRThe video explores the newly released Llama 3.2 model by Meta, which comes in multimodal versions supporting both text and image inputs. The 11 billion and 90 billion parameter models were tested using IBM Watson X, an efficient platform offering safe data management due to its compliance with IBM's data privacy policies. Larger models like the 90 billion parameter model require significant hardware resources, such as 64 gigabytes of video RAM, while smaller models like the 11 billion version can run on personal devices. The tests included analyzing images, such as performance benchmark tables and creative interpretations like a dog in a hard hat driving a forklift. These models showcased capabilities in data extraction from images without OCR, potentially reducing costs for application development and ensuring privacy by running locally. This local operation is particularly beneficial for privacy-sensitive industries like healthcare, where data confidentiality is critical. The video emphasizes the growing accessibility and practical benefits of AI models for personal and professional use.

Takeaways

  • ๐Ÿฆ™ Meta released the Llama 3.2 model with both text-only and multimodal functionalities.
  • ๐Ÿ–ผ๏ธ Multimodal models can process both text and images, improving data interaction.
  • ๐Ÿ–ฅ๏ธ The large 90 billion parameter model requires significant computational resources.
  • ๐Ÿ”’ Using IBM Watson X ensures secure and private data handling.
  • โš™๏ธ 11 billion parameter models are manageable and can run on personal computers.
  • ๐Ÿงพ Extraction from images without OCR was successful, reducing operational costs.
  • ๐Ÿ” Image recognition models can work locally, aiding industries like healthcare.
  • ๐Ÿ’ก Running models locally saves on operational costs and improves data privacy.
  • ๐Ÿค” The 90 billion model's refusal hints at built-in content safeguards.
  • ๐Ÿ“ˆ The video highlights the growing ease and utility of using AI models.

Timeline

  • 00:00:00 - 00:08:04

    The new Llama 3.2 model by Meta has been released, featuring a multimodal capability for its 11 billion and 90 billion parameter models, capable of interpreting but not generating images. The smaller 1.3 billion parameter versions remain text-only. The demonstration utilizes IBM Watson X for its safety and speed, with the test focusing on the Llama 3.2 90 billion parameter Vision instructor model, which requires substantial video RAM. The model effectively described a performance benchmark image but showed limitations or refusal in processing unconventional images like a dog driving a forklift. Switching to the 11 billion parameter model showed better performance on the odd image with less technical requirements, indicating room for development in the larger model's content handling.

Mind Map

Video Q&A

  • What is the Llama 3.2 model?

    Llama 3.2 is a new multimodal model released by Meta, supporting both text and image inputs for certain parameter sizes.

  • What platforms can the Llama models be used on?

    The Llama models were tested on IBM Watson X, a secure and fast platform for managing AI models.

  • What sizes do the Llama 3.2 models come in?

    The Llama 3.2 models have versions with 1 billion, 11 billion, and 90 billion parameters.

  • How much video RAM is needed to run the 90 billion parameter model?

    Running the 90 billion parameter model requires around 64 gigabytes of video RAM.

  • Which Llama model can I run on a personal computer?

    The 11 billion parameter Llama model is manageable to run on a typical personal computer.

  • What type of data can these models process?

    They can process text inputs and interpret images for data extraction and descriptions.

  • Why is it advantageous to run a model locally?

    Running a model locally is more sustainable and ensures data privacy, especially for sensitive data.

  • Why did the 90 billion parameter model refuse to interpret an image?

    The refusal could be due to safeguards or limitations set within the model's interpretation rules.

  • How does image recognition benefit from these models?

    These models can handle image recognition tasks without needing third-party services, reducing costs and privacy concerns.

  • Why would privacy-sensitive industries, like healthcare, benefit from these models?

    These models allow for secure, local processing of sensitive images, keeping data private and reducing risk.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    okay I am on the road today but the new
  • 00:00:03
    llama 3.2 model just dropped this is
  • 00:00:06
    from meta this is a multimodal model so
  • 00:00:10
    the 1 three billion parameter versions
  • 00:00:12
    are text only like most generative AI
  • 00:00:14
    most language models the 11 billion and
  • 00:00:17
    90 billion parameters are multimodal
  • 00:00:19
    which means you can put text or images
  • 00:00:21
    in and have it return out um text it
  • 00:00:25
    can't create images it can only
  • 00:00:27
    interpret them so let's go ahead and
  • 00:00:29
    give these two models a test I'm going
  • 00:00:31
    to go ahead and switch to there we go uh
  • 00:00:35
    switch to IBM Watson X we're using IBM
  • 00:00:37
    Watson X because it's a hosted platform
  • 00:00:39
    it's safe it's secure uh your data is
  • 00:00:41
    governed by IBM's uh data privacy
  • 00:00:45
    policies and uh frankly it's the fastest
  • 00:00:48
    way to get and running so let's go to
  • 00:00:49
    our foundation models page and we're
  • 00:00:51
    going to select the model we want to use
  • 00:00:53
    we want to start with llama 3.2 90
  • 00:00:56
    billion parameter Vision instructor now
  • 00:00:58
    this is a big model if you were to run
  • 00:01:00
    this yourself let's go ahead and and hit
  • 00:01:01
    okay if you were to run this yourself
  • 00:01:03
    you would need uh a computer with like
  • 00:01:05
    64 gig of video RAM which is a lot uh
  • 00:01:09
    let's go ahead and take a look at the
  • 00:01:10
    system prompt uh
  • 00:01:12
    pretty pretty stock here um I'm GNA just
  • 00:01:17
    I'll leave this in in place now one of
  • 00:01:19
    the things I like about IBM Watson X is
  • 00:01:21
    you can change the system prompt let's
  • 00:01:23
    go ahead and upload a document let's add
  • 00:01:27
    an image and let's find an image all
  • 00:01:30
    right let's go ahead and let's start
  • 00:01:31
    with this image so this image is a
  • 00:01:34
    benchmark table this is ironically of
  • 00:01:35
    the Llama uh 2 performance benchmarks
  • 00:01:39
    we're going to go ahead and load this in
  • 00:01:41
    and here we go the image loaded we're
  • 00:01:43
    going to hit add and now we're going to
  • 00:01:45
    ask it the model to say what do you see
  • 00:01:48
    in this
  • 00:01:50
    screenshot describe it this image
  • 00:01:52
    represents a comprehensive table bench
  • 00:01:54
    titled Benchmark higher is better
  • 00:01:56
    evaluating different performances good
  • 00:01:58
    so it can clearly
  • 00:02:00
    uh see what's in the image let's give it
  • 00:02:02
    another image let's clear the chat here
  • 00:02:07
    and now let's put in a ridiculous
  • 00:02:11
    image this is a pitbull driving a
  • 00:02:14
    forklift what do you see in this image
  • 00:02:18
    describe it I'm not comfortable
  • 00:02:19
    responding to this conversation subject
  • 00:02:22
    okay let's try that again let's clear
  • 00:02:25
    the chat and now let's remove this and
  • 00:02:29
    we're still
  • 00:02:30
    90b let's use our images let's put our
  • 00:02:35
    black
  • 00:02:37
    pitbull with a hard hat driving a yellow
  • 00:02:39
    forklift what do you see in this image
  • 00:02:44
    describe it interesting so the model is
  • 00:02:46
    just the model is inherently refusing to
  • 00:02:49
    answer this question okay fair enough
  • 00:02:51
    that will be fixed once the open models
  • 00:02:53
    Community removes things like that now
  • 00:02:56
    let's switch
  • 00:02:57
    models that what we were using using was
  • 00:03:00
    the 90 billion parameter model which is
  • 00:03:02
    a big model 60 gig of video RAM beefy
  • 00:03:06
    computers need to you need a beefy
  • 00:03:08
    computer to run that I'm going to run
  • 00:03:10
    the Llama 3.2 11 billion parameter model
  • 00:03:13
    this requires about 7 gigabytes of of
  • 00:03:15
    video RAM which is much more manageable
  • 00:03:19
    um I expect the quality to be worse so
  • 00:03:22
    let's go ahead and start with our images
  • 00:03:25
    here let's browse actually I should be
  • 00:03:27
    able to find my previous images yep
  • 00:03:30
    there's my previous image that is the
  • 00:03:32
    MML table what do you see in this
  • 00:03:37
    screenshot describe it good that's
  • 00:03:42
    better than I expected it to
  • 00:03:44
    do let's try pitpull again in the 11
  • 00:03:49
    billion parameter model what do you see
  • 00:03:51
    in this interesting so the 90 billion
  • 00:03:53
    parameter model saw something it didn't
  • 00:03:55
    like and iused
  • 00:03:57
    the 11 billion parameter model
  • 00:04:00
    recognized it okay it says a dog wearing
  • 00:04:02
    yellow hat hard hat weing a fork and
  • 00:04:04
    wear yeah I mean that's that's accurate
  • 00:04:06
    um let's try one more thing gonna let's
  • 00:04:10
    go start your chat we're going to select
  • 00:04:13
    we're going to go back to our table
  • 00:04:15
    let's see how good this model's Vision
  • 00:04:17
    actually is from this image of a table
  • 00:04:23
    extract all the data and return it in a
  • 00:04:27
    pipe delimited
  • 00:04:30
    mark down table
  • 00:04:33
    format let's see if we can do
  • 00:04:35
    this you reach your quote up okay that's
  • 00:04:38
    really interesting that the model was
  • 00:04:40
    able to the 11 billion parameter model
  • 00:04:43
    was able to accomplish the task and the
  • 00:04:45
    and the 90 billion parameter model did
  • 00:04:46
    not and that it can extract data from IM
  • 00:04:49
    this is a big deal
  • 00:04:51
    because image
  • 00:04:53
    extraction getting data out of images
  • 00:04:56
    from a model that can recognize imagery
  • 00:04:57
    and not have to do OCR means that you
  • 00:04:59
    can extract data from images in a local
  • 00:05:03
    open
  • 00:05:05
    model when you use something like Google
  • 00:05:07
    Gemini or chat GPT or whatever uh
  • 00:05:10
    through their apis if you were to build
  • 00:05:11
    this into an app this would cost you a
  • 00:05:13
    decent amount of money every time you
  • 00:05:15
    you did it and if you're building an app
  • 00:05:17
    you're talking you know hundreds of
  • 00:05:18
    thousands of dollars not millions of
  • 00:05:20
    dollars possibly per day of users trying
  • 00:05:24
    to to to use your app these models these
  • 00:05:28
    Vision language models
  • 00:05:30
    allow you
  • 00:05:32
    to essentially do the same thing for the
  • 00:05:34
    cost of electricity so in this case I'm
  • 00:05:36
    using IBM's data center the 11 billion
  • 00:05:38
    parameter model I can run that on my Mac
  • 00:05:41
    and so I don't have to to use a third
  • 00:05:45
    party data center I don't have to to do
  • 00:05:47
    anything crazy like that I can do it
  • 00:05:48
    straight from my machine that's a big
  • 00:05:51
    deal that is a big deal for a couple
  • 00:05:54
    reasons obviously one is the
  • 00:05:56
    sustainability angle you can run an 11
  • 00:05:58
    billion par model on your laptop and get
  • 00:06:00
    good results out of it as as we clearly
  • 00:06:02
    were um you don't have to use a big data
  • 00:06:05
    center which means you don't have to
  • 00:06:05
    burn nearly as much electricity on the
  • 00:06:08
    task and two with an image recognition
  • 00:06:10
    model there's so many
  • 00:06:12
    images you might not be comfortable
  • 00:06:15
    handing off to a third party you know uh
  • 00:06:18
    a really good example would be healthc
  • 00:06:19
    Care Imaging maybe you know images of
  • 00:06:21
    X-rays and scans and stuff that is very
  • 00:06:24
    clearly very very personal
  • 00:06:26
    data no matter what the the service
  • 00:06:30
    guarantee or the terms of service are on
  • 00:06:32
    a on a on a website or a service like
  • 00:06:35
    chat GPT or
  • 00:06:37
    whatever I wouldn't feel comfortable
  • 00:06:39
    putting those kinds of images in there
  • 00:06:41
    that is very personal very sensitive
  • 00:06:43
    data on a Model that I can run on my
  • 00:06:45
    laptop and know for sure my data is not
  • 00:06:48
    going anywhere that's a great
  • 00:06:50
    application for me my next steps are
  • 00:06:52
    going to be trying to get the 11 billion
  • 00:06:54
    parameter model working locally on my
  • 00:06:55
    machine and if I can I'm going to write
  • 00:06:58
    a piece of software and the piece of I'm
  • 00:06:59
    going to write is it's going to take my
  • 00:07:03
    folder of poorly labeled
  • 00:07:05
    screenshots and look at them and then
  • 00:07:08
    rename the files like what's in the
  • 00:07:10
    screenshot like a black dog and a yellow
  • 00:07:12
    hard hat wearing a uh wearing a yellow
  • 00:07:15
    hard hat and a driving a fork Clift
  • 00:07:16
    right that would be a useful thing or
  • 00:07:18
    that other one you know an MML Benchmark
  • 00:07:20
    table being able to rename files that's
  • 00:07:22
    a very pedestrian use of this technology
  • 00:07:24
    but again if I can get the 11 billion
  • 00:07:25
    parameter model working on my machine
  • 00:07:27
    then it's just the cost of electricity
  • 00:07:29
    and that my machine consumes way less
  • 00:07:31
    electricity than a big data
  • 00:07:32
    center so that's the new llama 3.2
  • 00:07:36
    models um very impressed I'm a bit
  • 00:07:40
    confused as to why the 90 billion
  • 00:07:42
    parameter model had a refusal and the 11
  • 00:07:44
    billion didn't for the same image I
  • 00:07:47
    think I think there's there's there's a
  • 00:07:49
    there there there's some interesting
  • 00:07:50
    stuff going on um but this is a a
  • 00:07:53
    technology that a lot of people are
  • 00:07:55
    going to want to use and the fact that
  • 00:07:57
    you can run it for the cost of
  • 00:07:58
    electricity on your machine is is
  • 00:08:00
    amazing so that's all for the for this
  • 00:08:02
    time talk to you soon
Tags
  • Llama 3.2
  • Meta
  • Multimodal Model
  • IBM Watson X
  • Image Recognition
  • Data Privacy
  • AI Models
  • Text and Image Processing
  • Parameter Sizes
  • Hardware Requirements