Ollama Llama 3 - RAG: How to create a local RAG system with LLAMA 3 using OLLAMA

00:28:00
https://www.youtube.com/watch?v=WfFpeBNfaeQ

Summary

TLDRIn this video, Sidan guides the audience through hosting a Lama 3 model locally and performing retrieval augmented generation (RAG) using the OLama framework. The process involves setting up the Lama 3 model on a local system, pulling the model, and preparing it using various software tools including LangChain for both the language model and document management. The tutorial demonstrates how to load documents, split them into manageable chunks, convert these chunks into vector embeddings using Hugging Face models, and store them for retrieval using FAISS for similarity searches. The video also covers setting up the necessary dependencies and coding steps to implement RAG by querying specific documents instead of relying on the model's trained data alone. Moreover, it explains the implementation of a Question Answering chain with the LLM to process and respond accurately to questions based on the documents' contents.

Takeaways

  • 🤖 Use Lama 3 model for local LLM hosting.
  • 📄 Implement RAG to answer document-specific queries.
  • 🔧 Utilize OLama to pull and configure models.
  • 🛠️ Use Python libraries like LangChain and FAISS.
  • 🔍 Perform similarity searches with document embeddings.
  • 💾 Manage document content with character text splitting.
  • ⚙️ Convert text to vector embeddings with Hugging Face.
  • 🔗 Chain together steps for efficient document QA.
  • 🌐 Explore alternatives like GPT models with LangChain.
  • 📚 Learn the process from model setup to query answering.

Timeline

  • 00:00:00 - 00:05:00

    In this video, the presenter explains the concept of RAG (Retrieval Augmented Generation) which involves utilizing a Llama 3 model with OLama to perform question-answering based on a specific document rather than the model's entire training corpus. The video covers how to host a Llama 3 model locally and use it to answer questions based on content present in a document, by loading the model and pulling it from a local system.

  • 00:05:00 - 00:10:00

    The next step involves setting up the necessary libraries and dependencies including LangChain, FAISS for similarity search, and Transformers for embedding models. The presenter discusses potential issues with CUDA versions when choosing between CPU and GPU versions of FAISS and emphasizes the importance of choosing the right libraries for text processing and vector embeddings to facilitate the RAG process.

  • 00:10:00 - 00:15:00

    The process then involves loading the Llama 3 model, creating document chunks, and then converting text chunks into vector embeddings. This includes setting the model parameters like temperature for controlling response randomness, and using character text splitting for creating manageable content chunks for embedding.

  • 00:15:00 - 00:20:00

    After embedding the text, the presenter illustrates how to create a knowledge base with vector embeddings for similarity searching. The retrieval QA chain is set up using LangChain, which ties the LLM with the vector embeddings to perform RAG by converting questions into embeddings, retrieving relevant information, and generating answers.

  • 00:20:00 - 00:28:00

    Finally, the presentation demonstrates querying the system with questions about a document, showcasing how responses are generated using the Llama 3 model. A summary of the entire process is provided, covering model loading, PDF processing, text chunking, embedding, and generating Q&A responses using vector similarity and the LLM.

Show more

Mind Map

Video Q&A

  • What is RAG?

    RAG stands for Retrieval Augmented Generation, where LLMs answer queries based on content from a specific document rather than their entire training corpus.

  • How can Lama 3 be hosted locally?

    By using the OLama framework to pull and host the Lama 3 model on a local system.

  • What are the prerequisites for working with Lama 3 locally?

    You must install OLama and ensure your system meets the GPU requirements for the model version you choose.

  • Which programming libraries are used in the tutorial?

    The tutorial uses Python with libraries like LangChain, FAISS, Hugging Face embeddings, and Unstructured.

  • What is the purpose of character text splitting?

    Character text splitting is used to divide document content into manageable chunks for processing in RAG.

  • Can different versions of Lama or GMA be used?

    Yes, different versions of Lama or GMA models can be used depending on the system's GPU capabilities.

  • Why is embedding conversion necessary?

    Embedding conversion allows document text and queries to be transformed into vectors for similarity searches.

  • What is FAISS used for in this tutorial?

    FAISS is employed for performing similarity searches over vector embeddings of text chunks.

  • What does 'invoke' do in this code?

    'Invoke' runs the QA retrieval chain to produce an answer from the LLM based on the queried content.

  • What alternatives to Lama can be used for LLM in this setup?

    Alternatives like GPT models available through LangChain can be used for LLM instead of Lama.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:01
    hello everyone I'm sidan in this video
  • 00:00:03
    Let's host a Lama 3 Model on local using
  • 00:00:06
    o Lama and using this llm we are going
  • 00:00:08
    to perform retrieval augmented
  • 00:00:10
    generation shortly known as rag so rag
  • 00:00:13
    is basically when you want your llm to
  • 00:00:16
    answer from a context window so let's
  • 00:00:18
    say that you have a document and you
  • 00:00:19
    want the llm to answer based on the
  • 00:00:22
    content that's present in this document
  • 00:00:24
    instead of the large Corpus of data it
  • 00:00:26
    has been trained on so this is what rag
  • 00:00:28
    means and in this video we are going to
  • 00:00:30
    do all this locally by using o Lama
  • 00:00:32
    using which we will pull our Lama 3
  • 00:00:34
    model and and we will host it in our
  • 00:00:35
    local system so this will be the agenda
  • 00:00:38
    for today's video and let's get started
  • 00:00:40
    so the first step is to pull the Lama 3
  • 00:00:44
    Model so for that first you need to
  • 00:00:46
    install AMA so I've already made a video
  • 00:00:48
    on how to get started with this AMA
  • 00:00:50
    thing uh you can go through that video
  • 00:00:53
    if you want I'll give the link of that
  • 00:00:55
    video in this video description you can
  • 00:00:56
    check that out so next Once you are done
  • 00:00:58
    with that go to Google search AMA Lama 3
  • 00:01:02
    so if you want to download you can just
  • 00:01:03
    download from this page as well so the
  • 00:01:05
    download process is pretty simple for uh
  • 00:01:08
    Mac Windows as well as Linux it's like
  • 00:01:10
    not tough at all so once we are done
  • 00:01:13
    with the installation now we can pull
  • 00:01:14
    the model so here you can find the
  • 00:01:16
    different models that's present over
  • 00:01:18
    here so you can search the models like
  • 00:01:19
    gimma Lama 3 and so on so our model of
  • 00:01:22
    interest is Lama 3 and even if you want
  • 00:01:24
    to use a GMA let's say that you have a
  • 00:01:26
    machine that doesn't have like that
  • 00:01:27
    large of a GPU so for example my my
  • 00:01:30
    local machine only has like a 4 GB GPU
  • 00:01:33
    because it's a NVIDIA GTX 1650a so it
  • 00:01:37
    only comes with 4 GB of GPU Ram so here
  • 00:01:40
    I'm currently using a ec2 instance with
  • 00:01:42
    16GB GPU which is uh T4 tensar core
  • 00:01:46
    Tesla GPU so again so I'm using a Lama 3
  • 00:01:49
    Model so if not you can just use a GMA
  • 00:01:51
    model as well so that's the main thing
  • 00:01:53
    so let's say that we have searched Lama
  • 00:01:55
    3 in ama and you have this different
  • 00:01:58
    version so you have 70 billion version
  • 00:01:59
    you can also see the size Warrior it's
  • 00:02:01
    40 GB and the 8 billion version is 4.7
  • 00:02:04
    GB and you can also see its Q4 which is
  • 00:02:07
    4bit quantization so the weight or the
  • 00:02:10
    parameters of the model are reduced to a
  • 00:02:12
    Precision of 4 bits so you know the
  • 00:02:15
    purpose of doing this is your models
  • 00:02:18
    size decreases but the performance or
  • 00:02:21
    the efficiency doesn't like affect so
  • 00:02:23
    much so of course there will be a drop
  • 00:02:25
    in the performance but it's not like
  • 00:02:26
    that much so that's about quantization
  • 00:02:28
    so for rack we can use instruct version
  • 00:02:32
    instead of the base version so I'll
  • 00:02:34
    click this instruct and in order to pull
  • 00:02:36
    this instruct version you have to use
  • 00:02:37
    this particular command so this is AMA
  • 00:02:40
    run so this is used when you have
  • 00:02:43
    already pulled a model and then let's
  • 00:02:45
    say you want to run this model so you
  • 00:02:47
    would use that so for the first step you
  • 00:02:49
    need to pull it so I'll come to my
  • 00:02:51
    terminal and if you are in Windows you
  • 00:02:54
    probably would have installed this wama
  • 00:02:55
    and after that you can access this from
  • 00:02:57
    the terminal from your windows
  • 00:02:59
    Powershell so Linux and Mac users also
  • 00:03:01
    can access this from the terminal you
  • 00:03:02
    can say AMA space help so this will
  • 00:03:05
    display like all the commands that you
  • 00:03:07
    have over here so serve create Show run
  • 00:03:09
    pull and so on so let's say that uh I
  • 00:03:12
    can just type
  • 00:03:14
    in list and this will list all the
  • 00:03:16
    models that I have pulled already so
  • 00:03:19
    here you can see that I have GMA 7
  • 00:03:20
    billion model Lama 3 instruct this
  • 00:03:23
    particular model so let's say that I
  • 00:03:24
    want to pull just the base version 8
  • 00:03:27
    billion so what I should do is I should
  • 00:03:29
    come to the
  • 00:03:30
    terminal and say w Lama P so look at the
  • 00:03:35
    name over here which is Lama 3 colon 8
  • 00:03:38
    billion so Lama 3 colon 8 billion so
  • 00:03:42
    when I run this so it's going to load
  • 00:03:45
    the model so it's going to pull the
  • 00:03:46
    model from the repository so that's the
  • 00:03:49
    first step but again we are not going to
  • 00:03:51
    work with 8 billion so we are going to
  • 00:03:53
    work with 8 billion instruct right so
  • 00:03:56
    I'll clear all this and list this again
  • 00:04:00
    so now I have 8 billion 8 billion
  • 00:04:01
    instruct on GMA 7B so this is how you
  • 00:04:03
    can pull the model if you want you can
  • 00:04:05
    uh you know use again GMA as B so GMA
  • 00:04:07
    comes with 7 billion version and 2
  • 00:04:09
    billion version so you can check that
  • 00:04:10
    out as well so that's the other
  • 00:04:12
    thing
  • 00:04:15
    okay so now we can go into the coding
  • 00:04:18
    aspects where we will load this model
  • 00:04:21
    and how to do the retrieval augmental
  • 00:04:22
    Generation by passing a document and
  • 00:04:24
    asking a question to the llm so that
  • 00:04:26
    will be the next step so for this you
  • 00:04:28
    need a bunch of libraries so we are
  • 00:04:30
    going to install L chain and Lang chain
  • 00:04:33
    community so the versions are given over
  • 00:04:35
    here so I'll share you this particular
  • 00:04:36
    notebook and then we have this fa ISS
  • 00:04:39
    which stands for Facebook AI similarity
  • 00:04:42
    search so this was developed by
  • 00:04:43
    Facebook's AI research team so this is
  • 00:04:46
    just to find like a similarity search of
  • 00:04:48
    the vectors so here we will convert all
  • 00:04:51
    the content of the document into vector
  • 00:04:53
    embeddings and the query will also be
  • 00:04:54
    converted into vector embeddings and we
  • 00:04:56
    do a similarity search so for that we
  • 00:04:58
    need this and Fa s comes in two version
  • 00:05:00
    one is CPU and the other one is GPU and
  • 00:05:04
    because of some Cuda version errors so
  • 00:05:06
    GPU is giving me some issues so I'm
  • 00:05:08
    going with CPU and even if you are in
  • 00:05:10
    Python 3.10 you probably shouldn't be
  • 00:05:12
    facing any issues but if you're in 3.2
  • 00:05:15
    so sometimes like it's causing issues
  • 00:05:17
    with the Cuda and so on so you can
  • 00:05:19
    figure out maybe but if not you can just
  • 00:05:21
    use the CPU version for now so we have
  • 00:05:23
    the safs CPU and then we have the
  • 00:05:25
    unstructured so this unstructured
  • 00:05:27
    library is used to load the content of
  • 00:05:29
    the PDF file and then we have this PDF
  • 00:05:32
    uh you know related code for this
  • 00:05:34
    unstructured so that's comes with this
  • 00:05:36
    unstructured uh square bracket PDF so
  • 00:05:38
    that model will be downloaded and then
  • 00:05:40
    you have Transformers so we are
  • 00:05:43
    downloading the Transformers and the
  • 00:05:44
    sentence Transformers so this is used to
  • 00:05:46
    load uh the embedding model so we have
  • 00:05:49
    the sentence embedding models uh so in
  • 00:05:51
    order to load that we are going to use
  • 00:05:53
    the Transformers and sentence
  • 00:05:54
    Transformer Library so these are all the
  • 00:05:56
    dependencies that we need so the next
  • 00:05:58
    step is once you have have installed all
  • 00:06:00
    this libraries I've already installed so
  • 00:06:02
    I'm just have commented it so if not uh
  • 00:06:05
    if you haven't already installed it you
  • 00:06:06
    can uncomment this and work on it so the
  • 00:06:08
    next step will be importing the
  • 00:06:10
    dependencies so here I'll
  • 00:06:15
    say importing the
  • 00:06:21
    dependencies so I'll say import os I'll
  • 00:06:24
    tell you why we need all these things or
  • 00:06:28
    the next thing is import L chain _
  • 00:06:35
    community.
  • 00:06:37
    llms import
  • 00:06:40
    AMA and
  • 00:06:43
    from Lang chain.
  • 00:06:46
    document _
  • 00:06:52
    loaders
  • 00:06:56
    import unstructured file loader
  • 00:07:06
    from L chain.
  • 00:07:09
    embeddings I'll explain you the purpose
  • 00:07:11
    of all this libraries and functions in a
  • 00:07:13
    minute once I'm done with this FS so all
  • 00:07:16
    should be
  • 00:07:17
    uppercase from L
  • 00:07:21
    chain do text
  • 00:07:26
    splitter
  • 00:07:28
    import character
  • 00:07:30
    text
  • 00:07:34
    splitter and finally we
  • 00:07:37
    need L chain.
  • 00:07:40
    change inut retrieval
  • 00:07:45
    QA okay so these are the things let's
  • 00:07:47
    understand this so w we need uh one
  • 00:07:50
    second this should be
  • 00:07:57
    from H so is used in order to uh you
  • 00:08:02
    know access some of the files that's
  • 00:08:04
    present in our system so for that we use
  • 00:08:06
    this and then we have this ama ama is
  • 00:08:08
    used in order to access the model that
  • 00:08:11
    we have just now pulled which is Lama 3
  • 00:08:13
    or Gemma model and so on and then we
  • 00:08:15
    have this unstructural file loader in
  • 00:08:16
    order to load our PDF file and then we
  • 00:08:18
    have this Lang chain. embeddings so from
  • 00:08:21
    Lang chain. embeddings we are importing
  • 00:08:24
    fa ISS so okay so this should be not
  • 00:08:27
    embeddings but
  • 00:08:29
    okay let me try this
  • 00:08:32
    again
  • 00:08:40
    from L chain
  • 00:08:48
    Community do vctor
  • 00:08:54
    stores input fa ISS so from Lang chain.
  • 00:08:58
    edings we have to
  • 00:09:00
    load the agging phase
  • 00:09:04
    embeddings H these are the things let's
  • 00:09:07
    run this and see if this is
  • 00:09:19
    working okay so it's worked well so we
  • 00:09:22
    have W in order to access the llm that
  • 00:09:24
    we have just now pulled and we need this
  • 00:09:26
    unstructured file loader in order to
  • 00:09:28
    read the PDF file and get all the text
  • 00:09:30
    content out of it and then you have this
  • 00:09:32
    FS for similarity search of vector
  • 00:09:34
    embeddings and then you have this uh
  • 00:09:37
    Lang chain. embeddings we are importing
  • 00:09:39
    this agging phase embeddings using which
  • 00:09:40
    we will uh you know convert the text
  • 00:09:42
    into Vector embeddings so the next step
  • 00:09:44
    is character text R so this is used in
  • 00:09:46
    order to chunk the content that we have
  • 00:09:49
    in the PDF file so we can't cannot like
  • 00:09:51
    pass this entire text so instead we
  • 00:09:53
    would uh chunk this in a bit by bit
  • 00:09:55
    fashion so once all all these chunks are
  • 00:09:58
    present we will code those things into a
  • 00:10:00
    vctor embedding thing and store it in a
  • 00:10:02
    victar DB or just create a knowledge
  • 00:10:04
    base out of it and later we can use FAS
  • 00:10:07
    as in order to do a similarity search so
  • 00:10:09
    for chunking it we are using this
  • 00:10:10
    character T text splitter and then we
  • 00:10:12
    have retrieval QA so there are like
  • 00:10:14
    different change so there are like
  • 00:10:15
    conversational change retrial Q change
  • 00:10:17
    and so on so our Focus here is just
  • 00:10:19
    document question answering so for that
  • 00:10:21
    uh we can just stick with retrieval QA
  • 00:10:24
    where you can just pass this document
  • 00:10:26
    thing and and pass a question and and
  • 00:10:28
    get a responds out of it so this is how
  • 00:10:30
    all the things work so if you are also
  • 00:10:32
    not very sure about the concept of
  • 00:10:34
    retrieval augmented generation I've
  • 00:10:37
    already made a video about that so it's
  • 00:10:38
    like a conceptual video of all the you
  • 00:10:41
    know things that's come together and the
  • 00:10:43
    architecture of it so I'll give that
  • 00:10:45
    video link in the description as well so
  • 00:10:46
    you can check that out so that's all
  • 00:10:49
    about the import statement now let's uh
  • 00:10:51
    load the llm so I'll say llm is equal to
  • 00:10:57
    AMA and within that I can
  • 00:11:03
    say so llm is equal to
  • 00:11:08
    Lama model is equal to so we have to
  • 00:11:11
    name this model so the model name is
  • 00:11:13
    Lama 3
  • 00:11:16
    colum instruct so this is the 8 billion
  • 00:11:19
    version the other version is 70 billion
  • 00:11:22
    so Lama 3
  • 00:11:25
    instruct and the other thing that I'm
  • 00:11:27
    going to need is parameters temperature
  • 00:11:29
    so you can also include other parameters
  • 00:11:31
    as well so temperature is kind of
  • 00:11:33
    determines the randomness of the
  • 00:11:35
    response of the model so when I run this
  • 00:11:37
    this will load the Lama 3 instrict Model
  • 00:11:40
    so this won't work if you haven't pulled
  • 00:11:41
    your model already so this will work
  • 00:11:43
    only for the list of models that you
  • 00:11:44
    have pulled and it's like available to
  • 00:11:46
    you so here also you can see the sizes
  • 00:11:49
    of the model so that's the next step so
  • 00:11:51
    we have loaded our Lama 3 instruct again
  • 00:11:53
    if you're working with GMA right so you
  • 00:11:55
    just have to replace this Lama 3
  • 00:11:57
    instruct with whatever GMA version that
  • 00:11:59
    you're working on so 2 billion just
  • 00:12:01
    comes with GMA colon 2 billion so you
  • 00:12:03
    can check that out once and you can load
  • 00:12:05
    that so the next step is loading the
  • 00:12:07
    document so here I'll put a
  • 00:12:11
    command saying
  • 00:12:15
    that sorry so loading the llm which is
  • 00:12:18
    Lama 3 in this case the next step
  • 00:12:22
    is loading the
  • 00:12:24
    document so for this I can create a
  • 00:12:27
    variable called as loader and loader is
  • 00:12:29
    equal to unstructured file loaded which
  • 00:12:33
    is the function that we have
  • 00:12:36
    imported and within that we can pass the
  • 00:12:39
    name of the file so for this we are
  • 00:12:41
    going to work with a PDF file and that
  • 00:12:44
    PDF file is the paper that's that was
  • 00:12:47
    released on Transformers so I'll copy
  • 00:12:49
    the name of this
  • 00:12:54
    file and paste it over here so both this
  • 00:12:57
    notebook and this file are in the same
  • 00:12:58
    directory so I'm not giving any path to
  • 00:13:00
    this I'm just mentioning the name of
  • 00:13:02
    this file and I can say
  • 00:13:06
    documents is equal to loaded. load so
  • 00:13:09
    this will load the content of the PDF in
  • 00:13:12
    in a document data type which is an
  • 00:13:14
    unstructured data type let's run this so
  • 00:13:17
    this will load this constant so this may
  • 00:13:19
    take some time depending upon the size
  • 00:13:21
    of the PDF so if you want
  • 00:13:23
    to uh know how big of this PDF file is
  • 00:13:27
    right maybe I can open it and show you
  • 00:13:29
    to
  • 00:13:33
    you so in the
  • 00:13:40
    downloads oh
  • 00:13:42
    sorry this is the PDF file so you can
  • 00:13:45
    see the total number of pages is 11 and
  • 00:13:48
    it has like some content about like the
  • 00:13:50
    transformer architecture so this is the
  • 00:13:52
    paper that was uh you know released by
  • 00:13:54
    Google first basically on the
  • 00:13:56
    Transformer model so we have this
  • 00:13:58
    document is equal to loaded. Lo so this
  • 00:14:00
    will load the documents and the next
  • 00:14:03
    step is we are going to create chunks
  • 00:14:04
    for this
  • 00:14:06
    document
  • 00:14:08
    create document
  • 00:14:13
    chunks so I'll say text
  • 00:14:16
    splitter is equal
  • 00:14:19
    to character text
  • 00:14:24
    splitter and we can mention like what
  • 00:14:26
    kind of separator needs to be so let's
  • 00:14:28
    say that the separ is forward sln which
  • 00:14:31
    is a new Lan
  • 00:14:32
    separator and then we have this chunk
  • 00:14:36
    size and Chun size let's say that it's
  • 00:14:39
    th000 you can play around with the
  • 00:14:41
    different numbers and overlap is equal
  • 00:14:44
    to
  • 00:14:48
    200 right so here we are creating an
  • 00:14:50
    instance of this character splitter and
  • 00:14:52
    giving the parameters for this the
  • 00:14:54
    separator can be new line the chunk size
  • 00:14:56
    can be th000 and the overlap is equal to
  • 00:14:58
    200 so so wherever it sees a new L
  • 00:15:00
    character it's going to separate those
  • 00:15:02
    into different chunks and again uh the
  • 00:15:04
    maximum number of uh characters that can
  • 00:15:08
    be present in one chunk is th so that's
  • 00:15:10
    what we are mentioning over here and the
  • 00:15:11
    overlap is 200 so that means uh we won't
  • 00:15:15
    chunk it just by the character sizes so
  • 00:15:17
    what happens is when you kind of Chunk
  • 00:15:20
    it like in a in a discrete way the llm
  • 00:15:23
    or the vector emings may not have this
  • 00:15:25
    proper context so maybe I'll show you
  • 00:15:28
    show this to you visually so let's say
  • 00:15:30
    that we have this a large paragraph
  • 00:15:33
    about this Transformer thing or let's
  • 00:15:35
    say we have this paragraph right and uh
  • 00:15:37
    when we chunk it let's say
  • 00:15:40
    that H let's take this example so these
  • 00:15:44
    lines are chunked to one chunk and the
  • 00:15:47
    line after that are chunk to a different
  • 00:15:49
    chunk so we don't do that so instead we
  • 00:15:52
    will first make this as the first CH and
  • 00:15:55
    do a overlap so let's say that uh this
  • 00:15:57
    has like let's say 6 7 lines and the
  • 00:15:59
    second line Second chunk will start from
  • 00:16:01
    this line so there will be a overlap so
  • 00:16:03
    that the context is not lost when we do
  • 00:16:06
    this conversion from Vector embeddings
  • 00:16:08
    uh you know TT to Vector embeddings so
  • 00:16:10
    that's the reason we are mentioning this
  • 00:16:11
    overlap parameter of 200 again you can
  • 00:16:13
    play around with this number so this is
  • 00:16:16
    the instance that we are getting okay so
  • 00:16:19
    this should be chunk
  • 00:16:22
    overlap so we are instant instantiating
  • 00:16:24
    this character Tex spitter with all this
  • 00:16:26
    parameters now we can pass our document
  • 00:16:29
    which is basically the content that the
  • 00:16:31
    unstructured file loader has loaded so
  • 00:16:34
    here I can say you can print and show
  • 00:16:37
    this to you so you can see so this is
  • 00:16:39
    the entire text that has been U derived
  • 00:16:42
    from the PDF using this unstructured
  • 00:16:44
    file loader thing and it's form of it's
  • 00:16:46
    in the type of document thing page
  • 00:16:48
    content all this you know entire content
  • 00:16:51
    so I'll delete this follow so here we
  • 00:16:54
    can say text
  • 00:16:56
    chunks is equal to
  • 00:16:59
    text
  • 00:17:02
    splitter do split documents so this is
  • 00:17:05
    in documents format so we can use split
  • 00:17:07
    documents sometimes you can also you
  • 00:17:09
    know chunk this as just as text string
  • 00:17:12
    data types so in those cases we will use
  • 00:17:14
    split text so here we are saying split
  • 00:17:17
    documents and within that we can pass
  • 00:17:19
    the documents variable that we have
  • 00:17:20
    created War here so let's run
  • 00:17:25
    this so This step is done so first we
  • 00:17:28
    have loaded the document using
  • 00:17:29
    unstructured file loader and then uh we
  • 00:17:32
    are splitting this uh documents content
  • 00:17:34
    into different
  • 00:17:36
    chunks so the next step is loading the
  • 00:17:39
    embedding model so here I'll
  • 00:17:44
    say loading the vector embedding model
  • 00:17:47
    so this is used to convert this text
  • 00:17:49
    into Vector
  • 00:17:51
    embeddings so let's create a variable
  • 00:17:54
    called as embeddings is equal to hugging
  • 00:17:56
    phase embeddings so with within this
  • 00:17:59
    agging phase emings you can use specific
  • 00:18:01
    models it's present in the agging phase
  • 00:18:03
    Library sorry the agging phase library
  • 00:18:05
    or agging phase up but you can just go
  • 00:18:07
    with the default one by just mentioning
  • 00:18:09
    a phase embeddings so let's run this so
  • 00:18:12
    similar to loading the LM this will load
  • 00:18:14
    the embedding
  • 00:18:16
    model there are some warnings but you
  • 00:18:18
    can ignore that for it's about like
  • 00:18:20
    resuming the download so the next step
  • 00:18:23
    is converting all this text content into
  • 00:18:27
    Vector embeddings using this embeddings
  • 00:18:29
    model so here I'll
  • 00:18:33
    say knowledge
  • 00:18:39
    base so knowledge base is equal to
  • 00:18:45
    F Dot from
  • 00:18:52
    documents text chunks comma
  • 00:18:55
    embeddings so we are going to convert
  • 00:18:58
    all these
  • 00:19:00
    uh text chunks into vector embeddings
  • 00:19:02
    and later this will be used for
  • 00:19:04
    similarity search from this FAS uh you
  • 00:19:07
    know the similarity search Library so
  • 00:19:09
    this will be the next
  • 00:19:12
    step so we are almost done with this
  • 00:19:15
    only like there are few things left over
  • 00:19:17
    here so now I'll
  • 00:19:23
    say retrieval UA chain so let's say say
  • 00:19:28
    that QA chain is equal to retrieval QA
  • 00:19:31
    the chain that we have
  • 00:19:34
    imported Dot from chain
  • 00:19:40
    type and here I'll say llm
  • 00:19:47
    comma retriever is equal
  • 00:19:52
    to knowledge base
  • 00:19:55
    dot as retriever
  • 00:20:05
    H okay so we are creating a QA chain a
  • 00:20:09
    question answering chain okay knowledge
  • 00:20:12
    base must have
  • 00:20:15
    made near
  • 00:20:18
    okay let's rename this and run this
  • 00:20:22
    again H we are creating a variable
  • 00:20:24
    called as QA chain and using this
  • 00:20:26
    retrieval QA which is used for this
  • 00:20:28
    question answering task coupled with
  • 00:20:30
    this uh you know Vector TB and llms and
  • 00:20:33
    from chain type we are passing a llm and
  • 00:20:35
    the retriever retriever is basically
  • 00:20:37
    your knowledge base as retriever so what
  • 00:20:40
    basically happens in this rag thing is
  • 00:20:42
    right so you have a document so in this
  • 00:20:44
    case the attention is all you need paper
  • 00:20:46
    is the document right uh this entire
  • 00:20:49
    content will be read and this will be
  • 00:20:51
    stored in a vcta database so if you want
  • 00:20:53
    to use an actual database you can go
  • 00:20:56
    with chroma DB Lance DB and other Vector
  • 00:20:58
    DB that are present this FAS is just
  • 00:21:00
    runs in the memory so let's say that we
  • 00:21:02
    are considering a vector database which
  • 00:21:04
    stor these Vector embeddings so all
  • 00:21:06
    these contents will be converted into
  • 00:21:08
    vector embeddings and stored in the
  • 00:21:10
    vector DB and for each of the text let's
  • 00:21:13
    say that a particular sentence has been
  • 00:21:14
    converted into a vector embedding so
  • 00:21:18
    it's nothing but some set of numbers
  • 00:21:20
    that represent this text as vectors and
  • 00:21:22
    there will be this corresponding IDs as
  • 00:21:24
    well to extract this corresponding text
  • 00:21:26
    from those vectors so now you have
  • 00:21:28
    vector B that has the vector embeddings
  • 00:21:30
    of this entire document and now the user
  • 00:21:32
    ask a question so let's say that the
  • 00:21:33
    user ask question about this hardware
  • 00:21:35
    and schedule so now what happens is you
  • 00:21:37
    would convert that query as well into a
  • 00:21:41
    vector eming and now you search the
  • 00:21:44
    vectors that's similar to this question
  • 00:21:46
    to this entire vectors of this document
  • 00:21:48
    so let's say that in this case right so
  • 00:21:51
    when the user asks uh you know something
  • 00:21:54
    about let's
  • 00:21:57
    say uh the machine translation thing so
  • 00:21:59
    the question is about machine
  • 00:22:00
    translation so it will convert that into
  • 00:22:03
    a vector and search that Vector with
  • 00:22:05
    this entire documents vector embeding
  • 00:22:07
    and let's say that it will extract this
  • 00:22:09
    particular text so now you have the
  • 00:22:12
    question and the content in which the
  • 00:22:15
    answer may be present so now you sent
  • 00:22:17
    this question in text format as well as
  • 00:22:19
    the answer in text format to the llm and
  • 00:22:22
    the llm will frame the answer so this is
  • 00:22:24
    all that's happening it's not like the
  • 00:22:26
    llm kind of feeds from the vector
  • 00:22:28
    embeddings so it doesn't it doesn't kind
  • 00:22:30
    of fre the vector embeddings so what
  • 00:22:32
    happens is the vector embeddings is used
  • 00:22:33
    in order to do the similarity search and
  • 00:22:35
    find the similar vectors only the
  • 00:22:37
    question and the relevant content that
  • 00:22:39
    has been retrieved from the vector
  • 00:22:42
    database is sent to the model so it's
  • 00:22:44
    basically like the question will be what
  • 00:22:46
    is machine translation mentioned in this
  • 00:22:48
    paper so that question and this
  • 00:22:49
    particular text will be sent to the
  • 00:22:51
    model and the model or the llm in this
  • 00:22:53
    case will the Lama 3 Model will frame
  • 00:22:56
    the answer for this question and it will
  • 00:22:57
    send this to so this is what happens
  • 00:22:59
    it's not like the entire content is fit
  • 00:23:01
    to the llm so that's not how this works
  • 00:23:03
    so this is about the architecture of
  • 00:23:05
    this Rag and this is what I've explained
  • 00:23:07
    in my conceptual video as well you can
  • 00:23:09
    check that out if like you want to
  • 00:23:11
    understand like this concept like more
  • 00:23:13
    clearly so this is about this Q next
  • 00:23:16
    let's let's ask a
  • 00:23:18
    question so let's say that this question
  • 00:23:20
    is equal
  • 00:23:22
    to let's ask a simple question of what
  • 00:23:24
    is this document about
  • 00:23:30
    and then you have to say
  • 00:23:34
    response is equal to qore chain the
  • 00:23:37
    retrieval QA chain that we have created
  • 00:23:39
    do invoke so you have to use this invoke
  • 00:23:42
    and within this you can pass this query
  • 00:23:45
    as a
  • 00:23:49
    dictionary
  • 00:23:50
    query colon
  • 00:23:55
    question now let's print this response
  • 00:23:57
    the
  • 00:23:59
    final answer will be present within this
  • 00:24:02
    response and within the key called as
  • 00:24:06
    result so you can check that okay so
  • 00:24:10
    let's run this so here we are creating a
  • 00:24:12
    QA chain and passing the llm and the
  • 00:24:14
    knowledge based the vector embeddings
  • 00:24:16
    that stored and then we are passing a
  • 00:24:17
    question so this question will be
  • 00:24:19
    converted into Vector embedding and we
  • 00:24:20
    will retrieve the relevant information
  • 00:24:22
    and then pass this to llm so this is
  • 00:24:24
    what happens in a QA chain and we are
  • 00:24:26
    using this QA chain. invoke to invoke
  • 00:24:28
    this chain which has this llm and the
  • 00:24:30
    knowledge base put this query and get
  • 00:24:33
    the relevant response from this so this
  • 00:24:35
    is how this rag for this retrieval QA
  • 00:24:38
    works and let's
  • 00:24:40
    see so the llm of choice in this case is
  • 00:24:44
    you know that it's Lama 3 but again as I
  • 00:24:46
    said you can use gimma or if you want to
  • 00:24:48
    use uh GPT 48 even you can use the Lang
  • 00:24:52
    chain version of this GPT 4 so you can
  • 00:24:54
    access GPT 40 API from L chain itself so
  • 00:24:58
    that APA thing you can pass to this LM
  • 00:25:00
    and then pass this in turn to this
  • 00:25:02
    retrieval QA so this answer says that
  • 00:25:04
    this document appears to be a research
  • 00:25:06
    paper on article about the Transformer
  • 00:25:08
    sequence transaction model and so on so
  • 00:25:10
    it was able to capture that information
  • 00:25:12
    properly maybe let's ask another
  • 00:25:14
    question of again a similar question of
  • 00:25:17
    what is uh you know the architecture
  • 00:25:19
    discussed in this model
  • 00:25:35
    let's ask
  • 00:25:36
    this so This the second turn uh kind of
  • 00:25:40
    took about four or 5 Seconds it won't
  • 00:25:42
    take like that much of a time so the
  • 00:25:45
    first time the model loads and all the
  • 00:25:47
    things happen so it might take a some
  • 00:25:49
    time but the later questions that you
  • 00:25:51
    ask won't take that much of a time again
  • 00:25:53
    it also depends on the GPU that you are
  • 00:25:55
    using so the answer is the architecture
  • 00:25:57
    discussed in the model and so on again
  • 00:25:59
    so we can't assure that all the
  • 00:26:01
    questions that we ask kind of gives you
  • 00:26:03
    accurate response because this is fairly
  • 00:26:05
    a smaller model so Lama 38 billion
  • 00:26:08
    version so the larger version is 70
  • 00:26:10
    billion again of course we can't load it
  • 00:26:12
    because of the GPU limitation but if you
  • 00:26:14
    are working in a company and they have
  • 00:26:17
    that machine with them or the cloud
  • 00:26:19
    instance with them then you can use like
  • 00:26:20
    the 70 billion version but all this
  • 00:26:23
    steps Remains the Same so you can you
  • 00:26:25
    know go to an E2 uh you know instead of
  • 00:26:28
    of 8 billion you can kind of load the 70
  • 00:26:31
    billion version and so on again if
  • 00:26:32
    you're working on your local if you have
  • 00:26:34
    like let's say 8GB GPU you can you know
  • 00:26:37
    load Lama 38 billion version itself so
  • 00:26:39
    it's only about 4.7 GB but this is how
  • 00:26:42
    this works let's just uh you know do a
  • 00:26:44
    quick recap of what we have done so the
  • 00:26:46
    first step we have imported all the
  • 00:26:48
    dependencies that we need so we have
  • 00:26:50
    this all in order to load the llm unst
  • 00:26:52
    structure file loader to load the PDF
  • 00:26:53
    file FAS in order to do similar search
  • 00:26:56
    similarity search of the vectors and
  • 00:26:57
    then again embeddings used in order to
  • 00:26:59
    convert the text into Vector embeddings
  • 00:27:01
    character text splitter is used to chunk
  • 00:27:04
    your PDF documents content into smaller
  • 00:27:07
    chunks and then retrial qway where we
  • 00:27:09
    will pass the llm as well as the
  • 00:27:11
    knowledge base which is the vector
  • 00:27:12
    embeding that we have stored then we are
  • 00:27:15
    loading the llm from Lama the Lama 3
  • 00:27:18
    instruct model that we have pulled with
  • 00:27:19
    some temperature parameter zero so that
  • 00:27:21
    we don't want any uh larger Randomness
  • 00:27:25
    to the model so and then we are loading
  • 00:27:27
    this using unst structure file loader
  • 00:27:29
    and then using text splitter text chunks
  • 00:27:31
    and so on and then we are loading the
  • 00:27:32
    embedding
  • 00:27:34
    model and we are converting this text
  • 00:27:36
    documents uh into vector embeddings and
  • 00:27:39
    storing this in this FAS thing for
  • 00:27:41
    similarity sets and finally we have this
  • 00:27:44
    Ral QA QA chain passing the llm and the
  • 00:27:46
    knowledge based to it and then we can
  • 00:27:48
    ask this question by using QA chain.
  • 00:27:50
    invoke so I hope everyone is clear about
  • 00:27:53
    this and please try and see how this
  • 00:27:55
    code works so that is all from my side
  • 00:27:57
    and I'll see in the next uplo thanks for
  • 00:27:59
    watching
Tags
  • Lama 3
  • RAG
  • retrieval augmented generation
  • local hosting
  • vector embeddings
  • LangChain
  • OLama
  • FAISS
  • document processing
  • LLM