RAG stands for Retrieval Augmented Generation, where LLMs answer queries based on content from a specific document rather than their entire training corpus.

How can Lama 3 be hosted locally?

By using the OLama framework to pull and host the Lama 3 model on a local system.

What are the prerequisites for working with Lama 3 locally?

You must install OLama and ensure your system meets the GPU requirements for the model version you choose.

Which programming libraries are used in the tutorial?

The tutorial uses Python with libraries like LangChain, FAISS, Hugging Face embeddings, and Unstructured.

What is the purpose of character text splitting?

Character text splitting is used to divide document content into manageable chunks for processing in RAG.

Can different versions of Lama or GMA be used?

Yes, different versions of Lama or GMA models can be used depending on the system's GPU capabilities.

Why is embedding conversion necessary?

Embedding conversion allows document text and queries to be transformed into vectors for similarity searches.

What is FAISS used for in this tutorial?

FAISS is employed for performing similarity searches over vector embeddings of text chunks.

What does 'invoke' do in this code?

'Invoke' runs the QA retrieval chain to produce an answer from the LLM based on the queried content.

What alternatives to Lama can be used for LLM in this setup?

Alternatives like GPT models available through LangChain can be used for LLM instead of Lama.

Ollama Llama 3 - RAG: How to create a local RAG system with LLAMA 3 using OLLAMA

00:28:00

https://www.youtube.com/watch?v=WfFpeBNfaeQ

Summary

TLDRIn this video, Sidan guides the audience through hosting a Lama 3 model locally and performing retrieval augmented generation (RAG) using the OLama framework. The process involves setting up the Lama 3 model on a local system, pulling the model, and preparing it using various software tools including LangChain for both the language model and document management. The tutorial demonstrates how to load documents, split them into manageable chunks, convert these chunks into vector embeddings using Hugging Face models, and store them for retrieval using FAISS for similarity searches. The video also covers setting up the necessary dependencies and coding steps to implement RAG by querying specific documents instead of relying on the model's trained data alone. Moreover, it explains the implementation of a Question Answering chain with the LLM to process and respond accurately to questions based on the documents' contents.

Takeaways

🤖 Use Lama 3 model for local LLM hosting.
📄 Implement RAG to answer document-specific queries.
🔧 Utilize OLama to pull and configure models.
🛠️ Use Python libraries like LangChain and FAISS.
🔍 Perform similarity searches with document embeddings.
💾 Manage document content with character text splitting.
⚙️ Convert text to vector embeddings with Hugging Face.
🔗 Chain together steps for efficient document QA.
🌐 Explore alternatives like GPT models with LangChain.
📚 Learn the process from model setup to query answering.

Timeline

00:00:00 - 00:05:00
In this video, the presenter explains the concept of RAG (Retrieval Augmented Generation) which involves utilizing a Llama 3 model with OLama to perform question-answering based on a specific document rather than the model's entire training corpus. The video covers how to host a Llama 3 model locally and use it to answer questions based on content present in a document, by loading the model and pulling it from a local system.
00:05:00 - 00:10:00
The next step involves setting up the necessary libraries and dependencies including LangChain, FAISS for similarity search, and Transformers for embedding models. The presenter discusses potential issues with CUDA versions when choosing between CPU and GPU versions of FAISS and emphasizes the importance of choosing the right libraries for text processing and vector embeddings to facilitate the RAG process.
00:10:00 - 00:15:00
The process then involves loading the Llama 3 model, creating document chunks, and then converting text chunks into vector embeddings. This includes setting the model parameters like temperature for controlling response randomness, and using character text splitting for creating manageable content chunks for embedding.
00:15:00 - 00:20:00
After embedding the text, the presenter illustrates how to create a knowledge base with vector embeddings for similarity searching. The retrieval QA chain is set up using LangChain, which ties the LLM with the vector embeddings to perform RAG by converting questions into embeddings, retrieving relevant information, and generating answers.
00:20:00 - 00:28:00
Finally, the presentation demonstrates querying the system with questions about a document, showcasing how responses are generated using the Llama 3 model. A summary of the entire process is provided, covering model loading, PDF processing, text chunking, embedding, and generating Q&A responses using vector similarity and the LLM.

Mind Map

Video Q&A

What is RAG?
RAG stands for Retrieval Augmented Generation, where LLMs answer queries based on content from a specific document rather than their entire training corpus.
How can Lama 3 be hosted locally?
By using the OLama framework to pull and host the Lama 3 model on a local system.
What are the prerequisites for working with Lama 3 locally?
You must install OLama and ensure your system meets the GPU requirements for the model version you choose.
Which programming libraries are used in the tutorial?
The tutorial uses Python with libraries like LangChain, FAISS, Hugging Face embeddings, and Unstructured.
What is the purpose of character text splitting?
Character text splitting is used to divide document content into manageable chunks for processing in RAG.
Can different versions of Lama or GMA be used?
Yes, different versions of Lama or GMA models can be used depending on the system's GPU capabilities.
Why is embedding conversion necessary?
Embedding conversion allows document text and queries to be transformed into vectors for similarity searches.
What is FAISS used for in this tutorial?
FAISS is employed for performing similarity searches over vector embeddings of text chunks.
What does 'invoke' do in this code?
'Invoke' runs the QA retrieval chain to produce an answer from the LLM based on the queried content.
What alternatives to Lama can be used for LLM in this setup?
Alternatives like GPT models available through LangChain can be used for LLM instead of Lama.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:01
hello everyone I'm sidan in this video
00:00:03
Let's host a Lama 3 Model on local using
00:00:06
o Lama and using this llm we are going
00:00:08
to perform retrieval augmented
00:00:10
generation shortly known as rag so rag
00:00:13
is basically when you want your llm to
00:00:16
answer from a context window so let's
00:00:18
say that you have a document and you
00:00:19
want the llm to answer based on the
00:00:22
content that's present in this document
00:00:24
instead of the large Corpus of data it
00:00:26
has been trained on so this is what rag
00:00:28
means and in this video we are going to
00:00:30
do all this locally by using o Lama
00:00:32
using which we will pull our Lama 3
00:00:34
model and and we will host it in our
00:00:35
local system so this will be the agenda
00:00:38
for today's video and let's get started
00:00:40
so the first step is to pull the Lama 3
00:00:44
Model so for that first you need to
00:00:46
install AMA so I've already made a video
00:00:48
on how to get started with this AMA
00:00:50
thing uh you can go through that video
00:00:53
if you want I'll give the link of that
00:00:55
video in this video description you can
00:00:56
check that out so next Once you are done
00:00:58
with that go to Google search AMA Lama 3
00:01:02
so if you want to download you can just
00:01:03
download from this page as well so the
00:01:05
download process is pretty simple for uh
00:01:08
Mac Windows as well as Linux it's like
00:01:10
not tough at all so once we are done
00:01:13
with the installation now we can pull
00:01:14
the model so here you can find the
00:01:16
different models that's present over
00:01:18
here so you can search the models like
00:01:19
gimma Lama 3 and so on so our model of
00:01:22
interest is Lama 3 and even if you want
00:01:24
to use a GMA let's say that you have a
00:01:26
machine that doesn't have like that
00:01:27
large of a GPU so for example my my
00:01:30
local machine only has like a 4 GB GPU
00:01:33
because it's a NVIDIA GTX 1650a so it
00:01:37
only comes with 4 GB of GPU Ram so here
00:01:40
I'm currently using a ec2 instance with
00:01:42
16GB GPU which is uh T4 tensar core
00:01:46
Tesla GPU so again so I'm using a Lama 3
00:01:49
Model so if not you can just use a GMA
00:01:51
model as well so that's the main thing
00:01:53
so let's say that we have searched Lama
00:01:55
3 in ama and you have this different
00:01:58
version so you have 70 billion version
00:01:59
you can also see the size Warrior it's
00:02:01
40 GB and the 8 billion version is 4.7
00:02:04
GB and you can also see its Q4 which is
00:02:07
4bit quantization so the weight or the
00:02:10
parameters of the model are reduced to a
00:02:12
Precision of 4 bits so you know the
00:02:15
purpose of doing this is your models
00:02:18
size decreases but the performance or
00:02:21
the efficiency doesn't like affect so
00:02:23
much so of course there will be a drop
00:02:25
in the performance but it's not like
00:02:26
that much so that's about quantization
00:02:28
so for rack we can use instruct version
00:02:32
instead of the base version so I'll
00:02:34
click this instruct and in order to pull
00:02:36
this instruct version you have to use
00:02:37
this particular command so this is AMA
00:02:40
run so this is used when you have
00:02:43
already pulled a model and then let's
00:02:45
say you want to run this model so you
00:02:47
would use that so for the first step you
00:02:49
need to pull it so I'll come to my
00:02:51
terminal and if you are in Windows you
00:02:54
probably would have installed this wama
00:02:55
and after that you can access this from
00:02:57
the terminal from your windows
00:02:59
Powershell so Linux and Mac users also
00:03:01
can access this from the terminal you
00:03:02
can say AMA space help so this will
00:03:05
display like all the commands that you
00:03:07
have over here so serve create Show run
00:03:09
pull and so on so let's say that uh I
00:03:12
can just type
00:03:14
in list and this will list all the
00:03:16
models that I have pulled already so
00:03:19
here you can see that I have GMA 7
00:03:20
billion model Lama 3 instruct this
00:03:23
particular model so let's say that I
00:03:24
want to pull just the base version 8
00:03:27
billion so what I should do is I should
00:03:29
come to the
00:03:30
terminal and say w Lama P so look at the
00:03:35
name over here which is Lama 3 colon 8
00:03:38
billion so Lama 3 colon 8 billion so
00:03:42
when I run this so it's going to load
00:03:45
the model so it's going to pull the
00:03:46
model from the repository so that's the
00:03:49
first step but again we are not going to
00:03:51
work with 8 billion so we are going to
00:03:53
work with 8 billion instruct right so
00:03:56
I'll clear all this and list this again
00:04:00
so now I have 8 billion 8 billion
00:04:01
instruct on GMA 7B so this is how you
00:04:03
can pull the model if you want you can
00:04:05
uh you know use again GMA as B so GMA
00:04:07
comes with 7 billion version and 2
00:04:09
billion version so you can check that
00:04:10
out as well so that's the other
00:04:12
thing
00:04:15
okay so now we can go into the coding
00:04:18
aspects where we will load this model
00:04:21
and how to do the retrieval augmental
00:04:22
Generation by passing a document and
00:04:24
asking a question to the llm so that
00:04:26
will be the next step so for this you
00:04:28
need a bunch of libraries so we are
00:04:30
going to install L chain and Lang chain
00:04:33
community so the versions are given over
00:04:35
here so I'll share you this particular
00:04:36
notebook and then we have this fa ISS
00:04:39
which stands for Facebook AI similarity
00:04:42
search so this was developed by
00:04:43
Facebook's AI research team so this is
00:04:46
just to find like a similarity search of
00:04:48
the vectors so here we will convert all
00:04:51
the content of the document into vector
00:04:53
embeddings and the query will also be
00:04:54
converted into vector embeddings and we
00:04:56
do a similarity search so for that we
00:04:58
need this and Fa s comes in two version
00:05:00
one is CPU and the other one is GPU and
00:05:04
because of some Cuda version errors so
00:05:06
GPU is giving me some issues so I'm
00:05:08
going with CPU and even if you are in
00:05:10
Python 3.10 you probably shouldn't be
00:05:12
facing any issues but if you're in 3.2
00:05:15
so sometimes like it's causing issues
00:05:17
with the Cuda and so on so you can
00:05:19
figure out maybe but if not you can just
00:05:21
use the CPU version for now so we have
00:05:23
the safs CPU and then we have the
00:05:25
unstructured so this unstructured
00:05:27
library is used to load the content of
00:05:29
the PDF file and then we have this PDF
00:05:32
uh you know related code for this
00:05:34
unstructured so that's comes with this
00:05:36
unstructured uh square bracket PDF so
00:05:38
that model will be downloaded and then
00:05:40
you have Transformers so we are
00:05:43
downloading the Transformers and the
00:05:44
sentence Transformers so this is used to
00:05:46
load uh the embedding model so we have
00:05:49
the sentence embedding models uh so in
00:05:51
order to load that we are going to use
00:05:53
the Transformers and sentence
00:05:54
Transformer Library so these are all the
00:05:56
dependencies that we need so the next
00:05:58
step is once you have have installed all
00:06:00
this libraries I've already installed so
00:06:02
I'm just have commented it so if not uh
00:06:05
if you haven't already installed it you
00:06:06
can uncomment this and work on it so the
00:06:08
next step will be importing the
00:06:10
dependencies so here I'll
00:06:15
say importing the
00:06:21
dependencies so I'll say import os I'll
00:06:24
tell you why we need all these things or
00:06:28
the next thing is import L chain _
00:06:35
community.
00:06:37
llms import
00:06:40
AMA and
00:06:43
from Lang chain.
00:06:46
document _
00:06:52
loaders
00:06:56
import unstructured file loader
00:07:06
from L chain.
00:07:09
embeddings I'll explain you the purpose
00:07:11
of all this libraries and functions in a
00:07:13
minute once I'm done with this FS so all
00:07:16
should be
00:07:17
uppercase from L
00:07:21
chain do text
00:07:26
splitter
00:07:28
import character
00:07:30
text
00:07:34
splitter and finally we
00:07:37
need L chain.
00:07:40
change inut retrieval
00:07:45
QA okay so these are the things let's
00:07:47
understand this so w we need uh one
00:07:50
second this should be
00:07:57
from H so is used in order to uh you
00:08:02
know access some of the files that's
00:08:04
present in our system so for that we use
00:08:06
this and then we have this ama ama is
00:08:08
used in order to access the model that
00:08:11
we have just now pulled which is Lama 3
00:08:13
or Gemma model and so on and then we
00:08:15
have this unstructural file loader in
00:08:16
order to load our PDF file and then we
00:08:18
have this Lang chain. embeddings so from
00:08:21
Lang chain. embeddings we are importing
00:08:24
fa ISS so okay so this should be not
00:08:27
embeddings but
00:08:29
okay let me try this
00:08:32
again
00:08:40
from L chain
00:08:48
Community do vctor
00:08:54
stores input fa ISS so from Lang chain.
00:08:58
edings we have to
00:09:00
load the agging phase
00:09:04
embeddings H these are the things let's
00:09:07
run this and see if this is
00:09:19
working okay so it's worked well so we
00:09:22
have W in order to access the llm that
00:09:24
we have just now pulled and we need this
00:09:26
unstructured file loader in order to
00:09:28
read the PDF file and get all the text
00:09:30
content out of it and then you have this
00:09:32
FS for similarity search of vector
00:09:34
embeddings and then you have this uh
00:09:37
Lang chain. embeddings we are importing
00:09:39
this agging phase embeddings using which
00:09:40
we will uh you know convert the text
00:09:42
into Vector embeddings so the next step
00:09:44
is character text R so this is used in
00:09:46
order to chunk the content that we have
00:09:49
in the PDF file so we can't cannot like
00:09:51
pass this entire text so instead we
00:09:53
would uh chunk this in a bit by bit
00:09:55
fashion so once all all these chunks are
00:09:58
present we will code those things into a
00:10:00
vctor embedding thing and store it in a
00:10:02
victar DB or just create a knowledge
00:10:04
base out of it and later we can use FAS
00:10:07
as in order to do a similarity search so
00:10:09
for chunking it we are using this
00:10:10
character T text splitter and then we
00:10:12
have retrieval QA so there are like
00:10:14
different change so there are like
00:10:15
conversational change retrial Q change
00:10:17
and so on so our Focus here is just
00:10:19
document question answering so for that
00:10:21
uh we can just stick with retrieval QA
00:10:24
where you can just pass this document
00:10:26
thing and and pass a question and and
00:10:28
get a responds out of it so this is how
00:10:30
all the things work so if you are also
00:10:32
not very sure about the concept of
00:10:34
retrieval augmented generation I've
00:10:37
already made a video about that so it's
00:10:38
like a conceptual video of all the you
00:10:41
know things that's come together and the
00:10:43
architecture of it so I'll give that
00:10:45
video link in the description as well so
00:10:46
you can check that out so that's all
00:10:49
about the import statement now let's uh
00:10:51
load the llm so I'll say llm is equal to
00:10:57
AMA and within that I can
00:11:03
say so llm is equal to
00:11:08
Lama model is equal to so we have to
00:11:11
name this model so the model name is
00:11:13
Lama 3
00:11:16
colum instruct so this is the 8 billion
00:11:19
version the other version is 70 billion
00:11:22
so Lama 3
00:11:25
instruct and the other thing that I'm
00:11:27
going to need is parameters temperature
00:11:29
so you can also include other parameters
00:11:31
as well so temperature is kind of
00:11:33
determines the randomness of the
00:11:35
response of the model so when I run this
00:11:37
this will load the Lama 3 instrict Model
00:11:40
so this won't work if you haven't pulled
00:11:41
your model already so this will work
00:11:43
only for the list of models that you
00:11:44
have pulled and it's like available to
00:11:46
you so here also you can see the sizes
00:11:49
of the model so that's the next step so
00:11:51
we have loaded our Lama 3 instruct again
00:11:53
if you're working with GMA right so you
00:11:55
just have to replace this Lama 3
00:11:57
instruct with whatever GMA version that
00:11:59
you're working on so 2 billion just
00:12:01
comes with GMA colon 2 billion so you
00:12:03
can check that out once and you can load
00:12:05
that so the next step is loading the
00:12:07
document so here I'll put a
00:12:11
command saying
00:12:15
that sorry so loading the llm which is
00:12:18
Lama 3 in this case the next step
00:12:22
is loading the
00:12:24
document so for this I can create a
00:12:27
variable called as loader and loader is
00:12:29
equal to unstructured file loaded which
00:12:33
is the function that we have
00:12:36
imported and within that we can pass the
00:12:39
name of the file so for this we are
00:12:41
going to work with a PDF file and that
00:12:44
PDF file is the paper that's that was
00:12:47
released on Transformers so I'll copy
00:12:49
the name of this
00:12:54
file and paste it over here so both this
00:12:57
notebook and this file are in the same
00:12:58
directory so I'm not giving any path to
00:13:00
this I'm just mentioning the name of
00:13:02
this file and I can say
00:13:06
documents is equal to loaded. load so
00:13:09
this will load the content of the PDF in
00:13:12
in a document data type which is an
00:13:14
unstructured data type let's run this so
00:13:17
this will load this constant so this may
00:13:19
take some time depending upon the size
00:13:21
of the PDF so if you want
00:13:23
to uh know how big of this PDF file is
00:13:27
right maybe I can open it and show you
00:13:29
to
00:13:33
you so in the
00:13:40
downloads oh
00:13:42
sorry this is the PDF file so you can
00:13:45
see the total number of pages is 11 and
00:13:48
it has like some content about like the
00:13:50
transformer architecture so this is the
00:13:52
paper that was uh you know released by
00:13:54
Google first basically on the
00:13:56
Transformer model so we have this
00:13:58
document is equal to loaded. Lo so this
00:14:00
will load the documents and the next
00:14:03
step is we are going to create chunks
00:14:04
for this
00:14:06
document
00:14:08
create document
00:14:13
chunks so I'll say text
00:14:16
splitter is equal
00:14:19
to character text
00:14:24
splitter and we can mention like what
00:14:26
kind of separator needs to be so let's
00:14:28
say that the separ is forward sln which
00:14:31
is a new Lan
00:14:32
separator and then we have this chunk
00:14:36
size and Chun size let's say that it's
00:14:39
th000 you can play around with the
00:14:41
different numbers and overlap is equal
00:14:44
to
00:14:48
200 right so here we are creating an
00:14:50
instance of this character splitter and
00:14:52
giving the parameters for this the
00:14:54
separator can be new line the chunk size
00:14:56
can be th000 and the overlap is equal to
00:14:58
200 so so wherever it sees a new L
00:15:00
character it's going to separate those
00:15:02
into different chunks and again uh the
00:15:04
maximum number of uh characters that can
00:15:08
be present in one chunk is th so that's
00:15:10
what we are mentioning over here and the
00:15:11
overlap is 200 so that means uh we won't
00:15:15
chunk it just by the character sizes so
00:15:17
what happens is when you kind of Chunk
00:15:20
it like in a in a discrete way the llm
00:15:23
or the vector emings may not have this
00:15:25
proper context so maybe I'll show you
00:15:28
show this to you visually so let's say
00:15:30
that we have this a large paragraph
00:15:33
about this Transformer thing or let's
00:15:35
say we have this paragraph right and uh
00:15:37
when we chunk it let's say
00:15:40
that H let's take this example so these
00:15:44
lines are chunked to one chunk and the
00:15:47
line after that are chunk to a different
00:15:49
chunk so we don't do that so instead we
00:15:52
will first make this as the first CH and
00:15:55
do a overlap so let's say that uh this
00:15:57
has like let's say 6 7 lines and the
00:15:59
second line Second chunk will start from
00:16:01
this line so there will be a overlap so
00:16:03
that the context is not lost when we do
00:16:06
this conversion from Vector embeddings
00:16:08
uh you know TT to Vector embeddings so
00:16:10
that's the reason we are mentioning this
00:16:11
overlap parameter of 200 again you can
00:16:13
play around with this number so this is
00:16:16
the instance that we are getting okay so
00:16:19
this should be chunk
00:16:22
overlap so we are instant instantiating
00:16:24
this character Tex spitter with all this
00:16:26
parameters now we can pass our document
00:16:29
which is basically the content that the
00:16:31
unstructured file loader has loaded so
00:16:34
here I can say you can print and show
00:16:37
this to you so you can see so this is
00:16:39
the entire text that has been U derived
00:16:42
from the PDF using this unstructured
00:16:44
file loader thing and it's form of it's
00:16:46
in the type of document thing page
00:16:48
content all this you know entire content
00:16:51
so I'll delete this follow so here we
00:16:54
can say text
00:16:56
chunks is equal to
00:16:59
text
00:17:02
splitter do split documents so this is
00:17:05
in documents format so we can use split
00:17:07
documents sometimes you can also you
00:17:09
know chunk this as just as text string
00:17:12
data types so in those cases we will use
00:17:14
split text so here we are saying split
00:17:17
documents and within that we can pass
00:17:19
the documents variable that we have
00:17:20
created War here so let's run
00:17:25
this so This step is done so first we
00:17:28
have loaded the document using
00:17:29
unstructured file loader and then uh we
00:17:32
are splitting this uh documents content
00:17:34
into different
00:17:36
chunks so the next step is loading the
00:17:39
embedding model so here I'll
00:17:44
say loading the vector embedding model
00:17:47
so this is used to convert this text
00:17:49
into Vector
00:17:51
embeddings so let's create a variable
00:17:54
called as embeddings is equal to hugging
00:17:56
phase embeddings so with within this
00:17:59
agging phase emings you can use specific
00:18:01
models it's present in the agging phase
00:18:03
Library sorry the agging phase library
00:18:05
or agging phase up but you can just go
00:18:07
with the default one by just mentioning
00:18:09
a phase embeddings so let's run this so
00:18:12
similar to loading the LM this will load
00:18:14
the embedding
00:18:16
model there are some warnings but you
00:18:18
can ignore that for it's about like
00:18:20
resuming the download so the next step
00:18:23
is converting all this text content into
00:18:27
Vector embeddings using this embeddings
00:18:29
model so here I'll
00:18:33
say knowledge
00:18:39
base so knowledge base is equal to
00:18:45
F Dot from
00:18:52
documents text chunks comma
00:18:55
embeddings so we are going to convert
00:18:58
all these
00:19:00
uh text chunks into vector embeddings
00:19:02
and later this will be used for
00:19:04
similarity search from this FAS uh you
00:19:07
know the similarity search Library so
00:19:09
this will be the next
00:19:12
step so we are almost done with this
00:19:15
only like there are few things left over
00:19:17
here so now I'll
00:19:23
say retrieval UA chain so let's say say
00:19:28
that QA chain is equal to retrieval QA
00:19:31
the chain that we have
00:19:34
imported Dot from chain
00:19:40
type and here I'll say llm
00:19:47
comma retriever is equal
00:19:52
to knowledge base
00:19:55
dot as retriever
00:20:05
H okay so we are creating a QA chain a
00:20:09
question answering chain okay knowledge
00:20:12
base must have
00:20:15
made near
00:20:18
okay let's rename this and run this
00:20:22
again H we are creating a variable
00:20:24
called as QA chain and using this
00:20:26
retrieval QA which is used for this
00:20:28
question answering task coupled with
00:20:30
this uh you know Vector TB and llms and
00:20:33
from chain type we are passing a llm and
00:20:35
the retriever retriever is basically
00:20:37
your knowledge base as retriever so what
00:20:40
basically happens in this rag thing is
00:20:42
right so you have a document so in this
00:20:44
case the attention is all you need paper
00:20:46
is the document right uh this entire
00:20:49
content will be read and this will be
00:20:51
stored in a vcta database so if you want
00:20:53
to use an actual database you can go
00:20:56
with chroma DB Lance DB and other Vector
00:20:58
DB that are present this FAS is just
00:21:00
runs in the memory so let's say that we
00:21:02
are considering a vector database which
00:21:04
stor these Vector embeddings so all
00:21:06
these contents will be converted into
00:21:08
vector embeddings and stored in the
00:21:10
vector DB and for each of the text let's
00:21:13
say that a particular sentence has been
00:21:14
converted into a vector embedding so
00:21:18
it's nothing but some set of numbers
00:21:20
that represent this text as vectors and
00:21:22
there will be this corresponding IDs as
00:21:24
well to extract this corresponding text
00:21:26
from those vectors so now you have
00:21:28
vector B that has the vector embeddings
00:21:30
of this entire document and now the user
00:21:32
ask a question so let's say that the
00:21:33
user ask question about this hardware
00:21:35
and schedule so now what happens is you
00:21:37
would convert that query as well into a
00:21:41
vector eming and now you search the
00:21:44
vectors that's similar to this question
00:21:46
to this entire vectors of this document
00:21:48
so let's say that in this case right so
00:21:51
when the user asks uh you know something
00:21:54
about let's
00:21:57
say uh the machine translation thing so
00:21:59
the question is about machine
00:22:00
translation so it will convert that into
00:22:03
a vector and search that Vector with
00:22:05
this entire documents vector embeding
00:22:07
and let's say that it will extract this
00:22:09
particular text so now you have the
00:22:12
question and the content in which the
00:22:15
answer may be present so now you sent
00:22:17
this question in text format as well as
00:22:19
the answer in text format to the llm and
00:22:22
the llm will frame the answer so this is
00:22:24
all that's happening it's not like the
00:22:26
llm kind of feeds from the vector
00:22:28
embeddings so it doesn't it doesn't kind
00:22:30
of fre the vector embeddings so what
00:22:32
happens is the vector embeddings is used
00:22:33
in order to do the similarity search and
00:22:35
find the similar vectors only the
00:22:37
question and the relevant content that
00:22:39
has been retrieved from the vector
00:22:42
database is sent to the model so it's
00:22:44
basically like the question will be what
00:22:46
is machine translation mentioned in this
00:22:48
paper so that question and this
00:22:49
particular text will be sent to the
00:22:51
model and the model or the llm in this
00:22:53
case will the Lama 3 Model will frame
00:22:56
the answer for this question and it will
00:22:57
send this to so this is what happens
00:22:59
it's not like the entire content is fit
00:23:01
to the llm so that's not how this works
00:23:03
so this is about the architecture of
00:23:05
this Rag and this is what I've explained
00:23:07
in my conceptual video as well you can
00:23:09
check that out if like you want to
00:23:11
understand like this concept like more
00:23:13
clearly so this is about this Q next
00:23:16
let's let's ask a
00:23:18
question so let's say that this question
00:23:20
is equal
00:23:22
to let's ask a simple question of what
00:23:24
is this document about
00:23:30
and then you have to say
00:23:34
response is equal to qore chain the
00:23:37
retrieval QA chain that we have created
00:23:39
do invoke so you have to use this invoke
00:23:42
and within this you can pass this query
00:23:45
as a
00:23:49
dictionary
00:23:50
query colon
00:23:55
question now let's print this response
00:23:57
the
00:23:59
final answer will be present within this
00:24:02
response and within the key called as
00:24:06
result so you can check that okay so
00:24:10
let's run this so here we are creating a
00:24:12
QA chain and passing the llm and the
00:24:14
knowledge based the vector embeddings
00:24:16
that stored and then we are passing a
00:24:17
question so this question will be
00:24:19
converted into Vector embedding and we
00:24:20
will retrieve the relevant information
00:24:22
and then pass this to llm so this is
00:24:24
what happens in a QA chain and we are
00:24:26
using this QA chain. invoke to invoke
00:24:28
this chain which has this llm and the
00:24:30
knowledge base put this query and get
00:24:33
the relevant response from this so this
00:24:35
is how this rag for this retrieval QA
00:24:38
works and let's
00:24:40
see so the llm of choice in this case is
00:24:44
you know that it's Lama 3 but again as I
00:24:46
said you can use gimma or if you want to
00:24:48
use uh GPT 48 even you can use the Lang
00:24:52
chain version of this GPT 4 so you can
00:24:54
access GPT 40 API from L chain itself so
00:24:58
that APA thing you can pass to this LM
00:25:00
and then pass this in turn to this
00:25:02
retrieval QA so this answer says that
00:25:04
this document appears to be a research
00:25:06
paper on article about the Transformer
00:25:08
sequence transaction model and so on so
00:25:10
it was able to capture that information
00:25:12
properly maybe let's ask another
00:25:14
question of again a similar question of
00:25:17
what is uh you know the architecture
00:25:19
discussed in this model
00:25:35
let's ask
00:25:36
this so This the second turn uh kind of
00:25:40
took about four or 5 Seconds it won't
00:25:42
take like that much of a time so the
00:25:45
first time the model loads and all the
00:25:47
things happen so it might take a some
00:25:49
time but the later questions that you
00:25:51
ask won't take that much of a time again
00:25:53
it also depends on the GPU that you are
00:25:55
using so the answer is the architecture
00:25:57
discussed in the model and so on again
00:25:59
so we can't assure that all the
00:26:01
questions that we ask kind of gives you
00:26:03
accurate response because this is fairly
00:26:05
a smaller model so Lama 38 billion
00:26:08
version so the larger version is 70
00:26:10
billion again of course we can't load it
00:26:12
because of the GPU limitation but if you
00:26:14
are working in a company and they have
00:26:17
that machine with them or the cloud
00:26:19
instance with them then you can use like
00:26:20
the 70 billion version but all this
00:26:23
steps Remains the Same so you can you
00:26:25
know go to an E2 uh you know instead of
00:26:28
of 8 billion you can kind of load the 70
00:26:31
billion version and so on again if
00:26:32
you're working on your local if you have
00:26:34
like let's say 8GB GPU you can you know
00:26:37
load Lama 38 billion version itself so
00:26:39
it's only about 4.7 GB but this is how
00:26:42
this works let's just uh you know do a
00:26:44
quick recap of what we have done so the
00:26:46
first step we have imported all the
00:26:48
dependencies that we need so we have
00:26:50
this all in order to load the llm unst
00:26:52
structure file loader to load the PDF
00:26:53
file FAS in order to do similar search
00:26:56
similarity search of the vectors and
00:26:57
then again embeddings used in order to
00:26:59
convert the text into Vector embeddings
00:27:01
character text splitter is used to chunk
00:27:04
your PDF documents content into smaller
00:27:07
chunks and then retrial qway where we
00:27:09
will pass the llm as well as the
00:27:11
knowledge base which is the vector
00:27:12
embeding that we have stored then we are
00:27:15
loading the llm from Lama the Lama 3
00:27:18
instruct model that we have pulled with
00:27:19
some temperature parameter zero so that
00:27:21
we don't want any uh larger Randomness
00:27:25
to the model so and then we are loading
00:27:27
this using unst structure file loader
00:27:29
and then using text splitter text chunks
00:27:31
and so on and then we are loading the
00:27:32
embedding
00:27:34
model and we are converting this text
00:27:36
documents uh into vector embeddings and
00:27:39
storing this in this FAS thing for
00:27:41
similarity sets and finally we have this
00:27:44
Ral QA QA chain passing the llm and the
00:27:46
knowledge based to it and then we can
00:27:48
ask this question by using QA chain.
00:27:50
invoke so I hope everyone is clear about
00:27:53
this and please try and see how this
00:27:55
code works so that is all from my side
00:27:57
and I'll see in the next uplo thanks for
00:27:59
watching