00:00:01
hello everyone I'm sidan in this video
00:00:03
Let's host a Lama 3 Model on local using
00:00:06
o Lama and using this llm we are going
00:00:08
to perform retrieval augmented
00:00:10
generation shortly known as rag so rag
00:00:13
is basically when you want your llm to
00:00:16
answer from a context window so let's
00:00:18
say that you have a document and you
00:00:19
want the llm to answer based on the
00:00:22
content that's present in this document
00:00:24
instead of the large Corpus of data it
00:00:26
has been trained on so this is what rag
00:00:28
means and in this video we are going to
00:00:30
do all this locally by using o Lama
00:00:32
using which we will pull our Lama 3
00:00:34
model and and we will host it in our
00:00:35
local system so this will be the agenda
00:00:38
for today's video and let's get started
00:00:40
so the first step is to pull the Lama 3
00:00:44
Model so for that first you need to
00:00:46
install AMA so I've already made a video
00:00:48
on how to get started with this AMA
00:00:50
thing uh you can go through that video
00:00:53
if you want I'll give the link of that
00:00:55
video in this video description you can
00:00:56
check that out so next Once you are done
00:00:58
with that go to Google search AMA Lama 3
00:01:02
so if you want to download you can just
00:01:03
download from this page as well so the
00:01:05
download process is pretty simple for uh
00:01:08
Mac Windows as well as Linux it's like
00:01:10
not tough at all so once we are done
00:01:13
with the installation now we can pull
00:01:14
the model so here you can find the
00:01:16
different models that's present over
00:01:18
here so you can search the models like
00:01:19
gimma Lama 3 and so on so our model of
00:01:22
interest is Lama 3 and even if you want
00:01:24
to use a GMA let's say that you have a
00:01:26
machine that doesn't have like that
00:01:27
large of a GPU so for example my my
00:01:30
local machine only has like a 4 GB GPU
00:01:33
because it's a NVIDIA GTX 1650a so it
00:01:37
only comes with 4 GB of GPU Ram so here
00:01:40
I'm currently using a ec2 instance with
00:01:42
16GB GPU which is uh T4 tensar core
00:01:46
Tesla GPU so again so I'm using a Lama 3
00:01:49
Model so if not you can just use a GMA
00:01:51
model as well so that's the main thing
00:01:53
so let's say that we have searched Lama
00:01:55
3 in ama and you have this different
00:01:58
version so you have 70 billion version
00:01:59
you can also see the size Warrior it's
00:02:01
40 GB and the 8 billion version is 4.7
00:02:04
GB and you can also see its Q4 which is
00:02:07
4bit quantization so the weight or the
00:02:10
parameters of the model are reduced to a
00:02:12
Precision of 4 bits so you know the
00:02:15
purpose of doing this is your models
00:02:18
size decreases but the performance or
00:02:21
the efficiency doesn't like affect so
00:02:23
much so of course there will be a drop
00:02:25
in the performance but it's not like
00:02:26
that much so that's about quantization
00:02:28
so for rack we can use instruct version
00:02:32
instead of the base version so I'll
00:02:34
click this instruct and in order to pull
00:02:36
this instruct version you have to use
00:02:37
this particular command so this is AMA
00:02:40
run so this is used when you have
00:02:43
already pulled a model and then let's
00:02:45
say you want to run this model so you
00:02:47
would use that so for the first step you
00:02:49
need to pull it so I'll come to my
00:02:51
terminal and if you are in Windows you
00:02:54
probably would have installed this wama
00:02:55
and after that you can access this from
00:02:57
the terminal from your windows
00:02:59
Powershell so Linux and Mac users also
00:03:01
can access this from the terminal you
00:03:02
can say AMA space help so this will
00:03:05
display like all the commands that you
00:03:07
have over here so serve create Show run
00:03:09
pull and so on so let's say that uh I
00:03:12
can just type
00:03:14
in list and this will list all the
00:03:16
models that I have pulled already so
00:03:19
here you can see that I have GMA 7
00:03:20
billion model Lama 3 instruct this
00:03:23
particular model so let's say that I
00:03:24
want to pull just the base version 8
00:03:27
billion so what I should do is I should
00:03:29
come to the
00:03:30
terminal and say w Lama P so look at the
00:03:35
name over here which is Lama 3 colon 8
00:03:38
billion so Lama 3 colon 8 billion so
00:03:42
when I run this so it's going to load
00:03:45
the model so it's going to pull the
00:03:46
model from the repository so that's the
00:03:49
first step but again we are not going to
00:03:51
work with 8 billion so we are going to
00:03:53
work with 8 billion instruct right so
00:03:56
I'll clear all this and list this again
00:04:00
so now I have 8 billion 8 billion
00:04:01
instruct on GMA 7B so this is how you
00:04:03
can pull the model if you want you can
00:04:05
uh you know use again GMA as B so GMA
00:04:07
comes with 7 billion version and 2
00:04:09
billion version so you can check that
00:04:10
out as well so that's the other
00:04:12
thing
00:04:15
okay so now we can go into the coding
00:04:18
aspects where we will load this model
00:04:21
and how to do the retrieval augmental
00:04:22
Generation by passing a document and
00:04:24
asking a question to the llm so that
00:04:26
will be the next step so for this you
00:04:28
need a bunch of libraries so we are
00:04:30
going to install L chain and Lang chain
00:04:33
community so the versions are given over
00:04:35
here so I'll share you this particular
00:04:36
notebook and then we have this fa ISS
00:04:39
which stands for Facebook AI similarity
00:04:42
search so this was developed by
00:04:43
Facebook's AI research team so this is
00:04:46
just to find like a similarity search of
00:04:48
the vectors so here we will convert all
00:04:51
the content of the document into vector
00:04:53
embeddings and the query will also be
00:04:54
converted into vector embeddings and we
00:04:56
do a similarity search so for that we
00:04:58
need this and Fa s comes in two version
00:05:00
one is CPU and the other one is GPU and
00:05:04
because of some Cuda version errors so
00:05:06
GPU is giving me some issues so I'm
00:05:08
going with CPU and even if you are in
00:05:10
Python 3.10 you probably shouldn't be
00:05:12
facing any issues but if you're in 3.2
00:05:15
so sometimes like it's causing issues
00:05:17
with the Cuda and so on so you can
00:05:19
figure out maybe but if not you can just
00:05:21
use the CPU version for now so we have
00:05:23
the safs CPU and then we have the
00:05:25
unstructured so this unstructured
00:05:27
library is used to load the content of
00:05:29
the PDF file and then we have this PDF
00:05:32
uh you know related code for this
00:05:34
unstructured so that's comes with this
00:05:36
unstructured uh square bracket PDF so
00:05:38
that model will be downloaded and then
00:05:40
you have Transformers so we are
00:05:43
downloading the Transformers and the
00:05:44
sentence Transformers so this is used to
00:05:46
load uh the embedding model so we have
00:05:49
the sentence embedding models uh so in
00:05:51
order to load that we are going to use
00:05:53
the Transformers and sentence
00:05:54
Transformer Library so these are all the
00:05:56
dependencies that we need so the next
00:05:58
step is once you have have installed all
00:06:00
this libraries I've already installed so
00:06:02
I'm just have commented it so if not uh
00:06:05
if you haven't already installed it you
00:06:06
can uncomment this and work on it so the
00:06:08
next step will be importing the
00:06:10
dependencies so here I'll
00:06:15
say importing the
00:06:21
dependencies so I'll say import os I'll
00:06:24
tell you why we need all these things or
00:06:28
the next thing is import L chain _
00:06:35
community.
00:06:37
llms import
00:06:40
AMA and
00:06:43
from Lang chain.
00:06:46
document _
00:06:52
loaders
00:06:56
import unstructured file loader
00:07:06
from L chain.
00:07:09
embeddings I'll explain you the purpose
00:07:11
of all this libraries and functions in a
00:07:13
minute once I'm done with this FS so all
00:07:16
should be
00:07:17
uppercase from L
00:07:21
chain do text
00:07:26
splitter
00:07:28
import character
00:07:30
text
00:07:34
splitter and finally we
00:07:37
need L chain.
00:07:40
change inut retrieval
00:07:45
QA okay so these are the things let's
00:07:47
understand this so w we need uh one
00:07:50
second this should be
00:07:57
from H so is used in order to uh you
00:08:02
know access some of the files that's
00:08:04
present in our system so for that we use
00:08:06
this and then we have this ama ama is
00:08:08
used in order to access the model that
00:08:11
we have just now pulled which is Lama 3
00:08:13
or Gemma model and so on and then we
00:08:15
have this unstructural file loader in
00:08:16
order to load our PDF file and then we
00:08:18
have this Lang chain. embeddings so from
00:08:21
Lang chain. embeddings we are importing
00:08:24
fa ISS so okay so this should be not
00:08:27
embeddings but
00:08:29
okay let me try this
00:08:32
again
00:08:40
from L chain
00:08:48
Community do vctor
00:08:54
stores input fa ISS so from Lang chain.
00:08:58
edings we have to
00:09:00
load the agging phase
00:09:04
embeddings H these are the things let's
00:09:07
run this and see if this is
00:09:19
working okay so it's worked well so we
00:09:22
have W in order to access the llm that
00:09:24
we have just now pulled and we need this
00:09:26
unstructured file loader in order to
00:09:28
read the PDF file and get all the text
00:09:30
content out of it and then you have this
00:09:32
FS for similarity search of vector
00:09:34
embeddings and then you have this uh
00:09:37
Lang chain. embeddings we are importing
00:09:39
this agging phase embeddings using which
00:09:40
we will uh you know convert the text
00:09:42
into Vector embeddings so the next step
00:09:44
is character text R so this is used in
00:09:46
order to chunk the content that we have
00:09:49
in the PDF file so we can't cannot like
00:09:51
pass this entire text so instead we
00:09:53
would uh chunk this in a bit by bit
00:09:55
fashion so once all all these chunks are
00:09:58
present we will code those things into a
00:10:00
vctor embedding thing and store it in a
00:10:02
victar DB or just create a knowledge
00:10:04
base out of it and later we can use FAS
00:10:07
as in order to do a similarity search so
00:10:09
for chunking it we are using this
00:10:10
character T text splitter and then we
00:10:12
have retrieval QA so there are like
00:10:14
different change so there are like
00:10:15
conversational change retrial Q change
00:10:17
and so on so our Focus here is just
00:10:19
document question answering so for that
00:10:21
uh we can just stick with retrieval QA
00:10:24
where you can just pass this document
00:10:26
thing and and pass a question and and
00:10:28
get a responds out of it so this is how
00:10:30
all the things work so if you are also
00:10:32
not very sure about the concept of
00:10:34
retrieval augmented generation I've
00:10:37
already made a video about that so it's
00:10:38
like a conceptual video of all the you
00:10:41
know things that's come together and the
00:10:43
architecture of it so I'll give that
00:10:45
video link in the description as well so
00:10:46
you can check that out so that's all
00:10:49
about the import statement now let's uh
00:10:51
load the llm so I'll say llm is equal to
00:10:57
AMA and within that I can
00:11:03
say so llm is equal to
00:11:08
Lama model is equal to so we have to
00:11:11
name this model so the model name is
00:11:13
Lama 3
00:11:16
colum instruct so this is the 8 billion
00:11:19
version the other version is 70 billion
00:11:22
so Lama 3
00:11:25
instruct and the other thing that I'm
00:11:27
going to need is parameters temperature
00:11:29
so you can also include other parameters
00:11:31
as well so temperature is kind of
00:11:33
determines the randomness of the
00:11:35
response of the model so when I run this
00:11:37
this will load the Lama 3 instrict Model
00:11:40
so this won't work if you haven't pulled
00:11:41
your model already so this will work
00:11:43
only for the list of models that you
00:11:44
have pulled and it's like available to
00:11:46
you so here also you can see the sizes
00:11:49
of the model so that's the next step so
00:11:51
we have loaded our Lama 3 instruct again
00:11:53
if you're working with GMA right so you
00:11:55
just have to replace this Lama 3
00:11:57
instruct with whatever GMA version that
00:11:59
you're working on so 2 billion just
00:12:01
comes with GMA colon 2 billion so you
00:12:03
can check that out once and you can load
00:12:05
that so the next step is loading the
00:12:07
document so here I'll put a
00:12:11
command saying
00:12:15
that sorry so loading the llm which is
00:12:18
Lama 3 in this case the next step
00:12:22
is loading the
00:12:24
document so for this I can create a
00:12:27
variable called as loader and loader is
00:12:29
equal to unstructured file loaded which
00:12:33
is the function that we have
00:12:36
imported and within that we can pass the
00:12:39
name of the file so for this we are
00:12:41
going to work with a PDF file and that
00:12:44
PDF file is the paper that's that was
00:12:47
released on Transformers so I'll copy
00:12:49
the name of this
00:12:54
file and paste it over here so both this
00:12:57
notebook and this file are in the same
00:12:58
directory so I'm not giving any path to
00:13:00
this I'm just mentioning the name of
00:13:02
this file and I can say
00:13:06
documents is equal to loaded. load so
00:13:09
this will load the content of the PDF in
00:13:12
in a document data type which is an
00:13:14
unstructured data type let's run this so
00:13:17
this will load this constant so this may
00:13:19
take some time depending upon the size
00:13:21
of the PDF so if you want
00:13:23
to uh know how big of this PDF file is
00:13:27
right maybe I can open it and show you
00:13:29
to
00:13:33
you so in the
00:13:40
downloads oh
00:13:42
sorry this is the PDF file so you can
00:13:45
see the total number of pages is 11 and
00:13:48
it has like some content about like the
00:13:50
transformer architecture so this is the
00:13:52
paper that was uh you know released by
00:13:54
Google first basically on the
00:13:56
Transformer model so we have this
00:13:58
document is equal to loaded. Lo so this
00:14:00
will load the documents and the next
00:14:03
step is we are going to create chunks
00:14:04
for this
00:14:06
document
00:14:08
create document
00:14:13
chunks so I'll say text
00:14:16
splitter is equal
00:14:19
to character text
00:14:24
splitter and we can mention like what
00:14:26
kind of separator needs to be so let's
00:14:28
say that the separ is forward sln which
00:14:31
is a new Lan
00:14:32
separator and then we have this chunk
00:14:36
size and Chun size let's say that it's
00:14:39
th000 you can play around with the
00:14:41
different numbers and overlap is equal
00:14:44
to
00:14:48
200 right so here we are creating an
00:14:50
instance of this character splitter and
00:14:52
giving the parameters for this the
00:14:54
separator can be new line the chunk size
00:14:56
can be th000 and the overlap is equal to
00:14:58
200 so so wherever it sees a new L
00:15:00
character it's going to separate those
00:15:02
into different chunks and again uh the
00:15:04
maximum number of uh characters that can
00:15:08
be present in one chunk is th so that's
00:15:10
what we are mentioning over here and the
00:15:11
overlap is 200 so that means uh we won't
00:15:15
chunk it just by the character sizes so
00:15:17
what happens is when you kind of Chunk
00:15:20
it like in a in a discrete way the llm
00:15:23
or the vector emings may not have this
00:15:25
proper context so maybe I'll show you
00:15:28
show this to you visually so let's say
00:15:30
that we have this a large paragraph
00:15:33
about this Transformer thing or let's
00:15:35
say we have this paragraph right and uh
00:15:37
when we chunk it let's say
00:15:40
that H let's take this example so these
00:15:44
lines are chunked to one chunk and the
00:15:47
line after that are chunk to a different
00:15:49
chunk so we don't do that so instead we
00:15:52
will first make this as the first CH and
00:15:55
do a overlap so let's say that uh this
00:15:57
has like let's say 6 7 lines and the
00:15:59
second line Second chunk will start from
00:16:01
this line so there will be a overlap so
00:16:03
that the context is not lost when we do
00:16:06
this conversion from Vector embeddings
00:16:08
uh you know TT to Vector embeddings so
00:16:10
that's the reason we are mentioning this
00:16:11
overlap parameter of 200 again you can
00:16:13
play around with this number so this is
00:16:16
the instance that we are getting okay so
00:16:19
this should be chunk
00:16:22
overlap so we are instant instantiating
00:16:24
this character Tex spitter with all this
00:16:26
parameters now we can pass our document
00:16:29
which is basically the content that the
00:16:31
unstructured file loader has loaded so
00:16:34
here I can say you can print and show
00:16:37
this to you so you can see so this is
00:16:39
the entire text that has been U derived
00:16:42
from the PDF using this unstructured
00:16:44
file loader thing and it's form of it's
00:16:46
in the type of document thing page
00:16:48
content all this you know entire content
00:16:51
so I'll delete this follow so here we
00:16:54
can say text
00:16:56
chunks is equal to
00:16:59
text
00:17:02
splitter do split documents so this is
00:17:05
in documents format so we can use split
00:17:07
documents sometimes you can also you
00:17:09
know chunk this as just as text string
00:17:12
data types so in those cases we will use
00:17:14
split text so here we are saying split
00:17:17
documents and within that we can pass
00:17:19
the documents variable that we have
00:17:20
created War here so let's run
00:17:25
this so This step is done so first we
00:17:28
have loaded the document using
00:17:29
unstructured file loader and then uh we
00:17:32
are splitting this uh documents content
00:17:34
into different
00:17:36
chunks so the next step is loading the
00:17:39
embedding model so here I'll
00:17:44
say loading the vector embedding model
00:17:47
so this is used to convert this text
00:17:49
into Vector
00:17:51
embeddings so let's create a variable
00:17:54
called as embeddings is equal to hugging
00:17:56
phase embeddings so with within this
00:17:59
agging phase emings you can use specific
00:18:01
models it's present in the agging phase
00:18:03
Library sorry the agging phase library
00:18:05
or agging phase up but you can just go
00:18:07
with the default one by just mentioning
00:18:09
a phase embeddings so let's run this so
00:18:12
similar to loading the LM this will load
00:18:14
the embedding
00:18:16
model there are some warnings but you
00:18:18
can ignore that for it's about like
00:18:20
resuming the download so the next step
00:18:23
is converting all this text content into
00:18:27
Vector embeddings using this embeddings
00:18:29
model so here I'll
00:18:33
say knowledge
00:18:39
base so knowledge base is equal to
00:18:45
F Dot from
00:18:52
documents text chunks comma
00:18:55
embeddings so we are going to convert
00:18:58
all these
00:19:00
uh text chunks into vector embeddings
00:19:02
and later this will be used for
00:19:04
similarity search from this FAS uh you
00:19:07
know the similarity search Library so
00:19:09
this will be the next
00:19:12
step so we are almost done with this
00:19:15
only like there are few things left over
00:19:17
here so now I'll
00:19:23
say retrieval UA chain so let's say say
00:19:28
that QA chain is equal to retrieval QA
00:19:31
the chain that we have
00:19:34
imported Dot from chain
00:19:40
type and here I'll say llm
00:19:47
comma retriever is equal
00:19:52
to knowledge base
00:19:55
dot as retriever
00:20:05
H okay so we are creating a QA chain a
00:20:09
question answering chain okay knowledge
00:20:12
base must have
00:20:15
made near
00:20:18
okay let's rename this and run this
00:20:22
again H we are creating a variable
00:20:24
called as QA chain and using this
00:20:26
retrieval QA which is used for this
00:20:28
question answering task coupled with
00:20:30
this uh you know Vector TB and llms and
00:20:33
from chain type we are passing a llm and
00:20:35
the retriever retriever is basically
00:20:37
your knowledge base as retriever so what
00:20:40
basically happens in this rag thing is
00:20:42
right so you have a document so in this
00:20:44
case the attention is all you need paper
00:20:46
is the document right uh this entire
00:20:49
content will be read and this will be
00:20:51
stored in a vcta database so if you want
00:20:53
to use an actual database you can go
00:20:56
with chroma DB Lance DB and other Vector
00:20:58
DB that are present this FAS is just
00:21:00
runs in the memory so let's say that we
00:21:02
are considering a vector database which
00:21:04
stor these Vector embeddings so all
00:21:06
these contents will be converted into
00:21:08
vector embeddings and stored in the
00:21:10
vector DB and for each of the text let's
00:21:13
say that a particular sentence has been
00:21:14
converted into a vector embedding so
00:21:18
it's nothing but some set of numbers
00:21:20
that represent this text as vectors and
00:21:22
there will be this corresponding IDs as
00:21:24
well to extract this corresponding text
00:21:26
from those vectors so now you have
00:21:28
vector B that has the vector embeddings
00:21:30
of this entire document and now the user
00:21:32
ask a question so let's say that the
00:21:33
user ask question about this hardware
00:21:35
and schedule so now what happens is you
00:21:37
would convert that query as well into a
00:21:41
vector eming and now you search the
00:21:44
vectors that's similar to this question
00:21:46
to this entire vectors of this document
00:21:48
so let's say that in this case right so
00:21:51
when the user asks uh you know something
00:21:54
about let's
00:21:57
say uh the machine translation thing so
00:21:59
the question is about machine
00:22:00
translation so it will convert that into
00:22:03
a vector and search that Vector with
00:22:05
this entire documents vector embeding
00:22:07
and let's say that it will extract this
00:22:09
particular text so now you have the
00:22:12
question and the content in which the
00:22:15
answer may be present so now you sent
00:22:17
this question in text format as well as
00:22:19
the answer in text format to the llm and
00:22:22
the llm will frame the answer so this is
00:22:24
all that's happening it's not like the
00:22:26
llm kind of feeds from the vector
00:22:28
embeddings so it doesn't it doesn't kind
00:22:30
of fre the vector embeddings so what
00:22:32
happens is the vector embeddings is used
00:22:33
in order to do the similarity search and
00:22:35
find the similar vectors only the
00:22:37
question and the relevant content that
00:22:39
has been retrieved from the vector
00:22:42
database is sent to the model so it's
00:22:44
basically like the question will be what
00:22:46
is machine translation mentioned in this
00:22:48
paper so that question and this
00:22:49
particular text will be sent to the
00:22:51
model and the model or the llm in this
00:22:53
case will the Lama 3 Model will frame
00:22:56
the answer for this question and it will
00:22:57
send this to so this is what happens
00:22:59
it's not like the entire content is fit
00:23:01
to the llm so that's not how this works
00:23:03
so this is about the architecture of
00:23:05
this Rag and this is what I've explained
00:23:07
in my conceptual video as well you can
00:23:09
check that out if like you want to
00:23:11
understand like this concept like more
00:23:13
clearly so this is about this Q next
00:23:16
let's let's ask a
00:23:18
question so let's say that this question
00:23:20
is equal
00:23:22
to let's ask a simple question of what
00:23:24
is this document about
00:23:30
and then you have to say
00:23:34
response is equal to qore chain the
00:23:37
retrieval QA chain that we have created
00:23:39
do invoke so you have to use this invoke
00:23:42
and within this you can pass this query
00:23:45
as a
00:23:49
dictionary
00:23:50
query colon
00:23:55
question now let's print this response
00:23:57
the
00:23:59
final answer will be present within this
00:24:02
response and within the key called as
00:24:06
result so you can check that okay so
00:24:10
let's run this so here we are creating a
00:24:12
QA chain and passing the llm and the
00:24:14
knowledge based the vector embeddings
00:24:16
that stored and then we are passing a
00:24:17
question so this question will be
00:24:19
converted into Vector embedding and we
00:24:20
will retrieve the relevant information
00:24:22
and then pass this to llm so this is
00:24:24
what happens in a QA chain and we are
00:24:26
using this QA chain. invoke to invoke
00:24:28
this chain which has this llm and the
00:24:30
knowledge base put this query and get
00:24:33
the relevant response from this so this
00:24:35
is how this rag for this retrieval QA
00:24:38
works and let's
00:24:40
see so the llm of choice in this case is
00:24:44
you know that it's Lama 3 but again as I
00:24:46
said you can use gimma or if you want to
00:24:48
use uh GPT 48 even you can use the Lang
00:24:52
chain version of this GPT 4 so you can
00:24:54
access GPT 40 API from L chain itself so
00:24:58
that APA thing you can pass to this LM
00:25:00
and then pass this in turn to this
00:25:02
retrieval QA so this answer says that
00:25:04
this document appears to be a research
00:25:06
paper on article about the Transformer
00:25:08
sequence transaction model and so on so
00:25:10
it was able to capture that information
00:25:12
properly maybe let's ask another
00:25:14
question of again a similar question of
00:25:17
what is uh you know the architecture
00:25:19
discussed in this model
00:25:35
let's ask
00:25:36
this so This the second turn uh kind of
00:25:40
took about four or 5 Seconds it won't
00:25:42
take like that much of a time so the
00:25:45
first time the model loads and all the
00:25:47
things happen so it might take a some
00:25:49
time but the later questions that you
00:25:51
ask won't take that much of a time again
00:25:53
it also depends on the GPU that you are
00:25:55
using so the answer is the architecture
00:25:57
discussed in the model and so on again
00:25:59
so we can't assure that all the
00:26:01
questions that we ask kind of gives you
00:26:03
accurate response because this is fairly
00:26:05
a smaller model so Lama 38 billion
00:26:08
version so the larger version is 70
00:26:10
billion again of course we can't load it
00:26:12
because of the GPU limitation but if you
00:26:14
are working in a company and they have
00:26:17
that machine with them or the cloud
00:26:19
instance with them then you can use like
00:26:20
the 70 billion version but all this
00:26:23
steps Remains the Same so you can you
00:26:25
know go to an E2 uh you know instead of
00:26:28
of 8 billion you can kind of load the 70
00:26:31
billion version and so on again if
00:26:32
you're working on your local if you have
00:26:34
like let's say 8GB GPU you can you know
00:26:37
load Lama 38 billion version itself so
00:26:39
it's only about 4.7 GB but this is how
00:26:42
this works let's just uh you know do a
00:26:44
quick recap of what we have done so the
00:26:46
first step we have imported all the
00:26:48
dependencies that we need so we have
00:26:50
this all in order to load the llm unst
00:26:52
structure file loader to load the PDF
00:26:53
file FAS in order to do similar search
00:26:56
similarity search of the vectors and
00:26:57
then again embeddings used in order to
00:26:59
convert the text into Vector embeddings
00:27:01
character text splitter is used to chunk
00:27:04
your PDF documents content into smaller
00:27:07
chunks and then retrial qway where we
00:27:09
will pass the llm as well as the
00:27:11
knowledge base which is the vector
00:27:12
embeding that we have stored then we are
00:27:15
loading the llm from Lama the Lama 3
00:27:18
instruct model that we have pulled with
00:27:19
some temperature parameter zero so that
00:27:21
we don't want any uh larger Randomness
00:27:25
to the model so and then we are loading
00:27:27
this using unst structure file loader
00:27:29
and then using text splitter text chunks
00:27:31
and so on and then we are loading the
00:27:32
embedding
00:27:34
model and we are converting this text
00:27:36
documents uh into vector embeddings and
00:27:39
storing this in this FAS thing for
00:27:41
similarity sets and finally we have this
00:27:44
Ral QA QA chain passing the llm and the
00:27:46
knowledge based to it and then we can
00:27:48
ask this question by using QA chain.
00:27:50
invoke so I hope everyone is clear about
00:27:53
this and please try and see how this
00:27:55
code works so that is all from my side
00:27:57
and I'll see in the next uplo thanks for
00:27:59
watching