00:00:00
in a previous video we looked at project
00:00:02
graph rack from Microsoft this aims to
00:00:04
combine knowledge graphs with retrial
00:00:06
augmented generation to address the
00:00:08
limitations of traditional rack systems
00:00:11
and in that video we use GPD 40 as our
00:00:14
llm but that was pretty expensive to use
00:00:18
in this video I'll show you how you can
00:00:20
use a local model using AMA as well as
00:00:23
the gro API and we're going to also talk
00:00:26
about why it probably is not a good idea
00:00:28
to use local models with graph rag but
00:00:31
to get started you will first need to
00:00:32
download AMA on your local machine and
00:00:35
then choose the model that you want to
00:00:36
use so in this case we're going to be
00:00:38
using Lama 3 but I'll recommend to use a
00:00:41
much bigger model if your Hardware can
00:00:43
support it and I'll explain the reason
00:00:45
of using bigger models later in the
00:00:47
video L 3 follows the same API standard
00:00:50
as open so it makes it very easy to just
00:00:53
replace the open API server with this
00:00:57
new endpoint and by default it's going
00:00:59
to be run at Local Host Port
00:01:02
11434 and V1 that's basically the API
00:01:06
version we will need this base URL as
00:01:10
well as the API key which is going to be
00:01:12
AMA in this case now we just need to
00:01:15
point the graph rag application to start
00:01:20
interacting with that API endpoint and
00:01:22
for that we will just need to go to the
00:01:25
project that we set up for graph rag
00:01:27
then go to settings. EML and in here
00:01:31
we're going to look at the llm now you
00:01:33
will need to set up the project so again
00:01:36
U if you're not familiar with it I'll
00:01:38
highly recommend to watch my previous
00:01:39
video in that I both cover the
00:01:41
theoretical aspect of how a graph rack
00:01:44
works as well as like how to set it up
00:01:47
now in this case we'll need to make a
00:01:48
few changes so initially this was
00:01:51
pointing towards the graph rag API key
00:01:54
for the llm which is basically the open
00:01:57
API key but I'm going to provide the API
00:01:59
ke key as AMA you can keep the type as
00:02:03
openi chat because AMA follows the same
00:02:06
standard as open AI for the model we're
00:02:09
going to provide Lama 3 that's the model
00:02:12
currently we are serving and since AMA
00:02:14
also supports Json mode so you can set
00:02:16
this to trim for the base API we are
00:02:20
going to provide this API point that we
00:02:23
are currently running on our local
00:02:25
machine now there are a couple of other
00:02:27
parameters that you want to set if you
00:02:29
are working with something like Gro so
00:02:32
if you are serving the model through Gro
00:02:35
in that case you will need to change
00:02:37
this API key to the groc uh API
00:02:41
endpoint and in that case instead of the
00:02:43
base URL that we're currently using
00:02:45
we're going to use this base URL I'll
00:02:47
show you that later in the video but
00:02:50
this is one of the changes that you need
00:02:52
to make and the second one is going to
00:02:55
be the model that you want to use so for
00:02:57
example you can select the Lama 370
00:03:00
billion model so you will have to
00:03:02
provide that model as well now Gro does
00:03:05
have some rate limits and that is why
00:03:08
you actually need to make sure that you
00:03:10
are following those rate limits so for
00:03:12
example for most of the models you can
00:03:14
make only 30 requests per minute that's
00:03:18
the max number of requests grock is
00:03:20
going to allow you on the their uh free
00:03:23
tier so for groc we will need to
00:03:25
actually set this number of requests per
00:03:28
minute to 30 or probably less than that
00:03:32
that is to ensure that we don't time out
00:03:34
but keep in mind that if you set that up
00:03:37
it's going to take a while for the
00:03:39
process to
00:03:40
finish the second aspect is the
00:03:42
embedding model that is going to be
00:03:44
using now in my case I haven't really
00:03:46
found a solution to replace the open AI
00:03:48
embedding model and the reason is I
00:03:50
think there is no standard API when it
00:03:52
comes to the embedding models that other
00:03:54
uh API providers are following so when I
00:03:58
was experimenting with local weding
00:04:00
models I couldn't really make it work
00:04:02
because I think they're not following
00:04:04
the same standard as uh open now even if
00:04:08
you use the open air embedding model the
00:04:10
cost associated with embedding models is
00:04:13
pretty small compared to the llm so for
00:04:17
examp so for example in my previous
00:04:20
experiment we made only 25 requests to
00:04:23
the embedding model compared to 570
00:04:26
requests to GPT 40 so even if if you
00:04:30
were to use the embedding model from
00:04:32
openi I think it's not going to cost too
00:04:34
much okay so once you set this up the
00:04:37
next thing is to just run the local
00:04:39
indexing that's the first part and for
00:04:41
that we're going to be using python DM
00:04:44
craft rank. index because we want to
00:04:46
create the index and we want to provide
00:04:48
the file path so by default it's looking
00:04:51
at this folder and within this folder we
00:04:53
have an input
00:04:54
folder and uh you can change that within
00:04:58
the settings. yml file so here is the uh
00:05:01
input folder you can uh provide another
00:05:03
name if you want you can also change the
00:05:05
chunk size as well as the overlap but
00:05:08
I'm not doing that in this case now
00:05:10
current the int extraction part is
00:05:12
running it has taken about 21 minutes
00:05:15
and this has only completed about 50% or
00:05:18
58% in this case I'm running this on an
00:05:20
M2 MacBook Pro that has 96 GB of and
00:05:25
here's what the GPU usage looks like now
00:05:29
this is not the only process that is
00:05:30
running on my machine I have gazillions
00:05:33
of chrome tabs open as well as some
00:05:36
other processes running and we can also
00:05:39
look at the output folder so the output
00:05:41
folder basically is going to be creating
00:05:43
the embedding vectors as well as this
00:05:45
going to be creating this reports this
00:05:47
is a place where you can see what
00:05:49
exactly is going on so let me give you a
00:05:52
quick overview of what is happening so
00:05:55
for the base URL it's using the API
00:05:57
endpoint that we have provided and it is
00:05:59
is using the Lama 3 Model so that's a
00:06:01
good thing now it has the other settings
00:06:05
that are coming in from the settings.
00:06:07
yml file and you can see it's making
00:06:10
calls to API endpoint and sometimes it
00:06:13
retries things so for example here was
00:06:15
an error uh that it got but it's able to
00:06:20
recover from that error because it was
00:06:21
retrying it multiple times uh and you
00:06:24
can uh select how many retries it's
00:06:26
going to have now when I was running uh
00:06:30
GPT 40 the process took a much shorter
00:06:33
amount of time and the results probably
00:06:35
are going to be much better if you're
00:06:36
using the bigger gp4 or model compared
00:06:39
to when you're using something like Lama
00:06:41
38 billion but I just wanted to show you
00:06:43
how you can uh set this up for better
00:06:46
results it might be better to use the L
00:06:49
370 billion model from Gro however in
00:06:51
that case you need to make sure that you
00:06:53
set the uh requests per minute to a
00:06:56
lower value and that could um result in
00:06:59
much longer uh time that it's going to
00:07:01
take to create the in extraction as well
00:07:04
as the uh corresponding graph for us now
00:07:06
it has already taken about uh 27 minutes
00:07:10
so I'm going to wait for this process to
00:07:11
complete and then I'll walk you through
00:07:14
how to run this to test this out we're
00:07:17
going to use the um same prompt that we
00:07:19
used in the previous video so we will
00:07:21
use Python DM graph rag. query and we
00:07:25
need to provide where our documents are
00:07:27
located or where the index was was
00:07:29
created along with the graph then the
00:07:31
method that we're going to be using is
00:07:33
global Community I describe or explain
00:07:36
this in my previous video so I highly
00:07:38
recommend to watch that to understand
00:07:40
the difference between Global and local
00:07:43
and the prompt is going to be what is
00:07:45
the main theme of the book so let's run
00:07:48
this okay so here's the response Global
00:07:51
search response main theme of the book
00:07:54
and the main theme of the book as
00:07:55
highlighted by multiple analyst is the
00:07:57
importance of human Connection in
00:07:59
relationships this theme is evident
00:08:01
throughout the character of Scrooge who
00:08:04
continues to answer Marley's name even
00:08:06
after his death right so the the the
00:08:09
summary or the main thing that we get
00:08:11
from Lama 3 is not as good as gbd4 and
00:08:16
that is kind of
00:08:18
self-explanatory because when it comes
00:08:20
to graph rack the choice of llm that you
00:08:22
are going to be using is a lot more
00:08:24
critical than the choice of llm In
00:08:27
traditional rag system and in order to
00:08:30
explain this let's look at just
00:08:31
traditional rack system so in
00:08:33
traditional rack system the most
00:08:34
important aspect is your embedding model
00:08:37
because that really determines what type
00:08:39
of chunks the the llm is going to
00:08:41
receive when it's trying to generate the
00:08:43
responses so you want to make sure that
00:08:45
uh both the chunking strategy as well as
00:08:48
the embedding model that you choose is
00:08:50
great so that the llm receives the
00:08:53
proper context so in that case if even
00:08:56
you have a smaller llm and you provide
00:08:59
great context it will be able to
00:09:01
generate good
00:09:03
responses but in case of the graph frag
00:09:07
approach it's very different right
00:09:08
because the way you are building these
00:09:11
knowledge graphs is that you first
00:09:13
extract entities from your text right so
00:09:15
you need to have a really great llm that
00:09:17
is able to recognize different entities
00:09:20
that are present in in your documents
00:09:23
and extract relationships so if you use
00:09:25
a smaller llm like Lama 38 billion uh
00:09:28
then it will not be able to actually
00:09:30
extract those relationships accurately
00:09:33
and and as a as a result the graph that
00:09:36
you create is not going to be great so
00:09:38
and then you also need to basically
00:09:41
create summaries of the communities that
00:09:43
you're creating based on the graph that
00:09:45
is created by the llm so there are like
00:09:48
multiple aspects in which the llm plays
00:09:50
a lot more critical role when it comes
00:09:52
to graph ride graph rag compared to the
00:09:55
traditional rack system so a smaller llm
00:09:58
is probably not a great Choice here and
00:10:01
you want to look at much bigger llms
00:10:03
like Lama 370 billion model so here are
00:10:06
the results when I tried to use the L
00:10:09
T70 billion model from Croc so in this
00:10:11
case I replaced the base API also Chang
00:10:14
the requests per minute we are using the
00:10:17
L 370 billion model and the results is
00:10:20
this now keep in mind that I didn't
00:10:22
embed the whole file in this case
00:10:24
because that was taking way too long so
00:10:26
it's just a small portion of the main
00:10:28
book which says the main theme of the
00:10:30
book revolves around Scrooge's
00:10:32
transformative gener marked by
00:10:34
Supernatural events and interaction with
00:10:37
various entities right and it talks
00:10:39
about the significance of s Supernatural
00:10:42
events implications of the theme so it's
00:10:44
much better compared to the Lama 38
00:10:47
billion model but not still as good as
00:10:50
the gbt 40 and the reason being that it
00:10:52
was just looking at a very small portion
00:10:54
of the document compared to the whole
00:10:57
book another thing to consider is these
00:11:00
prompts that are being used so by now we
00:11:02
know that different llms react
00:11:06
differently to the same prompt so you
00:11:08
really need to actually look at your
00:11:11
prompt and kind of hand craftter for
00:11:14
each and every LM
00:11:16
differently so a prompt that works great
00:11:19
for GPD 40 may not be a great prompt for
00:11:22
Lama 38 billion or even u l 370 billion
00:11:26
so if you are going to use craft rag in
00:11:30
your system just make sure that you are
00:11:33
able to modify these proms based on the
00:11:35
llm that you are using that is going to
00:11:38
be a very critical component that a lot
00:11:40
of people simply ignores and then the
00:11:42
system doesn't R good outputs I'm going
00:11:44
to be experimenting with graph rag a lot
00:11:46
more because I think it's a great
00:11:48
framework that needs uh a lot more
00:11:50
exploration and there are some other um
00:11:53
implementation of graft rag as well uh
00:11:55
so we are going to look at some of them
00:11:58
in subsequent videos so if that's
00:12:00
something that interests you make sure
00:12:01
to subscribe to the channel thanks for
00:12:04
watching and as always see you in the
00:12:06
next one