What is Project GraphRAG?

Project GraphRAG combines knowledge graphs with retrieval-augmented generation to improve traditional RAG systems.

Why is Llama 3 recommended for local models?

Llama 3 follows the same API standard as OpenAI, making it easier to integrate.

What are the limitations of using smaller models with GraphRAG?

Smaller models may struggle to accurately extract relationships, which is critical for building effective knowledge graphs.

How does the LLM impact GraphRAG compared to traditional RAG?

In GraphRAG, the quality of the LLM is more critical than just the embedding model, which is not the case in traditional RAG systems.

What should you consider when crafting prompts for different LLMs?

Different LLMs respond differently to the same prompts, so they should be tailored for each specific model.

What are the requests-per-minute limits for models like Gro?

For Gro, the max requests allowed are typically 30 per minute on the free tier.

Can you replace the OpenAI embedding model with local models?

Currently, there is no standard API for embedding models from other providers, making it challenging to replace OpenAI's embedding.

Why are larger models preferred in this setup?

Larger models provide better entity recognition and relationship extraction crucial for effective graph building.

What is the main theme of the book discussed in the experiment?

The main theme revolves around the importance of human connection in relationships, as highlighted by character Scrooge.

Graph RAG with Ollama - Save $$$ with Local LLMs

00:12:08

https://www.youtube.com/watch?v=_XOCAVsr3KU

Zusammenfassung

TLDRThis video explores how to leverage Project GraphRAG using local models instead of costly cloud-based LLMs like GPT-4. The host walks viewers through the setup of the local environment with the Llama 3 model, emphasizing key configurations needed to interact with the API effectively. The limitations of local models are discussed, particularly around entity and relationship extraction, emphasizing the importance of using robust models for generating knowledge graphs. The video also highlights how different LLMs react to prompts differently, necessitating careful crafting of queries for optimal results. Comparisons of outputs between Llama 3, Llama 70 billion, and GPT-4 show variations in quality and the inherent challenges in local model applications. The presenter encourages further experimentation with GraphRAG, indicating future videos on related topics.

Mitbringsel

📈 Using local models can reduce costs compared to cloud-based LLMs.
🔧 Setting up AMA and Llama 3 on your local machine is straightforward.
⚙️ Proper API configuration is essential for effective GraphRAG usage.
👥 Larger models are preferred for better relationship extraction in knowledge graphs.
📉 Smaller models like Llama 3 may struggle with accuracy in extracting entities.
📝 Different LLMs react uniquely to prompts, requiring tailored queries for each.
📐 Rate limits on services like Gro must be respected to avoid timeouts.
⚠️ The quality of the LLM significantly impacts overall performance in GraphRAG.
📊 The results from Llama models show improvements, but may still fall short compared to GPT-4.
🔍 Continued exploration of GraphRAG can yield valuable insights and improvements.

Zeitleiste

00:00:00 - 00:05:00
In this video, the presenter discusses utilizing a local model with Project Graph RAG from Microsoft, following a previous video that used GPT-4. The local model setup involves downloading AMA, setting it up on a local machine, selecting a model such as Llama 3, and configuring the API endpoint appropriately. It's emphasized that using larger models could yield better results, especially when integrated with Graph RAG, due to their capability in handling complex entity extraction and relationship recognition.
00:05:00 - 00:12:08
The presenter runs a local indexing process on an M2 MacBook, detailing the setup and challenges faced, including API error handling. The importance of using a suitable LLM is highlighted, especially in the Graph RAG framework, where entity extraction is critical for creating an effective knowledge graph. Testing with prompts reveals that while the output from Llama 3 improves with Llama 370 billion, it still falls short of GPT-4 quality, underlining the significance of prompt engineering tailored for each model.

Mind Map

Video-Fragen und Antworten

What is Project GraphRAG?
Project GraphRAG combines knowledge graphs with retrieval-augmented generation to improve traditional RAG systems.
Why is Llama 3 recommended for local models?
Llama 3 follows the same API standard as OpenAI, making it easier to integrate.
What are the limitations of using smaller models with GraphRAG?
Smaller models may struggle to accurately extract relationships, which is critical for building effective knowledge graphs.
How does the LLM impact GraphRAG compared to traditional RAG?
In GraphRAG, the quality of the LLM is more critical than just the embedding model, which is not the case in traditional RAG systems.
What should you consider when crafting prompts for different LLMs?
Different LLMs respond differently to the same prompts, so they should be tailored for each specific model.
What are the requests-per-minute limits for models like Gro?
For Gro, the max requests allowed are typically 30 per minute on the free tier.
Can you replace the OpenAI embedding model with local models?
Currently, there is no standard API for embedding models from other providers, making it challenging to replace OpenAI's embedding.
Why are larger models preferred in this setup?
Larger models provide better entity recognition and relationship extraction crucial for effective graph building.
What is the main theme of the book discussed in the experiment?
The main theme revolves around the importance of human connection in relationships, as highlighted by character Scrooge.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!

Untertitel

Automatisches Blättern:

00:00:00
in a previous video we looked at project
00:00:02
graph rack from Microsoft this aims to
00:00:04
combine knowledge graphs with retrial
00:00:06
augmented generation to address the
00:00:08
limitations of traditional rack systems
00:00:11
and in that video we use GPD 40 as our
00:00:14
llm but that was pretty expensive to use
00:00:18
in this video I'll show you how you can
00:00:20
use a local model using AMA as well as
00:00:23
the gro API and we're going to also talk
00:00:26
about why it probably is not a good idea
00:00:28
to use local models with graph rag but
00:00:31
to get started you will first need to
00:00:32
download AMA on your local machine and
00:00:35
then choose the model that you want to
00:00:36
use so in this case we're going to be
00:00:38
using Lama 3 but I'll recommend to use a
00:00:41
much bigger model if your Hardware can
00:00:43
support it and I'll explain the reason
00:00:45
of using bigger models later in the
00:00:47
video L 3 follows the same API standard
00:00:50
as open so it makes it very easy to just
00:00:53
replace the open API server with this
00:00:57
new endpoint and by default it's going
00:00:59
to be run at Local Host Port
00:01:02
11434 and V1 that's basically the API
00:01:06
version we will need this base URL as
00:01:10
well as the API key which is going to be
00:01:12
AMA in this case now we just need to
00:01:15
point the graph rag application to start
00:01:20
interacting with that API endpoint and
00:01:22
for that we will just need to go to the
00:01:25
project that we set up for graph rag
00:01:27
then go to settings. EML and in here
00:01:31
we're going to look at the llm now you
00:01:33
will need to set up the project so again
00:01:36
U if you're not familiar with it I'll
00:01:38
highly recommend to watch my previous
00:01:39
video in that I both cover the
00:01:41
theoretical aspect of how a graph rack
00:01:44
works as well as like how to set it up
00:01:47
now in this case we'll need to make a
00:01:48
few changes so initially this was
00:01:51
pointing towards the graph rag API key
00:01:54
for the llm which is basically the open
00:01:57
API key but I'm going to provide the API
00:01:59
ke key as AMA you can keep the type as
00:02:03
openi chat because AMA follows the same
00:02:06
standard as open AI for the model we're
00:02:09
going to provide Lama 3 that's the model
00:02:12
currently we are serving and since AMA
00:02:14
also supports Json mode so you can set
00:02:16
this to trim for the base API we are
00:02:20
going to provide this API point that we
00:02:23
are currently running on our local
00:02:25
machine now there are a couple of other
00:02:27
parameters that you want to set if you
00:02:29
are working with something like Gro so
00:02:32
if you are serving the model through Gro
00:02:35
in that case you will need to change
00:02:37
this API key to the groc uh API
00:02:41
endpoint and in that case instead of the
00:02:43
base URL that we're currently using
00:02:45
we're going to use this base URL I'll
00:02:47
show you that later in the video but
00:02:50
this is one of the changes that you need
00:02:52
to make and the second one is going to
00:02:55
be the model that you want to use so for
00:02:57
example you can select the Lama 370
00:03:00
billion model so you will have to
00:03:02
provide that model as well now Gro does
00:03:05
have some rate limits and that is why
00:03:08
you actually need to make sure that you
00:03:10
are following those rate limits so for
00:03:12
example for most of the models you can
00:03:14
make only 30 requests per minute that's
00:03:18
the max number of requests grock is
00:03:20
going to allow you on the their uh free
00:03:23
tier so for groc we will need to
00:03:25
actually set this number of requests per
00:03:28
minute to 30 or probably less than that
00:03:32
that is to ensure that we don't time out
00:03:34
but keep in mind that if you set that up
00:03:37
it's going to take a while for the
00:03:39
process to
00:03:40
finish the second aspect is the
00:03:42
embedding model that is going to be
00:03:44
using now in my case I haven't really
00:03:46
found a solution to replace the open AI
00:03:48
embedding model and the reason is I
00:03:50
think there is no standard API when it
00:03:52
comes to the embedding models that other
00:03:54
uh API providers are following so when I
00:03:58
was experimenting with local weding
00:04:00
models I couldn't really make it work
00:04:02
because I think they're not following
00:04:04
the same standard as uh open now even if
00:04:08
you use the open air embedding model the
00:04:10
cost associated with embedding models is
00:04:13
pretty small compared to the llm so for
00:04:17
examp so for example in my previous
00:04:20
experiment we made only 25 requests to
00:04:23
the embedding model compared to 570
00:04:26
requests to GPT 40 so even if if you
00:04:30
were to use the embedding model from
00:04:32
openi I think it's not going to cost too
00:04:34
much okay so once you set this up the
00:04:37
next thing is to just run the local
00:04:39
indexing that's the first part and for
00:04:41
that we're going to be using python DM
00:04:44
craft rank. index because we want to
00:04:46
create the index and we want to provide
00:04:48
the file path so by default it's looking
00:04:51
at this folder and within this folder we
00:04:53
have an input
00:04:54
folder and uh you can change that within
00:04:58
the settings. yml file so here is the uh
00:05:01
input folder you can uh provide another
00:05:03
name if you want you can also change the
00:05:05
chunk size as well as the overlap but
00:05:08
I'm not doing that in this case now
00:05:10
current the int extraction part is
00:05:12
running it has taken about 21 minutes
00:05:15
and this has only completed about 50% or
00:05:18
58% in this case I'm running this on an
00:05:20
M2 MacBook Pro that has 96 GB of and
00:05:25
here's what the GPU usage looks like now
00:05:29
this is not the only process that is
00:05:30
running on my machine I have gazillions
00:05:33
of chrome tabs open as well as some
00:05:36
other processes running and we can also
00:05:39
look at the output folder so the output
00:05:41
folder basically is going to be creating
00:05:43
the embedding vectors as well as this
00:05:45
going to be creating this reports this
00:05:47
is a place where you can see what
00:05:49
exactly is going on so let me give you a
00:05:52
quick overview of what is happening so
00:05:55
for the base URL it's using the API
00:05:57
endpoint that we have provided and it is
00:05:59
is using the Lama 3 Model so that's a
00:06:01
good thing now it has the other settings
00:06:05
that are coming in from the settings.
00:06:07
yml file and you can see it's making
00:06:10
calls to API endpoint and sometimes it
00:06:13
retries things so for example here was
00:06:15
an error uh that it got but it's able to
00:06:20
recover from that error because it was
00:06:21
retrying it multiple times uh and you
00:06:24
can uh select how many retries it's
00:06:26
going to have now when I was running uh
00:06:30
GPT 40 the process took a much shorter
00:06:33
amount of time and the results probably
00:06:35
are going to be much better if you're
00:06:36
using the bigger gp4 or model compared
00:06:39
to when you're using something like Lama
00:06:41
38 billion but I just wanted to show you
00:06:43
how you can uh set this up for better
00:06:46
results it might be better to use the L
00:06:49
370 billion model from Gro however in
00:06:51
that case you need to make sure that you
00:06:53
set the uh requests per minute to a
00:06:56
lower value and that could um result in
00:06:59
much longer uh time that it's going to
00:07:01
take to create the in extraction as well
00:07:04
as the uh corresponding graph for us now
00:07:06
it has already taken about uh 27 minutes
00:07:10
so I'm going to wait for this process to
00:07:11
complete and then I'll walk you through
00:07:14
how to run this to test this out we're
00:07:17
going to use the um same prompt that we
00:07:19
used in the previous video so we will
00:07:21
use Python DM graph rag. query and we
00:07:25
need to provide where our documents are
00:07:27
located or where the index was was
00:07:29
created along with the graph then the
00:07:31
method that we're going to be using is
00:07:33
global Community I describe or explain
00:07:36
this in my previous video so I highly
00:07:38
recommend to watch that to understand
00:07:40
the difference between Global and local
00:07:43
and the prompt is going to be what is
00:07:45
the main theme of the book so let's run
00:07:48
this okay so here's the response Global
00:07:51
search response main theme of the book
00:07:54
and the main theme of the book as
00:07:55
highlighted by multiple analyst is the
00:07:57
importance of human Connection in
00:07:59
relationships this theme is evident
00:08:01
throughout the character of Scrooge who
00:08:04
continues to answer Marley's name even
00:08:06
after his death right so the the the
00:08:09
summary or the main thing that we get
00:08:11
from Lama 3 is not as good as gbd4 and
00:08:16
that is kind of
00:08:18
self-explanatory because when it comes
00:08:20
to graph rack the choice of llm that you
00:08:22
are going to be using is a lot more
00:08:24
critical than the choice of llm In
00:08:27
traditional rag system and in order to
00:08:30
explain this let's look at just
00:08:31
traditional rack system so in
00:08:33
traditional rack system the most
00:08:34
important aspect is your embedding model
00:08:37
because that really determines what type
00:08:39
of chunks the the llm is going to
00:08:41
receive when it's trying to generate the
00:08:43
responses so you want to make sure that
00:08:45
uh both the chunking strategy as well as
00:08:48
the embedding model that you choose is
00:08:50
great so that the llm receives the
00:08:53
proper context so in that case if even
00:08:56
you have a smaller llm and you provide
00:08:59
great context it will be able to
00:09:01
generate good
00:09:03
responses but in case of the graph frag
00:09:07
approach it's very different right
00:09:08
because the way you are building these
00:09:11
knowledge graphs is that you first
00:09:13
extract entities from your text right so
00:09:15
you need to have a really great llm that
00:09:17
is able to recognize different entities
00:09:20
that are present in in your documents
00:09:23
and extract relationships so if you use
00:09:25
a smaller llm like Lama 38 billion uh
00:09:28
then it will not be able to actually
00:09:30
extract those relationships accurately
00:09:33
and and as a as a result the graph that
00:09:36
you create is not going to be great so
00:09:38
and then you also need to basically
00:09:41
create summaries of the communities that
00:09:43
you're creating based on the graph that
00:09:45
is created by the llm so there are like
00:09:48
multiple aspects in which the llm plays
00:09:50
a lot more critical role when it comes
00:09:52
to graph ride graph rag compared to the
00:09:55
traditional rack system so a smaller llm
00:09:58
is probably not a great Choice here and
00:10:01
you want to look at much bigger llms
00:10:03
like Lama 370 billion model so here are
00:10:06
the results when I tried to use the L
00:10:09
T70 billion model from Croc so in this
00:10:11
case I replaced the base API also Chang
00:10:14
the requests per minute we are using the
00:10:17
L 370 billion model and the results is
00:10:20
this now keep in mind that I didn't
00:10:22
embed the whole file in this case
00:10:24
because that was taking way too long so
00:10:26
it's just a small portion of the main
00:10:28
book which says the main theme of the
00:10:30
book revolves around Scrooge's
00:10:32
transformative gener marked by
00:10:34
Supernatural events and interaction with
00:10:37
various entities right and it talks
00:10:39
about the significance of s Supernatural
00:10:42
events implications of the theme so it's
00:10:44
much better compared to the Lama 38
00:10:47
billion model but not still as good as
00:10:50
the gbt 40 and the reason being that it
00:10:52
was just looking at a very small portion
00:10:54
of the document compared to the whole
00:10:57
book another thing to consider is these
00:11:00
prompts that are being used so by now we
00:11:02
know that different llms react
00:11:06
differently to the same prompt so you
00:11:08
really need to actually look at your
00:11:11
prompt and kind of hand craftter for
00:11:14
each and every LM
00:11:16
differently so a prompt that works great
00:11:19
for GPD 40 may not be a great prompt for
00:11:22
Lama 38 billion or even u l 370 billion
00:11:26
so if you are going to use craft rag in
00:11:30
your system just make sure that you are
00:11:33
able to modify these proms based on the
00:11:35
llm that you are using that is going to
00:11:38
be a very critical component that a lot
00:11:40
of people simply ignores and then the
00:11:42
system doesn't R good outputs I'm going
00:11:44
to be experimenting with graph rag a lot
00:11:46
more because I think it's a great
00:11:48
framework that needs uh a lot more
00:11:50
exploration and there are some other um
00:11:53
implementation of graft rag as well uh
00:11:55
so we are going to look at some of them
00:11:58
in subsequent videos so if that's
00:12:00
something that interests you make sure
00:12:01
to subscribe to the channel thanks for
00:12:04
watching and as always see you in the
00:12:06
next one