Graph RAG with Ollama - Save $$$ with Local LLMs

00:12:08
https://www.youtube.com/watch?v=_XOCAVsr3KU

Zusammenfassung

TLDRThis video explores how to leverage Project GraphRAG using local models instead of costly cloud-based LLMs like GPT-4. The host walks viewers through the setup of the local environment with the Llama 3 model, emphasizing key configurations needed to interact with the API effectively. The limitations of local models are discussed, particularly around entity and relationship extraction, emphasizing the importance of using robust models for generating knowledge graphs. The video also highlights how different LLMs react to prompts differently, necessitating careful crafting of queries for optimal results. Comparisons of outputs between Llama 3, Llama 70 billion, and GPT-4 show variations in quality and the inherent challenges in local model applications. The presenter encourages further experimentation with GraphRAG, indicating future videos on related topics.

Mitbringsel

  • 📈 Using local models can reduce costs compared to cloud-based LLMs.
  • 🔧 Setting up AMA and Llama 3 on your local machine is straightforward.
  • ⚙️ Proper API configuration is essential for effective GraphRAG usage.
  • 👥 Larger models are preferred for better relationship extraction in knowledge graphs.
  • 📉 Smaller models like Llama 3 may struggle with accuracy in extracting entities.
  • 📝 Different LLMs react uniquely to prompts, requiring tailored queries for each.
  • 📐 Rate limits on services like Gro must be respected to avoid timeouts.
  • ⚠️ The quality of the LLM significantly impacts overall performance in GraphRAG.
  • 📊 The results from Llama models show improvements, but may still fall short compared to GPT-4.
  • 🔍 Continued exploration of GraphRAG can yield valuable insights and improvements.

Zeitleiste

  • 00:00:00 - 00:05:00

    In this video, the presenter discusses utilizing a local model with Project Graph RAG from Microsoft, following a previous video that used GPT-4. The local model setup involves downloading AMA, setting it up on a local machine, selecting a model such as Llama 3, and configuring the API endpoint appropriately. It's emphasized that using larger models could yield better results, especially when integrated with Graph RAG, due to their capability in handling complex entity extraction and relationship recognition.

  • 00:05:00 - 00:12:08

    The presenter runs a local indexing process on an M2 MacBook, detailing the setup and challenges faced, including API error handling. The importance of using a suitable LLM is highlighted, especially in the Graph RAG framework, where entity extraction is critical for creating an effective knowledge graph. Testing with prompts reveals that while the output from Llama 3 improves with Llama 370 billion, it still falls short of GPT-4 quality, underlining the significance of prompt engineering tailored for each model.

Mind Map

Video-Fragen und Antworten

  • What is Project GraphRAG?

    Project GraphRAG combines knowledge graphs with retrieval-augmented generation to improve traditional RAG systems.

  • Why is Llama 3 recommended for local models?

    Llama 3 follows the same API standard as OpenAI, making it easier to integrate.

  • What are the limitations of using smaller models with GraphRAG?

    Smaller models may struggle to accurately extract relationships, which is critical for building effective knowledge graphs.

  • How does the LLM impact GraphRAG compared to traditional RAG?

    In GraphRAG, the quality of the LLM is more critical than just the embedding model, which is not the case in traditional RAG systems.

  • What should you consider when crafting prompts for different LLMs?

    Different LLMs respond differently to the same prompts, so they should be tailored for each specific model.

  • What are the requests-per-minute limits for models like Gro?

    For Gro, the max requests allowed are typically 30 per minute on the free tier.

  • Can you replace the OpenAI embedding model with local models?

    Currently, there is no standard API for embedding models from other providers, making it challenging to replace OpenAI's embedding.

  • Why are larger models preferred in this setup?

    Larger models provide better entity recognition and relationship extraction crucial for effective graph building.

  • What is the main theme of the book discussed in the experiment?

    The main theme revolves around the importance of human connection in relationships, as highlighted by character Scrooge.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!
Untertitel
en
Automatisches Blättern:
  • 00:00:00
    in a previous video we looked at project
  • 00:00:02
    graph rack from Microsoft this aims to
  • 00:00:04
    combine knowledge graphs with retrial
  • 00:00:06
    augmented generation to address the
  • 00:00:08
    limitations of traditional rack systems
  • 00:00:11
    and in that video we use GPD 40 as our
  • 00:00:14
    llm but that was pretty expensive to use
  • 00:00:18
    in this video I'll show you how you can
  • 00:00:20
    use a local model using AMA as well as
  • 00:00:23
    the gro API and we're going to also talk
  • 00:00:26
    about why it probably is not a good idea
  • 00:00:28
    to use local models with graph rag but
  • 00:00:31
    to get started you will first need to
  • 00:00:32
    download AMA on your local machine and
  • 00:00:35
    then choose the model that you want to
  • 00:00:36
    use so in this case we're going to be
  • 00:00:38
    using Lama 3 but I'll recommend to use a
  • 00:00:41
    much bigger model if your Hardware can
  • 00:00:43
    support it and I'll explain the reason
  • 00:00:45
    of using bigger models later in the
  • 00:00:47
    video L 3 follows the same API standard
  • 00:00:50
    as open so it makes it very easy to just
  • 00:00:53
    replace the open API server with this
  • 00:00:57
    new endpoint and by default it's going
  • 00:00:59
    to be run at Local Host Port
  • 00:01:02
    11434 and V1 that's basically the API
  • 00:01:06
    version we will need this base URL as
  • 00:01:10
    well as the API key which is going to be
  • 00:01:12
    AMA in this case now we just need to
  • 00:01:15
    point the graph rag application to start
  • 00:01:20
    interacting with that API endpoint and
  • 00:01:22
    for that we will just need to go to the
  • 00:01:25
    project that we set up for graph rag
  • 00:01:27
    then go to settings. EML and in here
  • 00:01:31
    we're going to look at the llm now you
  • 00:01:33
    will need to set up the project so again
  • 00:01:36
    U if you're not familiar with it I'll
  • 00:01:38
    highly recommend to watch my previous
  • 00:01:39
    video in that I both cover the
  • 00:01:41
    theoretical aspect of how a graph rack
  • 00:01:44
    works as well as like how to set it up
  • 00:01:47
    now in this case we'll need to make a
  • 00:01:48
    few changes so initially this was
  • 00:01:51
    pointing towards the graph rag API key
  • 00:01:54
    for the llm which is basically the open
  • 00:01:57
    API key but I'm going to provide the API
  • 00:01:59
    ke key as AMA you can keep the type as
  • 00:02:03
    openi chat because AMA follows the same
  • 00:02:06
    standard as open AI for the model we're
  • 00:02:09
    going to provide Lama 3 that's the model
  • 00:02:12
    currently we are serving and since AMA
  • 00:02:14
    also supports Json mode so you can set
  • 00:02:16
    this to trim for the base API we are
  • 00:02:20
    going to provide this API point that we
  • 00:02:23
    are currently running on our local
  • 00:02:25
    machine now there are a couple of other
  • 00:02:27
    parameters that you want to set if you
  • 00:02:29
    are working with something like Gro so
  • 00:02:32
    if you are serving the model through Gro
  • 00:02:35
    in that case you will need to change
  • 00:02:37
    this API key to the groc uh API
  • 00:02:41
    endpoint and in that case instead of the
  • 00:02:43
    base URL that we're currently using
  • 00:02:45
    we're going to use this base URL I'll
  • 00:02:47
    show you that later in the video but
  • 00:02:50
    this is one of the changes that you need
  • 00:02:52
    to make and the second one is going to
  • 00:02:55
    be the model that you want to use so for
  • 00:02:57
    example you can select the Lama 370
  • 00:03:00
    billion model so you will have to
  • 00:03:02
    provide that model as well now Gro does
  • 00:03:05
    have some rate limits and that is why
  • 00:03:08
    you actually need to make sure that you
  • 00:03:10
    are following those rate limits so for
  • 00:03:12
    example for most of the models you can
  • 00:03:14
    make only 30 requests per minute that's
  • 00:03:18
    the max number of requests grock is
  • 00:03:20
    going to allow you on the their uh free
  • 00:03:23
    tier so for groc we will need to
  • 00:03:25
    actually set this number of requests per
  • 00:03:28
    minute to 30 or probably less than that
  • 00:03:32
    that is to ensure that we don't time out
  • 00:03:34
    but keep in mind that if you set that up
  • 00:03:37
    it's going to take a while for the
  • 00:03:39
    process to
  • 00:03:40
    finish the second aspect is the
  • 00:03:42
    embedding model that is going to be
  • 00:03:44
    using now in my case I haven't really
  • 00:03:46
    found a solution to replace the open AI
  • 00:03:48
    embedding model and the reason is I
  • 00:03:50
    think there is no standard API when it
  • 00:03:52
    comes to the embedding models that other
  • 00:03:54
    uh API providers are following so when I
  • 00:03:58
    was experimenting with local weding
  • 00:04:00
    models I couldn't really make it work
  • 00:04:02
    because I think they're not following
  • 00:04:04
    the same standard as uh open now even if
  • 00:04:08
    you use the open air embedding model the
  • 00:04:10
    cost associated with embedding models is
  • 00:04:13
    pretty small compared to the llm so for
  • 00:04:17
    examp so for example in my previous
  • 00:04:20
    experiment we made only 25 requests to
  • 00:04:23
    the embedding model compared to 570
  • 00:04:26
    requests to GPT 40 so even if if you
  • 00:04:30
    were to use the embedding model from
  • 00:04:32
    openi I think it's not going to cost too
  • 00:04:34
    much okay so once you set this up the
  • 00:04:37
    next thing is to just run the local
  • 00:04:39
    indexing that's the first part and for
  • 00:04:41
    that we're going to be using python DM
  • 00:04:44
    craft rank. index because we want to
  • 00:04:46
    create the index and we want to provide
  • 00:04:48
    the file path so by default it's looking
  • 00:04:51
    at this folder and within this folder we
  • 00:04:53
    have an input
  • 00:04:54
    folder and uh you can change that within
  • 00:04:58
    the settings. yml file so here is the uh
  • 00:05:01
    input folder you can uh provide another
  • 00:05:03
    name if you want you can also change the
  • 00:05:05
    chunk size as well as the overlap but
  • 00:05:08
    I'm not doing that in this case now
  • 00:05:10
    current the int extraction part is
  • 00:05:12
    running it has taken about 21 minutes
  • 00:05:15
    and this has only completed about 50% or
  • 00:05:18
    58% in this case I'm running this on an
  • 00:05:20
    M2 MacBook Pro that has 96 GB of and
  • 00:05:25
    here's what the GPU usage looks like now
  • 00:05:29
    this is not the only process that is
  • 00:05:30
    running on my machine I have gazillions
  • 00:05:33
    of chrome tabs open as well as some
  • 00:05:36
    other processes running and we can also
  • 00:05:39
    look at the output folder so the output
  • 00:05:41
    folder basically is going to be creating
  • 00:05:43
    the embedding vectors as well as this
  • 00:05:45
    going to be creating this reports this
  • 00:05:47
    is a place where you can see what
  • 00:05:49
    exactly is going on so let me give you a
  • 00:05:52
    quick overview of what is happening so
  • 00:05:55
    for the base URL it's using the API
  • 00:05:57
    endpoint that we have provided and it is
  • 00:05:59
    is using the Lama 3 Model so that's a
  • 00:06:01
    good thing now it has the other settings
  • 00:06:05
    that are coming in from the settings.
  • 00:06:07
    yml file and you can see it's making
  • 00:06:10
    calls to API endpoint and sometimes it
  • 00:06:13
    retries things so for example here was
  • 00:06:15
    an error uh that it got but it's able to
  • 00:06:20
    recover from that error because it was
  • 00:06:21
    retrying it multiple times uh and you
  • 00:06:24
    can uh select how many retries it's
  • 00:06:26
    going to have now when I was running uh
  • 00:06:30
    GPT 40 the process took a much shorter
  • 00:06:33
    amount of time and the results probably
  • 00:06:35
    are going to be much better if you're
  • 00:06:36
    using the bigger gp4 or model compared
  • 00:06:39
    to when you're using something like Lama
  • 00:06:41
    38 billion but I just wanted to show you
  • 00:06:43
    how you can uh set this up for better
  • 00:06:46
    results it might be better to use the L
  • 00:06:49
    370 billion model from Gro however in
  • 00:06:51
    that case you need to make sure that you
  • 00:06:53
    set the uh requests per minute to a
  • 00:06:56
    lower value and that could um result in
  • 00:06:59
    much longer uh time that it's going to
  • 00:07:01
    take to create the in extraction as well
  • 00:07:04
    as the uh corresponding graph for us now
  • 00:07:06
    it has already taken about uh 27 minutes
  • 00:07:10
    so I'm going to wait for this process to
  • 00:07:11
    complete and then I'll walk you through
  • 00:07:14
    how to run this to test this out we're
  • 00:07:17
    going to use the um same prompt that we
  • 00:07:19
    used in the previous video so we will
  • 00:07:21
    use Python DM graph rag. query and we
  • 00:07:25
    need to provide where our documents are
  • 00:07:27
    located or where the index was was
  • 00:07:29
    created along with the graph then the
  • 00:07:31
    method that we're going to be using is
  • 00:07:33
    global Community I describe or explain
  • 00:07:36
    this in my previous video so I highly
  • 00:07:38
    recommend to watch that to understand
  • 00:07:40
    the difference between Global and local
  • 00:07:43
    and the prompt is going to be what is
  • 00:07:45
    the main theme of the book so let's run
  • 00:07:48
    this okay so here's the response Global
  • 00:07:51
    search response main theme of the book
  • 00:07:54
    and the main theme of the book as
  • 00:07:55
    highlighted by multiple analyst is the
  • 00:07:57
    importance of human Connection in
  • 00:07:59
    relationships this theme is evident
  • 00:08:01
    throughout the character of Scrooge who
  • 00:08:04
    continues to answer Marley's name even
  • 00:08:06
    after his death right so the the the
  • 00:08:09
    summary or the main thing that we get
  • 00:08:11
    from Lama 3 is not as good as gbd4 and
  • 00:08:16
    that is kind of
  • 00:08:18
    self-explanatory because when it comes
  • 00:08:20
    to graph rack the choice of llm that you
  • 00:08:22
    are going to be using is a lot more
  • 00:08:24
    critical than the choice of llm In
  • 00:08:27
    traditional rag system and in order to
  • 00:08:30
    explain this let's look at just
  • 00:08:31
    traditional rack system so in
  • 00:08:33
    traditional rack system the most
  • 00:08:34
    important aspect is your embedding model
  • 00:08:37
    because that really determines what type
  • 00:08:39
    of chunks the the llm is going to
  • 00:08:41
    receive when it's trying to generate the
  • 00:08:43
    responses so you want to make sure that
  • 00:08:45
    uh both the chunking strategy as well as
  • 00:08:48
    the embedding model that you choose is
  • 00:08:50
    great so that the llm receives the
  • 00:08:53
    proper context so in that case if even
  • 00:08:56
    you have a smaller llm and you provide
  • 00:08:59
    great context it will be able to
  • 00:09:01
    generate good
  • 00:09:03
    responses but in case of the graph frag
  • 00:09:07
    approach it's very different right
  • 00:09:08
    because the way you are building these
  • 00:09:11
    knowledge graphs is that you first
  • 00:09:13
    extract entities from your text right so
  • 00:09:15
    you need to have a really great llm that
  • 00:09:17
    is able to recognize different entities
  • 00:09:20
    that are present in in your documents
  • 00:09:23
    and extract relationships so if you use
  • 00:09:25
    a smaller llm like Lama 38 billion uh
  • 00:09:28
    then it will not be able to actually
  • 00:09:30
    extract those relationships accurately
  • 00:09:33
    and and as a as a result the graph that
  • 00:09:36
    you create is not going to be great so
  • 00:09:38
    and then you also need to basically
  • 00:09:41
    create summaries of the communities that
  • 00:09:43
    you're creating based on the graph that
  • 00:09:45
    is created by the llm so there are like
  • 00:09:48
    multiple aspects in which the llm plays
  • 00:09:50
    a lot more critical role when it comes
  • 00:09:52
    to graph ride graph rag compared to the
  • 00:09:55
    traditional rack system so a smaller llm
  • 00:09:58
    is probably not a great Choice here and
  • 00:10:01
    you want to look at much bigger llms
  • 00:10:03
    like Lama 370 billion model so here are
  • 00:10:06
    the results when I tried to use the L
  • 00:10:09
    T70 billion model from Croc so in this
  • 00:10:11
    case I replaced the base API also Chang
  • 00:10:14
    the requests per minute we are using the
  • 00:10:17
    L 370 billion model and the results is
  • 00:10:20
    this now keep in mind that I didn't
  • 00:10:22
    embed the whole file in this case
  • 00:10:24
    because that was taking way too long so
  • 00:10:26
    it's just a small portion of the main
  • 00:10:28
    book which says the main theme of the
  • 00:10:30
    book revolves around Scrooge's
  • 00:10:32
    transformative gener marked by
  • 00:10:34
    Supernatural events and interaction with
  • 00:10:37
    various entities right and it talks
  • 00:10:39
    about the significance of s Supernatural
  • 00:10:42
    events implications of the theme so it's
  • 00:10:44
    much better compared to the Lama 38
  • 00:10:47
    billion model but not still as good as
  • 00:10:50
    the gbt 40 and the reason being that it
  • 00:10:52
    was just looking at a very small portion
  • 00:10:54
    of the document compared to the whole
  • 00:10:57
    book another thing to consider is these
  • 00:11:00
    prompts that are being used so by now we
  • 00:11:02
    know that different llms react
  • 00:11:06
    differently to the same prompt so you
  • 00:11:08
    really need to actually look at your
  • 00:11:11
    prompt and kind of hand craftter for
  • 00:11:14
    each and every LM
  • 00:11:16
    differently so a prompt that works great
  • 00:11:19
    for GPD 40 may not be a great prompt for
  • 00:11:22
    Lama 38 billion or even u l 370 billion
  • 00:11:26
    so if you are going to use craft rag in
  • 00:11:30
    your system just make sure that you are
  • 00:11:33
    able to modify these proms based on the
  • 00:11:35
    llm that you are using that is going to
  • 00:11:38
    be a very critical component that a lot
  • 00:11:40
    of people simply ignores and then the
  • 00:11:42
    system doesn't R good outputs I'm going
  • 00:11:44
    to be experimenting with graph rag a lot
  • 00:11:46
    more because I think it's a great
  • 00:11:48
    framework that needs uh a lot more
  • 00:11:50
    exploration and there are some other um
  • 00:11:53
    implementation of graft rag as well uh
  • 00:11:55
    so we are going to look at some of them
  • 00:11:58
    in subsequent videos so if that's
  • 00:12:00
    something that interests you make sure
  • 00:12:01
    to subscribe to the channel thanks for
  • 00:12:04
    watching and as always see you in the
  • 00:12:06
    next one
Tags
  • GraphRAG
  • Microsoft
  • LLM
  • Local Models
  • Llama 3
  • API Integration
  • Knowledge Graphs
  • Retrieval-Augmented Generation
  • Embedding Models
  • Prompt Engineering