Windows Handles Local LLMs… Before Linux Destroys It

00:13:05
https://www.youtube.com/watch?v=7RTXliAe4DI

Zusammenfassung

TLDRVideoen undersøger ydeevnen af LLMs (Large Language Models) på forskellige operativsystemer, specifikt Windows og Linux. Ved hjælp af LM Studio tester værten forskellige modeller på en Nvidia GeForce RTX 5080 GPU. Resultaterne viser, at Linux, især når det kører direkte på hardware, kan være hurtigere end Windows, med hastigheder op til 173 tokens per sekund for Gemma 34B. WSL (Windows Subsystem for Linux) viste også gode resultater, men var generelt langsommere end ægte Linux. Videoen fremhæver også vigtigheden af hardware og hvordan forskellige modeller kan påvirke ydeevnen.

Mitbringsel

  • 💻 Windows vs. Linux for LLMs: Linux er hurtigere!
  • 🚀 Gemma 34B opnåede 173 tokens/sekund på Linux.
  • 🖥️ WSL giver god ydeevne, men ikke helt som ægte Linux.
  • 📊 Hastigheden falder med flere tokens sendt til modellen.
  • 🔍 Testede modeller inkluderer Gemma, Deepseek og Quentu.
  • 🛡️ Surf Shark tilbyder sikkerhed og anonymitet online.
  • 📈 Ydeevnen varierer mellem forskellige GPU'er.
  • 🔄 Større modeller kan køre langsommere på Windows og WSL.
  • 📅 Hold øje med fremtidige videoer om hardware tests.
  • 🤖 LLMs kræver kraftig hardware for optimal ydeevne.

Zeitleiste

  • 00:00:00 - 00:05:00

    I denne video undersøger vi ydeevnen af LLMs (Large Language Models) på forskellige operativsystemer, herunder Windows og Linux. Vi starter med at køre modellen Gemma 34B på Windows med en Nvidia GeForce RTX 5080 GPU, hvor vi opnår 102 tokens per sekund. Derudover diskuteres hardwarekonfigurationen, og det bemærkes, at der ikke er en Mac til stede i testen. Målet er at holde hardware ensartet for at få en retfærdig sammenligning.

  • 00:05:00 - 00:13:05

    Efter at have testet LLMs på Windows, skifter vi til WSL (Windows Subsystem for Linux) og kører den samme model, hvor vi opnår 108 tokens per sekund. Dette viser, at ydeevnen i WSL er næsten lig med Windows. Derefter skifter vi til en fuld Linux-installation (Ubuntu) og bemærker en betydelig forbedring i ydeevnen, hvor vi når op til 173 tokens per sekund. Sammenligninger mellem forskellige modeller viser, at Ubuntu generelt har en fordel, men resultaterne varierer afhængigt af modelstørrelsen.

Mind Map

Video-Fragen und Antworten

  • Hvilket operativsystem er hurtigere til LLMs, Windows eller Linux?

    Linux viser sig at være hurtigere end Windows i testen.

  • Hvad er LM Studio?

    LM Studio er en platform til at køre og teste LLM-modeller.

  • Hvilken GPU blev brugt i testen?

    Nvidia GeForce RTX 5080 blev brugt til testen.

  • Hvad er WSL?

    WSL står for Windows Subsystem for Linux, som tillader kørsel af Linux-applikationer på Windows.

  • Hvordan påvirker antallet af tokens hastigheden?

    Generelt bliver hastigheden langsommere, jo flere tokens der sendes til modellen.

  • Hvad er forskellen mellem WSL og ægte Linux?

    WSL kører Linux i Windows, men ikke direkte på hardware, hvilket kan påvirke ydeevnen.

  • Hvilke modeller blev testet?

    Modeller som Gemma 34B, Deepseek R1, og Quentu coder blev testet.

  • Hvad er fordelene ved at bruge Surf Shark?

    Surf Shark tilbyder kryptering, adblocker og mulighed for at bruge det på ubegrænsede enheder.

  • Hvad er den bedste ydeevne for LLMs ifølge videoen?

    Den bedste ydeevne blev opnået på Ubuntu med 173 tokens per sekund for Gemma 34B.

  • Hvad skete der med de større modeller?

    De større modeller kørte langsommere på Windows og WSL, mens de kunne køre på Ubuntu.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!
Untertitel
en
Automatisches Blättern:
  • 00:00:00
    All right. As software developers, we
  • 00:00:01
    often have to deal with different kinds
  • 00:00:03
    of environments like Windows, Linux, Mac
  • 00:00:06
    OS, and we want to run LLMs on those
  • 00:00:08
    different environments. But the question
  • 00:00:10
    that often comes up is, well, what's
  • 00:00:12
    faster? Is it Linux or is it Windows? Is
  • 00:00:14
    Windows going to hinder your performance
  • 00:00:17
    for LLMs? So, I've got here LM Studio.
  • 00:00:19
    We're going to pick a model real quick
  • 00:00:21
    here. Let's do this one. Gemma
  • 00:00:23
    34B. I got a bunch of them I'm going to
  • 00:00:25
    run today, but we're going to start with
  • 00:00:27
    that one just to get a baseline. And I'm
  • 00:00:29
    doing 34 layers out of 34 for GPU
  • 00:00:32
    offload. We want to run everything on
  • 00:00:34
    the GPU today. And I'm going to check
  • 00:00:36
    that through my task manager here in
  • 00:00:39
    Windows. I'm starting out with Windows,
  • 00:00:40
    but we're going to do Linux, too, don't
  • 00:00:41
    worry. And taking a look at the GPU
  • 00:00:43
    here, I've got the Nvidia GeForce RTX
  • 00:00:46
    5080. 16 GB of memory, VRAM, and 6.7 of
  • 00:00:50
    dedicated memory is being used by this
  • 00:00:52
    model. It's loaded up. Let's just warm
  • 00:00:55
    it up by saying hello. Now, I know some
  • 00:00:57
    of you are going to say, "Hi, that's not
  • 00:00:59
    a very good prompt, is it? It's not
  • 00:01:00
    realistic." No, it's not realistic. It's
  • 00:01:02
    only one token, but it gives us a good
  • 00:01:04
    baseline. It's going to be a relative
  • 00:01:06
    comparison between the different
  • 00:01:07
    machines. I'm also going to send other
  • 00:01:08
    prompts, of course. But here's a good
  • 00:01:10
    start with one token. And for those of
  • 00:01:12
    you that are not familiar with how LLM
  • 00:01:14
    work, the more tokens you send to it,
  • 00:01:16
    the slightly slower it generally gets
  • 00:01:18
    generally. So, this is going to be the
  • 00:01:20
    fastest possible output. And we're
  • 00:01:22
    getting 102 tokens per second here with
  • 00:01:24
    this model. Now, as for hardware goes,
  • 00:01:27
    yeah, I have a couple of different
  • 00:01:28
    pieces that I've been testing here. This
  • 00:01:30
    is the 5090. This is the RTX Pro 6000.
  • 00:01:34
    This is brand new that I just got. I
  • 00:01:36
    haven't had a chance to test it out yet.
  • 00:01:37
    Stay tuned for those videos for sure,
  • 00:01:40
    but the point is not to compare the
  • 00:01:42
    hardware. In fact, if you do want to see
  • 00:01:43
    those videos, let me know in the
  • 00:01:45
    comments down below comparing these
  • 00:01:46
    cards against each other. I bought all
  • 00:01:48
    the different Nvidia varieties, and no,
  • 00:01:51
    they did not send them to me. I wish I
  • 00:01:53
    kind of wish they did, but
  • 00:01:55
    no. I was out there hunting for these
  • 00:01:57
    things. This one was especially hard to
  • 00:01:59
    get, but I'm getting off topic here. And
  • 00:02:01
    another thing about hardware, notice I
  • 00:02:03
    don't have a Mac here today. I have
  • 00:02:04
    plenty of videos comparing Macs with
  • 00:02:06
    these things, so check that out if
  • 00:02:08
    you're interested in that specifically.
  • 00:02:09
    But today, I want to keep the hardware
  • 00:02:11
    exactly the same, which is the point of
  • 00:02:13
    this exercise. I'm doing the exact same
  • 00:02:15
    machine, which I got right over here.
  • 00:02:17
    This is my AI machine that I built a
  • 00:02:18
    month or so ago.
  • 00:02:29
    I swapped out the 4090 to the 5090 and
  • 00:02:32
    now I'm on the 5080 in there. So, we're
  • 00:02:34
    going to keep it at the 5080. The next
  • 00:02:36
    thing I want to do is well, I want to
  • 00:02:38
    test this out in WSL because that's a
  • 00:02:41
    valid environment where people do their
  • 00:02:42
    work. It's Linux inside Windows. And if
  • 00:02:45
    you don't know what that is or if you've
  • 00:02:47
    never used it, basically it's like a
  • 00:02:48
    full Linux kernel running inside
  • 00:02:50
    Windows, but it's not a virtual machine.
  • 00:02:53
    It lives directly inside the kernel. I
  • 00:02:56
    believe somebody will correct me in the
  • 00:02:58
    comments, I'm sure. But the point is the
  • 00:03:00
    performance degradation is supposed to
  • 00:03:02
    be very, very small. So, we're not even
  • 00:03:04
    supposed to notice any performance
  • 00:03:05
    degradation. Well, let's see. I
  • 00:03:07
    installed LM Studio here as well. And
  • 00:03:10
    well, let me let me quit this one. Just
  • 00:03:12
    remember 102 tokens per second for Gemma
  • 00:03:14
    34B it. And I want you to take a guess
  • 00:03:17
    right now what's going to happen when I
  • 00:03:18
    run the Linux version of that in WSL.
  • 00:03:21
    Notice I quit the other one. And yes, I
  • 00:03:23
    have my guey enabled here. So there's an
  • 00:03:25
    extra little overhead going on over
  • 00:03:28
    there. And there it is. Now here you see
  • 00:03:31
    that we're still getting that GPU
  • 00:03:33
    offload of 34 layers out of 34. Even
  • 00:03:35
    though this is running inside WSL, it
  • 00:03:38
    means it's communicating properly with
  • 00:03:40
    our GPU. In fact, when you install WSL,
  • 00:03:42
    the latest version supports GPU offload.
  • 00:03:45
    So, we're good to go with that. I'm
  • 00:03:46
    going to load that up. And we can prove
  • 00:03:47
    this by just going to task manager in
  • 00:03:50
    Windows now. And you'll see the model is
  • 00:03:52
    now taking up some GPU memory. We're
  • 00:03:54
    good to go. Let's say hi here.
  • 00:03:57
    Boom. And there it is. What? 108 tokens
  • 00:04:00
    per second. Of course, that's within the
  • 00:04:03
    margin of error, but that's pretty
  • 00:04:06
    good. Now, we're down to 82 because I
  • 00:04:09
    said high again. And the second
  • 00:04:11
    iteration of that is going to take that
  • 00:04:12
    entire conversation as context. So it's
  • 00:04:15
    going to be a little slower. That's to
  • 00:04:16
    be expected. Let's start a new
  • 00:04:18
    conversation and say
  • 00:04:20
    hi. 118 tokens. We're just getting
  • 00:04:23
    faster and faster here. One more time.
  • 00:04:26
    103. Okay. So you see the range there.
  • 00:04:28
    We're just between 100 and 118 somewhere
  • 00:04:33
    around there. I want to go back to
  • 00:04:34
    Windows now and do it again. But if
  • 00:04:36
    you're in WSL and you want to use LM
  • 00:04:38
    Studio or LM Studio CLI, the CLI is
  • 00:04:41
    pretty cool, by the way. You can query
  • 00:04:43
    everything and load up models through
  • 00:04:44
    the terminal programmatically. In fact,
  • 00:04:46
    I recently did a members video showing
  • 00:04:49
    how to do all that. As an aside, thank
  • 00:04:51
    you to the members of the channel for
  • 00:04:53
    supporting the channel. Members get
  • 00:04:54
    extra videos and things like that and
  • 00:04:56
    live streams. So, if you want to join,
  • 00:04:57
    there's a join button right down below.
  • 00:04:59
    But also, if you subscribe, that's just
  • 00:05:01
    free and you get all these videos. All
  • 00:05:03
    right, let's go back and close this one
  • 00:05:04
    out. And I just want to go back and redo
  • 00:05:08
    LM Studio one more time in Windows.
  • 00:05:10
    There's our model. 34 out of 34. And I'm
  • 00:05:13
    going to create a new chat. Let's say hi
  • 00:05:16
    real quick here.
  • 00:05:17
    Boom. 115. So, as you can see, between
  • 00:05:20
    WSL and Windows, not much of a
  • 00:05:23
    difference there. So, you can be working
  • 00:05:25
    in the Linux environment in WSL. But I
  • 00:05:27
    know what you're thinking right now. I
  • 00:05:28
    mean, I know I know what you're
  • 00:05:30
    thinking. You're saying, "Uh, Alex, this
  • 00:05:33
    is not real Linux." Well, it kind of is.
  • 00:05:36
    It mostly is. But the thing is, it's not
  • 00:05:38
    living on the hardware, right? It's not
  • 00:05:40
    directly on the metal. So, let's do
  • 00:05:42
    that. Let's have Linux be on the metal.
  • 00:05:45
    Next week, I'll be at Google IO
  • 00:05:47
    shoulderto-shoulder with thousands of
  • 00:05:49
    developers on the same sketchy public
  • 00:05:51
    Wi-Fi. My source code and my NAS login
  • 00:05:54
    are staying private. So, Surf Shark
  • 00:05:56
    comes with me. One tap and every packet,
  • 00:05:59
    get push, docker pull, slack ping shoots
  • 00:06:02
    through an AES 256 encrypted tunnel.
  • 00:06:04
    Clean web is their ad blocking service
  • 00:06:07
    and it quietly nukes the trackers that
  • 00:06:09
    slow pages down. Surf Shark keeps no
  • 00:06:11
    logs. Deote literally check this and the
  • 00:06:14
    servers run on RAM only. So a reboot
  • 00:06:17
    wipes everything. My favorite perk,
  • 00:06:19
    unlimited devices. I run it here on my
  • 00:06:21
    MacBook, on the mini PC media service
  • 00:06:24
    back at the office, and on my phone so I
  • 00:06:26
    can upload my B-roll from the hotel
  • 00:06:28
    without ISP throttling. Go to surf
  • 00:06:31
    shark.com/allexiscant or use code
  • 00:06:33
    alexiskin at checkout to get four extra
  • 00:06:36
    months of Surf SharkVPN. And of course,
  • 00:06:38
    a 30-day money back guarantee. So, it's
  • 00:06:41
    risk-free to try. Now, I'm going to shut
  • 00:06:43
    down Windows and get it to Ubuntu. And
  • 00:06:45
    by the way, it is pronounced Ubuntu.
  • 00:06:47
    That's the right way. Not Ubuntu, not
  • 00:06:49
    Ubuntu, it's Ubuntu. All right, I want
  • 00:06:52
    to show you that I have all these models
  • 00:06:54
    here and I wrote a little program that
  • 00:06:56
    automatically using LM Studio CLI loads
  • 00:07:00
    all these up one at a time, does some
  • 00:07:02
    prompts against them. Not just a short
  • 00:07:04
    little prompt, but a little bit longer.
  • 00:07:06
    And I'm going to run that. And I have
  • 00:07:07
    not seen the results yet. So I'm really
  • 00:07:09
    curious to see what the result will be
  • 00:07:11
    on all these because different models
  • 00:07:13
    are going to act slightly differently.
  • 00:07:14
    This was Gemma 3. We saw that it's
  • 00:07:16
    pretty much the same, but the
  • 00:07:17
    architecture of the model like Quen
  • 00:07:19
    Llama or Gemma and others will behave
  • 00:07:22
    slightly differently. And let's go ahead
  • 00:07:24
    and restart this
  • 00:07:28
    machine. Did it freeze? Come on. What
  • 00:07:31
    happened here? Restart. Boom. I missed
  • 00:07:34
    the old dedicated restart buttons or
  • 00:07:36
    reset they used to be called. Do you
  • 00:07:38
    remember those? Come on. I was just in
  • 00:07:41
    you, Ubuntu. These things happen. That's
  • 00:07:44
    That's how things go. When you expect
  • 00:07:46
    something to work, it doesn't. Good
  • 00:07:48
    thing this is not live. Reboot. We're
  • 00:07:51
    getting closer, though. I feel like this
  • 00:07:53
    reboot is going to be the one. One
  • 00:07:55
    eternity later. All right, I just
  • 00:07:57
    downloaded the model again. And here I
  • 00:07:59
    am in Ubuntu on the metal and I have LM
  • 00:08:02
    Studio installed. Let's make it a little
  • 00:08:04
    bit larger. Whoa, whoa, whoa, whoa,
  • 00:08:07
    whoa. Right over here. Gemma 3 4B. Boom.
  • 00:08:12
    And notice this is detecting our Nvidia
  • 00:08:14
    GPU. How do we monitor that? Well, there
  • 00:08:17
    is no task manager here per se, but uh
  • 00:08:21
    there's Nvidia SMI which shows me that
  • 00:08:25
    we are on the right driver. It's the
  • 00:08:26
    latest driver, but there's another
  • 00:08:28
    little tool here called NV Top that we
  • 00:08:30
    can run. Hopefully, we'll be able to see
  • 00:08:32
    the activity. There's a memory chart
  • 00:08:34
    right here. And right now, it's showing
  • 00:08:36
    us 1.28 28 out of 16 gigs of VRAM is
  • 00:08:41
    being used. And there's the GPU
  • 00:08:43
    activity, which is zero right now. So,
  • 00:08:44
    let's load up this model. Create a new
  • 00:08:46
    chat. Load up model. Gemma 3. Boom. And
  • 00:08:50
    load. And you'll see that there it is.
  • 00:08:53
    There's our memory going up. We're about
  • 00:08:54
    6.7 gigs. That was actually pretty fast.
  • 00:08:57
    It seemed faster than on Windows. Let's
  • 00:09:00
    say hi to our little model here. Boom.
  • 00:09:03
    What? Holy smokes. I take it back. All
  • 00:09:06
    you people that said things about Linux
  • 00:09:10
    being faster, you might have been right.
  • 00:09:12
    Let's do that again.
  • 00:09:16
    Hi. 170 tokens per second. Oh my
  • 00:09:19
    goodness. This is a ridiculous speed up
  • 00:09:22
    over
  • 00:09:24
    Windows. 173 for the same model. I'm
  • 00:09:27
    just blown away. This is crazy. All you
  • 00:09:29
    people that are working in Windows, I
  • 00:09:31
    know that you have to. I know that
  • 00:09:33
    people have to. And people that are
  • 00:09:34
    Linux people, don't yell at Windows
  • 00:09:36
    people because they have to. Okay, but
  • 00:09:38
    wow, this is
  • 00:09:40
    crazy. Now, let's take a look at the
  • 00:09:42
    chart that actually takes all the models
  • 00:09:44
    that I have installed in LM Studio and
  • 00:09:46
    uses LM Studio CLI to iterate over each
  • 00:09:49
    of the models, pull them in, load them
  • 00:09:51
    up, run a prompt against it, note down
  • 00:09:53
    the tokens per second, and tell me what
  • 00:09:55
    that is. So, let's see what that is. For
  • 00:09:58
    WSL, I have here's Deepseek R1 distill
  • 00:10:00
    quen 7B and we're at 122 tokens per
  • 00:10:04
    second. Let's find that same one on
  • 00:10:06
    Windows. Here it is. 143. So quite a bit
  • 00:10:10
    more. But what's that on Ubuntu? Here it
  • 00:10:12
    is. 135. So in this case, for some
  • 00:10:15
    reason, it's a little bit less in
  • 00:10:17
    Ubuntu. There was a significant jump in
  • 00:10:19
    the Gemma 34B, but for Deep CR1, slower.
  • 00:10:24
    Again, it's a little bit within the
  • 00:10:26
    margin of error there, but for this
  • 00:10:28
    model, doesn't seem to matter that much.
  • 00:10:30
    Let's take a look at a model that's a
  • 00:10:31
    little bit more useful. This Gemma 312
  • 00:10:34
    billion in Ubuntu, we're getting 83
  • 00:10:36
    tokens per second. Windows 71 slower in
  • 00:10:40
    Windows and 66 in WSL. So, a pretty
  • 00:10:43
    consistent drop there from Ubuntu being
  • 00:10:46
    the highest to WSL being the lowest.
  • 00:10:47
    Quentu coder 14b 66 in WSL 70 in Windows
  • 00:10:53
    and there it is
  • 00:10:54
    74.9 almost 75 in Ubuntu. Ubuntu seems
  • 00:10:58
    to have a slight edge. Not as huge a
  • 00:11:00
    jump as we saw initially but still.
  • 00:11:02
    Let's take a look at Gemma 3 1 billion.
  • 00:11:04
    Ubuntu 288 tokens per second. Windows
  • 00:11:08
    307 faster. Holy cow. Notice Llama 3.21
  • 00:11:12
    billion went up to 367 tokens per second
  • 00:11:15
    on Windows. and Gemma 3 1 billion 193 in
  • 00:11:19
    WSL. Wow, that's still fast. But the
  • 00:11:22
    difference between WSL and Windows and
  • 00:11:25
    Ubuntu here is staggering for the
  • 00:11:27
    smaller models. Notice that on Ubuntu,
  • 00:11:30
    the larger models just completely
  • 00:11:32
    failed. on Windows. Those larger models,
  • 00:11:34
    the 32 billion models, those are models
  • 00:11:37
    that are really large and they barely
  • 00:11:39
    fit, if not fit at all, inside the VRAMm
  • 00:11:42
    of that video card, which is a 16 GB
  • 00:11:45
    video card. They ran really, really
  • 00:11:47
    slowly. What that means is that LM
  • 00:11:49
    Studio offloaded some of that to the
  • 00:11:51
    CPU, making it extremely slow. We don't
  • 00:11:54
    want to do that in real life, but it
  • 00:11:56
    happened here. and QWQ 32 billion
  • 00:11:59
    parameter model. This Q4 quad. Try
  • 00:12:02
    saying that fast 10 times. That's not
  • 00:12:04
    going to happen. That model is 18.49 GB
  • 00:12:07
    on disk. So, it's going to be even
  • 00:12:08
    larger when you load it up. I can't
  • 00:12:10
    believe it ran, but it did. The Windows
  • 00:12:12
    version of LM Studio allowed it to.
  • 00:12:14
    Let's see what happened in WSL for
  • 00:12:16
    those. Yeah, look at that. It ran in WSL
  • 00:12:18
    2, giving us an even lower score, 1.7
  • 00:12:21
    tokens per second. Unacceptable, but it
  • 00:12:24
    ran, which is pretty incredible. So,
  • 00:12:26
    luckily the comparison here is a
  • 00:12:28
    standalone comparison because everything
  • 00:12:30
    is kind of relative here. Now, if you
  • 00:12:31
    want to see how this card does,
  • 00:12:34
    definitely subscribe and stay tuned for
  • 00:12:36
    that. I'm also working on another video
  • 00:12:37
    with the AMD Ryzen AI Max Plus 395. A
  • 00:12:41
    couple of those, including this new
  • 00:12:44
    little machine right here. The video
  • 00:12:47
    might be done already, and if it is,
  • 00:12:48
    it's going to be up here. And this one
  • 00:12:49
    is another one you might want to watch.
  • 00:12:51
    So, thanks for watching, and I'll see
  • 00:12:53
    you in the next video.
  • 00:12:55
    [Applause]
  • 00:12:56
    [Music]
Tags
  • LLM
  • Windows
  • Linux
  • LM Studio
  • GPU
  • Nvidia
  • WSL
  • performance
  • tokens
  • Gemma