Hvilket operativsystem er hurtigere til LLMs, Windows eller Linux?

Linux viser sig at være hurtigere end Windows i testen.

LM Studio er en platform til at køre og teste LLM-modeller.

Hvilken GPU blev brugt i testen?

Nvidia GeForce RTX 5080 blev brugt til testen.

WSL står for Windows Subsystem for Linux, som tillader kørsel af Linux-applikationer på Windows.

Hvordan påvirker antallet af tokens hastigheden?

Generelt bliver hastigheden langsommere, jo flere tokens der sendes til modellen.

Hvad er forskellen mellem WSL og ægte Linux?

WSL kører Linux i Windows, men ikke direkte på hardware, hvilket kan påvirke ydeevnen.

Hvilke modeller blev testet?

Modeller som Gemma 34B, Deepseek R1, og Quentu coder blev testet.

Hvad er fordelene ved at bruge Surf Shark?

Surf Shark tilbyder kryptering, adblocker og mulighed for at bruge det på ubegrænsede enheder.

Hvad er den bedste ydeevne for LLMs ifølge videoen?

Den bedste ydeevne blev opnået på Ubuntu med 173 tokens per sekund for Gemma 34B.

Hvad skete der med de større modeller?

De større modeller kørte langsommere på Windows og WSL, mens de kunne køre på Ubuntu.

Windows Handles Local LLMs… Before Linux Destroys It

00:13:05

https://www.youtube.com/watch?v=7RTXliAe4DI

Zusammenfassung

TLDRVideoen undersøger ydeevnen af LLMs (Large Language Models) på forskellige operativsystemer, specifikt Windows og Linux. Ved hjælp af LM Studio tester værten forskellige modeller på en Nvidia GeForce RTX 5080 GPU. Resultaterne viser, at Linux, især når det kører direkte på hardware, kan være hurtigere end Windows, med hastigheder op til 173 tokens per sekund for Gemma 34B. WSL (Windows Subsystem for Linux) viste også gode resultater, men var generelt langsommere end ægte Linux. Videoen fremhæver også vigtigheden af hardware og hvordan forskellige modeller kan påvirke ydeevnen.

Mitbringsel

💻 Windows vs. Linux for LLMs: Linux er hurtigere!
🚀 Gemma 34B opnåede 173 tokens/sekund på Linux.
🖥️ WSL giver god ydeevne, men ikke helt som ægte Linux.
📊 Hastigheden falder med flere tokens sendt til modellen.
🔍 Testede modeller inkluderer Gemma, Deepseek og Quentu.
🛡️ Surf Shark tilbyder sikkerhed og anonymitet online.
📈 Ydeevnen varierer mellem forskellige GPU'er.
🔄 Større modeller kan køre langsommere på Windows og WSL.
📅 Hold øje med fremtidige videoer om hardware tests.
🤖 LLMs kræver kraftig hardware for optimal ydeevne.

Zeitleiste

00:00:00 - 00:05:00
I denne video undersøger vi ydeevnen af LLMs (Large Language Models) på forskellige operativsystemer, herunder Windows og Linux. Vi starter med at køre modellen Gemma 34B på Windows med en Nvidia GeForce RTX 5080 GPU, hvor vi opnår 102 tokens per sekund. Derudover diskuteres hardwarekonfigurationen, og det bemærkes, at der ikke er en Mac til stede i testen. Målet er at holde hardware ensartet for at få en retfærdig sammenligning.
00:05:00 - 00:13:05
Efter at have testet LLMs på Windows, skifter vi til WSL (Windows Subsystem for Linux) og kører den samme model, hvor vi opnår 108 tokens per sekund. Dette viser, at ydeevnen i WSL er næsten lig med Windows. Derefter skifter vi til en fuld Linux-installation (Ubuntu) og bemærker en betydelig forbedring i ydeevnen, hvor vi når op til 173 tokens per sekund. Sammenligninger mellem forskellige modeller viser, at Ubuntu generelt har en fordel, men resultaterne varierer afhængigt af modelstørrelsen.

Mind Map

Video-Fragen und Antworten

Hvilket operativsystem er hurtigere til LLMs, Windows eller Linux?
Linux viser sig at være hurtigere end Windows i testen.
Hvad er LM Studio?
LM Studio er en platform til at køre og teste LLM-modeller.
Hvilken GPU blev brugt i testen?
Nvidia GeForce RTX 5080 blev brugt til testen.
Hvad er WSL?
WSL står for Windows Subsystem for Linux, som tillader kørsel af Linux-applikationer på Windows.
Hvordan påvirker antallet af tokens hastigheden?
Generelt bliver hastigheden langsommere, jo flere tokens der sendes til modellen.
Hvad er forskellen mellem WSL og ægte Linux?
WSL kører Linux i Windows, men ikke direkte på hardware, hvilket kan påvirke ydeevnen.
Hvilke modeller blev testet?
Modeller som Gemma 34B, Deepseek R1, og Quentu coder blev testet.
Hvad er fordelene ved at bruge Surf Shark?
Surf Shark tilbyder kryptering, adblocker og mulighed for at bruge det på ubegrænsede enheder.
Hvad er den bedste ydeevne for LLMs ifølge videoen?
Den bedste ydeevne blev opnået på Ubuntu med 173 tokens per sekund for Gemma 34B.
Hvad skete der med de større modeller?
De større modeller kørte langsommere på Windows og WSL, mens de kunne køre på Ubuntu.

Weitere Video-Zusammenfassungen anzeigen

Erhalten Sie sofortigen Zugang zu kostenlosen YouTube-Videozusammenfassungen, die von AI unterstützt werden!

Untertitel

Automatisches Blättern:

00:00:00
All right. As software developers, we
00:00:01
often have to deal with different kinds
00:00:03
of environments like Windows, Linux, Mac
00:00:06
OS, and we want to run LLMs on those
00:00:08
different environments. But the question
00:00:10
that often comes up is, well, what's
00:00:12
faster? Is it Linux or is it Windows? Is
00:00:14
Windows going to hinder your performance
00:00:17
for LLMs? So, I've got here LM Studio.
00:00:19
We're going to pick a model real quick
00:00:21
here. Let's do this one. Gemma
00:00:23
34B. I got a bunch of them I'm going to
00:00:25
run today, but we're going to start with
00:00:27
that one just to get a baseline. And I'm
00:00:29
doing 34 layers out of 34 for GPU
00:00:32
offload. We want to run everything on
00:00:34
the GPU today. And I'm going to check
00:00:36
that through my task manager here in
00:00:39
Windows. I'm starting out with Windows,
00:00:40
but we're going to do Linux, too, don't
00:00:41
worry. And taking a look at the GPU
00:00:43
here, I've got the Nvidia GeForce RTX
00:00:46
5080. 16 GB of memory, VRAM, and 6.7 of
00:00:50
dedicated memory is being used by this
00:00:52
model. It's loaded up. Let's just warm
00:00:55
it up by saying hello. Now, I know some
00:00:57
of you are going to say, "Hi, that's not
00:00:59
a very good prompt, is it? It's not
00:01:00
realistic." No, it's not realistic. It's
00:01:02
only one token, but it gives us a good
00:01:04
baseline. It's going to be a relative
00:01:06
comparison between the different
00:01:07
machines. I'm also going to send other
00:01:08
prompts, of course. But here's a good
00:01:10
start with one token. And for those of
00:01:12
you that are not familiar with how LLM
00:01:14
work, the more tokens you send to it,
00:01:16
the slightly slower it generally gets
00:01:18
generally. So, this is going to be the
00:01:20
fastest possible output. And we're
00:01:22
getting 102 tokens per second here with
00:01:24
this model. Now, as for hardware goes,
00:01:27
yeah, I have a couple of different
00:01:28
pieces that I've been testing here. This
00:01:30
is the 5090. This is the RTX Pro 6000.
00:01:34
This is brand new that I just got. I
00:01:36
haven't had a chance to test it out yet.
00:01:37
Stay tuned for those videos for sure,
00:01:40
but the point is not to compare the
00:01:42
hardware. In fact, if you do want to see
00:01:43
those videos, let me know in the
00:01:45
comments down below comparing these
00:01:46
cards against each other. I bought all
00:01:48
the different Nvidia varieties, and no,
00:01:51
they did not send them to me. I wish I
00:01:53
kind of wish they did, but
00:01:55
no. I was out there hunting for these
00:01:57
things. This one was especially hard to
00:01:59
get, but I'm getting off topic here. And
00:02:01
another thing about hardware, notice I
00:02:03
don't have a Mac here today. I have
00:02:04
plenty of videos comparing Macs with
00:02:06
these things, so check that out if
00:02:08
you're interested in that specifically.
00:02:09
But today, I want to keep the hardware
00:02:11
exactly the same, which is the point of
00:02:13
this exercise. I'm doing the exact same
00:02:15
machine, which I got right over here.
00:02:17
This is my AI machine that I built a
00:02:18
month or so ago.
00:02:29
I swapped out the 4090 to the 5090 and
00:02:32
now I'm on the 5080 in there. So, we're
00:02:34
going to keep it at the 5080. The next
00:02:36
thing I want to do is well, I want to
00:02:38
test this out in WSL because that's a
00:02:41
valid environment where people do their
00:02:42
work. It's Linux inside Windows. And if
00:02:45
you don't know what that is or if you've
00:02:47
never used it, basically it's like a
00:02:48
full Linux kernel running inside
00:02:50
Windows, but it's not a virtual machine.
00:02:53
It lives directly inside the kernel. I
00:02:56
believe somebody will correct me in the
00:02:58
comments, I'm sure. But the point is the
00:03:00
performance degradation is supposed to
00:03:02
be very, very small. So, we're not even
00:03:04
supposed to notice any performance
00:03:05
degradation. Well, let's see. I
00:03:07
installed LM Studio here as well. And
00:03:10
well, let me let me quit this one. Just
00:03:12
remember 102 tokens per second for Gemma
00:03:14
34B it. And I want you to take a guess
00:03:17
right now what's going to happen when I
00:03:18
run the Linux version of that in WSL.
00:03:21
Notice I quit the other one. And yes, I
00:03:23
have my guey enabled here. So there's an
00:03:25
extra little overhead going on over
00:03:28
there. And there it is. Now here you see
00:03:31
that we're still getting that GPU
00:03:33
offload of 34 layers out of 34. Even
00:03:35
though this is running inside WSL, it
00:03:38
means it's communicating properly with
00:03:40
our GPU. In fact, when you install WSL,
00:03:42
the latest version supports GPU offload.
00:03:45
So, we're good to go with that. I'm
00:03:46
going to load that up. And we can prove
00:03:47
this by just going to task manager in
00:03:50
Windows now. And you'll see the model is
00:03:52
now taking up some GPU memory. We're
00:03:54
good to go. Let's say hi here.
00:03:57
Boom. And there it is. What? 108 tokens
00:04:00
per second. Of course, that's within the
00:04:03
margin of error, but that's pretty
00:04:06
good. Now, we're down to 82 because I
00:04:09
said high again. And the second
00:04:11
iteration of that is going to take that
00:04:12
entire conversation as context. So it's
00:04:15
going to be a little slower. That's to
00:04:16
be expected. Let's start a new
00:04:18
conversation and say
00:04:20
hi. 118 tokens. We're just getting
00:04:23
faster and faster here. One more time.
00:04:26
103. Okay. So you see the range there.
00:04:28
We're just between 100 and 118 somewhere
00:04:33
around there. I want to go back to
00:04:34
Windows now and do it again. But if
00:04:36
you're in WSL and you want to use LM
00:04:38
Studio or LM Studio CLI, the CLI is
00:04:41
pretty cool, by the way. You can query
00:04:43
everything and load up models through
00:04:44
the terminal programmatically. In fact,
00:04:46
I recently did a members video showing
00:04:49
how to do all that. As an aside, thank
00:04:51
you to the members of the channel for
00:04:53
supporting the channel. Members get
00:04:54
extra videos and things like that and
00:04:56
live streams. So, if you want to join,
00:04:57
there's a join button right down below.
00:04:59
But also, if you subscribe, that's just
00:05:01
free and you get all these videos. All
00:05:03
right, let's go back and close this one
00:05:04
out. And I just want to go back and redo
00:05:08
LM Studio one more time in Windows.
00:05:10
There's our model. 34 out of 34. And I'm
00:05:13
going to create a new chat. Let's say hi
00:05:16
real quick here.
00:05:17
Boom. 115. So, as you can see, between
00:05:20
WSL and Windows, not much of a
00:05:23
difference there. So, you can be working
00:05:25
in the Linux environment in WSL. But I
00:05:27
know what you're thinking right now. I
00:05:28
mean, I know I know what you're
00:05:30
thinking. You're saying, "Uh, Alex, this
00:05:33
is not real Linux." Well, it kind of is.
00:05:36
It mostly is. But the thing is, it's not
00:05:38
living on the hardware, right? It's not
00:05:40
directly on the metal. So, let's do
00:05:42
that. Let's have Linux be on the metal.
00:05:45
Next week, I'll be at Google IO
00:05:47
shoulderto-shoulder with thousands of
00:05:49
developers on the same sketchy public
00:05:51
Wi-Fi. My source code and my NAS login
00:05:54
are staying private. So, Surf Shark
00:05:56
comes with me. One tap and every packet,
00:05:59
get push, docker pull, slack ping shoots
00:06:02
through an AES 256 encrypted tunnel.
00:06:04
Clean web is their ad blocking service
00:06:07
and it quietly nukes the trackers that
00:06:09
slow pages down. Surf Shark keeps no
00:06:11
logs. Deote literally check this and the
00:06:14
servers run on RAM only. So a reboot
00:06:17
wipes everything. My favorite perk,
00:06:19
unlimited devices. I run it here on my
00:06:21
MacBook, on the mini PC media service
00:06:24
back at the office, and on my phone so I
00:06:26
can upload my B-roll from the hotel
00:06:28
without ISP throttling. Go to surf
00:06:31
shark.com/allexiscant or use code
00:06:33
alexiskin at checkout to get four extra
00:06:36
months of Surf SharkVPN. And of course,
00:06:38
a 30-day money back guarantee. So, it's
00:06:41
risk-free to try. Now, I'm going to shut
00:06:43
down Windows and get it to Ubuntu. And
00:06:45
by the way, it is pronounced Ubuntu.
00:06:47
That's the right way. Not Ubuntu, not
00:06:49
Ubuntu, it's Ubuntu. All right, I want
00:06:52
to show you that I have all these models
00:06:54
here and I wrote a little program that
00:06:56
automatically using LM Studio CLI loads
00:07:00
all these up one at a time, does some
00:07:02
prompts against them. Not just a short
00:07:04
little prompt, but a little bit longer.
00:07:06
And I'm going to run that. And I have
00:07:07
not seen the results yet. So I'm really
00:07:09
curious to see what the result will be
00:07:11
on all these because different models
00:07:13
are going to act slightly differently.
00:07:14
This was Gemma 3. We saw that it's
00:07:16
pretty much the same, but the
00:07:17
architecture of the model like Quen
00:07:19
Llama or Gemma and others will behave
00:07:22
slightly differently. And let's go ahead
00:07:24
and restart this
00:07:28
machine. Did it freeze? Come on. What
00:07:31
happened here? Restart. Boom. I missed
00:07:34
the old dedicated restart buttons or
00:07:36
reset they used to be called. Do you
00:07:38
remember those? Come on. I was just in
00:07:41
you, Ubuntu. These things happen. That's
00:07:44
That's how things go. When you expect
00:07:46
something to work, it doesn't. Good
00:07:48
thing this is not live. Reboot. We're
00:07:51
getting closer, though. I feel like this
00:07:53
reboot is going to be the one. One
00:07:55
eternity later. All right, I just
00:07:57
downloaded the model again. And here I
00:07:59
am in Ubuntu on the metal and I have LM
00:08:02
Studio installed. Let's make it a little
00:08:04
bit larger. Whoa, whoa, whoa, whoa,
00:08:07
whoa. Right over here. Gemma 3 4B. Boom.
00:08:12
And notice this is detecting our Nvidia
00:08:14
GPU. How do we monitor that? Well, there
00:08:17
is no task manager here per se, but uh
00:08:21
there's Nvidia SMI which shows me that
00:08:25
we are on the right driver. It's the
00:08:26
latest driver, but there's another
00:08:28
little tool here called NV Top that we
00:08:30
can run. Hopefully, we'll be able to see
00:08:32
the activity. There's a memory chart
00:08:34
right here. And right now, it's showing
00:08:36
us 1.28 28 out of 16 gigs of VRAM is
00:08:41
being used. And there's the GPU
00:08:43
activity, which is zero right now. So,
00:08:44
let's load up this model. Create a new
00:08:46
chat. Load up model. Gemma 3. Boom. And
00:08:50
load. And you'll see that there it is.
00:08:53
There's our memory going up. We're about
00:08:54
6.7 gigs. That was actually pretty fast.
00:08:57
It seemed faster than on Windows. Let's
00:09:00
say hi to our little model here. Boom.
00:09:03
What? Holy smokes. I take it back. All
00:09:06
you people that said things about Linux
00:09:10
being faster, you might have been right.
00:09:12
Let's do that again.
00:09:16
Hi. 170 tokens per second. Oh my
00:09:19
goodness. This is a ridiculous speed up
00:09:22
over
00:09:24
Windows. 173 for the same model. I'm
00:09:27
just blown away. This is crazy. All you
00:09:29
people that are working in Windows, I
00:09:31
know that you have to. I know that
00:09:33
people have to. And people that are
00:09:34
Linux people, don't yell at Windows
00:09:36
people because they have to. Okay, but
00:09:38
wow, this is
00:09:40
crazy. Now, let's take a look at the
00:09:42
chart that actually takes all the models
00:09:44
that I have installed in LM Studio and
00:09:46
uses LM Studio CLI to iterate over each
00:09:49
of the models, pull them in, load them
00:09:51
up, run a prompt against it, note down
00:09:53
the tokens per second, and tell me what
00:09:55
that is. So, let's see what that is. For
00:09:58
WSL, I have here's Deepseek R1 distill
00:10:00
quen 7B and we're at 122 tokens per
00:10:04
second. Let's find that same one on
00:10:06
Windows. Here it is. 143. So quite a bit
00:10:10
more. But what's that on Ubuntu? Here it
00:10:12
is. 135. So in this case, for some
00:10:15
reason, it's a little bit less in
00:10:17
Ubuntu. There was a significant jump in
00:10:19
the Gemma 34B, but for Deep CR1, slower.
00:10:24
Again, it's a little bit within the
00:10:26
margin of error there, but for this
00:10:28
model, doesn't seem to matter that much.
00:10:30
Let's take a look at a model that's a
00:10:31
little bit more useful. This Gemma 312
00:10:34
billion in Ubuntu, we're getting 83
00:10:36
tokens per second. Windows 71 slower in
00:10:40
Windows and 66 in WSL. So, a pretty
00:10:43
consistent drop there from Ubuntu being
00:10:46
the highest to WSL being the lowest.
00:10:47
Quentu coder 14b 66 in WSL 70 in Windows
00:10:53
and there it is
00:10:54
74.9 almost 75 in Ubuntu. Ubuntu seems
00:10:58
to have a slight edge. Not as huge a
00:11:00
jump as we saw initially but still.
00:11:02
Let's take a look at Gemma 3 1 billion.
00:11:04
Ubuntu 288 tokens per second. Windows
00:11:08
307 faster. Holy cow. Notice Llama 3.21
00:11:12
billion went up to 367 tokens per second
00:11:15
on Windows. and Gemma 3 1 billion 193 in
00:11:19
WSL. Wow, that's still fast. But the
00:11:22
difference between WSL and Windows and
00:11:25
Ubuntu here is staggering for the
00:11:27
smaller models. Notice that on Ubuntu,
00:11:30
the larger models just completely
00:11:32
failed. on Windows. Those larger models,
00:11:34
the 32 billion models, those are models
00:11:37
that are really large and they barely
00:11:39
fit, if not fit at all, inside the VRAMm
00:11:42
of that video card, which is a 16 GB
00:11:45
video card. They ran really, really
00:11:47
slowly. What that means is that LM
00:11:49
Studio offloaded some of that to the
00:11:51
CPU, making it extremely slow. We don't
00:11:54
want to do that in real life, but it
00:11:56
happened here. and QWQ 32 billion
00:11:59
parameter model. This Q4 quad. Try
00:12:02
saying that fast 10 times. That's not
00:12:04
going to happen. That model is 18.49 GB
00:12:07
on disk. So, it's going to be even
00:12:08
larger when you load it up. I can't
00:12:10
believe it ran, but it did. The Windows
00:12:12
version of LM Studio allowed it to.
00:12:14
Let's see what happened in WSL for
00:12:16
those. Yeah, look at that. It ran in WSL
00:12:18
2, giving us an even lower score, 1.7
00:12:21
tokens per second. Unacceptable, but it
00:12:24
ran, which is pretty incredible. So,
00:12:26
luckily the comparison here is a
00:12:28
standalone comparison because everything
00:12:30
is kind of relative here. Now, if you
00:12:31
want to see how this card does,
00:12:34
definitely subscribe and stay tuned for
00:12:36
that. I'm also working on another video
00:12:37
with the AMD Ryzen AI Max Plus 395. A
00:12:41
couple of those, including this new
00:12:44
little machine right here. The video
00:12:47
might be done already, and if it is,
00:12:48
it's going to be up here. And this one
00:12:49
is another one you might want to watch.
00:12:51
So, thanks for watching, and I'll see
00:12:53
you in the next video.
00:12:55
[Applause]
00:12:56
[Music]