00:00:00
All right. As software developers, we
00:00:01
often have to deal with different kinds
00:00:03
of environments like Windows, Linux, Mac
00:00:06
OS, and we want to run LLMs on those
00:00:08
different environments. But the question
00:00:10
that often comes up is, well, what's
00:00:12
faster? Is it Linux or is it Windows? Is
00:00:14
Windows going to hinder your performance
00:00:17
for LLMs? So, I've got here LM Studio.
00:00:19
We're going to pick a model real quick
00:00:21
here. Let's do this one. Gemma
00:00:23
34B. I got a bunch of them I'm going to
00:00:25
run today, but we're going to start with
00:00:27
that one just to get a baseline. And I'm
00:00:29
doing 34 layers out of 34 for GPU
00:00:32
offload. We want to run everything on
00:00:34
the GPU today. And I'm going to check
00:00:36
that through my task manager here in
00:00:39
Windows. I'm starting out with Windows,
00:00:40
but we're going to do Linux, too, don't
00:00:41
worry. And taking a look at the GPU
00:00:43
here, I've got the Nvidia GeForce RTX
00:00:46
5080. 16 GB of memory, VRAM, and 6.7 of
00:00:50
dedicated memory is being used by this
00:00:52
model. It's loaded up. Let's just warm
00:00:55
it up by saying hello. Now, I know some
00:00:57
of you are going to say, "Hi, that's not
00:00:59
a very good prompt, is it? It's not
00:01:00
realistic." No, it's not realistic. It's
00:01:02
only one token, but it gives us a good
00:01:04
baseline. It's going to be a relative
00:01:06
comparison between the different
00:01:07
machines. I'm also going to send other
00:01:08
prompts, of course. But here's a good
00:01:10
start with one token. And for those of
00:01:12
you that are not familiar with how LLM
00:01:14
work, the more tokens you send to it,
00:01:16
the slightly slower it generally gets
00:01:18
generally. So, this is going to be the
00:01:20
fastest possible output. And we're
00:01:22
getting 102 tokens per second here with
00:01:24
this model. Now, as for hardware goes,
00:01:27
yeah, I have a couple of different
00:01:28
pieces that I've been testing here. This
00:01:30
is the 5090. This is the RTX Pro 6000.
00:01:34
This is brand new that I just got. I
00:01:36
haven't had a chance to test it out yet.
00:01:37
Stay tuned for those videos for sure,
00:01:40
but the point is not to compare the
00:01:42
hardware. In fact, if you do want to see
00:01:43
those videos, let me know in the
00:01:45
comments down below comparing these
00:01:46
cards against each other. I bought all
00:01:48
the different Nvidia varieties, and no,
00:01:51
they did not send them to me. I wish I
00:01:53
kind of wish they did, but
00:01:55
no. I was out there hunting for these
00:01:57
things. This one was especially hard to
00:01:59
get, but I'm getting off topic here. And
00:02:01
another thing about hardware, notice I
00:02:03
don't have a Mac here today. I have
00:02:04
plenty of videos comparing Macs with
00:02:06
these things, so check that out if
00:02:08
you're interested in that specifically.
00:02:09
But today, I want to keep the hardware
00:02:11
exactly the same, which is the point of
00:02:13
this exercise. I'm doing the exact same
00:02:15
machine, which I got right over here.
00:02:17
This is my AI machine that I built a
00:02:18
month or so ago.
00:02:29
I swapped out the 4090 to the 5090 and
00:02:32
now I'm on the 5080 in there. So, we're
00:02:34
going to keep it at the 5080. The next
00:02:36
thing I want to do is well, I want to
00:02:38
test this out in WSL because that's a
00:02:41
valid environment where people do their
00:02:42
work. It's Linux inside Windows. And if
00:02:45
you don't know what that is or if you've
00:02:47
never used it, basically it's like a
00:02:48
full Linux kernel running inside
00:02:50
Windows, but it's not a virtual machine.
00:02:53
It lives directly inside the kernel. I
00:02:56
believe somebody will correct me in the
00:02:58
comments, I'm sure. But the point is the
00:03:00
performance degradation is supposed to
00:03:02
be very, very small. So, we're not even
00:03:04
supposed to notice any performance
00:03:05
degradation. Well, let's see. I
00:03:07
installed LM Studio here as well. And
00:03:10
well, let me let me quit this one. Just
00:03:12
remember 102 tokens per second for Gemma
00:03:14
34B it. And I want you to take a guess
00:03:17
right now what's going to happen when I
00:03:18
run the Linux version of that in WSL.
00:03:21
Notice I quit the other one. And yes, I
00:03:23
have my guey enabled here. So there's an
00:03:25
extra little overhead going on over
00:03:28
there. And there it is. Now here you see
00:03:31
that we're still getting that GPU
00:03:33
offload of 34 layers out of 34. Even
00:03:35
though this is running inside WSL, it
00:03:38
means it's communicating properly with
00:03:40
our GPU. In fact, when you install WSL,
00:03:42
the latest version supports GPU offload.
00:03:45
So, we're good to go with that. I'm
00:03:46
going to load that up. And we can prove
00:03:47
this by just going to task manager in
00:03:50
Windows now. And you'll see the model is
00:03:52
now taking up some GPU memory. We're
00:03:54
good to go. Let's say hi here.
00:03:57
Boom. And there it is. What? 108 tokens
00:04:00
per second. Of course, that's within the
00:04:03
margin of error, but that's pretty
00:04:06
good. Now, we're down to 82 because I
00:04:09
said high again. And the second
00:04:11
iteration of that is going to take that
00:04:12
entire conversation as context. So it's
00:04:15
going to be a little slower. That's to
00:04:16
be expected. Let's start a new
00:04:18
conversation and say
00:04:20
hi. 118 tokens. We're just getting
00:04:23
faster and faster here. One more time.
00:04:26
103. Okay. So you see the range there.
00:04:28
We're just between 100 and 118 somewhere
00:04:33
around there. I want to go back to
00:04:34
Windows now and do it again. But if
00:04:36
you're in WSL and you want to use LM
00:04:38
Studio or LM Studio CLI, the CLI is
00:04:41
pretty cool, by the way. You can query
00:04:43
everything and load up models through
00:04:44
the terminal programmatically. In fact,
00:04:46
I recently did a members video showing
00:04:49
how to do all that. As an aside, thank
00:04:51
you to the members of the channel for
00:04:53
supporting the channel. Members get
00:04:54
extra videos and things like that and
00:04:56
live streams. So, if you want to join,
00:04:57
there's a join button right down below.
00:04:59
But also, if you subscribe, that's just
00:05:01
free and you get all these videos. All
00:05:03
right, let's go back and close this one
00:05:04
out. And I just want to go back and redo
00:05:08
LM Studio one more time in Windows.
00:05:10
There's our model. 34 out of 34. And I'm
00:05:13
going to create a new chat. Let's say hi
00:05:16
real quick here.
00:05:17
Boom. 115. So, as you can see, between
00:05:20
WSL and Windows, not much of a
00:05:23
difference there. So, you can be working
00:05:25
in the Linux environment in WSL. But I
00:05:27
know what you're thinking right now. I
00:05:28
mean, I know I know what you're
00:05:30
thinking. You're saying, "Uh, Alex, this
00:05:33
is not real Linux." Well, it kind of is.
00:05:36
It mostly is. But the thing is, it's not
00:05:38
living on the hardware, right? It's not
00:05:40
directly on the metal. So, let's do
00:05:42
that. Let's have Linux be on the metal.
00:05:45
Next week, I'll be at Google IO
00:05:47
shoulderto-shoulder with thousands of
00:05:49
developers on the same sketchy public
00:05:51
Wi-Fi. My source code and my NAS login
00:05:54
are staying private. So, Surf Shark
00:05:56
comes with me. One tap and every packet,
00:05:59
get push, docker pull, slack ping shoots
00:06:02
through an AES 256 encrypted tunnel.
00:06:04
Clean web is their ad blocking service
00:06:07
and it quietly nukes the trackers that
00:06:09
slow pages down. Surf Shark keeps no
00:06:11
logs. Deote literally check this and the
00:06:14
servers run on RAM only. So a reboot
00:06:17
wipes everything. My favorite perk,
00:06:19
unlimited devices. I run it here on my
00:06:21
MacBook, on the mini PC media service
00:06:24
back at the office, and on my phone so I
00:06:26
can upload my B-roll from the hotel
00:06:28
without ISP throttling. Go to surf
00:06:31
shark.com/allexiscant or use code
00:06:33
alexiskin at checkout to get four extra
00:06:36
months of Surf SharkVPN. And of course,
00:06:38
a 30-day money back guarantee. So, it's
00:06:41
risk-free to try. Now, I'm going to shut
00:06:43
down Windows and get it to Ubuntu. And
00:06:45
by the way, it is pronounced Ubuntu.
00:06:47
That's the right way. Not Ubuntu, not
00:06:49
Ubuntu, it's Ubuntu. All right, I want
00:06:52
to show you that I have all these models
00:06:54
here and I wrote a little program that
00:06:56
automatically using LM Studio CLI loads
00:07:00
all these up one at a time, does some
00:07:02
prompts against them. Not just a short
00:07:04
little prompt, but a little bit longer.
00:07:06
And I'm going to run that. And I have
00:07:07
not seen the results yet. So I'm really
00:07:09
curious to see what the result will be
00:07:11
on all these because different models
00:07:13
are going to act slightly differently.
00:07:14
This was Gemma 3. We saw that it's
00:07:16
pretty much the same, but the
00:07:17
architecture of the model like Quen
00:07:19
Llama or Gemma and others will behave
00:07:22
slightly differently. And let's go ahead
00:07:24
and restart this
00:07:28
machine. Did it freeze? Come on. What
00:07:31
happened here? Restart. Boom. I missed
00:07:34
the old dedicated restart buttons or
00:07:36
reset they used to be called. Do you
00:07:38
remember those? Come on. I was just in
00:07:41
you, Ubuntu. These things happen. That's
00:07:44
That's how things go. When you expect
00:07:46
something to work, it doesn't. Good
00:07:48
thing this is not live. Reboot. We're
00:07:51
getting closer, though. I feel like this
00:07:53
reboot is going to be the one. One
00:07:55
eternity later. All right, I just
00:07:57
downloaded the model again. And here I
00:07:59
am in Ubuntu on the metal and I have LM
00:08:02
Studio installed. Let's make it a little
00:08:04
bit larger. Whoa, whoa, whoa, whoa,
00:08:07
whoa. Right over here. Gemma 3 4B. Boom.
00:08:12
And notice this is detecting our Nvidia
00:08:14
GPU. How do we monitor that? Well, there
00:08:17
is no task manager here per se, but uh
00:08:21
there's Nvidia SMI which shows me that
00:08:25
we are on the right driver. It's the
00:08:26
latest driver, but there's another
00:08:28
little tool here called NV Top that we
00:08:30
can run. Hopefully, we'll be able to see
00:08:32
the activity. There's a memory chart
00:08:34
right here. And right now, it's showing
00:08:36
us 1.28 28 out of 16 gigs of VRAM is
00:08:41
being used. And there's the GPU
00:08:43
activity, which is zero right now. So,
00:08:44
let's load up this model. Create a new
00:08:46
chat. Load up model. Gemma 3. Boom. And
00:08:50
load. And you'll see that there it is.
00:08:53
There's our memory going up. We're about
00:08:54
6.7 gigs. That was actually pretty fast.
00:08:57
It seemed faster than on Windows. Let's
00:09:00
say hi to our little model here. Boom.
00:09:03
What? Holy smokes. I take it back. All
00:09:06
you people that said things about Linux
00:09:10
being faster, you might have been right.
00:09:12
Let's do that again.
00:09:16
Hi. 170 tokens per second. Oh my
00:09:19
goodness. This is a ridiculous speed up
00:09:22
over
00:09:24
Windows. 173 for the same model. I'm
00:09:27
just blown away. This is crazy. All you
00:09:29
people that are working in Windows, I
00:09:31
know that you have to. I know that
00:09:33
people have to. And people that are
00:09:34
Linux people, don't yell at Windows
00:09:36
people because they have to. Okay, but
00:09:38
wow, this is
00:09:40
crazy. Now, let's take a look at the
00:09:42
chart that actually takes all the models
00:09:44
that I have installed in LM Studio and
00:09:46
uses LM Studio CLI to iterate over each
00:09:49
of the models, pull them in, load them
00:09:51
up, run a prompt against it, note down
00:09:53
the tokens per second, and tell me what
00:09:55
that is. So, let's see what that is. For
00:09:58
WSL, I have here's Deepseek R1 distill
00:10:00
quen 7B and we're at 122 tokens per
00:10:04
second. Let's find that same one on
00:10:06
Windows. Here it is. 143. So quite a bit
00:10:10
more. But what's that on Ubuntu? Here it
00:10:12
is. 135. So in this case, for some
00:10:15
reason, it's a little bit less in
00:10:17
Ubuntu. There was a significant jump in
00:10:19
the Gemma 34B, but for Deep CR1, slower.
00:10:24
Again, it's a little bit within the
00:10:26
margin of error there, but for this
00:10:28
model, doesn't seem to matter that much.
00:10:30
Let's take a look at a model that's a
00:10:31
little bit more useful. This Gemma 312
00:10:34
billion in Ubuntu, we're getting 83
00:10:36
tokens per second. Windows 71 slower in
00:10:40
Windows and 66 in WSL. So, a pretty
00:10:43
consistent drop there from Ubuntu being
00:10:46
the highest to WSL being the lowest.
00:10:47
Quentu coder 14b 66 in WSL 70 in Windows
00:10:53
and there it is
00:10:54
74.9 almost 75 in Ubuntu. Ubuntu seems
00:10:58
to have a slight edge. Not as huge a
00:11:00
jump as we saw initially but still.
00:11:02
Let's take a look at Gemma 3 1 billion.
00:11:04
Ubuntu 288 tokens per second. Windows
00:11:08
307 faster. Holy cow. Notice Llama 3.21
00:11:12
billion went up to 367 tokens per second
00:11:15
on Windows. and Gemma 3 1 billion 193 in
00:11:19
WSL. Wow, that's still fast. But the
00:11:22
difference between WSL and Windows and
00:11:25
Ubuntu here is staggering for the
00:11:27
smaller models. Notice that on Ubuntu,
00:11:30
the larger models just completely
00:11:32
failed. on Windows. Those larger models,
00:11:34
the 32 billion models, those are models
00:11:37
that are really large and they barely
00:11:39
fit, if not fit at all, inside the VRAMm
00:11:42
of that video card, which is a 16 GB
00:11:45
video card. They ran really, really
00:11:47
slowly. What that means is that LM
00:11:49
Studio offloaded some of that to the
00:11:51
CPU, making it extremely slow. We don't
00:11:54
want to do that in real life, but it
00:11:56
happened here. and QWQ 32 billion
00:11:59
parameter model. This Q4 quad. Try
00:12:02
saying that fast 10 times. That's not
00:12:04
going to happen. That model is 18.49 GB
00:12:07
on disk. So, it's going to be even
00:12:08
larger when you load it up. I can't
00:12:10
believe it ran, but it did. The Windows
00:12:12
version of LM Studio allowed it to.
00:12:14
Let's see what happened in WSL for
00:12:16
those. Yeah, look at that. It ran in WSL
00:12:18
2, giving us an even lower score, 1.7
00:12:21
tokens per second. Unacceptable, but it
00:12:24
ran, which is pretty incredible. So,
00:12:26
luckily the comparison here is a
00:12:28
standalone comparison because everything
00:12:30
is kind of relative here. Now, if you
00:12:31
want to see how this card does,
00:12:34
definitely subscribe and stay tuned for
00:12:36
that. I'm also working on another video
00:12:37
with the AMD Ryzen AI Max Plus 395. A
00:12:41
couple of those, including this new
00:12:44
little machine right here. The video
00:12:47
might be done already, and if it is,
00:12:48
it's going to be up here. And this one
00:12:49
is another one you might want to watch.
00:12:51
So, thanks for watching, and I'll see
00:12:53
you in the next video.
00:12:55
[Applause]
00:12:56
[Music]