00:00:00
running machine learning models is a
00:00:01
parallel task that certain types of
00:00:03
processes are really good at CPUs are
00:00:06
not great at running things in parallel
00:00:07
so running models on CPUs is very slow
00:00:10
but gpus are great at parallel
00:00:12
processing so that's why they're
00:00:14
commonly used for this and dedicated
00:00:16
gpus like the RTX 490 and its brothers
00:00:19
and sisters are traditionally really
00:00:22
fast but they're also really expensive
00:00:24
and use a lot of power and that's why
00:00:26
Apple silicon architecture emerged as
00:00:28
the next best thing for consumers to run
00:00:30
at home that can be used to run local
00:00:32
llms for a lot cheaper yeah and I can't
00:00:34
believe I'm saying this about Apple it's
00:00:37
actually a lot cheaper to run fully
00:00:38
blown out MacBook Pro than a couple of
00:00:40
RTX 490s that's the initial cost and the
00:00:43
ongoing operating costs I'm getting to
00:00:45
the cluster okay just just give me a
00:00:47
moment I just want to mention that
00:00:48
besides the GPU if you want to run
00:00:50
larger models you're going to need more
00:00:51
memory or Ram in apple silicon terms
00:00:54
that's unified memory which means that
00:00:55
the CPU and GPU can use the same memory
00:00:57
you might already know this so you can
00:00:59
have a Mac mini like this with 64 GB of
00:01:02
RAM and even the most expensive consumer
00:01:05
Nvidia card only has 24 and that's 2,000
00:01:10
bucks right there now how do you use
00:01:11
this stuff for machine learning well to
00:01:13
actually write machine learning code
00:01:14
you're going to want to use a framework
00:01:16
or a library for example you might have
00:01:18
heard of the old tensor flow then
00:01:20
there's pytorch and those will run on
00:01:21
Apple silicon machines and Nvidia and
00:01:24
other Hardware as well but not too long
00:01:26
ago a new framework emerged that's
00:01:28
optimized for Apple silicon and that's
00:01:30
called mlx it was released I think in
00:01:32
2023 by Apple as a machine learning team
00:01:35
and it's supposed to squeeze more juice
00:01:36
out of apple silicon chips for machine
00:01:38
learning and in benchmarks is showing
00:01:40
that it performs better than P torch so
00:01:42
essentially it's Apple's answer to
00:01:44
nvidia's Cuda and this allows something
00:01:46
like this tiny Mac Mini to run mlx and
00:01:49
have pretty good performance now since
00:01:51
machine learning takes advantage of
00:01:52
parallelism and it's parallel in nature
00:01:54
running several Macs in parallel should
00:01:57
theoretically even better distribute the
00:01:59
load right well theoretically it should
00:02:02
so that's why I set up the cluster to
00:02:04
try this out and answer some questions
00:02:05
for example is it faster to run models
00:02:08
if machines are clustered and I also
00:02:09
wanted to reassure myself that extending
00:02:11
a machine with another machine will give
00:02:13
me more capability to run larger models
00:02:15
these are the basic questions that are
00:02:16
addressed by setting up a distributed
00:02:19
system like this or a cluster you can
00:02:21
set up mlx with a distributed
00:02:23
communication but uh as you can see it's
00:02:27
a bit of a setup and there's definitely
00:02:30
some tuning and tweaking involved so
00:02:33
that's why I've already made a couple of
00:02:34
videos about this thing it's called EXO
00:02:38
and it's really easy to set up and get
00:02:41
going I already made a video setting
00:02:43
this up and I'll link to that down below
00:02:45
but this thing evolving very quickly so
00:02:46
the steps in that video are already a
00:02:48
little bit outdated but it gives you the
00:02:50
idea of where to go and what to do
00:02:52
anyway EXO wraps up that uh difficulty
00:02:55
of distributed computing in a nice
00:02:57
simple package for you by the way this
00:02:59
video is not sponsored by them uh or mlx
00:03:01
or apple or XO I just really like this
00:03:04
however I do want to thank the members
00:03:05
of the channel your support means a lot
00:03:07
and it goes towards me purchasing these
00:03:09
kinds of things and doing these kinds of
00:03:10
setups so thank you so first of all
00:03:13
these are not all the same machine these
00:03:15
are the specs of the machines there's
00:03:16
two M4 Pro machines in there and three
00:03:19
M4 machines with slightly different
00:03:21
configurations there are however two
00:03:23
base models the the $600 machine in here
00:03:26
and I wanted to see if running just two
00:03:29
base models is going to be any better
00:03:31
than running one M4 machine with twice
00:03:34
the ram 32 gigs so 16 + 16 versus one
00:03:38
with 32 and also 16 + 16 versus the base
00:03:42
model M4 Pro now what's incredible here
00:03:45
is that all these are running right now
00:03:47
and I have all the screens up here and
00:03:49
as you can see there's stuff running on
00:03:50
these machines they're not just sitting
00:03:52
idly well they're not doing much but
00:03:54
they have a few things running on them
00:03:55
like the browser terminal uh activity
00:03:57
monitor things like that and we are
00:03:59
using 28 watts of power for all these
00:04:02
machines very low power usage there's
00:04:05
multiple ways of hooking these up these
00:04:06
are actually connected via Thunderbolt
00:04:09
bridge and through some experimentation
00:04:11
I've discovered that yes you can run
00:04:14
everything through Wi-Fi or through land
00:04:16
but it's faster when you're doing it
00:04:18
through Thunderbolt bridge I mean on the
00:04:20
surface that kind of makes sense but you
00:04:21
actually have to test it out because
00:04:23
what's theoretically supposed to happen
00:04:25
is each machine is going to download a
00:04:27
chunk of the model to work on based on
00:04:30
its capabilities this is determined
00:04:31
automatically by Exo but in reality
00:04:33
there is communication during the model
00:04:35
run which can be influenced by the
00:04:37
network connection so to set that up I
00:04:40
go to network and then Thunderbolt
00:04:43
Bridge make sure I configured my IP
00:04:45
addresses manually on all the machines I
00:04:48
have uh 192 168 1010 1020 1030 1040 1050
00:04:53
and then also under Thunderbolt Bridge
00:04:55
Under Hardware I've set up jumbo packets
00:04:57
now I haven't tested this out yet
00:04:59
theoretically if you have jumbo packets
00:05:01
the packets are larger sending more
00:05:03
information across therefore decreasing
00:05:05
the processing power required to process
00:05:07
each packet makes sense got to test it
00:05:09
out different video maybe let me know if
00:05:11
you're interested in that now a couple
00:05:13
of you have noticed that I have five Mac
00:05:15
minis but each Mac Mini has only three
00:05:18
Thunderbolt ports so if this is my home
00:05:20
one down here at the bottom then I can
00:05:22
only run it to three other Mac minis so
00:05:26
that's why I added a thunderbolt Hub now
00:05:28
I do have a thunderbolt 5 Hub here but
00:05:32
um I've tested this one versus the
00:05:34
little one that I had before which is a
00:05:35
thunderbolt 4 one and I so far didn't
00:05:37
notice any difference between them as
00:05:39
far as the networking side of things now
00:05:41
I want to dig deeper into Thunderbolt
00:05:43
connectivity in a different video but
00:05:44
that's not for this one even though I'm
00:05:45
using all Thunderbolt 5 cables here and
00:05:48
two of these machines are capable with
00:05:49
Thunderbolt 5 let's just assume it's a
00:05:51
thunderbolt 4 system which has an
00:05:53
improvement over regular Wi-Fi and land
00:05:56
anyway we're going to start out with
00:05:57
some smaller models as proof of concept
00:05:59
and I'm going to run EXO here so this is
00:06:02
llama
00:06:03
3.21 billion parameter one and it's
00:06:06
small and it's fast let's say hello to
00:06:08
it and we get 90 tokens per second right
00:06:13
uh
00:06:14
story so yeah if it's sustained it's
00:06:16
about 73 tokens per second that's pretty
00:06:19
fast it's pretty good so about 70 tokens
00:06:22
sustained and this is running on the M4
00:06:25
Mac Mini base model no problem fitting
00:06:27
that in there because it's a tiny tiny
00:06:29
any
00:06:31
model now for comparison running this on
00:06:34
the M4 Pro chip is giving me a close to
00:06:38
100 tokens per second 96 95 94 so quite
00:06:42
a bit better just to give you a baseline
00:06:43
of the differences between the two
00:06:44
machines because those two chips have
00:06:46
different memory bandwidths and memory
00:06:48
bandwidth not the size here because the
00:06:50
Size Doesn't Matter we've taken that out
00:06:52
of the equation cuz it's a one billion
00:06:53
parameter model which is small here the
00:06:55
memory bandwidth is what's showing
00:06:57
through memory bandwidth plays a huge
00:06:58
role in how quickly these things
00:07:00
generate tokens and I made a whole video
00:07:02
on bandwidth you can check it out I'll
00:07:03
link to it down below all right so going
00:07:05
back to our example with running it on
00:07:07
one base model and getting 70 tokens per
00:07:10
second I'm now going to start up EXO on
00:07:13
two base models so now you'll see that I
00:07:16
have two nodes running here Mac Mini 16
00:07:19
gigs and another Mac Mini with 16 gigs
00:07:22
together not too much power but let's
00:07:24
see if we get a speed Improvement so I'm
00:07:26
going to run the same llama 3 .21
00:07:31
billion and we're getting 45 tokens per
00:07:34
second considerably worse this time
00:07:36
around and this is happening because the
00:07:39
connections on the back both the base
00:07:41
models are going through the Thunderbolt
00:07:43
Hub I found out that this plays a rle
00:07:45
and it's a negative consequence
00:07:49
unfortunately so now I've connected the
00:07:51
two machines directly to each other with
00:07:53
Thunderbolt all right ready watch
00:07:56
this 87
00:07:59
right uh story 82 83 87 99 100 tokens
00:08:06
per second it was 73 sustained when it
00:08:09
was running on one machine now we're
00:08:11
getting 95 same model same prompt when
00:08:14
it's two machines connected together via
00:08:17
Thunderbolt not Wi-Fi so let's see if we
00:08:19
can get the same kind of result by using
00:08:22
one of the other machines look we're
00:08:23
looking to beat 95 tokens per second
00:08:25
here I'm going to head over to Mac Mini
00:08:27
4 cuz that's the one that that's also an
00:08:30
M4 chip but it has 32 gigs of RAM and
00:08:33
let's run
00:08:34
that all right all right so we're
00:08:37
getting 89 here let's do a
00:08:41
sustained I have a feeling it's going to
00:08:42
be yeah there it is so it's 7374 tokens
00:08:46
per second clearly the amount of ram
00:08:48
didn't have any effect on this we're
00:08:50
still dealing with the M4 chip so this
00:08:51
kind of proves that we're limited by the
00:08:53
M4 chip and not the ram in this case so
00:08:55
we can kind of conclude that getting one
00:08:57
base model M4 Pro machine will give you
00:09:00
similar performance as far as this model
00:09:02
and tokens per second that two M4 base
00:09:04
models will give you and finally machine
00:09:07
3 right a story so we're getting about
00:09:11
104 tokens per second here about 100
00:09:14
dipping around 97 96 oh come on make up
00:09:17
your mind we're down to 93 one other
00:09:19
thing to keep in mind is EXO is not
00:09:21
without overhead if you run mlx directly
00:09:24
against llama 3.21 billion for example
00:09:28
with 4bit uh monetization there it is
00:09:30
it's it's going pretty fast and we're
00:09:33
talking about 281 tokens per second here
00:09:36
let's do it again to make sure it's that
00:09:39
seems a little high 280 tokens per
00:09:41
second that's pretty good remember with
00:09:44
EXO we were getting about 100 so if
00:09:45
you're running only on one machine
00:09:47
that's something to keep in mind as well
00:09:48
now let's take it up to the max really
00:09:50
push this system hard and see what kind
00:09:52
of power usage is going to have when all
00:09:54
five machines are just going full bore
00:09:57
for that I'm going to run a
00:10:00
loop generating on all five machines at
00:10:04
the same time each of the M4 Pro
00:10:06
machines takes up about 87 watts and
00:10:08
each of the M4 machines is taken up
00:10:10
about 50 for a total of almost 200 Watts
00:10:13
just over 200 now so this is sustained
00:10:16
power usage when everything is being
00:10:18
utilized to the max we're still using
00:10:20
less power than this and this is
00:10:23
interesting right here the machine on
00:10:24
the bottom
00:10:28
[Music]
00:10:29
H this might not be the best rack setup
00:10:32
folks we're at almost 40° on the bottom
00:10:35
machine the other ones seem like they're
00:10:38
quite a bit cooler this is the other
00:10:41
this one right here machine number three
00:10:42
is the other M4 Pro machine and this one
00:10:47
this one and the top one are all M4
00:10:49
machines so the m4s are a lot cooler the
00:10:52
than the M4 Pros but the one on the
00:10:54
bottom I don't know if it's because
00:10:57
that's the central Hub and it's got more
00:10:59
work to do with Thunderbolt or because
00:11:01
all the air is sort of blowing down
00:11:04
although I don't know it's getting
00:11:06
dispersed so I'm not sure why I'm going
00:11:08
to have to blame the Thunderbolt
00:11:09
connectivity on this let me know in the
00:11:11
comments if you think otherwise let's
00:11:13
push this a little bit further to see
00:11:15
how big a model we can run so here is
00:11:17
the base model Mac Mini and I want to
00:11:19
switch this to let's say quen 2.5 coder
00:11:24
7 billion parameters code something boom
00:11:28
and there it is it's actually running
00:11:30
the 7 billion parameter Quin model which
00:11:32
is supposed to be very good at coding
00:11:34
and it's giving me 21 tokens per second
00:11:37
which is pretty good
00:11:39
write primes
00:11:42
injs there it is is generating primes up
00:11:45
to a certain limit in JavaScript 20
00:11:47
tokens per second pretty good now let's
00:11:50
try the quen 2.5 coder 32 billion
00:11:54
parameter model which should definitely
00:11:56
break try again and it started
00:12:00
downloading the model and that makes
00:12:02
perfect sense because the model is not
00:12:04
locally available on this machine now
00:12:06
the way it's set up by default is that
00:12:08
each machine will get a copy of the
00:12:10
model but if you start up a whole
00:12:12
cluster without running anything first
00:12:14
like five nodes or four nodes or
00:12:16
whatever it may be and then you run a
00:12:18
model that might not fit it's supposed
00:12:20
to only download parts of that model to
00:12:23
each machine but that's not the case
00:12:24
here I'm only running it on one right
00:12:26
now so it's going to be 17 gb 5 it's
00:12:29
going to take a little bit well after a
00:12:31
bunch of downloading it finally did it
00:12:33
it's it's doing a 32 billion parameter
00:12:36
model code something and it's using the
00:12:39
two base model Mac minis but it's doing
00:12:41
it kind of slow we're going about eight
00:12:43
tokens per second here so not super
00:12:47
practical to running such a large model
00:12:50
on two base Mac Minis and just for
00:12:52
comparison it is a little bit better
00:12:54
when running on just one M4 Pro Mac Mini
00:12:58
we're getting about 12 tokens per second
00:13:00
here time to try a big one let's go to
00:13:03
uh neotron
00:13:05
70b hello hello it's nice to meet you is
00:13:10
there something I can help you with so
00:13:11
it's going kind of slow 4.2 tokens per
00:13:14
second and that's running on the two
00:13:16
most powerful ones I have the M4 Pro
00:13:19
down here and the M4 Pro over here
00:13:21
that's a total of um 88 GB of RAM my
00:13:24
math is not that great but you know 64 +
00:13:27
24 we don't need that much RAM for this
00:13:29
because it's a 4 bit quantization but
00:13:31
even at 4 bits it should be really fast
00:13:33
it's 4.9 tokens per second pretty
00:13:34
unusable okay it's the Moment of Truth
00:13:37
this is when I get to run a model on all
00:13:40
five machines so I'm going to start this
00:13:43
up on all five machines let's go with
00:13:46
something nice and easy llama 3.21
00:13:48
billion
00:13:52
hello it worked the whole thing works
00:13:55
nice to meet
00:13:58
you 67 tokens per second what's your
00:14:03
name okay 69 72 74 tokens per second all
00:14:08
right this is actually working and it's
00:14:10
working pretty well we got a cluster of
00:14:12
five nodes and it's going 74 tokens per
00:14:15
second
00:14:16
high up to 74 I should say sometimes it
00:14:20
ranges but you know that's not too bad
00:14:22
but it's a small model 1 billion
00:14:24
parameters and 74 is pretty much what we
00:14:27
had to start with on one machine which
00:14:29
we can easily do I guess I should try a
00:14:31
bigger model so because the Mac Mini has
00:14:33
only three Thunderbolt connections I
00:14:35
have to use that Hub and because I use
00:14:37
the Hub there is a little bit of network
00:14:40
contention going on there these hubs are
00:14:42
mostly used for displays if you want to
00:14:44
have multiple displays off of one
00:14:45
Thunderbolt Port but they're not really
00:14:47
meant for connecting multiple computers
00:14:49
to each other for networking purposes so
00:14:51
that's why the best possible scenario
00:14:53
that I can show you here is having four
00:14:55
machines not five so I wanted to
00:14:57
demonstrate that while all the these
00:14:59
machines are connected directly to each
00:15:01
other through Thunderbolt and we're
00:15:03
going to go with quen 2.5 coder 32
00:15:07
billion
00:15:08
hello hello how can I assist you today
00:15:11
good 16.4 tokens write some js code to
00:15:16
find uh primes there we
00:15:19
go and there it is it's going and it's
00:15:23
giving me about 16 12 tokens per second
00:15:26
on average it's writing the function
00:15:28
it's not terribly slow but it's pretty
00:15:31
slow and while that's happening you can
00:15:34
see that all these machines the four
00:15:36
machines I should say not five they're
00:15:39
all using some of the GPU up to probably
00:15:42
80% of it and we're consuming just about
00:15:46
50 WS of power for all this so overall
00:15:50
not terrible if you're going to use this
00:15:51
on a smaller model that would be ideal
00:15:53
the power savings are pretty tremendous
00:15:55
here so why bother what's the point can
00:15:57
anyone explain why cluster a bunch of
00:15:59
Max is better than just having a PC with
00:16:00
a GPU cluster well in some ways it is
00:16:03
and in some way it's not for example the
00:16:05
Mac Mini M4 has a unified memory which
00:16:07
means on chip memory can be used for the
00:16:08
GPU given it up to 64 GB of GPU Ram I
00:16:11
already said this earlier in the video
00:16:13
the biggest consumer GPU has only 24 GB
00:16:16
of RAM unless you go for the a100 or the
00:16:19
h100 which are very expensive so
00:16:22
theoretically you can run bigger models
00:16:23
on the minis seeing that that's the case
00:16:25
why doesn't musk just buy 100,000 Max
00:16:28
well musk has a lot of money I don't
00:16:31
have and you probably don't have that
00:16:33
kind of money that musk has so musk can
00:16:35
afford buying $100,000
00:16:37
$30,000 boards and probably paying
00:16:40
hundreds of thousands of dollars if not
00:16:42
millions of dollars to run them the
00:16:44
electricity costs so for those of you
00:16:46
that were about to write that in the
00:16:47
comments this is useless this is
00:16:49
pointless why not just run a bunch of
00:16:51
these well this is why so depending on
00:16:53
your use case the machines that you're
00:16:54
using and the model that you're running
00:16:56
you will get wildly different results
00:16:59
for now this is a great concept and I
00:17:01
love this idea but so far I haven't
00:17:03
found that it's better to run a cluster
00:17:06
than to just get a MacBook Pro with 128
00:17:09
gigs of RAM and the M4 Max with 128 gigs
00:17:12
of RAM which I also be making videos
00:17:13
about and testing so make sure you don't
00:17:15
miss that that has been shown to have
00:17:18
insane performance for these types of
00:17:20
applications of course if you can afford
00:17:22
it and get two of these then you'll
00:17:25
still be cheaper than getting an
00:17:27
equivalent setup in Nvidia cards I can't
00:17:29
swing that so I'm just going to stick
00:17:31
with one for now and uh hope for the
00:17:33
best yeah I'm going to sell my M2 Max
00:17:36
anyway thanks for watching it's been fun
00:17:37
showing you this Tower fun experiment
00:17:39
but uh I don't think I'm going to
00:17:41
actually use something like that in the
00:17:43
long run at least not yet this is still
00:17:45
very early days for this kind of
00:17:47
Technology this setup will probably have
00:17:48
some uses especially if you combine 4 64
00:17:52
GB machines which I happen not to have
00:17:55
but Alex gima on Twitter he showed off
00:17:58
his setup so if you want to follow
00:17:59
project EXO along I'll link to it down
00:18:01
below as well you can check it out
00:18:03
anyway I hope you have a good one and I
00:18:04
will see you in the next one