M4 Mac Mini CLUSTER 🤯

00:18:06
https://www.youtube.com/watch?v=GBR6pHZ68Ho

Summary

TLDRThe video discusses the transition from CPUs to GPUs in machine learning, emphasizing the capabilities of Apple's Silicon architecture. It introduces MLX, a new framework for Apple Silicon that improves performance for machine learning tasks. A comparison is made between running models on various setups, including a cluster of Mac Minis, and the sustained performance observed during these tests. Challenges such as network contention when using Thunderbolt Hubs and power consumption are also mentioned, ultimately questioning whether a cluster provides efficiency over a high-end MacBook Pro. In conclusion, while clustering has benefits, a single robust machine may be more practical in certain scenarios.

Takeaways

  • ⚡️ GPU vs CPU: GPUs excel in parallel tasks, making them ideal for ML.
  • 💰 Apple Silicon: A cost-effective alternative to expensive GPUs.
  • 🧠 MLX Framework: Optimized for Apple Silicon, outperforms traditional frameworks.
  • 🔗 Thunderbolt Hubs: Can cause network contention, affecting performance.
  • 📊 Cluster Efficiency: Mixing multiple Macs shows varied results against high-performance machines.

Timeline

  • 00:00:00 - 00:05:00

    Masinõppe mudelite käitamine toimub paralleelselt, mis on CPU-dele keeruline. GPU-d, nagu RTX 490, pakuvad paremat jõudlust, kuid on kallid ja energiaManipuleerimises suurus. Apple'i sihtarchitecture on ühendatud, et pakkuda odavamat ja energiatõhusamat alternatiivi. Apple'i M4 Pro mudel, millel on ühtne mälu, annab parema jõudluse kui Nvidia kaardid, kuid M4 alusmudel ei pruugi sama hästi töötada. Uus MLX raamatukogu on Apple'i vastus Nvidia CUDA-le, ja see optimeerib Apple'i tooteid masinõppe jaoks.

  • 00:05:00 - 00:10:00

    Klastri seadistamine toob esile küsimuse, kas mudelite käitamine masinatega koos annab kiirusel kasu. Klastri seadmiseks on vajalik kohandamine ja EXO programmid pakuvad selle seadistamise lihtsust. Erinevate M4 ja M4 Pro mudelite testimine näitas erinevaid tulemusi, energiatõhususe jagamine parandas jõudlust ning mudelite osad ja mälumaht mängisid olulist rolli. Thunderbolt ühenduse kaudu havitatav ülesehitus tõi esile kaebusi, kuid näitas potentsiaalselt suurendatud efektiivsust

  • 00:10:00 - 00:18:06

    Klastri testimine näitas, et väiksemate mudelitega töötamine võib olla efektiivne, kuid suuremate mudelite puhul tuli esile suurema energia tarbimise ja aeglasema töötamise probleem. Suuremad mudelid, nagu 32 miljardit parameetrit, ei edene kergesti ning EXO ja MLX raamatukogude vahel on tõsised jõudlusvahed. Klastri 5 seadme kasutamine ei toonud oodatud tulemusi võrreldes ühe MacBook Pro-ga, mis pakub suuremat RAM-i ja GPU jõudlust odavama hinnaga.

Mind Map

Video Q&A

  • Why are GPUs better for running machine learning models?

    GPUs are optimized for parallel processing, making them much faster than CPUs for running machine learning tasks.

  • What is MLX?

    MLX is a machine learning framework optimized for Apple Silicon, providing better performance than traditional frameworks like PyTorch.

  • How does Apple Silicon compare to powerful GPUs like RTX 490?

    Apple Silicon offers competitive performance at a lower cost with lower power consumption for running machine learning models.

  • What issues arise from using a Thunderbolt Hub in a cluster?

    Using a Thunderbolt Hub can lead to network contention and reduced performance compared to direct connections.

  • Is it more efficient to cluster multiple Mac Minis or use a powerful MacBook Pro?

    In some cases, using a single powerful MacBook Pro with M1 Max can be more efficient than clustering several Mac Minis.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    running machine learning models is a
  • 00:00:01
    parallel task that certain types of
  • 00:00:03
    processes are really good at CPUs are
  • 00:00:06
    not great at running things in parallel
  • 00:00:07
    so running models on CPUs is very slow
  • 00:00:10
    but gpus are great at parallel
  • 00:00:12
    processing so that's why they're
  • 00:00:14
    commonly used for this and dedicated
  • 00:00:16
    gpus like the RTX 490 and its brothers
  • 00:00:19
    and sisters are traditionally really
  • 00:00:22
    fast but they're also really expensive
  • 00:00:24
    and use a lot of power and that's why
  • 00:00:26
    Apple silicon architecture emerged as
  • 00:00:28
    the next best thing for consumers to run
  • 00:00:30
    at home that can be used to run local
  • 00:00:32
    llms for a lot cheaper yeah and I can't
  • 00:00:34
    believe I'm saying this about Apple it's
  • 00:00:37
    actually a lot cheaper to run fully
  • 00:00:38
    blown out MacBook Pro than a couple of
  • 00:00:40
    RTX 490s that's the initial cost and the
  • 00:00:43
    ongoing operating costs I'm getting to
  • 00:00:45
    the cluster okay just just give me a
  • 00:00:47
    moment I just want to mention that
  • 00:00:48
    besides the GPU if you want to run
  • 00:00:50
    larger models you're going to need more
  • 00:00:51
    memory or Ram in apple silicon terms
  • 00:00:54
    that's unified memory which means that
  • 00:00:55
    the CPU and GPU can use the same memory
  • 00:00:57
    you might already know this so you can
  • 00:00:59
    have a Mac mini like this with 64 GB of
  • 00:01:02
    RAM and even the most expensive consumer
  • 00:01:05
    Nvidia card only has 24 and that's 2,000
  • 00:01:10
    bucks right there now how do you use
  • 00:01:11
    this stuff for machine learning well to
  • 00:01:13
    actually write machine learning code
  • 00:01:14
    you're going to want to use a framework
  • 00:01:16
    or a library for example you might have
  • 00:01:18
    heard of the old tensor flow then
  • 00:01:20
    there's pytorch and those will run on
  • 00:01:21
    Apple silicon machines and Nvidia and
  • 00:01:24
    other Hardware as well but not too long
  • 00:01:26
    ago a new framework emerged that's
  • 00:01:28
    optimized for Apple silicon and that's
  • 00:01:30
    called mlx it was released I think in
  • 00:01:32
    2023 by Apple as a machine learning team
  • 00:01:35
    and it's supposed to squeeze more juice
  • 00:01:36
    out of apple silicon chips for machine
  • 00:01:38
    learning and in benchmarks is showing
  • 00:01:40
    that it performs better than P torch so
  • 00:01:42
    essentially it's Apple's answer to
  • 00:01:44
    nvidia's Cuda and this allows something
  • 00:01:46
    like this tiny Mac Mini to run mlx and
  • 00:01:49
    have pretty good performance now since
  • 00:01:51
    machine learning takes advantage of
  • 00:01:52
    parallelism and it's parallel in nature
  • 00:01:54
    running several Macs in parallel should
  • 00:01:57
    theoretically even better distribute the
  • 00:01:59
    load right well theoretically it should
  • 00:02:02
    so that's why I set up the cluster to
  • 00:02:04
    try this out and answer some questions
  • 00:02:05
    for example is it faster to run models
  • 00:02:08
    if machines are clustered and I also
  • 00:02:09
    wanted to reassure myself that extending
  • 00:02:11
    a machine with another machine will give
  • 00:02:13
    me more capability to run larger models
  • 00:02:15
    these are the basic questions that are
  • 00:02:16
    addressed by setting up a distributed
  • 00:02:19
    system like this or a cluster you can
  • 00:02:21
    set up mlx with a distributed
  • 00:02:23
    communication but uh as you can see it's
  • 00:02:27
    a bit of a setup and there's definitely
  • 00:02:30
    some tuning and tweaking involved so
  • 00:02:33
    that's why I've already made a couple of
  • 00:02:34
    videos about this thing it's called EXO
  • 00:02:38
    and it's really easy to set up and get
  • 00:02:41
    going I already made a video setting
  • 00:02:43
    this up and I'll link to that down below
  • 00:02:45
    but this thing evolving very quickly so
  • 00:02:46
    the steps in that video are already a
  • 00:02:48
    little bit outdated but it gives you the
  • 00:02:50
    idea of where to go and what to do
  • 00:02:52
    anyway EXO wraps up that uh difficulty
  • 00:02:55
    of distributed computing in a nice
  • 00:02:57
    simple package for you by the way this
  • 00:02:59
    video is not sponsored by them uh or mlx
  • 00:03:01
    or apple or XO I just really like this
  • 00:03:04
    however I do want to thank the members
  • 00:03:05
    of the channel your support means a lot
  • 00:03:07
    and it goes towards me purchasing these
  • 00:03:09
    kinds of things and doing these kinds of
  • 00:03:10
    setups so thank you so first of all
  • 00:03:13
    these are not all the same machine these
  • 00:03:15
    are the specs of the machines there's
  • 00:03:16
    two M4 Pro machines in there and three
  • 00:03:19
    M4 machines with slightly different
  • 00:03:21
    configurations there are however two
  • 00:03:23
    base models the the $600 machine in here
  • 00:03:26
    and I wanted to see if running just two
  • 00:03:29
    base models is going to be any better
  • 00:03:31
    than running one M4 machine with twice
  • 00:03:34
    the ram 32 gigs so 16 + 16 versus one
  • 00:03:38
    with 32 and also 16 + 16 versus the base
  • 00:03:42
    model M4 Pro now what's incredible here
  • 00:03:45
    is that all these are running right now
  • 00:03:47
    and I have all the screens up here and
  • 00:03:49
    as you can see there's stuff running on
  • 00:03:50
    these machines they're not just sitting
  • 00:03:52
    idly well they're not doing much but
  • 00:03:54
    they have a few things running on them
  • 00:03:55
    like the browser terminal uh activity
  • 00:03:57
    monitor things like that and we are
  • 00:03:59
    using 28 watts of power for all these
  • 00:04:02
    machines very low power usage there's
  • 00:04:05
    multiple ways of hooking these up these
  • 00:04:06
    are actually connected via Thunderbolt
  • 00:04:09
    bridge and through some experimentation
  • 00:04:11
    I've discovered that yes you can run
  • 00:04:14
    everything through Wi-Fi or through land
  • 00:04:16
    but it's faster when you're doing it
  • 00:04:18
    through Thunderbolt bridge I mean on the
  • 00:04:20
    surface that kind of makes sense but you
  • 00:04:21
    actually have to test it out because
  • 00:04:23
    what's theoretically supposed to happen
  • 00:04:25
    is each machine is going to download a
  • 00:04:27
    chunk of the model to work on based on
  • 00:04:30
    its capabilities this is determined
  • 00:04:31
    automatically by Exo but in reality
  • 00:04:33
    there is communication during the model
  • 00:04:35
    run which can be influenced by the
  • 00:04:37
    network connection so to set that up I
  • 00:04:40
    go to network and then Thunderbolt
  • 00:04:43
    Bridge make sure I configured my IP
  • 00:04:45
    addresses manually on all the machines I
  • 00:04:48
    have uh 192 168 1010 1020 1030 1040 1050
  • 00:04:53
    and then also under Thunderbolt Bridge
  • 00:04:55
    Under Hardware I've set up jumbo packets
  • 00:04:57
    now I haven't tested this out yet
  • 00:04:59
    theoretically if you have jumbo packets
  • 00:05:01
    the packets are larger sending more
  • 00:05:03
    information across therefore decreasing
  • 00:05:05
    the processing power required to process
  • 00:05:07
    each packet makes sense got to test it
  • 00:05:09
    out different video maybe let me know if
  • 00:05:11
    you're interested in that now a couple
  • 00:05:13
    of you have noticed that I have five Mac
  • 00:05:15
    minis but each Mac Mini has only three
  • 00:05:18
    Thunderbolt ports so if this is my home
  • 00:05:20
    one down here at the bottom then I can
  • 00:05:22
    only run it to three other Mac minis so
  • 00:05:26
    that's why I added a thunderbolt Hub now
  • 00:05:28
    I do have a thunderbolt 5 Hub here but
  • 00:05:32
    um I've tested this one versus the
  • 00:05:34
    little one that I had before which is a
  • 00:05:35
    thunderbolt 4 one and I so far didn't
  • 00:05:37
    notice any difference between them as
  • 00:05:39
    far as the networking side of things now
  • 00:05:41
    I want to dig deeper into Thunderbolt
  • 00:05:43
    connectivity in a different video but
  • 00:05:44
    that's not for this one even though I'm
  • 00:05:45
    using all Thunderbolt 5 cables here and
  • 00:05:48
    two of these machines are capable with
  • 00:05:49
    Thunderbolt 5 let's just assume it's a
  • 00:05:51
    thunderbolt 4 system which has an
  • 00:05:53
    improvement over regular Wi-Fi and land
  • 00:05:56
    anyway we're going to start out with
  • 00:05:57
    some smaller models as proof of concept
  • 00:05:59
    and I'm going to run EXO here so this is
  • 00:06:02
    llama
  • 00:06:03
    3.21 billion parameter one and it's
  • 00:06:06
    small and it's fast let's say hello to
  • 00:06:08
    it and we get 90 tokens per second right
  • 00:06:13
    uh
  • 00:06:14
    story so yeah if it's sustained it's
  • 00:06:16
    about 73 tokens per second that's pretty
  • 00:06:19
    fast it's pretty good so about 70 tokens
  • 00:06:22
    sustained and this is running on the M4
  • 00:06:25
    Mac Mini base model no problem fitting
  • 00:06:27
    that in there because it's a tiny tiny
  • 00:06:29
    any
  • 00:06:31
    model now for comparison running this on
  • 00:06:34
    the M4 Pro chip is giving me a close to
  • 00:06:38
    100 tokens per second 96 95 94 so quite
  • 00:06:42
    a bit better just to give you a baseline
  • 00:06:43
    of the differences between the two
  • 00:06:44
    machines because those two chips have
  • 00:06:46
    different memory bandwidths and memory
  • 00:06:48
    bandwidth not the size here because the
  • 00:06:50
    Size Doesn't Matter we've taken that out
  • 00:06:52
    of the equation cuz it's a one billion
  • 00:06:53
    parameter model which is small here the
  • 00:06:55
    memory bandwidth is what's showing
  • 00:06:57
    through memory bandwidth plays a huge
  • 00:06:58
    role in how quickly these things
  • 00:07:00
    generate tokens and I made a whole video
  • 00:07:02
    on bandwidth you can check it out I'll
  • 00:07:03
    link to it down below all right so going
  • 00:07:05
    back to our example with running it on
  • 00:07:07
    one base model and getting 70 tokens per
  • 00:07:10
    second I'm now going to start up EXO on
  • 00:07:13
    two base models so now you'll see that I
  • 00:07:16
    have two nodes running here Mac Mini 16
  • 00:07:19
    gigs and another Mac Mini with 16 gigs
  • 00:07:22
    together not too much power but let's
  • 00:07:24
    see if we get a speed Improvement so I'm
  • 00:07:26
    going to run the same llama 3 .21
  • 00:07:31
    billion and we're getting 45 tokens per
  • 00:07:34
    second considerably worse this time
  • 00:07:36
    around and this is happening because the
  • 00:07:39
    connections on the back both the base
  • 00:07:41
    models are going through the Thunderbolt
  • 00:07:43
    Hub I found out that this plays a rle
  • 00:07:45
    and it's a negative consequence
  • 00:07:49
    unfortunately so now I've connected the
  • 00:07:51
    two machines directly to each other with
  • 00:07:53
    Thunderbolt all right ready watch
  • 00:07:56
    this 87
  • 00:07:59
    right uh story 82 83 87 99 100 tokens
  • 00:08:06
    per second it was 73 sustained when it
  • 00:08:09
    was running on one machine now we're
  • 00:08:11
    getting 95 same model same prompt when
  • 00:08:14
    it's two machines connected together via
  • 00:08:17
    Thunderbolt not Wi-Fi so let's see if we
  • 00:08:19
    can get the same kind of result by using
  • 00:08:22
    one of the other machines look we're
  • 00:08:23
    looking to beat 95 tokens per second
  • 00:08:25
    here I'm going to head over to Mac Mini
  • 00:08:27
    4 cuz that's the one that that's also an
  • 00:08:30
    M4 chip but it has 32 gigs of RAM and
  • 00:08:33
    let's run
  • 00:08:34
    that all right all right so we're
  • 00:08:37
    getting 89 here let's do a
  • 00:08:41
    sustained I have a feeling it's going to
  • 00:08:42
    be yeah there it is so it's 7374 tokens
  • 00:08:46
    per second clearly the amount of ram
  • 00:08:48
    didn't have any effect on this we're
  • 00:08:50
    still dealing with the M4 chip so this
  • 00:08:51
    kind of proves that we're limited by the
  • 00:08:53
    M4 chip and not the ram in this case so
  • 00:08:55
    we can kind of conclude that getting one
  • 00:08:57
    base model M4 Pro machine will give you
  • 00:09:00
    similar performance as far as this model
  • 00:09:02
    and tokens per second that two M4 base
  • 00:09:04
    models will give you and finally machine
  • 00:09:07
    3 right a story so we're getting about
  • 00:09:11
    104 tokens per second here about 100
  • 00:09:14
    dipping around 97 96 oh come on make up
  • 00:09:17
    your mind we're down to 93 one other
  • 00:09:19
    thing to keep in mind is EXO is not
  • 00:09:21
    without overhead if you run mlx directly
  • 00:09:24
    against llama 3.21 billion for example
  • 00:09:28
    with 4bit uh monetization there it is
  • 00:09:30
    it's it's going pretty fast and we're
  • 00:09:33
    talking about 281 tokens per second here
  • 00:09:36
    let's do it again to make sure it's that
  • 00:09:39
    seems a little high 280 tokens per
  • 00:09:41
    second that's pretty good remember with
  • 00:09:44
    EXO we were getting about 100 so if
  • 00:09:45
    you're running only on one machine
  • 00:09:47
    that's something to keep in mind as well
  • 00:09:48
    now let's take it up to the max really
  • 00:09:50
    push this system hard and see what kind
  • 00:09:52
    of power usage is going to have when all
  • 00:09:54
    five machines are just going full bore
  • 00:09:57
    for that I'm going to run a
  • 00:10:00
    loop generating on all five machines at
  • 00:10:04
    the same time each of the M4 Pro
  • 00:10:06
    machines takes up about 87 watts and
  • 00:10:08
    each of the M4 machines is taken up
  • 00:10:10
    about 50 for a total of almost 200 Watts
  • 00:10:13
    just over 200 now so this is sustained
  • 00:10:16
    power usage when everything is being
  • 00:10:18
    utilized to the max we're still using
  • 00:10:20
    less power than this and this is
  • 00:10:23
    interesting right here the machine on
  • 00:10:24
    the bottom
  • 00:10:28
    [Music]
  • 00:10:29
    H this might not be the best rack setup
  • 00:10:32
    folks we're at almost 40° on the bottom
  • 00:10:35
    machine the other ones seem like they're
  • 00:10:38
    quite a bit cooler this is the other
  • 00:10:41
    this one right here machine number three
  • 00:10:42
    is the other M4 Pro machine and this one
  • 00:10:47
    this one and the top one are all M4
  • 00:10:49
    machines so the m4s are a lot cooler the
  • 00:10:52
    than the M4 Pros but the one on the
  • 00:10:54
    bottom I don't know if it's because
  • 00:10:57
    that's the central Hub and it's got more
  • 00:10:59
    work to do with Thunderbolt or because
  • 00:11:01
    all the air is sort of blowing down
  • 00:11:04
    although I don't know it's getting
  • 00:11:06
    dispersed so I'm not sure why I'm going
  • 00:11:08
    to have to blame the Thunderbolt
  • 00:11:09
    connectivity on this let me know in the
  • 00:11:11
    comments if you think otherwise let's
  • 00:11:13
    push this a little bit further to see
  • 00:11:15
    how big a model we can run so here is
  • 00:11:17
    the base model Mac Mini and I want to
  • 00:11:19
    switch this to let's say quen 2.5 coder
  • 00:11:24
    7 billion parameters code something boom
  • 00:11:28
    and there it is it's actually running
  • 00:11:30
    the 7 billion parameter Quin model which
  • 00:11:32
    is supposed to be very good at coding
  • 00:11:34
    and it's giving me 21 tokens per second
  • 00:11:37
    which is pretty good
  • 00:11:39
    write primes
  • 00:11:42
    injs there it is is generating primes up
  • 00:11:45
    to a certain limit in JavaScript 20
  • 00:11:47
    tokens per second pretty good now let's
  • 00:11:50
    try the quen 2.5 coder 32 billion
  • 00:11:54
    parameter model which should definitely
  • 00:11:56
    break try again and it started
  • 00:12:00
    downloading the model and that makes
  • 00:12:02
    perfect sense because the model is not
  • 00:12:04
    locally available on this machine now
  • 00:12:06
    the way it's set up by default is that
  • 00:12:08
    each machine will get a copy of the
  • 00:12:10
    model but if you start up a whole
  • 00:12:12
    cluster without running anything first
  • 00:12:14
    like five nodes or four nodes or
  • 00:12:16
    whatever it may be and then you run a
  • 00:12:18
    model that might not fit it's supposed
  • 00:12:20
    to only download parts of that model to
  • 00:12:23
    each machine but that's not the case
  • 00:12:24
    here I'm only running it on one right
  • 00:12:26
    now so it's going to be 17 gb 5 it's
  • 00:12:29
    going to take a little bit well after a
  • 00:12:31
    bunch of downloading it finally did it
  • 00:12:33
    it's it's doing a 32 billion parameter
  • 00:12:36
    model code something and it's using the
  • 00:12:39
    two base model Mac minis but it's doing
  • 00:12:41
    it kind of slow we're going about eight
  • 00:12:43
    tokens per second here so not super
  • 00:12:47
    practical to running such a large model
  • 00:12:50
    on two base Mac Minis and just for
  • 00:12:52
    comparison it is a little bit better
  • 00:12:54
    when running on just one M4 Pro Mac Mini
  • 00:12:58
    we're getting about 12 tokens per second
  • 00:13:00
    here time to try a big one let's go to
  • 00:13:03
    uh neotron
  • 00:13:05
    70b hello hello it's nice to meet you is
  • 00:13:10
    there something I can help you with so
  • 00:13:11
    it's going kind of slow 4.2 tokens per
  • 00:13:14
    second and that's running on the two
  • 00:13:16
    most powerful ones I have the M4 Pro
  • 00:13:19
    down here and the M4 Pro over here
  • 00:13:21
    that's a total of um 88 GB of RAM my
  • 00:13:24
    math is not that great but you know 64 +
  • 00:13:27
    24 we don't need that much RAM for this
  • 00:13:29
    because it's a 4 bit quantization but
  • 00:13:31
    even at 4 bits it should be really fast
  • 00:13:33
    it's 4.9 tokens per second pretty
  • 00:13:34
    unusable okay it's the Moment of Truth
  • 00:13:37
    this is when I get to run a model on all
  • 00:13:40
    five machines so I'm going to start this
  • 00:13:43
    up on all five machines let's go with
  • 00:13:46
    something nice and easy llama 3.21
  • 00:13:48
    billion
  • 00:13:52
    hello it worked the whole thing works
  • 00:13:55
    nice to meet
  • 00:13:58
    you 67 tokens per second what's your
  • 00:14:03
    name okay 69 72 74 tokens per second all
  • 00:14:08
    right this is actually working and it's
  • 00:14:10
    working pretty well we got a cluster of
  • 00:14:12
    five nodes and it's going 74 tokens per
  • 00:14:15
    second
  • 00:14:16
    high up to 74 I should say sometimes it
  • 00:14:20
    ranges but you know that's not too bad
  • 00:14:22
    but it's a small model 1 billion
  • 00:14:24
    parameters and 74 is pretty much what we
  • 00:14:27
    had to start with on one machine which
  • 00:14:29
    we can easily do I guess I should try a
  • 00:14:31
    bigger model so because the Mac Mini has
  • 00:14:33
    only three Thunderbolt connections I
  • 00:14:35
    have to use that Hub and because I use
  • 00:14:37
    the Hub there is a little bit of network
  • 00:14:40
    contention going on there these hubs are
  • 00:14:42
    mostly used for displays if you want to
  • 00:14:44
    have multiple displays off of one
  • 00:14:45
    Thunderbolt Port but they're not really
  • 00:14:47
    meant for connecting multiple computers
  • 00:14:49
    to each other for networking purposes so
  • 00:14:51
    that's why the best possible scenario
  • 00:14:53
    that I can show you here is having four
  • 00:14:55
    machines not five so I wanted to
  • 00:14:57
    demonstrate that while all the these
  • 00:14:59
    machines are connected directly to each
  • 00:15:01
    other through Thunderbolt and we're
  • 00:15:03
    going to go with quen 2.5 coder 32
  • 00:15:07
    billion
  • 00:15:08
    hello hello how can I assist you today
  • 00:15:11
    good 16.4 tokens write some js code to
  • 00:15:16
    find uh primes there we
  • 00:15:19
    go and there it is it's going and it's
  • 00:15:23
    giving me about 16 12 tokens per second
  • 00:15:26
    on average it's writing the function
  • 00:15:28
    it's not terribly slow but it's pretty
  • 00:15:31
    slow and while that's happening you can
  • 00:15:34
    see that all these machines the four
  • 00:15:36
    machines I should say not five they're
  • 00:15:39
    all using some of the GPU up to probably
  • 00:15:42
    80% of it and we're consuming just about
  • 00:15:46
    50 WS of power for all this so overall
  • 00:15:50
    not terrible if you're going to use this
  • 00:15:51
    on a smaller model that would be ideal
  • 00:15:53
    the power savings are pretty tremendous
  • 00:15:55
    here so why bother what's the point can
  • 00:15:57
    anyone explain why cluster a bunch of
  • 00:15:59
    Max is better than just having a PC with
  • 00:16:00
    a GPU cluster well in some ways it is
  • 00:16:03
    and in some way it's not for example the
  • 00:16:05
    Mac Mini M4 has a unified memory which
  • 00:16:07
    means on chip memory can be used for the
  • 00:16:08
    GPU given it up to 64 GB of GPU Ram I
  • 00:16:11
    already said this earlier in the video
  • 00:16:13
    the biggest consumer GPU has only 24 GB
  • 00:16:16
    of RAM unless you go for the a100 or the
  • 00:16:19
    h100 which are very expensive so
  • 00:16:22
    theoretically you can run bigger models
  • 00:16:23
    on the minis seeing that that's the case
  • 00:16:25
    why doesn't musk just buy 100,000 Max
  • 00:16:28
    well musk has a lot of money I don't
  • 00:16:31
    have and you probably don't have that
  • 00:16:33
    kind of money that musk has so musk can
  • 00:16:35
    afford buying $100,000
  • 00:16:37
    $30,000 boards and probably paying
  • 00:16:40
    hundreds of thousands of dollars if not
  • 00:16:42
    millions of dollars to run them the
  • 00:16:44
    electricity costs so for those of you
  • 00:16:46
    that were about to write that in the
  • 00:16:47
    comments this is useless this is
  • 00:16:49
    pointless why not just run a bunch of
  • 00:16:51
    these well this is why so depending on
  • 00:16:53
    your use case the machines that you're
  • 00:16:54
    using and the model that you're running
  • 00:16:56
    you will get wildly different results
  • 00:16:59
    for now this is a great concept and I
  • 00:17:01
    love this idea but so far I haven't
  • 00:17:03
    found that it's better to run a cluster
  • 00:17:06
    than to just get a MacBook Pro with 128
  • 00:17:09
    gigs of RAM and the M4 Max with 128 gigs
  • 00:17:12
    of RAM which I also be making videos
  • 00:17:13
    about and testing so make sure you don't
  • 00:17:15
    miss that that has been shown to have
  • 00:17:18
    insane performance for these types of
  • 00:17:20
    applications of course if you can afford
  • 00:17:22
    it and get two of these then you'll
  • 00:17:25
    still be cheaper than getting an
  • 00:17:27
    equivalent setup in Nvidia cards I can't
  • 00:17:29
    swing that so I'm just going to stick
  • 00:17:31
    with one for now and uh hope for the
  • 00:17:33
    best yeah I'm going to sell my M2 Max
  • 00:17:36
    anyway thanks for watching it's been fun
  • 00:17:37
    showing you this Tower fun experiment
  • 00:17:39
    but uh I don't think I'm going to
  • 00:17:41
    actually use something like that in the
  • 00:17:43
    long run at least not yet this is still
  • 00:17:45
    very early days for this kind of
  • 00:17:47
    Technology this setup will probably have
  • 00:17:48
    some uses especially if you combine 4 64
  • 00:17:52
    GB machines which I happen not to have
  • 00:17:55
    but Alex gima on Twitter he showed off
  • 00:17:58
    his setup so if you want to follow
  • 00:17:59
    project EXO along I'll link to it down
  • 00:18:01
    below as well you can check it out
  • 00:18:03
    anyway I hope you have a good one and I
  • 00:18:04
    will see you in the next one
Tags
  • Machine Learning
  • Apple Silicon
  • GPUs vs CPUs
  • MLX Framework
  • Cluster Computing
  • Power Consumption
  • Performance Testing
  • Parallel Processing