What is the main focus of the video?

The video focuses on comparing the memory architecture of Apple's M2 Ultra Mac Pro and GMK Tech's Ryzen 860S APU.

What is unified memory architecture (UMA)?

UMA is a memory architecture where the CPU, GPU, and other components share the same pool of memory, allowing for faster data access and reduced latency.

How does the M2 Ultra's memory bandwidth compare to the Ryzen 860S?

The M2 Ultra has a memory bandwidth of up to 800 GB/s, significantly higher than the Ryzen 860S, which has around 80 GB/s.

What are the advantages of the Ryzen system?

The Ryzen system offers flexibility and upgradability, allowing users to swap out RAM and customize their setup.

Which system is better for professional content creation?

The M2 Ultra is better for high-bandwidth professional content creation tasks due to its architectural advantages.

What is the trade-off between the two systems?

The M2 Ultra excels in performance and efficiency, while the Ryzen system provides more user choice and modularity.

How does software optimization affect performance?

Software optimized for Apple's architecture can leverage its memory management more effectively, leading to better performance.

What is the key takeaway regarding system choice?

Choosing between the two systems depends on the user's specific needs, such as performance requirements and upgrade flexibility.

What is the significance of cache coherency?

Cache coherency allows the CPU and GPU to access shared data without needing to copy it, reducing latency and improving performance.

What does Dave suggest for users who like to tinker with their systems?

Dave suggests that users who enjoy tinkering should consider the Ryzen system for its modularity and upgrade options.

Memory Wars! Apple vs Ryzen - Is Unified Memory Faster than Shared GPU Memory?

00:13:09

https://www.youtube.com/watch?v=Cn_nKxl8KE4

Summary

TLDRIn this video, Dave Plamer compares the memory architecture of Apple's M2 Ultra Mac Pro and GMK Tech's Ryzen 860S APU. He highlights the differences in memory access, bandwidth, and architecture, emphasizing Apple's unified memory architecture (UMA) that provides high bandwidth and low latency. The M2 Ultra's memory bandwidth reaches up to 800 GB/s, while the Ryzen 860S offers around 80 GB/s. The video discusses the advantages of each system, with the M2 Ultra being ideal for professional content creation and the Ryzen system offering flexibility and upgradability. Dave also touches on the importance of software optimization and how it affects performance, concluding that the choice between the two systems depends on the user's specific needs and preferences.

Takeaways

💻 Dave Plamer compares M2 Ultra and Ryzen 860S.
📊 M2 Ultra features unified memory architecture (UMA).
⚡ M2 Ultra offers up to 800 GB/s bandwidth.
🔄 Ryzen 860S provides flexibility and upgradability.
🖥️ M2 Ultra excels in professional content creation.
🔧 Ryzen system is better for users who like to tinker.
📈 Software optimization impacts performance significantly.
🔗 Cache coherency reduces latency in M2 Ultra.
⚖️ Trade-off between performance and modularity exists.
🔍 Choose the system based on specific needs.

Timeline

00:00:00 - 00:05:00
Dave Palmer introduces the topic of modern memory architecture, emphasizing the importance of memory access and sharing in system performance. He compares two platforms: Apple's M2 Ultra Mac Pro with unified memory and GMK Tech's Ryzen 8600S APU with shared DDR5 memory. The discussion highlights the architectural differences, focusing on bandwidth, bus width, and cache coherency, setting the stage for a detailed comparison of their performance in creative and technical workflows.
00:05:00 - 00:13:09
The M2 Ultra's unified memory architecture allows for high bandwidth and low latency, enabling efficient data sharing between CPU, GPU, and other components. In contrast, the Ryzen's shared memory model leads to contention and slower performance due to its narrower memory bus and the need for data copying. While the M2 Ultra excels in high-performance tasks like video editing and machine learning, the Ryzen offers flexibility and upgradability, making it suitable for general use and tinkering. Ultimately, the choice between the two depends on specific user needs and workloads.

Mind Map

Video Q&A

What is the main focus of the video?
The video focuses on comparing the memory architecture of Apple's M2 Ultra Mac Pro and GMK Tech's Ryzen 860S APU.
What is unified memory architecture (UMA)?
UMA is a memory architecture where the CPU, GPU, and other components share the same pool of memory, allowing for faster data access and reduced latency.
How does the M2 Ultra's memory bandwidth compare to the Ryzen 860S?
The M2 Ultra has a memory bandwidth of up to 800 GB/s, significantly higher than the Ryzen 860S, which has around 80 GB/s.
What are the advantages of the Ryzen system?
The Ryzen system offers flexibility and upgradability, allowing users to swap out RAM and customize their setup.
Which system is better for professional content creation?
The M2 Ultra is better for high-bandwidth professional content creation tasks due to its architectural advantages.
What is the trade-off between the two systems?
The M2 Ultra excels in performance and efficiency, while the Ryzen system provides more user choice and modularity.
How does software optimization affect performance?
Software optimized for Apple's architecture can leverage its memory management more effectively, leading to better performance.
What is the key takeaway regarding system choice?
Choosing between the two systems depends on the user's specific needs, such as performance requirements and upgrade flexibility.
What is the significance of cache coherency?
Cache coherency allows the CPU and GPU to access shared data without needing to copy it, reducing latency and improving performance.
What does Dave suggest for users who like to tinker with their systems?
Dave suggests that users who enjoy tinkering should consider the Ryzen system for its modularity and upgrade options.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:00
Hey, I'm Dave. Welcome to my shop. I'm
00:00:03
Dave Plamer, a retired software engineer
00:00:04
from Microsoft, going back to the MSTOS
00:00:06
and Windows 95 days. And today, we're
00:00:09
going to venture into the world of
00:00:10
modern memory architecture, but with a
00:00:12
twist. Because while everybody's busy
00:00:14
talking about raw CPU core counts and
00:00:16
GPU teraflops, there's something even
00:00:18
more foundational lurking under the hood
00:00:20
that makes or breaks your systems real
00:00:22
world performance, especially in
00:00:24
mixeduse creative or technical
00:00:25
workflows. And that something is memory.
00:00:28
how it's accessed, how it's shared, how
00:00:30
fast it is, and who gets to use how much
00:00:32
of it at a time. And to make that
00:00:34
exploration interesting, we're going to
00:00:35
do what I'm known for doing, pitting two
00:00:37
radically different platforms
00:00:39
head-to-head. And then I'll share some
00:00:40
comparison benchmarks towards the end
00:00:42
once we understand the platforms better.
00:00:45
On the one side, we've got the sleek,
00:00:46
streamlined M2 Ultra Mac Pro from Apple,
00:00:49
featuring 128 GB of what Apple calls
00:00:52
unified memory. On the other, we've got
00:00:54
GMK Tech Nookbox, a sleek 16 core Ryzen
00:00:57
desktop equipped with the AMD 8060S APU,
00:01:01
also with integrated graphics, but
00:01:03
running on a shared DDR5 system. They
00:01:05
both use integrated graphics. They can
00:01:07
both edit video. They both run
00:01:09
productivity apps, but beyond those
00:01:11
surface similarities, their memory
00:01:12
systems may as well come from two
00:01:14
different planets, and that's what we're
00:01:15
going to explore today. So, buckle up
00:01:17
because this is going to be a deep dive
00:01:19
into bandwidth, bus width, cache
00:01:21
coherency, and a bit of silicon
00:01:22
wizardry. Let's start with the Apple
00:01:24
side of the fence. Apple's M2 Ultra and
00:01:27
their Pro systems, and indeed, most of
00:01:29
Apple silicon is built around something
00:01:30
called a unified memory architecture, or
00:01:32
UMA. The M2 Ultra is a behemoth of a
00:01:35
chip. Up to 24 cores, 60 GPU cores, and
00:01:38
a neural engine to boot. But what sets
00:01:41
it apart isn't just what's on the chip.
00:01:42
It's how the memory is arranged around
00:01:44
it. Instead of slapping in some sodiums
00:01:46
on a motherboard and calling it a day,
00:01:48
Apple took the bold step of integrating
00:01:50
the memory directly onto the chip
00:01:51
package using a silicon interposer. That
00:01:54
means the LPDDR5 memory modules aren't
00:01:56
somewhere off in the weeds or on the
00:01:58
bus. They're right next to the SoC. And
00:02:01
not in a close enough kind of way, but
00:02:03
in a shared substrate with a thousand
00:02:05
pin connection kind of way. We're
00:02:06
talking about a 1024bit memory bus
00:02:09
capable of delivering up to 800 GB per
00:02:11
second of bandwidth. That's not a typo.
00:02:13
800 gigabytes per second. If that number
00:02:16
doesn't make your jaw drop, let me put
00:02:18
it this way. That's 8 to 10 times the
00:02:20
bandwidth you'd typically find in a
00:02:22
modern Ryzen desktop running dual
00:02:23
channel
00:02:24
DDR5. So, what does all the extra
00:02:26
bandwidth and proximity actually buy
00:02:28
you? Well, in Apple's, all the
00:02:30
components, the CPU, the GPU, the NPU,
00:02:33
and even the image signal processor
00:02:35
share access to the same pool of memory
00:02:37
in a cache coherent fashion. That means
00:02:39
if the GPU writes to a memory address,
00:02:41
the CPU can read that exact data without
00:02:43
needing to copy it out to another buffer
00:02:45
or go through explicit synchronization.
00:02:47
And this is a big deal because on a
00:02:49
traditional PC, things aren't nearly as
00:02:51
cooperative. Enter the AMD 860S in the
00:02:54
GMK Tech Nookbox. The 8060S lives inside
00:02:58
a Ryzen APU, and like many of its x86
00:03:01
siblings, it runs in what's called a
00:03:02
shared memory model. That's a much older
00:03:05
approach where the CPU and the GPU
00:03:07
technically share the same pool of RAM,
00:03:09
but functionally they don't share it
00:03:10
well. Instead, a portion of your
00:03:12
system's main memory, say 2 GB or 4 GB
00:03:15
or 16 or 32, is carved out and reserved
00:03:17
as VRAM for the GPU. This reservation is
00:03:20
handled by the firmware or the BIOS, and
00:03:22
the operating system treats it as off
00:03:24
limits for everything else. So, yes, the
00:03:26
memory is shared, but that's more of a
00:03:27
logistical arrangement than a true
00:03:29
architectural unification. data still
00:03:31
has to move and buffers are still copied
00:03:33
and everything still travels through a
00:03:35
memory controller located on the APU die
00:03:37
which then talks to your RAM modules
00:03:39
through a relatively narrow pipe. 128
00:03:41
bit or 256-bit dual channel DDR5 memory
00:03:45
interface and you're not getting 256
00:03:47
unless you're quad channel and I don't
00:03:48
think they do that on the Ryzen desktop
00:03:50
yet. But compared to the 1024-bit beast
00:03:53
on the N2 Ultra, that's a bit like
00:03:54
trying to hydrate a stadium using a
00:03:56
garden hose. But now let's talk speed
00:03:58
cuz that's what we care about. Apple's
00:04:00
LPDDR5 memory is not only wide, it's
00:04:02
also fast. Running at around 6,400 megat
00:04:05
transfers a second, each module can move
00:04:07
data very quickly. And when multiplied
00:04:09
across 1024 bits of access width, you
00:04:11
can start to see where that 800 GB a
00:04:14
second number comes from. And all of
00:04:15
this happens right on package, meaning
00:04:17
there are no long PCB traces, no
00:04:20
motherboard routes, no DIM slots, no
00:04:22
latency inducing connectors. Data moves
00:04:24
quickly and efficiently. On the Ryzen
00:04:26
side, DDR5 might also clock in at around
00:04:29
5200 or 5600 megat transfers a second.
00:04:32
But because the memory bus is narrower,
00:04:34
the total bandwidth is limited to
00:04:35
somewhere in the 80 GB range, depending
00:04:37
on configuration. Not bad, but once
00:04:40
again, about 1/8 of what the M2 Ultra
00:04:42
can do. And that's assuming that the CPU
00:04:44
and GPU aren't stepping on each other's
00:04:46
toes. In reality, contention can further
00:04:48
reduce effective bandwidth during mixed
00:04:50
workloads. So, when you're editing 8K
00:04:52
video or training a neural network and
00:04:54
both the CPU and the GPU want to chew on
00:04:56
the same data set, Apple's architecture
00:04:59
can serve as both without blinking. With
00:05:01
the Ryzen, it's got to mediate who goes
00:05:03
next. Now, let's talk bit width because
00:05:05
this is one of those classic size
00:05:07
matters situations. On the M2 Ultra,
00:05:10
that memory interface we identified was
00:05:11
1024 bits wide. That means the CPU, the
00:05:15
GPU, or the neural engine can request
00:05:17
huge chunks of data in a single
00:05:18
transaction. That's great for tasks like
00:05:21
high resolution video rendering where
00:05:22
you're moving gigabytes of raw pixel
00:05:24
data around per second. The rise in APU,
00:05:26
by contrast, is working with a 128 bit
00:05:29
bus. And that smaller highway means more
00:05:31
memory transactions are required to move
00:05:34
the same amount of data. Not only does
00:05:36
that slow things down, it also consumes
00:05:38
more power and can increase memory
00:05:39
contention when multiple agents are
00:05:41
requesting access. And speaking of
00:05:43
contention, let's move on to latency and
00:05:45
cache coherency. Apple's on package
00:05:48
memory and system level cache design
00:05:49
mean that the CPU and GPU can both
00:05:52
access the same data without having to
00:05:54
make redundant copies. This is
00:05:56
especially powerful for things like
00:05:57
metal accelerated machine learning where
00:05:59
data sets can live in shared memory and
00:06:01
be updated in place by whichever engine
00:06:03
is working on them. In contrast, the
00:06:06
Ryzen's architecture requires more
00:06:07
fencing and mapping. The GPU might have
00:06:10
its own view of a buffer and when the
00:06:11
CPU wants to read to write to it, a copy
00:06:14
operation or at least a synchronization
00:06:16
operation is often required. That adds
00:06:18
latency and burns power. And while Ryzen
00:06:21
does have a shared L3 cache across its
00:06:24
CPU cores, usually in the 16 to 32
00:06:26
megabyte range, it doesn't extend that
00:06:28
cache to the GPU in a unified fashion.
00:06:31
Apple on the other hand includes a
00:06:32
massive 64 megabyte system level cache
00:06:35
that is accessible and usable by all the
00:06:37
cores in the SOC. That means hot data
00:06:39
can be kept very close to all the
00:06:41
engines, reducing latency and boosting
00:06:43
throughput, which also brings us now to
00:06:45
power efficiency. Now, I'm not saying
00:06:47
the Apple silicon is magic, but if you
00:06:49
squint hard enough, it's starting to
00:06:50
feel that way. The N2 Ultra's tight
00:06:52
coupling of compute and memory, coupled
00:06:54
with the inherently lower power draw of
00:06:56
LPDDR5 versus desktop DDR5, means it can
00:06:59
deliver incredible performance per watt.
00:07:02
There are very few data copies, less
00:07:04
movement across physical interconnects,
00:07:06
and much lower idle and leakage power.
00:07:08
Ryzen, meanwhile, has to move data from
00:07:10
the APU die to external DIMs and across
00:07:12
the traces of your motherboard. That not
00:07:14
only takes more energy, but it also
00:07:16
means you've got the signal integrity
00:07:17
issues, timing coordination, and more
00:07:19
memory power spent on the controller
00:07:21
overhead. And sure, desktop DDR5
00:07:24
supports things like power down modes,
00:07:25
but nothing beats the efficiency of an
00:07:27
SOC where everything lives under one
00:07:29
digital roof. So, here's a philosophical
00:07:31
question. What's more important, raw
00:07:33
performance or flexibility? Because
00:07:35
while the M2 Ultra absolutely crushes
00:07:37
the Ryzen 860S in terms of architectural
00:07:40
elegance and performance per watt,
00:07:42
there's one area where the Ryzen system
00:07:44
still has a clear advantage.
00:07:46
Upgradability. On the Apple side, what
00:07:48
you buy is what you live with. If you
00:07:50
get the 128 GB of memory, great, but
00:07:52
it's expensive and it's soldered into
00:07:54
the SOC package and there's no going
00:07:55
back. That's fine for video editors or
00:07:58
machine learning engineers who know
00:07:59
their memory footprint. But for the rest
00:08:01
of us, especially those who like to
00:08:02
tinker or to buy small and scale up down
00:08:05
the road, it can be a hard stop. The
00:08:07
Ryzen system, on the other hand, uses
00:08:09
industry standard DDR5 DIMs in socketed
00:08:11
slots. If you want to swap in more RAM,
00:08:13
upgrade it to 128 GB, use faster memory,
00:08:16
or even run mixed mode configurations,
00:08:18
you can do that. And while that doesn't
00:08:20
help your integrated GPU performance, it
00:08:22
does offer system level flexibility that
00:08:24
Apple simply doesn't. If you're doing
00:08:26
prolevel video editing, machine
00:08:27
learning, or any kind of high resolution
00:08:29
media work, the M3 Ultra's unified
00:08:31
memory setup is hands down the better
00:08:33
tool. You get massive bandwidth, zero
00:08:35
copy data sharing between compute units,
00:08:37
and incredibly low latency. But if
00:08:39
you're gaming, browsing, running
00:08:41
spreadsheets with the occasional bit of
00:08:42
Photoshop thrown in, then the Ryzen APU
00:08:45
with shared memory is a fine choice. You
00:08:47
might not get the absolute best
00:08:48
performance per watt or the flashiest
00:08:50
benchmarks, but you get a solid
00:08:52
capability at a fraction of the price,
00:08:54
and you can upgrade or tinker to your
00:08:55
heart's content. What we're looking at
00:08:57
here isn't just two different chips, but
00:08:58
two different design philosophies. Apple
00:09:01
is betting everything on tight
00:09:02
integration, shared resources, and
00:09:04
vertical control of their hardware. AMD
00:09:07
and the broader x86 world is still built
00:09:09
around flexibility, modularity, and user
00:09:11
choice, even if it comes at a cost,
00:09:13
performance, and efficiency. And you
00:09:15
know what? There's room for both in this
00:09:17
world. There's one more subtle but
00:09:19
incredibly important angle that we need
00:09:20
to cover, and it's all about real world
00:09:22
optimization. Because it's easy to get
00:09:24
swept up in the specs, gigabytes per
00:09:26
second or cache sizes and bus widths.
00:09:29
But the rubber meets the road when you
00:09:31
ask a very simple question. How well
00:09:33
does your software stack actually use
00:09:34
your hardware? And here's where Apple's
00:09:36
approach really shines, especially if
00:09:38
you're inside their walled garden. Let
00:09:40
me explain. When you run Final Cut on an
00:09:43
M2 Ultra, you're running a program
00:09:44
tailor made to leverage everything
00:09:46
Apple's architecture has to offer. It
00:09:48
can stream data straight from SSD to RAM
00:09:50
to GPU to display, all without
00:09:52
translation layers, driver issues, or
00:09:54
copying buffers back and forth. Metal,
00:09:57
Apple's graphics and compute API, which
00:09:59
is kind of like CUDA, was built from the
00:10:01
ground up to play nice with unified
00:10:03
memory. And that means real gains
00:10:04
because projects that used to need
00:10:06
intermediate render passes on disk can
00:10:08
be computed on the fly. Machine learning
00:10:10
effects can tap into the neural engine
00:10:12
without exporting model data across
00:10:14
buses or dealing with interop headaches.
00:10:16
The OS, the apps, and the hardware all
00:10:18
speak the same language, and it's a
00:10:20
private dialect. Contrast that with the
00:10:22
Ryzen system. Yes, you can run resolver
00:10:24
or Blender or PyTorch, but now you're
00:10:26
relying on drivers from AMD, Open CL, or
00:10:29
Vulcan interop layers, and a dance of
00:10:31
memory synchronization going on between
00:10:33
the CPU and GPU buffers. You can still
00:10:35
get good results, but you're doing with
00:10:36
a lot more leg work behind the scenes.
00:10:38
Now, this doesn't mean that the PC is
00:10:40
inferior. It just means that open
00:10:42
platforms carry with them the burden of
00:10:44
interoperability. Every layer adds
00:10:46
flexibility for sure, but also friction.
00:10:49
And nowhere does that show up more
00:10:50
clearly than how memory is used and
00:10:51
managed across components. There's one
00:10:54
last philosophical contrast I want to
00:10:56
touch on. When you look at the Apple M2
00:10:58
Ultra, it's clear that Apple is chasing
00:10:59
a specific vision, a monolithic, highly
00:11:02
integrated comput engine where
00:11:04
specialization is internal, not
00:11:05
external. Everything's on the SOC.
00:11:08
Memory is shared. Caches are unified.
00:11:10
The user doesn't manage resources. The
00:11:12
system does. It's elegant, but it's also
00:11:14
very rigid. You're buying into a fixed
00:11:16
future. The Ryzen desktop, by contrast.
00:11:19
It's almost modular by design. You get
00:11:21
to pick your CPU, your RAM, your GPU if
00:11:23
you want one, and you can add more later
00:11:25
and change cooling systems, tune
00:11:26
voltages. It's messy, sure, but it's
00:11:28
yours. And that openness is why the PC
00:11:31
ecosystem has survived for decades and
00:11:33
adapted to workloads that Apple could
00:11:34
never have imagined. So, which is
00:11:37
better? Well, if you're building a
00:11:38
system for a tightly scoped, high
00:11:40
bandwidth professional content creation
00:11:42
task, especially video, photography, or
00:11:44
machine learning pipelines, then Apple's
00:11:46
unified memory architecture can offer a
00:11:48
level of performance and simplicity
00:11:50
that's hard to match. So, if like me,
00:11:52
your main reason for having a Mac is to
00:11:54
run Final Cut, it's almost perfect for
00:11:56
that task. But if your needs are more
00:11:58
general, or if you value upgrade paths,
00:12:00
flexibility, or just being able to
00:12:02
tinker and learn, then a Ryzenbased
00:12:04
system with shared memory is not only
00:12:05
good enough, it might actually be better
00:12:06
for you as a long-term investment. If
00:12:09
you're doing AI workloads, the ability
00:12:10
to run the larger models is appreciated
00:12:12
on both systems, but the Macs run them
00:12:14
significantly faster than the Ryzen's
00:12:16
APU. But if you're running a more CPU
00:12:18
friendly task like solving prime
00:12:20
numbers, the 16 high-speed cores of the
00:12:23
Ryzen actually put the Mac to shame,
00:12:25
turning in nearly double the
00:12:26
performance. In fact, the Knuckbox is
00:12:28
the fastest single core chip I've ever
00:12:30
tested, faster than both the M2 Mac Pro
00:12:32
Ultra and the Ryzen Thread Ripper 7995WX
00:12:35
on single core workloads. And the CPU is
00:12:38
fast enough that it even beats my older
00:12:39
32 core Thread Ripper 3270X on
00:12:42
multi-core tests despite having only
00:12:44
half the core count. So, the key is
00:12:46
knowing what kind of work you do and
00:12:47
then choosing the tool that best matches
00:12:49
that profile. If you found today's look
00:12:52
at memory architecture to be any
00:12:53
combination of informative or
00:12:55
entertaining, remember that I'm mostly
00:12:56
in this for the subs and likes. So, I'd
00:12:58
be honored if you consider leaving me
00:12:59
one of each before you go today. And if
00:13:01
you're already subscribed to the
00:13:02
channel, thank you. In the meantime, and
00:13:04
in between time, I hope to see you next
00:13:06
time right here in Dave's Garage.