00:00:00
Hey, I'm Dave. Welcome to my shop. I'm
00:00:03
Dave Plamer, a retired software engineer
00:00:04
from Microsoft, going back to the MSTOS
00:00:06
and Windows 95 days. And today, we're
00:00:09
going to venture into the world of
00:00:10
modern memory architecture, but with a
00:00:12
twist. Because while everybody's busy
00:00:14
talking about raw CPU core counts and
00:00:16
GPU teraflops, there's something even
00:00:18
more foundational lurking under the hood
00:00:20
that makes or breaks your systems real
00:00:22
world performance, especially in
00:00:24
mixeduse creative or technical
00:00:25
workflows. And that something is memory.
00:00:28
how it's accessed, how it's shared, how
00:00:30
fast it is, and who gets to use how much
00:00:32
of it at a time. And to make that
00:00:34
exploration interesting, we're going to
00:00:35
do what I'm known for doing, pitting two
00:00:37
radically different platforms
00:00:39
head-to-head. And then I'll share some
00:00:40
comparison benchmarks towards the end
00:00:42
once we understand the platforms better.
00:00:45
On the one side, we've got the sleek,
00:00:46
streamlined M2 Ultra Mac Pro from Apple,
00:00:49
featuring 128 GB of what Apple calls
00:00:52
unified memory. On the other, we've got
00:00:54
GMK Tech Nookbox, a sleek 16 core Ryzen
00:00:57
desktop equipped with the AMD 8060S APU,
00:01:01
also with integrated graphics, but
00:01:03
running on a shared DDR5 system. They
00:01:05
both use integrated graphics. They can
00:01:07
both edit video. They both run
00:01:09
productivity apps, but beyond those
00:01:11
surface similarities, their memory
00:01:12
systems may as well come from two
00:01:14
different planets, and that's what we're
00:01:15
going to explore today. So, buckle up
00:01:17
because this is going to be a deep dive
00:01:19
into bandwidth, bus width, cache
00:01:21
coherency, and a bit of silicon
00:01:22
wizardry. Let's start with the Apple
00:01:24
side of the fence. Apple's M2 Ultra and
00:01:27
their Pro systems, and indeed, most of
00:01:29
Apple silicon is built around something
00:01:30
called a unified memory architecture, or
00:01:32
UMA. The M2 Ultra is a behemoth of a
00:01:35
chip. Up to 24 cores, 60 GPU cores, and
00:01:38
a neural engine to boot. But what sets
00:01:41
it apart isn't just what's on the chip.
00:01:42
It's how the memory is arranged around
00:01:44
it. Instead of slapping in some sodiums
00:01:46
on a motherboard and calling it a day,
00:01:48
Apple took the bold step of integrating
00:01:50
the memory directly onto the chip
00:01:51
package using a silicon interposer. That
00:01:54
means the LPDDR5 memory modules aren't
00:01:56
somewhere off in the weeds or on the
00:01:58
bus. They're right next to the SoC. And
00:02:01
not in a close enough kind of way, but
00:02:03
in a shared substrate with a thousand
00:02:05
pin connection kind of way. We're
00:02:06
talking about a 1024bit memory bus
00:02:09
capable of delivering up to 800 GB per
00:02:11
second of bandwidth. That's not a typo.
00:02:13
800 gigabytes per second. If that number
00:02:16
doesn't make your jaw drop, let me put
00:02:18
it this way. That's 8 to 10 times the
00:02:20
bandwidth you'd typically find in a
00:02:22
modern Ryzen desktop running dual
00:02:23
channel
00:02:24
DDR5. So, what does all the extra
00:02:26
bandwidth and proximity actually buy
00:02:28
you? Well, in Apple's, all the
00:02:30
components, the CPU, the GPU, the NPU,
00:02:33
and even the image signal processor
00:02:35
share access to the same pool of memory
00:02:37
in a cache coherent fashion. That means
00:02:39
if the GPU writes to a memory address,
00:02:41
the CPU can read that exact data without
00:02:43
needing to copy it out to another buffer
00:02:45
or go through explicit synchronization.
00:02:47
And this is a big deal because on a
00:02:49
traditional PC, things aren't nearly as
00:02:51
cooperative. Enter the AMD 860S in the
00:02:54
GMK Tech Nookbox. The 8060S lives inside
00:02:58
a Ryzen APU, and like many of its x86
00:03:01
siblings, it runs in what's called a
00:03:02
shared memory model. That's a much older
00:03:05
approach where the CPU and the GPU
00:03:07
technically share the same pool of RAM,
00:03:09
but functionally they don't share it
00:03:10
well. Instead, a portion of your
00:03:12
system's main memory, say 2 GB or 4 GB
00:03:15
or 16 or 32, is carved out and reserved
00:03:17
as VRAM for the GPU. This reservation is
00:03:20
handled by the firmware or the BIOS, and
00:03:22
the operating system treats it as off
00:03:24
limits for everything else. So, yes, the
00:03:26
memory is shared, but that's more of a
00:03:27
logistical arrangement than a true
00:03:29
architectural unification. data still
00:03:31
has to move and buffers are still copied
00:03:33
and everything still travels through a
00:03:35
memory controller located on the APU die
00:03:37
which then talks to your RAM modules
00:03:39
through a relatively narrow pipe. 128
00:03:41
bit or 256-bit dual channel DDR5 memory
00:03:45
interface and you're not getting 256
00:03:47
unless you're quad channel and I don't
00:03:48
think they do that on the Ryzen desktop
00:03:50
yet. But compared to the 1024-bit beast
00:03:53
on the N2 Ultra, that's a bit like
00:03:54
trying to hydrate a stadium using a
00:03:56
garden hose. But now let's talk speed
00:03:58
cuz that's what we care about. Apple's
00:04:00
LPDDR5 memory is not only wide, it's
00:04:02
also fast. Running at around 6,400 megat
00:04:05
transfers a second, each module can move
00:04:07
data very quickly. And when multiplied
00:04:09
across 1024 bits of access width, you
00:04:11
can start to see where that 800 GB a
00:04:14
second number comes from. And all of
00:04:15
this happens right on package, meaning
00:04:17
there are no long PCB traces, no
00:04:20
motherboard routes, no DIM slots, no
00:04:22
latency inducing connectors. Data moves
00:04:24
quickly and efficiently. On the Ryzen
00:04:26
side, DDR5 might also clock in at around
00:04:29
5200 or 5600 megat transfers a second.
00:04:32
But because the memory bus is narrower,
00:04:34
the total bandwidth is limited to
00:04:35
somewhere in the 80 GB range, depending
00:04:37
on configuration. Not bad, but once
00:04:40
again, about 1/8 of what the M2 Ultra
00:04:42
can do. And that's assuming that the CPU
00:04:44
and GPU aren't stepping on each other's
00:04:46
toes. In reality, contention can further
00:04:48
reduce effective bandwidth during mixed
00:04:50
workloads. So, when you're editing 8K
00:04:52
video or training a neural network and
00:04:54
both the CPU and the GPU want to chew on
00:04:56
the same data set, Apple's architecture
00:04:59
can serve as both without blinking. With
00:05:01
the Ryzen, it's got to mediate who goes
00:05:03
next. Now, let's talk bit width because
00:05:05
this is one of those classic size
00:05:07
matters situations. On the M2 Ultra,
00:05:10
that memory interface we identified was
00:05:11
1024 bits wide. That means the CPU, the
00:05:15
GPU, or the neural engine can request
00:05:17
huge chunks of data in a single
00:05:18
transaction. That's great for tasks like
00:05:21
high resolution video rendering where
00:05:22
you're moving gigabytes of raw pixel
00:05:24
data around per second. The rise in APU,
00:05:26
by contrast, is working with a 128 bit
00:05:29
bus. And that smaller highway means more
00:05:31
memory transactions are required to move
00:05:34
the same amount of data. Not only does
00:05:36
that slow things down, it also consumes
00:05:38
more power and can increase memory
00:05:39
contention when multiple agents are
00:05:41
requesting access. And speaking of
00:05:43
contention, let's move on to latency and
00:05:45
cache coherency. Apple's on package
00:05:48
memory and system level cache design
00:05:49
mean that the CPU and GPU can both
00:05:52
access the same data without having to
00:05:54
make redundant copies. This is
00:05:56
especially powerful for things like
00:05:57
metal accelerated machine learning where
00:05:59
data sets can live in shared memory and
00:06:01
be updated in place by whichever engine
00:06:03
is working on them. In contrast, the
00:06:06
Ryzen's architecture requires more
00:06:07
fencing and mapping. The GPU might have
00:06:10
its own view of a buffer and when the
00:06:11
CPU wants to read to write to it, a copy
00:06:14
operation or at least a synchronization
00:06:16
operation is often required. That adds
00:06:18
latency and burns power. And while Ryzen
00:06:21
does have a shared L3 cache across its
00:06:24
CPU cores, usually in the 16 to 32
00:06:26
megabyte range, it doesn't extend that
00:06:28
cache to the GPU in a unified fashion.
00:06:31
Apple on the other hand includes a
00:06:32
massive 64 megabyte system level cache
00:06:35
that is accessible and usable by all the
00:06:37
cores in the SOC. That means hot data
00:06:39
can be kept very close to all the
00:06:41
engines, reducing latency and boosting
00:06:43
throughput, which also brings us now to
00:06:45
power efficiency. Now, I'm not saying
00:06:47
the Apple silicon is magic, but if you
00:06:49
squint hard enough, it's starting to
00:06:50
feel that way. The N2 Ultra's tight
00:06:52
coupling of compute and memory, coupled
00:06:54
with the inherently lower power draw of
00:06:56
LPDDR5 versus desktop DDR5, means it can
00:06:59
deliver incredible performance per watt.
00:07:02
There are very few data copies, less
00:07:04
movement across physical interconnects,
00:07:06
and much lower idle and leakage power.
00:07:08
Ryzen, meanwhile, has to move data from
00:07:10
the APU die to external DIMs and across
00:07:12
the traces of your motherboard. That not
00:07:14
only takes more energy, but it also
00:07:16
means you've got the signal integrity
00:07:17
issues, timing coordination, and more
00:07:19
memory power spent on the controller
00:07:21
overhead. And sure, desktop DDR5
00:07:24
supports things like power down modes,
00:07:25
but nothing beats the efficiency of an
00:07:27
SOC where everything lives under one
00:07:29
digital roof. So, here's a philosophical
00:07:31
question. What's more important, raw
00:07:33
performance or flexibility? Because
00:07:35
while the M2 Ultra absolutely crushes
00:07:37
the Ryzen 860S in terms of architectural
00:07:40
elegance and performance per watt,
00:07:42
there's one area where the Ryzen system
00:07:44
still has a clear advantage.
00:07:46
Upgradability. On the Apple side, what
00:07:48
you buy is what you live with. If you
00:07:50
get the 128 GB of memory, great, but
00:07:52
it's expensive and it's soldered into
00:07:54
the SOC package and there's no going
00:07:55
back. That's fine for video editors or
00:07:58
machine learning engineers who know
00:07:59
their memory footprint. But for the rest
00:08:01
of us, especially those who like to
00:08:02
tinker or to buy small and scale up down
00:08:05
the road, it can be a hard stop. The
00:08:07
Ryzen system, on the other hand, uses
00:08:09
industry standard DDR5 DIMs in socketed
00:08:11
slots. If you want to swap in more RAM,
00:08:13
upgrade it to 128 GB, use faster memory,
00:08:16
or even run mixed mode configurations,
00:08:18
you can do that. And while that doesn't
00:08:20
help your integrated GPU performance, it
00:08:22
does offer system level flexibility that
00:08:24
Apple simply doesn't. If you're doing
00:08:26
prolevel video editing, machine
00:08:27
learning, or any kind of high resolution
00:08:29
media work, the M3 Ultra's unified
00:08:31
memory setup is hands down the better
00:08:33
tool. You get massive bandwidth, zero
00:08:35
copy data sharing between compute units,
00:08:37
and incredibly low latency. But if
00:08:39
you're gaming, browsing, running
00:08:41
spreadsheets with the occasional bit of
00:08:42
Photoshop thrown in, then the Ryzen APU
00:08:45
with shared memory is a fine choice. You
00:08:47
might not get the absolute best
00:08:48
performance per watt or the flashiest
00:08:50
benchmarks, but you get a solid
00:08:52
capability at a fraction of the price,
00:08:54
and you can upgrade or tinker to your
00:08:55
heart's content. What we're looking at
00:08:57
here isn't just two different chips, but
00:08:58
two different design philosophies. Apple
00:09:01
is betting everything on tight
00:09:02
integration, shared resources, and
00:09:04
vertical control of their hardware. AMD
00:09:07
and the broader x86 world is still built
00:09:09
around flexibility, modularity, and user
00:09:11
choice, even if it comes at a cost,
00:09:13
performance, and efficiency. And you
00:09:15
know what? There's room for both in this
00:09:17
world. There's one more subtle but
00:09:19
incredibly important angle that we need
00:09:20
to cover, and it's all about real world
00:09:22
optimization. Because it's easy to get
00:09:24
swept up in the specs, gigabytes per
00:09:26
second or cache sizes and bus widths.
00:09:29
But the rubber meets the road when you
00:09:31
ask a very simple question. How well
00:09:33
does your software stack actually use
00:09:34
your hardware? And here's where Apple's
00:09:36
approach really shines, especially if
00:09:38
you're inside their walled garden. Let
00:09:40
me explain. When you run Final Cut on an
00:09:43
M2 Ultra, you're running a program
00:09:44
tailor made to leverage everything
00:09:46
Apple's architecture has to offer. It
00:09:48
can stream data straight from SSD to RAM
00:09:50
to GPU to display, all without
00:09:52
translation layers, driver issues, or
00:09:54
copying buffers back and forth. Metal,
00:09:57
Apple's graphics and compute API, which
00:09:59
is kind of like CUDA, was built from the
00:10:01
ground up to play nice with unified
00:10:03
memory. And that means real gains
00:10:04
because projects that used to need
00:10:06
intermediate render passes on disk can
00:10:08
be computed on the fly. Machine learning
00:10:10
effects can tap into the neural engine
00:10:12
without exporting model data across
00:10:14
buses or dealing with interop headaches.
00:10:16
The OS, the apps, and the hardware all
00:10:18
speak the same language, and it's a
00:10:20
private dialect. Contrast that with the
00:10:22
Ryzen system. Yes, you can run resolver
00:10:24
or Blender or PyTorch, but now you're
00:10:26
relying on drivers from AMD, Open CL, or
00:10:29
Vulcan interop layers, and a dance of
00:10:31
memory synchronization going on between
00:10:33
the CPU and GPU buffers. You can still
00:10:35
get good results, but you're doing with
00:10:36
a lot more leg work behind the scenes.
00:10:38
Now, this doesn't mean that the PC is
00:10:40
inferior. It just means that open
00:10:42
platforms carry with them the burden of
00:10:44
interoperability. Every layer adds
00:10:46
flexibility for sure, but also friction.
00:10:49
And nowhere does that show up more
00:10:50
clearly than how memory is used and
00:10:51
managed across components. There's one
00:10:54
last philosophical contrast I want to
00:10:56
touch on. When you look at the Apple M2
00:10:58
Ultra, it's clear that Apple is chasing
00:10:59
a specific vision, a monolithic, highly
00:11:02
integrated comput engine where
00:11:04
specialization is internal, not
00:11:05
external. Everything's on the SOC.
00:11:08
Memory is shared. Caches are unified.
00:11:10
The user doesn't manage resources. The
00:11:12
system does. It's elegant, but it's also
00:11:14
very rigid. You're buying into a fixed
00:11:16
future. The Ryzen desktop, by contrast.
00:11:19
It's almost modular by design. You get
00:11:21
to pick your CPU, your RAM, your GPU if
00:11:23
you want one, and you can add more later
00:11:25
and change cooling systems, tune
00:11:26
voltages. It's messy, sure, but it's
00:11:28
yours. And that openness is why the PC
00:11:31
ecosystem has survived for decades and
00:11:33
adapted to workloads that Apple could
00:11:34
never have imagined. So, which is
00:11:37
better? Well, if you're building a
00:11:38
system for a tightly scoped, high
00:11:40
bandwidth professional content creation
00:11:42
task, especially video, photography, or
00:11:44
machine learning pipelines, then Apple's
00:11:46
unified memory architecture can offer a
00:11:48
level of performance and simplicity
00:11:50
that's hard to match. So, if like me,
00:11:52
your main reason for having a Mac is to
00:11:54
run Final Cut, it's almost perfect for
00:11:56
that task. But if your needs are more
00:11:58
general, or if you value upgrade paths,
00:12:00
flexibility, or just being able to
00:12:02
tinker and learn, then a Ryzenbased
00:12:04
system with shared memory is not only
00:12:05
good enough, it might actually be better
00:12:06
for you as a long-term investment. If
00:12:09
you're doing AI workloads, the ability
00:12:10
to run the larger models is appreciated
00:12:12
on both systems, but the Macs run them
00:12:14
significantly faster than the Ryzen's
00:12:16
APU. But if you're running a more CPU
00:12:18
friendly task like solving prime
00:12:20
numbers, the 16 high-speed cores of the
00:12:23
Ryzen actually put the Mac to shame,
00:12:25
turning in nearly double the
00:12:26
performance. In fact, the Knuckbox is
00:12:28
the fastest single core chip I've ever
00:12:30
tested, faster than both the M2 Mac Pro
00:12:32
Ultra and the Ryzen Thread Ripper 7995WX
00:12:35
on single core workloads. And the CPU is
00:12:38
fast enough that it even beats my older
00:12:39
32 core Thread Ripper 3270X on
00:12:42
multi-core tests despite having only
00:12:44
half the core count. So, the key is
00:12:46
knowing what kind of work you do and
00:12:47
then choosing the tool that best matches
00:12:49
that profile. If you found today's look
00:12:52
at memory architecture to be any
00:12:53
combination of informative or
00:12:55
entertaining, remember that I'm mostly
00:12:56
in this for the subs and likes. So, I'd
00:12:58
be honored if you consider leaving me
00:12:59
one of each before you go today. And if
00:13:01
you're already subscribed to the
00:13:02
channel, thank you. In the meantime, and
00:13:04
in between time, I hope to see you next
00:13:06
time right here in Dave's Garage.