Menene bambance-bambance tsakanin CPU da GPU?

CPU yana nufin jinkirin aiwatar da kebantattun ayyuka, yayin da GPU ke nufin samun yawan aiwatar da abubuwa da yawa a lokaci guda.

Me yasa ake amfani da Tensor Cores a cikin GPU?

Tensor Cores suna taimakawa wajen aiwatar da ayyukan kimiyyar kere-kere da kuma horar da cibiyoyin sadarwa na zurfi.

Me yasa ake samun sabbin fasaha a kan Nvidia GPUs?

Sabbin fasaha na ba da dama ga ingantaccen aiki tare da ƙarfin gudanar da ayyuka daban-daban kamar horar da AI.

Menene ma'anar 'CUDA architecture'?

CUDA architecture wata tsari ce ta GPUs daga Nvidia wadda ke baiwa damar aiwatar da lissafi na gaba ɗaya ta hanyar amfani da jayen GPUs.

Menene keɓancewar zamanin CUDA?

Zamanin CUDA yana mayar da hankali kan samar da GPUs da zasu iya aiki a matsayin na'urori na lissafi na gaba ɗaya baya ga ayyukan hoto.

Ta yaya cigaban fasahar Nvidia ke amfani?

Karuwar karafin fasaha yana bawa damar aiwatar da abubuwa masu yiwuwa na AI, lissafi wanda yafi ƙarfin CPUs.

Yaushe aka kirkiro CUDA farko?

CUDA architecture na farko ya fara a shekarar 2008 tare da shimfiɗar Tesla.

Wane alfarma ne ake samu daga guntu-guntun Nvidia na kwanan nan?

Sun hada da ƙarfin gudanar da nauyi mai yawa da kuma kara adadin ƙwaƙwalwar ajiyar guntu suna ƙara wa damar amfani da su a fannoni daban-daban.

CUDA Hardware

00:42:20

https://www.youtube.com/watch?v=kUqkOAU84bA

摘要

TLDRWannan bayani ya mai da hankali akan yadda kayan aikin CUDA suka canza tsawon shekaru, musamman kan injin amfani da GPUs na kamfanin Nvidia. Bidiyon ya bayyana yadda kowane keɓantaccen fassarar GPU, daga Tesla zuwa Ampere, suke bayar da gwaninta daban a game da lissafi na gaba ɗaya da kuma amfani a fagen zane-zane. An tattauna yadda CPU da GPU suka sha bamban, inda aka bayyana cewa CPU na mayar da hankali akan jinkirin aiwatar da abubuwa, yayin da GPU ke mayar da hankali akan samun yawan aiki. An bayar da tabbacin yadda CUDA ke bawa damar lambace na'urorin mai da hankali ga taskarwa da aiwatar da kimiyyar kere-kere, tare da cikakken sharhi akan kalmomi da shahararrun samfuran fasaha na Nvidia.

心得

🖥️ CPU yawanci yana mai da hankali kan jinkirin aiwatar da abubuwa.
🚀 GPU yana bawa damar samun yawan aiki mafi girma tare da ƙarancin jinkirin mita.
📈 CUDA architecture na ba da damar aiwatar da lissafi na gaba ɗaya.
🔍 Nvidia GPUs suna tare da sabbin fasaha kamar Tensor Cores don AI.
🔄 Sabbin fasaha yana kara ƙarfin GPUs da gwaninta.
🕰️ Mekurorin CUDA sun haɗa da Tesla, Fermi, Kepler, Pascal da kuma Ampere.
💡 Architecture na CUDA yana bawa damar amfani da GPUs a ta foyar lissafi na gaba ɗaya.
🌐 Nvidia GPUs suna bawa damar aiwatar da zane-zane tare da lissafi na zamani.
🚦 Sabbin fasaha a kan Nvidia yana inganta taswirar aiki da na'amni.
🔧 CUDA yana bawa damar aiki tare akan na'urori daban-daban ba tare da wata wahala ba.

时间轴

00:00:00 - 00:05:00
This section introduces CUDA hardware evolution and gives an understanding of CPU versus GPU architecture. It differentiates how CPUs have fewer cores with complex caching for low latency while GPUs have numerous cores focused on high throughput. GPU's architecture allows more parallel processing by reducing space for caching compared to CPUs.
00:05:00 - 00:10:00
The transition from earlier Nvidia GPUs aimed solely at graphics to more modern CUDA-enabled GPUs supporting general computing is highlighted. The architecture evolution is discussed from Nvidia's first GPU in 1999 to modern multi-functional units introduced with micro-architectures like Tesla in 2008, with a focus on reducing feature size and increasing transistor count.
00:10:00 - 00:15:00
It continues to detail Nvidia's advancements with CUDA-enabled GPUs allowing general-purpose computing, including the introduction of unified memory space and tensor cores, catering to the rise of machine learning by optimizing GPU performance for neural network processing.
00:15:00 - 00:20:00
The historical shift from graphics-specialized hardware to general-purpose computing in GPU chips is discussed. It highlights the redefinition of processing power allocation for flexible computing tasks instead of purely graphics, marking a significant architectural transition starting from the Tesla series.
00:20:00 - 00:25:00
The architecture of GPUs is further explored using the Geforce 8800, illustrating the central components involved in computation, with particular attention to the streaming multiprocessors (SM) and the flexibility in managing computational tasks through general-purpose cores.
00:25:00 - 00:30:00
The structuring of GPUs into clusters and multiprocessors enabling highly parallel processing capabilities is detailed. This includes the explanation of register files and memory hierarchy that support high-throughput processing, showcasing the Fermi architecture improvements with enhanced cores and cache systems.
00:30:00 - 00:35:00
Subsequent GPU architectures like Kepler, Maxwell, and Pascal are analyzed for their increased processing power and improved density through more advanced fabrication. Each generation adds more cores and enhanced precision capabilities, addressing evolving computing needs.
00:35:00 - 00:42:20
Finally, Volta and Ampere architectures are explored, highlighting technological advancements like tensor cores for machine learning applications and changes in manufacturing focus to balance graphical and computing performance, reflecting market needs and technological progression.

显示更多

思维导图

视频问答

Menene bambance-bambance tsakanin CPU da GPU?
CPU yana nufin jinkirin aiwatar da kebantattun ayyuka, yayin da GPU ke nufin samun yawan aiwatar da abubuwa da yawa a lokaci guda.
Me yasa ake amfani da Tensor Cores a cikin GPU?
Tensor Cores suna taimakawa wajen aiwatar da ayyukan kimiyyar kere-kere da kuma horar da cibiyoyin sadarwa na zurfi.
Me yasa ake samun sabbin fasaha a kan Nvidia GPUs?
Sabbin fasaha na ba da dama ga ingantaccen aiki tare da ƙarfin gudanar da ayyuka daban-daban kamar horar da AI.
Menene ma'anar 'CUDA architecture'?
CUDA architecture wata tsari ce ta GPUs daga Nvidia wadda ke baiwa damar aiwatar da lissafi na gaba ɗaya ta hanyar amfani da jayen GPUs.
Menene keɓancewar zamanin CUDA?
Zamanin CUDA yana mayar da hankali kan samar da GPUs da zasu iya aiki a matsayin na'urori na lissafi na gaba ɗaya baya ga ayyukan hoto.
Ta yaya cigaban fasahar Nvidia ke amfani?
Karuwar karafin fasaha yana bawa damar aiwatar da abubuwa masu yiwuwa na AI, lissafi wanda yafi ƙarfin CPUs.
Yaushe aka kirkiro CUDA farko?
CUDA architecture na farko ya fara a shekarar 2008 tare da shimfiɗar Tesla.
Wane alfarma ne ake samu daga guntu-guntun Nvidia na kwanan nan?
Sun hada da ƙarfin gudanar da nauyi mai yawa da kuma kara adadin ƙwaƙwalwar ajiyar guntu suna ƙara wa damar amfani da su a fannoni daban-daban.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要！

字幕

自动滚动:

00:00:03
this is a brief
00:00:04
overview of some of the cuda hardware
00:00:06
that's been around for years now
00:00:09
i wanted to give you some kind of
00:00:10
perspective on where we've been where
00:00:12
the current generation is and
00:00:13
kind of where we're headed maybe by way
00:00:16
of overview i want to
00:00:17
spend a little bit of time just
00:00:18
refreshing your thinking about the
00:00:20
differences between a cpu
00:00:22
and a gpu and on the left here we've got
00:00:25
a illustration of a typical
00:00:26
modern cpu this particular one is
00:00:29
showing a four
00:00:30
core architecture so there's individual
00:00:33
processing
00:00:34
cores that are part of the same chip and
00:00:37
they have
00:00:38
the core itself which is the kind of the
00:00:40
processing logic that's actually doing
00:00:42
things you could think of this
00:00:44
in general terms as sort of the
00:00:45
arithmetic logic unit the alu
00:00:48
there's going to be for each of the
00:00:50
cores this would be typical of for
00:00:52
example a current intel processor
00:00:54
there's going to be some level 1 cache
00:00:56
that's dedicated to that core it's often
00:00:58
split into instruction cache and data
00:01:00
cache
00:01:01
there's also going to be a bunch of
00:01:02
control logic so circuitry that kind of
00:01:05
orchestrates the behavior of the core
00:01:07
which is distinct from the core itself
00:01:09
in the sense that the core is there to
00:01:11
do the computation to do an addition or
00:01:13
multiplication or whatever and the
00:01:14
control logic is there to
00:01:16
orchestrate that process so there's
00:01:19
basically four copies of that stamped
00:01:21
out on this particular chip
00:01:22
and then a very large proportion of the
00:01:24
chip is going to be taken
00:01:26
up by additional cache memory to improve
00:01:28
the performance of the processor
00:01:30
to lower the latency which is kind of
00:01:33
the focus
00:01:33
of a cpu processor
00:01:37
so we'll have some level 2 cache that's
00:01:40
going to be
00:01:41
shared in different ways in different
00:01:42
architectures and then level 3 cache
00:01:43
that's usually
00:01:44
shared across all of the cores and
00:01:46
that's kind of where the processor chip
00:01:48
ends and of course we can't have a
00:01:51
modern computer without additional
00:01:53
memory so the dram here is just
00:01:55
the general memory on the motherboard
00:01:57
for example
00:01:59
again the point of the architecture of
00:02:02
the cpu is to
00:02:04
run a lot of different types of compute
00:02:06
jobs very flexibly and with very low
00:02:09
latency
00:02:10
hence all of the cash that's devoted or
00:02:12
the space that's devoted on the chip to
00:02:14
cash
00:02:16
when we look at the look at the gpu then
00:02:18
obviously the
00:02:19
the architecture is quite a lot
00:02:20
different so there's going to be
00:02:23
still control logic and a little bit of
00:02:26
cache memory that's going on
00:02:27
across the gpu but the overwhelming
00:02:30
majority of the gpu
00:02:32
chip itself is going to be given over to
00:02:34
processing cores
00:02:36
and very many of them the key difference
00:02:38
here is that
00:02:39
the emphasis on the gpu side is not
00:02:42
going to be
00:02:43
low latency in other words something
00:02:45
that's going to require a lot of
00:02:46
elaborate caching in the in the memory
00:02:49
hierarchy
00:02:50
but high throughput so we can apply a
00:02:53
lot of different processing cores to do
00:02:55
small pieces of calculations of the
00:02:57
overall
00:02:58
job that we're trying to accomplish and
00:03:00
by
00:03:01
reducing the amount of chip space that's
00:03:03
dedicated to these kinds of other things
00:03:05
like caching and
00:03:07
more elaborate control you know out of
00:03:08
order execution and that kind of stuff
00:03:10
we can dramatically improve the number
00:03:13
of cores
00:03:14
or increase the number of cores that are
00:03:15
available on the chip so we can just do
00:03:17
more
00:03:18
processing now it's not going to be as
00:03:20
low latency because we don't have all of
00:03:22
that caching we've got a little bit of
00:03:23
cash but not much by comparison to a cpu
00:03:26
but because we have so many individual
00:03:28
processing cores
00:03:29
we can do a lot of calculation per time
00:03:32
in other words the throughput can be
00:03:33
very high
00:03:34
even though the latency isn't super
00:03:36
great
00:03:37
gpu is also going to have some amount of
00:03:39
level 2 cache that's shared across
00:03:41
all of the cores and then as with the
00:03:44
cpu we've got some external memory
00:03:46
called dram here this would be say in a
00:03:49
typical
00:03:50
desktop kind of installation you'd have
00:03:52
a graphics processing unit that plugged
00:03:54
into the back plane of the motherboard
00:03:56
and it would include its own memory
00:03:58
sometimes called graphics memory or
00:04:00
frame buffer memory or whatever
00:04:02
for our purposes in general purpose gpu
00:04:04
computing
00:04:05
we're not really thinking of that
00:04:06
necessarily as something that's going to
00:04:07
be used to drive
00:04:08
a video display although in a graphics
00:04:10
card that's almost always what it's
00:04:12
being used for but we do have local
00:04:15
memory on the video card
00:04:16
that's accessible to the to the gpu
00:04:21
here's a not so brief summary of
00:04:24
the various generations that nvidia has
00:04:26
been through with its cuda architecture
00:04:28
uh sort of a history of graphics
00:04:31
processing
00:04:31
if we go back here to the very beginning
00:04:34
in 1999
00:04:35
saw the introduction of nvidia's geforce
00:04:38
256
00:04:39
which was really the first graphics
00:04:41
processing unit to be
00:04:42
released into the wild and there was a
00:04:44
whole variety of different
00:04:46
improvements on that card uh throughout
00:04:48
the uh
00:04:49
the early 2000s where we're really
00:04:52
interested here
00:04:52
is at the beginning of what i've what
00:04:55
i've labeled here the beginning of kind
00:04:57
of the cuda era
00:04:58
starting here in 2008 and kind of going
00:05:00
forward until the present
00:05:01
we've got sort of the modern era of gpus
00:05:05
and a key distinguishing characteristic
00:05:06
here is that these gpus allow
00:05:09
for very high performance graphics
00:05:11
generation
00:05:12
but they also are being designed to
00:05:14
allow for general purpose computing
00:05:16
which is
00:05:17
really our interest in them in this
00:05:18
course
00:05:20
so the first of these are what what
00:05:22
nvidia refers to as the micro
00:05:24
architecture
00:05:25
which is just kind of a generation or a
00:05:27
family of gpu
00:05:29
processors that they've released that
00:05:30
have kind of common characteristics
00:05:33
the first of those was called tesla it
00:05:35
was introduced in 2008
00:05:37
and there's just some interesting
00:05:39
statistics here you can see across
00:05:41
these different generations the first of
00:05:43
these is the the process
00:05:45
and the process by that i mean the the
00:05:48
fabrication process that was used
00:05:50
to make the chips and one way of
00:05:53
understanding the the differences from
00:05:55
one process to another is
00:05:57
what's known as the feature size or the
00:05:59
basically the size of a single
00:06:00
transistor or a single component that
00:06:02
you're going to put down on the on the
00:06:03
chip itself
00:06:05
and that's measured in nanometers which
00:06:08
is pretty small
00:06:09
and you can see back in the in the late
00:06:11
90s with the first gpu
00:06:13
they were working at 220 nanometers and
00:06:16
by the time
00:06:17
we got to 2008 with the tesla we're down
00:06:19
to 65 nanometers
00:06:21
and that's been decreasing ever since uh
00:06:23
to the present
00:06:24
generation of nvidia gpus uses a seven
00:06:28
nanometer process
00:06:29
so it's it's a couple hundred times
00:06:32
denser
00:06:33
than what we had back at the beginning
00:06:35
the upside is that you can can
00:06:37
you can fit more components on the
00:06:39
surface of a chip
00:06:40
which allows you to build something
00:06:42
that's more and more capable
00:06:44
you can see also a similar measure here
00:06:46
in the next column the number of
00:06:47
transistors that are on that chip
00:06:49
which surprisingly is something that
00:06:51
nvidia is pretty happy to trumpet
00:06:53
if you look for that kind of information
00:06:55
from intel or amd
00:06:56
the sort of the cpu vendors they're less
00:06:59
they're less happy about sharing that
00:07:00
information
00:07:01
but you can see we went from you know in
00:07:03
the neighborhood of a couple of hundred
00:07:05
million
00:07:05
transistors which is still a lot to now
00:07:08
the
00:07:09
the ampere micro architecture is at 28.3
00:07:13
billion
00:07:13
transistors on one chip which is that's
00:07:16
just a lot of
00:07:17
billions so we're seeing
00:07:20
smaller and smaller feature sizes which
00:07:22
leads to larger and larger numbers of
00:07:24
transistors that we can use
00:07:25
and um and that means more and more
00:07:28
capability over time
00:07:30
and we're going to kind of unpack some
00:07:31
of these micro architecture generations
00:07:33
so we'll start with tesla
00:07:35
let me just point out some highlights
00:07:37
here that i've put over here in the
00:07:38
notes column
00:07:39
at the pascal architecture one of the
00:07:41
things that nvidia introduced was this
00:07:43
notion of a unified memory
00:07:46
we'll we'll sort of be working with the
00:07:48
less capable
00:07:50
parts here so we won't have that
00:07:52
advantage of this unified memory we have
00:07:54
to
00:07:54
we have to orchestrate moving
00:07:55
information back and forth between cpu
00:07:57
memory
00:07:58
and gpu memory but that's kind of
00:08:00
cumbersome
00:08:01
and results in some limitations and more
00:08:04
complexity in the code
00:08:06
and so nvidia has tried to address that
00:08:07
by first of all making the
00:08:09
memory space on the gpu itself unified
00:08:12
so we'll talk about the different types
00:08:14
of memory that are on the gpu
00:08:16
and those now are accessible through a
00:08:18
single address space
00:08:20
but they've also now provided a
00:08:21
mechanism where you can program
00:08:23
uh your applications in such a way that
00:08:26
whether you
00:08:26
allocate memory on the cpu memory or on
00:08:29
the gpu memory
00:08:30
they're considered part of the same
00:08:32
address space and you can kind of move
00:08:33
things back and forth and access those
00:08:35
memories
00:08:36
uh kind of transparently in your code
00:08:38
and then the runtime environment
00:08:40
moves things back and forth
00:08:44
in with the volta micro architecture in
00:08:47
2017
00:08:48
uh nvidia introduced what they call
00:08:50
tensor cores
00:08:52
so obviously at this point in history
00:08:55
we're seeing the the rise of machine
00:08:56
learning and deep neural networks
00:08:58
and the use of gpus specifically to
00:09:01
train neural networks and also to
00:09:02
evaluate
00:09:03
uh and solve problems using those neural
00:09:05
networks and although the
00:09:07
the prior generations of gpus were
00:09:09
really good at that
00:09:10
um what nvidia tried to do was uh
00:09:13
introduce
00:09:14
uh technology specifically designed for
00:09:18
processing deep neural networks and
00:09:19
that's what they refer to as their
00:09:21
tensor cores
00:09:22
and then in 2019 with the touring micro
00:09:25
architecture they
00:09:26
added a ray tracing capability so
00:09:29
essentially real time ray tracing for
00:09:31
graphics rendering
00:09:32
and a fairly considerable percentage of
00:09:34
the chip surface
00:09:36
in the touring architecture is dedicated
00:09:39
specifically to ray tracing
00:09:41
so in a way we're seeing kind of another
00:09:43
one of these pendulum swings that we see
00:09:45
in computing quite regularly
00:09:47
back in the in the late 90s early 2000s
00:09:50
there wasn't any kind of general purpose
00:09:54
focus in the design of these chips it
00:09:55
was very specific
00:09:57
logic on these chips that was focused on
00:09:59
doing
00:10:00
graphic stuff only and in order to use
00:10:03
those
00:10:04
generations of chips for general purpose
00:10:06
computing you had to kind of recast your
00:10:08
problem as a graphics problem
00:10:10
let the gpu do its thing and then kind
00:10:12
of map it back into the
00:10:13
the the more familiar computing domain
00:10:17
and with with starting with the tesla
00:10:20
architecture that was not really
00:10:21
necessary anymore you could just program
00:10:23
your application directly
00:10:24
but now we're starting to see a return a
00:10:26
little bit to some dedicated hardware
00:10:28
that's really focused on graphics
00:10:29
uh in this case to do to do real-time
00:10:32
ray tracing
00:10:34
um another thing here is that there's a
00:10:36
little bit of terminological confusion
00:10:38
and we'll see that there's different
00:10:39
terminology that gets used in different
00:10:40
generations there's a little variation
00:10:42
over time
00:10:43
but because the first microarchitecture
00:10:46
that nvidia introduced that did general
00:10:48
purpose computing
00:10:49
was called tesla they have kind of held
00:10:53
that
00:10:53
term to refer not only to this specific
00:10:56
generation of chips
00:10:58
but to refer to the basic idea of doing
00:11:01
general purpose computing on the
00:11:02
on the nvidia chipset so you'll still
00:11:05
hear hear and read people talking about
00:11:07
the tesla cores
00:11:08
and that really just is a general or has
00:11:10
become kind of a general reference
00:11:12
to this ability to do general purpose
00:11:14
computing on the gpu
00:11:16
and one final note here from this table
00:11:18
is the next generation of
00:11:20
chip that nvidia has announced is
00:11:22
actually named after admiral grace
00:11:24
murray hopper
00:11:25
who was one of the early pioneers in
00:11:26
computing and it was kind of cool to see
00:11:29
she's also the first woman on this list
00:11:31
obviously these are the names of famous
00:11:33
scientists and engineers and so forth
00:11:34
over time and she's the first woman
00:11:36
and one of the sort of founding founders
00:11:40
of computer science in in many
00:11:42
ways with the introduction of some of
00:11:43
the work that she did uh
00:11:45
years ago
00:11:48
all right um there's lots of terminology
00:11:52
that you kind of need to get embedded a
00:11:54
little bit when you're reading and
00:11:56
and trying to understand these gpus so
00:11:58
this is kind of a little secret decoder
00:12:00
ring some of the common things that
00:12:01
might be mystifying
00:12:02
i've starred the ones that are really
00:12:04
important so these two
00:12:06
uh these two terms here the streaming
00:12:07
multiprocessor and the streaming
00:12:09
processor
00:12:10
unfortunately they're kind of close in
00:12:12
in meaning and also
00:12:14
in terms of the initialism that's used
00:12:15
to represent them but those are really
00:12:17
the two
00:12:18
major kind of groupings of processors or
00:12:21
processing capability
00:12:23
that we care about in the gpu there's
00:12:26
there's still some specialized graphics
00:12:27
focused kinds of
00:12:29
hardware units on the gpu itself we're
00:12:31
not really going to drill into those in
00:12:32
any sort of detail
00:12:34
and you got to kind of filter those
00:12:36
things out when you're reading
00:12:38
documentation and spec sheets and stuff
00:12:40
for nvidia products
00:12:41
when you're just thinking about doing
00:12:43
cuda programming as opposed to doing
00:12:44
graphics programming
00:12:47
i learned something new the other day
00:12:49
that these are not acronyms
00:12:50
that they're actually called initialisms
00:12:53
uh an acronym is an initialism that's
00:12:55
also
00:12:56
a pronounceable word so unless you want
00:12:58
to pronounce
00:12:59
this sm as smum and smup or something
00:13:02
like that
00:13:03
they're not actually acronyms they're
00:13:05
initialisms so there you go free
00:13:06
knowledge
00:13:08
in addition these are some other other
00:13:10
terminologies that get used pretty
00:13:12
regularly
00:13:13
so the the streaming multiprocessor is
00:13:16
kind of a collection of streaming
00:13:17
processors so multiprocessor contains
00:13:19
processor
00:13:21
and those streaming multiprocessors are
00:13:23
also clustered together in kind of
00:13:24
larger units on the chip
00:13:26
and we'll see references to the tpc
00:13:29
which stands for texture slash processor
00:13:32
cluster
00:13:33
and gpc which stands for graphics
00:13:35
processing cluster
00:13:36
and these um i don't know that the
00:13:38
distinction between them is that
00:13:39
important
00:13:40
they're really just larger groupings of
00:13:42
symmetric multiprocessors which are
00:13:44
themselves groupings of
00:13:46
streaming process sorry streaming
00:13:48
multiprocessors which group streaming
00:13:49
processors
00:13:51
and something else here um the there's a
00:13:54
important difference here between
00:13:55
this the single precision and double
00:13:57
precision arithmetic that are performed
00:13:59
by the streaming processors
00:14:01
in the early going because the
00:14:04
the density with which the manufacturers
00:14:07
could get
00:14:08
features on the chips the early
00:14:11
streaming processors tended to be single
00:14:13
precision
00:14:14
which basically meant they could do
00:14:16
32-bit arithmetic
00:14:18
as opposed to double precision which is
00:14:21
64-bit so there was a
00:14:24
in order to do the 64-bit calculations
00:14:27
which you could still do you had to sort
00:14:28
of split them up
00:14:30
and run half of each on two different
00:14:32
streaming
00:14:33
processors which essentially halved the
00:14:36
performance
00:14:37
but at some point along the way nvidia
00:14:39
figured out ways to get enough
00:14:41
transistors on the surface of the chip
00:14:43
to allow you to do direct
00:14:44
double precision arithmetic and when
00:14:46
that starts to arise in the history of
00:14:49
these
00:14:49
of these chipsets we'll we'll kind of
00:14:51
point out where
00:14:52
where we see them starting to cite both
00:14:55
single precision and double precision
00:14:56
processing on a single
00:14:58
gpu this is a
00:15:01
an illustration that i thought was
00:15:03
particularly clear in
00:15:04
expressing the different main components
00:15:07
of an nvidia gpu
00:15:08
this actually corresponds to the geforce
00:15:10
8800 which was one of the first
00:15:12
tesla micro architecture chip sets that
00:15:15
was released
00:15:17
and i don't expect you to kind of
00:15:19
understand all of the details here but
00:15:20
it kind of gives you a nice
00:15:22
sort of overall view of things so we can
00:15:24
see here that
00:15:25
uh the the gpu as labeled here is really
00:15:29
the main portion of what's going on in
00:15:31
this chip there's some other things
00:15:32
happening like there's a connection to
00:15:34
allow you to talk to the host computer
00:15:36
um to to access the memory on them on
00:15:39
the motherboard and so forth
00:15:41
as well as access here to to dram that's
00:15:44
on the
00:15:45
on the board right so this is the the
00:15:47
video memory or the frame buffer memory
00:15:49
but what we're interested in here is
00:15:51
kind of this central portion
00:15:53
that represents the key computational
00:15:56
components that are in play here on
00:15:58
a gpu so in this particular chip there
00:16:01
are
00:16:01
eight of these tpcs which as you'll
00:16:04
recall back here
00:16:05
stands for texture slash processor
00:16:07
cluster and we're thinking about this as
00:16:09
the processor cluster so this is just a
00:16:12
kind of a high level grouping
00:16:14
of the streaming multiprocessor and the
00:16:17
streaming processor
00:16:18
to provide a mechanism to kind of get
00:16:21
multiple different
00:16:22
parts of the chip to do different things
00:16:23
at different times and so forth
00:16:25
there's quite a lot of quite a lot of
00:16:26
technology turned under to figure out
00:16:28
how to do that a lot of these
00:16:30
block diagram elements up here are are
00:16:32
focused on doing that
00:16:34
and we're going to drill down into
00:16:35
what's going on inside those tpcs
00:16:38
some other thing going on other things
00:16:39
going on here um there's a what it just
00:16:41
says interconnection network
00:16:43
um the uh the the
00:16:46
genesis of this whole notion of a
00:16:48
streaming multi-processor
00:16:50
or streaming processors within the
00:16:51
streaming multiprocessors
00:16:53
is that you can sort of stream a
00:16:55
calculation from one processor to the
00:16:57
next processor to the next processor
00:17:00
within the within the gpu so if you if
00:17:02
you go back and look at some of the
00:17:04
implementation details of an early
00:17:06
graphics processing unit
00:17:08
what you would find is that you'd get
00:17:10
some input from
00:17:11
the cpu right so some some element of
00:17:14
the model that you were trying to render
00:17:16
say in 3d or 2d and then there would be
00:17:18
a series of processors
00:17:20
that were very specifically tailored in
00:17:23
the hardware
00:17:24
to handle the the different stages in
00:17:26
what's called the
00:17:27
graphics processing pipeline so the cpu
00:17:30
feeds the gpu some sort of information
00:17:32
about a vertex or whatever in the scene
00:17:35
that's to be rendered
00:17:36
and then these different stages along
00:17:37
the way represented very specific pieces
00:17:40
of hardware that did certain kinds of
00:17:41
things
00:17:42
and would stream from one to the next to
00:17:44
the next and eventually
00:17:45
at the end of this process the the final
00:17:48
final processor in the step is going to
00:17:50
put something into into
00:17:51
the dram the frame buffer memory so it
00:17:54
actually shows up on the screen
00:17:56
now that's all fine and good except
00:17:58
that's not very flexible so if
00:18:00
you know if this particular step in the
00:18:02
process
00:18:03
is really busy and this step in the
00:18:05
process is really not doing very much
00:18:06
for a particular scene that's being
00:18:08
rendered
00:18:09
you're sort of leaving processing power
00:18:10
on the table but because these are
00:18:12
purpose-built
00:18:14
units within the gpu there's really no
00:18:17
way to flexibly
00:18:18
alter that behavior in order to get
00:18:21
certain parts of the hardware to do a
00:18:22
different job than it was designed for
00:18:25
so the idea here of but let me step back
00:18:28
you can see here the notion of streaming
00:18:30
though right the information is coming
00:18:31
from this
00:18:32
cpu and it streams from one processing
00:18:35
element to the next to the next until it
00:18:36
finally ends up in the frame buffer
00:18:38
so that notion of streaming is still
00:18:40
really important but instead of having
00:18:42
these kind of purpose-built modules that
00:18:44
only do one kind of calculation
00:18:46
the idea in the general purpose gpu or
00:18:49
the cuda architecture
00:18:51
is to say let's not let's not have these
00:18:53
specific units let's
00:18:55
let's instead offer a very powerful very
00:18:58
flexible
00:18:59
general-purpose processor and allow it
00:19:03
to be hooked up in such a way that we
00:19:04
get this kind of streaming behavior
00:19:06
so what we'll see here each of the
00:19:08
little elements inside these
00:19:10
inside these tpcs is one of those
00:19:12
processors
00:19:13
and you'll you'll notice here that uh
00:19:17
in this in this kind of very primitive
00:19:20
representation of a graphics processing
00:19:22
pipeline
00:19:23
this this is kind of a long series of
00:19:25
operations
00:19:26
not me not really long but there's
00:19:28
multiple steps in that process
00:19:30
and what what the what the streaming
00:19:33
multiprocessor the streaming processor
00:19:35
architecture and cuda tries to do
00:19:37
is to give you the ability to do this
00:19:39
kind of streaming
00:19:40
from one process to another to another
00:19:43
but using these general purpose
00:19:44
stream processor elements the
00:19:47
interconnection network then
00:19:49
is responsible for sort of routing
00:19:51
intermediate calculation results from
00:19:53
one streaming processor to another
00:19:55
but instead of it being instead of it
00:19:58
being
00:19:59
helpful to think about it as kind of a
00:20:01
linear chain
00:20:02
instead what you're seeing is the
00:20:04
ability for say
00:20:06
one of these tpcs to do a bunch of
00:20:08
calculations
00:20:09
and that output goes onto this
00:20:11
interconnection network and might feed
00:20:13
into
00:20:14
another set of stream processors
00:20:18
which could do a different calculation
00:20:19
and then those results could stream into
00:20:21
another one
00:20:22
and eventually that information is going
00:20:23
to or the result of those
00:20:25
calculations is going to make it out to
00:20:26
dram so you can still get the
00:20:30
this ability to stream calculations from
00:20:32
one stage to the next stage the next
00:20:33
stage
00:20:34
as you transform the information from
00:20:36
something that's used to model the scene
00:20:38
on the cpu and end up with pixels on the
00:20:42
screen
00:20:42
at the end of that process but it's now
00:20:45
more flexible than what you had when you
00:20:46
had dedicated processing stages
00:20:49
so if if you have more
00:20:52
more need to replicate the behavior of
00:20:54
this block you might allocate several
00:20:56
of the tpcs to doing this function
00:21:00
and only one of them for example just
00:21:03
kind of making this up
00:21:05
to handle this next set of functionality
00:21:07
and because the processors in the
00:21:09
streaming
00:21:10
streaming multiprocessor are general
00:21:12
purpose computers right
00:21:13
you can have them do any of these
00:21:14
functions that you'd like and and change
00:21:17
that
00:21:17
over time during the execution of a
00:21:19
single program or between
00:21:21
different programs that are going to run
00:21:22
on the gpu
00:21:24
the other cool thing about that is we
00:21:27
can now use these general purpose
00:21:29
streaming processors to do whatever kind
00:21:31
of calculations we want
00:21:33
whether it's specifically tailored to
00:21:34
doing some graphics thing or we're just
00:21:37
doing a say a large matrix operation
00:21:39
that we're storing out here in
00:21:41
in frame buffer memory but we're never
00:21:43
actually going to show on the screen
00:21:44
what that frame buffer memory contains
00:21:46
because it's just data at that point
00:21:47
it's not designed to be pixels
00:21:49
it's just the uh the input or the output
00:21:52
from the calculation that we're doing
00:21:54
on all of these very large number of
00:21:56
streaming processors
00:22:01
here's a little zoomed in view so this
00:22:02
is that the picture of the overall
00:22:05
architecture of the gpu for the 8800
00:22:08
and this kind of gives you a little bit
00:22:09
more detail on this on this tpc the pro
00:22:12
the processing cluster
00:22:14
so there's two units inside of here that
00:22:17
are of interest to us
00:22:18
both of which are labeled sm for
00:22:21
streaming multiprocessor
00:22:23
and within the streaming multiprocessor
00:22:25
there are the individual
00:22:26
sps the streaming processor and it's the
00:22:29
sp that's really
00:22:30
i mean this is what we're what we refer
00:22:31
to as a core in
00:22:33
informal terms right that's a that's a
00:22:35
gpu core or a cuda core you could call
00:22:37
it
00:22:37
and you can see in this architecture we
00:22:39
have um
00:22:41
eight of those cores eight of those sps
00:22:43
grouped together
00:22:44
into an sm and we've got two of those
00:22:47
sms that are grouped together into a tpc
00:22:49
and then we saw previously that we have
00:22:51
eight of these tpcs that are grouped
00:22:53
together to make up the entirety of the
00:22:55
gpu
00:22:56
i've also got here a little bit of an
00:22:58
exploded diagram of an individual
00:23:00
stream streaming multiprocessor so we
00:23:02
can see here in a little bit more detail
00:23:04
we've got the streaming processors of
00:23:06
course we have some special functional
00:23:08
units which do things like
00:23:10
transcendental functions like sine and
00:23:11
cosine that sort of thing
00:23:13
we've got some caching going on as i
00:23:15
mentioned per per streaming
00:23:17
multiprocessor so there's an instruction
00:23:19
level cache
00:23:20
there's also what's called a constant
00:23:22
cache so if you have specific constant
00:23:24
values that you need to refer to
00:23:26
over and over again nearby the
00:23:28
processors but
00:23:29
that need that need to change over time
00:23:31
you can store those in that cache
00:23:33
and then there's also some shared memory
00:23:36
which allows
00:23:37
x or provides access to all of the sps
00:23:40
on an sm
00:23:41
so if you have a calculation that that
00:23:44
you need to
00:23:45
split up into smaller pieces and allow
00:23:48
those
00:23:48
individual pieces to collaborate with
00:23:50
one another to solve the overall problem
00:23:52
even at the level of
00:23:54
individual sps within sm they have
00:23:56
access to this shared memory
00:23:58
programming the sps if they need to
00:24:00
collaborate over shared memory
00:24:01
is quite similar to what we were looking
00:24:03
at in the early part of the term when we
00:24:05
were doing p threads programming
00:24:06
in particular all of the sps can access
00:24:09
directly the shared memory within the sm
00:24:12
and in order to get them to cooperate
00:24:13
with one another you need to do things
00:24:15
like synchronizing
00:24:16
access to memory so that you don't step
00:24:17
on each other's toes so
00:24:19
buried inside of this whole architecture
00:24:22
is a little shared memory machine
00:24:24
with some number of sps attached to that
00:24:26
shared memory
00:24:27
and then of course because there is this
00:24:30
large collection of dram
00:24:32
memory that's attached to the entire gpu
00:24:35
the the sps can also access that global
00:24:38
memory as well
00:24:40
right so we're going to take a kind of a
00:24:42
quick tour through these generations of
00:24:44
the the chips chipsets and the micro
00:24:46
architectures that we've been talking
00:24:47
about so
00:24:48
starting out with the tesla architecture
00:24:50
and we've kind of looked at this
00:24:52
uh from from the previous illustration
00:24:54
uh this is another representation of
00:24:56
that same
00:24:57
8800 core but i wanted to just kind of
00:25:00
for consistency here
00:25:01
break out some of the key statistics so
00:25:03
this guy has
00:25:04
eight at this point they were calling
00:25:06
them graphics processing clusters
00:25:08
so each of these guys is a gpc and then
00:25:11
within each of those
00:25:12
there's two individual
00:25:16
sms which means we have a total of 16
00:25:20
streaming multiprocessors and within
00:25:22
each of those streaming multiprocessors
00:25:24
as we've seen already there were eight
00:25:26
streaming processors for a total of 128
00:25:28
streaming processors in this gpu
00:25:31
and then there's other things out here
00:25:33
too this block is
00:25:34
for information about like shading and
00:25:36
kind of graphic specific things that
00:25:38
we're going to just sort of
00:25:39
ignore and remember that there is an l1
00:25:41
cache that's associated with the gpc
00:25:44
and then an l2 cache that can be shared
00:25:47
among
00:25:48
multiple sms before we get out to
00:25:51
frame buffer memory here the fb and then
00:25:54
we can zoom in again on the individual
00:25:57
streaming multiprocessor so just one of
00:26:00
these chunks
00:26:00
of the the tpc is the streaming
00:26:03
multiprocessor and it's got streaming
00:26:05
processors inside of it
00:26:07
uh and then these other units here have
00:26:08
more to do with with dealing with some
00:26:10
graphic specific things like texture
00:26:12
texture maps
00:26:16
another release in kind of the second
00:26:19
generation of tesla
00:26:21
chips was the geforce 280 this is also
00:26:23
in 2006.
00:26:24
i guess there was a little bit more
00:26:26
emphasis this is again
00:26:27
an illustration grabbed from some of the
00:26:30
documentation white papers about these
00:26:31
guys
00:26:32
so 3d diagrams was apparently really
00:26:34
cool at that point
00:26:36
so uh we have a little bit more capable
00:26:39
uh chip here we've got 10
00:26:41
tpcs each of which has three streaming
00:26:43
multiprocessors so a total of 30
00:26:45
streaming multiprocessors
00:26:47
and each of those again had eight uh
00:26:50
streaming processors for a total of 240
00:26:53
streaming processors i'll have a little
00:26:54
summary slide here at the end that lets
00:26:56
us kind of compare
00:26:57
these key statistics that i've kind of
00:26:59
outlined here
00:27:00
for the number of sms and the number of
00:27:02
sps that's really the critical
00:27:04
uh the critical number that we're
00:27:06
interested in when we're thinking about
00:27:07
how do we program this
00:27:08
uh using the cuda platform
00:27:12
this is a a little larger view of
00:27:15
an individual tpc right so there's ten
00:27:18
of these
00:27:18
so we're looking at one of these guys
00:27:20
and you can see here's
00:27:21
the three streaming multiprocessors each
00:27:24
of which has eight
00:27:25
individual again cores or s this is the
00:27:28
streaming processor the sp or the core
00:27:31
and then there's some local memory that
00:27:33
would be the shared memory that's
00:27:34
available to all these cores
00:27:35
and then some texture stuff and a shared
00:27:38
l1 cache
00:27:39
across this this tpc
00:27:44
okay 2010 saw the introduction of the
00:27:47
fermi
00:27:48
microarchitecture and you can see here
00:27:51
this again ripped from the headlines
00:27:54
just straight from the documentation
00:27:56
here we're showing that there's 16
00:27:58
streaming multiprocessors so they're not
00:28:00
really breaking these out in
00:28:02
in detail as tpcs at this point in
00:28:04
history i'm not sure exactly why
00:28:05
and we can see though so here's
00:28:07
basically one of those streaming
00:28:08
multiprocessors
00:28:10
and each of those has 32 sps for a total
00:28:13
of 512 sps
00:28:15
and they're just kind of laying this out
00:28:16
differently we still have local cache
00:28:18
here within
00:28:19
the the streaming multiprocessors
00:28:21
there's a level two cache as well
00:28:23
and then the the blocks on the outside
00:28:25
are basically showing you
00:28:27
interfaces to the dynamic memory
00:28:30
if we zoom in on one of these sms we can
00:28:33
see here that
00:28:34
we've got a bunch of cores i mentioned
00:28:37
30 32 cores or 32 sp's per sm so each of
00:28:41
these is a core and there's actually a
00:28:42
little exploded diagram here of the core
00:28:44
notice that this guy has both a floating
00:28:46
point unit and
00:28:48
an integer unit so there's two different
00:28:50
paths through this core that allow you
00:28:52
to
00:28:53
simultaneously do a floating point
00:28:55
operation
00:28:56
and an integer operation so you kind of
00:28:59
get double your money if you want for
00:29:01
the pro
00:29:01
for certain kinds of calculations you
00:29:03
can have both of those things active
00:29:05
i think at this point we're still in the
00:29:07
single precision uh
00:29:09
world so if you wanted to do double
00:29:10
precision arithmetic you
00:29:12
had to figure out how to how to use
00:29:14
these things to do that
00:29:17
we've also here called out the load
00:29:19
store unit so this is a way of
00:29:21
streamlining access to memory
00:29:23
and then the special function units of
00:29:25
which there's just four
00:29:27
because presumably you're not going to
00:29:28
do that many you know sine and cosine
00:29:30
and transcendental operations
00:29:32
uh as you are normal sort of
00:29:36
arithmetic operations you can see here
00:29:38
there's there's shared memory the 664k
00:29:40
of shared memory per
00:29:42
sm which results in a lot of memory in
00:29:44
in total
00:29:45
a connection interconnection network to
00:29:47
allow you to stream calculations from
00:29:48
one processor to the next
00:29:50
and the cores also we haven't seen this
00:29:52
yet this is essentially
00:29:54
core local memory so
00:29:58
in the if you recall in this in a cpu uh
00:30:01
each of the processor cores has a
00:30:03
register file right sort of
00:30:04
really fast nearby static ram that's
00:30:08
really quick to access by the cores
00:30:10
there's a similar thing here for the
00:30:11
cuda cores the register file is quite a
00:30:13
lot larger
00:30:15
it's 32 000 entries of 32 032-bit
00:30:19
registers
00:30:20
and those are divided evenly across the
00:30:23
individual cores and since there's 32 of
00:30:25
those on this chip
00:30:26
each of those is going to get 1000 or 1k
00:30:29
32-bit registers that it can
00:30:31
individually access really quickly
00:30:33
so there's really three levels in the
00:30:34
memory hierarchy an individual core
00:30:36
has its own dedicated registers from the
00:30:39
register file so this is split up
00:30:41
32 different ways those cores can all
00:30:44
access the shared memory the 64k of
00:30:46
shared memory here
00:30:48
and then they can also access what's
00:30:50
called the global memory which is really
00:30:51
just the dram
00:30:52
frame buffers on the gpu card itself
00:30:57
next up was kepler in 2012
00:31:00
and the illustration just gets more and
00:31:03
more dense
00:31:04
um so you can see there's you know many
00:31:06
more components
00:31:07
that are fitting on the on the chip
00:31:09
surface here and again this is just a
00:31:10
result of having
00:31:12
a higher density fabrication process
00:31:14
that lets you put more processing power
00:31:16
on the same chip so this guy has
00:31:18
15 streaming multiprocessors each of
00:31:21
which has 192
00:31:23
processors for a total of almost 3 000
00:31:26
sps on a single chip otherwise the
00:31:29
architecture is quite similar
00:31:30
level 1 cache level 2 cache access to
00:31:33
dram
00:31:34
local register files and so forth here's
00:31:36
an exploded view of an almost
00:31:39
in almost impossible to read a view of
00:31:42
this
00:31:42
of this chip but you can see that
00:31:44
there's a whole variety of different
00:31:45
cores now that are going to be included
00:31:46
in here
00:31:47
some of them are starting to be able to
00:31:49
provide native
00:31:51
improvements to floating point
00:31:52
calculations and so forth they've all
00:31:54
got load store units
00:31:56
and then there's also some special
00:31:57
function units here for transcendental
00:32:00
operations similarly there's still
00:32:02
shared memory
00:32:03
and each of these guys is also going to
00:32:05
have pieces of a register file
00:32:07
so it's kind of more of the same being
00:32:09
stamped out here
00:32:10
to provide higher throughput
00:32:12
calculations
00:32:14
moved to the maxwell architecture in
00:32:16
2014 ocs are nicely evenly spaced for
00:32:19
the most part in two year increments
00:32:21
i couldn't find a picture of a maxwell
00:32:24
gpu but i did find an illustration of a
00:32:26
maxwell
00:32:27
streaming multiprocessor nvidia kind of
00:32:30
played with names here so they called
00:32:32
some of the sms
00:32:33
sms and some of them smxs and smms
00:32:36
it's all kind of the same idea here
00:32:38
we've got on the
00:32:40
typical maxwell architecture we're going
00:32:42
to have 16 of these sms
00:32:44
with 120 sps per sm so this is one of
00:32:47
those
00:32:48
sms and you end up with 2
00:32:51
000 streaming processors
00:32:54
per chip again very similar kinds of
00:32:57
pieces of functionality here i'm
00:32:59
not going to drill down into these for
00:33:01
each for each chip set but
00:33:03
you get the idea that it's just kind of
00:33:05
more of the same
00:33:07
2016 introduced the pascal architecture
00:33:10
and as you can see higher density we've
00:33:13
got
00:33:14
now they're starting to use the term
00:33:16
graphics processing cluster
00:33:18
graphics processor cluster so there's
00:33:19
six of these and you can see each of
00:33:21
those
00:33:22
is this larger group of of processing
00:33:25
elements
00:33:26
within each of those there's 10
00:33:28
streaming multiprocessors
00:33:30
right so back at the early generations
00:33:32
of tesla we had
00:33:33
10 streaming processors on the whole
00:33:35
chip now we've got 10 of them just in
00:33:37
this one little
00:33:38
section of the chip and we can stamp
00:33:39
that out a bunch of different times so
00:33:41
with six gpcs and 10 sms per we've got
00:33:44
60 streaming multiprocessors
00:33:46
and those can each have 64 sps
00:33:49
giving us a total of almost 4 000
00:33:52
streaming processors on one chip
00:33:54
otherwise similar things level 1 cache
00:33:56
level 2 cache register files
00:33:58
memory access
00:34:02
here's a close-up view of the individual
00:34:05
sms and sps so you can see here there's
00:34:09
some ordinary cores which are similar to
00:34:11
what we've been seeing already
00:34:12
but there's these dp units stand for
00:34:15
double precision units so they know how
00:34:17
to do double precision arithmetic
00:34:20
as well as single precision there's a
00:34:22
load store unit for memory access and
00:34:23
then a special function unit for each
00:34:25
group of four cores
00:34:27
so again more capability on the chip
00:34:30
giving you more flexibility for the
00:34:32
calculations you're going to do
00:34:34
shared memory there's a register file
00:34:36
for the individual processors and of
00:34:38
course we have access to global memory
00:34:42
volta in 2017 again more of the same
00:34:47
we've got now still six graphics
00:34:49
processing clusters but instead of 10
00:34:51
sms per we've got 14 per for a total of
00:34:53
84
00:34:54
streaming multiprocessors and now we're
00:34:57
to the point where we're kind of
00:34:58
pulling apart single precision and
00:35:01
double precision so each of these sms
00:35:03
has 64 single precision cores for
00:35:06
floating point
00:35:08
64 single precision cores for integer
00:35:11
arithmetic
00:35:12
and 32 cores for double precision
00:35:15
floating point
00:35:16
and that's kind of the holy grail here
00:35:18
for gpus in general purpose computing
00:35:20
most of the
00:35:21
modeling that gets done in in physical
00:35:24
simulations and similar sorts of things
00:35:27
require double precision arithmetic to
00:35:29
be at all accurate and so the
00:35:31
introduction of
00:35:33
dedicated cores on these gpus was a big
00:35:36
deal
00:35:36
and then if we you know kind of multiply
00:35:38
these things out we see that we're
00:35:40
getting an excess of 5000 single
00:35:42
precision cores
00:35:43
and almost three thousand double
00:35:45
precision floating point cores
00:35:47
on one chip here's
00:35:50
a illustration of the
00:35:54
of the sm in individual sm with its sps
00:35:58
kind of similar to before what's new
00:35:59
here is these tensor cores
00:36:02
so as i mentioned before a lot of the
00:36:05
important uses of a gpu these days is in
00:36:08
the machine learning
00:36:10
area where we're training uh deep neural
00:36:12
networks to
00:36:13
understand classification problems and
00:36:15
similar sorts of things
00:36:17
um and that's the the the dedicated
00:36:20
hardware that nvidia builds onto their
00:36:21
chips
00:36:22
that's focused on providing capabilities
00:36:25
for that are referred to as tensor cores
00:36:27
so these are kind of purpose-built
00:36:30
variations on the underlying
00:36:32
sp processor that do
00:36:35
directly some of the basic calculations
00:36:38
that are necessary for
00:36:39
training and evaluating a neural net
00:36:44
touring 2018 here again more the same
00:36:48
we've got 72
00:36:49
sms in total 64 cores per sm
00:36:53
8 tensor cores per sm so that kind of
00:36:56
continues going forward so we've got
00:36:58
almost 5 000 cores cuda cores and
00:37:01
more than 500 tensor cores otherwise
00:37:04
l1 l2 memory access and so forth is
00:37:06
quite similar
00:37:08
this is a so this is just a diagram of
00:37:11
the of the chip
00:37:12
this right here is actually a micrograph
00:37:14
so that's actually what the
00:37:15
what the components on the chip itself
00:37:17
look like
00:37:18
and you can see that you know when we
00:37:20
say stamp out another core that's
00:37:22
literally what's going on here they're
00:37:23
just duplicating that particular chunk
00:37:25
of the
00:37:26
of the uh of the image that's used to
00:37:28
etch the
00:37:29
components into the surface of the
00:37:31
silicon um and
00:37:32
and it's really evident how that's set
00:37:35
up when you see an actual picture of it
00:37:38
and then here's the the detail here of
00:37:40
the of a single sm
00:37:42
and it's sps uh same as before right
00:37:44
we've got some integer
00:37:46
floating integer units some floating
00:37:47
point units some tensor cores
00:37:49
register files cache special function
00:37:51
units and so forth
00:37:55
finally the ampere architecture is the
00:37:57
current one
00:37:58
so here's an ampere gpu
00:38:01
it's got seven uh graphics processing
00:38:06
clusters well this looks like eight in
00:38:09
the picture
00:38:10
uh with 12 sms per so a total of 84
00:38:14
streaming multiprocessors and then a
00:38:18
additional statistics here 128 cores per
00:38:21
sm which gives us a really finally you
00:38:22
know we're over 10 000 cuda cores just
00:38:25
on this one chip
00:38:26
across all of the sms within each gpc
00:38:29
and 336 tensor cores you know one of the
00:38:32
things that you see when you look at
00:38:33
these statistics over time is
00:38:35
as the manufacturer kind of goes back
00:38:37
and forth between you know what are they
00:38:38
truly trying to emphasize here
00:38:40
um how much space on the chip do they
00:38:44
dedicate to different kinds of
00:38:45
processors
00:38:46
and they're really trying to respond to
00:38:48
market demands right is it going to be
00:38:49
more important for us to just do raw
00:38:51
graphics processing is it going to be
00:38:52
more important for us to provide good
00:38:54
capabilities for machine learning and
00:38:55
that kind of stuff
00:38:57
and you can see maybe a little decreased
00:38:58
emphasis in a way on on the tensor cores
00:39:00
from the previous generation
00:39:02
architecture and more of an emphasis on
00:39:05
just individual
00:39:06
individual standard cuda cores i should
00:39:09
also point out that
00:39:10
this is kind of one sample of one
00:39:13
particular chip
00:39:14
at each of the architecture levels if
00:39:16
you go look
00:39:17
look around at different releases from
00:39:19
nvidia
00:39:20
you'll see that that they will have
00:39:23
various
00:39:25
various variations on these statistics
00:39:27
within the same processor family so
00:39:29
there might be
00:39:30
you know 10 chip i'm making this up but
00:39:31
10 chips in the ampere family
00:39:34
that have different numbers of each of
00:39:35
these types of processors
00:39:37
that are intended to kind of focus on
00:39:38
different markets right is this a chip
00:39:40
that's going to be
00:39:41
used for high performance super
00:39:43
computing is this a chip
00:39:44
that's going to be used for you know
00:39:47
high capacity desktop
00:39:49
rendering like at a graphics company or
00:39:52
a film production company or
00:39:54
for a high-end engineering workstation
00:39:56
or is this going to be a gpu that's
00:39:57
going to be used in a mobile device
00:39:59
which obviously has considerable
00:40:01
constraints around power and that sort
00:40:03
of thing
00:40:03
so these are these are not the only
00:40:06
versions of
00:40:07
of this of the statistics for these
00:40:08
chips it's just kind of a representative
00:40:10
sample
00:40:13
and then here's the the ampere streaming
00:40:16
multiprocessor
00:40:17
and again you can see energy units
00:40:19
floating point units 64-bit floating
00:40:21
point units
00:40:22
and then the tensor core as well
00:40:26
by way of summary so here's kind of the
00:40:29
the same
00:40:29
historic historic perspective here going
00:40:31
back to 2008 and the different micro
00:40:33
architectures
00:40:34
and what i've done here is just
00:40:35
collected together the from these slides
00:40:38
and again these are just representative
00:40:39
numbers you can find
00:40:41
parts in each of these product families
00:40:43
that have more or less
00:40:45
sms or sps but just kind of a
00:40:47
representative sample
00:40:49
back here we had 128 sps now we're
00:40:52
at 10 000 sps which is pretty remarkable
00:40:55
growth
00:40:57
i also wanted to point out here that the
00:40:59
the
00:41:00
processors on our what are what we call
00:41:02
our pseudo lab
00:41:04
those machines actually have a quadro
00:41:05
k620 which is not a super recent
00:41:09
super recent gpu board in them
00:41:13
you can tell here by the model number
00:41:14
with the k in it that this actually
00:41:16
follows the kepler architecture
00:41:18
and the kepler that we looked at earlier
00:41:20
has 15 sms and
00:41:21
a lot of sps the ones in the pseudo lab
00:41:24
actually are
00:41:24
rather more modest desktop focused
00:41:27
releases of this chipset
00:41:28
um that have only 308 only 384
00:41:33
sps so we'll just have to make two if
00:41:34
that's what you're using if you've got a
00:41:36
more capable
00:41:37
laptop or desktop you may be in in this
00:41:40
range instead
00:41:41
that's all good as we'll see the
00:41:44
the programming model that's provided by
00:41:46
cuda and this is really an important
00:41:48
idea to keep in mind as you're
00:41:49
developing software for these guys
00:41:51
is it's designed to run exactly the same
00:41:54
code
00:41:55
no matter where you are in the history
00:41:58
of these architectures
00:41:59
in particular it's designed such that
00:42:01
you can actually run this on future
00:42:03
architecture so when the hopper
00:42:04
architecture
00:42:05
becomes available as actual silicon
00:42:07
we'll be able to take those same
00:42:09
programs and
00:42:10
run them on the hopper architecture
00:42:12
without any modifications at all
00:42:14
because of the abstractions that are
00:42:15
provided by the cuda
00:42:18
the cuda model for programming

标签

CUDA
GPU
CPU
Nvidia
Architecture
Tensor Cores
Pascal
Ampere
Fermi
Kepler