CUDA Hardware
Resumo
TLDRWannan bayani ya mai da hankali akan yadda kayan aikin CUDA suka canza tsawon shekaru, musamman kan injin amfani da GPUs na kamfanin Nvidia. Bidiyon ya bayyana yadda kowane keɓantaccen fassarar GPU, daga Tesla zuwa Ampere, suke bayar da gwaninta daban a game da lissafi na gaba ɗaya da kuma amfani a fagen zane-zane. An tattauna yadda CPU da GPU suka sha bamban, inda aka bayyana cewa CPU na mayar da hankali akan jinkirin aiwatar da abubuwa, yayin da GPU ke mayar da hankali akan samun yawan aiki. An bayar da tabbacin yadda CUDA ke bawa damar lambace na'urorin mai da hankali ga taskarwa da aiwatar da kimiyyar kere-kere, tare da cikakken sharhi akan kalmomi da shahararrun samfuran fasaha na Nvidia.
Conclusões
- 🖥️ CPU yawanci yana mai da hankali kan jinkirin aiwatar da abubuwa.
- 🚀 GPU yana bawa damar samun yawan aiki mafi girma tare da ƙarancin jinkirin mita.
- 📈 CUDA architecture na ba da damar aiwatar da lissafi na gaba ɗaya.
- 🔍 Nvidia GPUs suna tare da sabbin fasaha kamar Tensor Cores don AI.
- 🔄 Sabbin fasaha yana kara ƙarfin GPUs da gwaninta.
- 🕰️ Mekurorin CUDA sun haɗa da Tesla, Fermi, Kepler, Pascal da kuma Ampere.
- 💡 Architecture na CUDA yana bawa damar amfani da GPUs a ta foyar lissafi na gaba ɗaya.
- 🌐 Nvidia GPUs suna bawa damar aiwatar da zane-zane tare da lissafi na zamani.
- 🚦 Sabbin fasaha a kan Nvidia yana inganta taswirar aiki da na'amni.
- 🔧 CUDA yana bawa damar aiki tare akan na'urori daban-daban ba tare da wata wahala ba.
Linha do tempo
- 00:00:00 - 00:05:00
This section introduces CUDA hardware evolution and gives an understanding of CPU versus GPU architecture. It differentiates how CPUs have fewer cores with complex caching for low latency while GPUs have numerous cores focused on high throughput. GPU's architecture allows more parallel processing by reducing space for caching compared to CPUs.
- 00:05:00 - 00:10:00
The transition from earlier Nvidia GPUs aimed solely at graphics to more modern CUDA-enabled GPUs supporting general computing is highlighted. The architecture evolution is discussed from Nvidia's first GPU in 1999 to modern multi-functional units introduced with micro-architectures like Tesla in 2008, with a focus on reducing feature size and increasing transistor count.
- 00:10:00 - 00:15:00
It continues to detail Nvidia's advancements with CUDA-enabled GPUs allowing general-purpose computing, including the introduction of unified memory space and tensor cores, catering to the rise of machine learning by optimizing GPU performance for neural network processing.
- 00:15:00 - 00:20:00
The historical shift from graphics-specialized hardware to general-purpose computing in GPU chips is discussed. It highlights the redefinition of processing power allocation for flexible computing tasks instead of purely graphics, marking a significant architectural transition starting from the Tesla series.
- 00:20:00 - 00:25:00
The architecture of GPUs is further explored using the Geforce 8800, illustrating the central components involved in computation, with particular attention to the streaming multiprocessors (SM) and the flexibility in managing computational tasks through general-purpose cores.
- 00:25:00 - 00:30:00
The structuring of GPUs into clusters and multiprocessors enabling highly parallel processing capabilities is detailed. This includes the explanation of register files and memory hierarchy that support high-throughput processing, showcasing the Fermi architecture improvements with enhanced cores and cache systems.
- 00:30:00 - 00:35:00
Subsequent GPU architectures like Kepler, Maxwell, and Pascal are analyzed for their increased processing power and improved density through more advanced fabrication. Each generation adds more cores and enhanced precision capabilities, addressing evolving computing needs.
- 00:35:00 - 00:42:20
Finally, Volta and Ampere architectures are explored, highlighting technological advancements like tensor cores for machine learning applications and changes in manufacturing focus to balance graphical and computing performance, reflecting market needs and technological progression.
Mapa mental
Vídeo de perguntas e respostas
Menene bambance-bambance tsakanin CPU da GPU?
CPU yana nufin jinkirin aiwatar da kebantattun ayyuka, yayin da GPU ke nufin samun yawan aiwatar da abubuwa da yawa a lokaci guda.
Me yasa ake amfani da Tensor Cores a cikin GPU?
Tensor Cores suna taimakawa wajen aiwatar da ayyukan kimiyyar kere-kere da kuma horar da cibiyoyin sadarwa na zurfi.
Me yasa ake samun sabbin fasaha a kan Nvidia GPUs?
Sabbin fasaha na ba da dama ga ingantaccen aiki tare da ƙarfin gudanar da ayyuka daban-daban kamar horar da AI.
Menene ma'anar 'CUDA architecture'?
CUDA architecture wata tsari ce ta GPUs daga Nvidia wadda ke baiwa damar aiwatar da lissafi na gaba ɗaya ta hanyar amfani da jayen GPUs.
Menene keɓancewar zamanin CUDA?
Zamanin CUDA yana mayar da hankali kan samar da GPUs da zasu iya aiki a matsayin na'urori na lissafi na gaba ɗaya baya ga ayyukan hoto.
Ta yaya cigaban fasahar Nvidia ke amfani?
Karuwar karafin fasaha yana bawa damar aiwatar da abubuwa masu yiwuwa na AI, lissafi wanda yafi ƙarfin CPUs.
Yaushe aka kirkiro CUDA farko?
CUDA architecture na farko ya fara a shekarar 2008 tare da shimfiɗar Tesla.
Wane alfarma ne ake samu daga guntu-guntun Nvidia na kwanan nan?
Sun hada da ƙarfin gudanar da nauyi mai yawa da kuma kara adadin ƙwaƙwalwar ajiyar guntu suna ƙara wa damar amfani da su a fannoni daban-daban.
Ver mais resumos de vídeos
- 00:00:03this is a brief
- 00:00:04overview of some of the cuda hardware
- 00:00:06that's been around for years now
- 00:00:09i wanted to give you some kind of
- 00:00:10perspective on where we've been where
- 00:00:12the current generation is and
- 00:00:13kind of where we're headed maybe by way
- 00:00:16of overview i want to
- 00:00:17spend a little bit of time just
- 00:00:18refreshing your thinking about the
- 00:00:20differences between a cpu
- 00:00:22and a gpu and on the left here we've got
- 00:00:25a illustration of a typical
- 00:00:26modern cpu this particular one is
- 00:00:29showing a four
- 00:00:30core architecture so there's individual
- 00:00:33processing
- 00:00:34cores that are part of the same chip and
- 00:00:37they have
- 00:00:38the core itself which is the kind of the
- 00:00:40processing logic that's actually doing
- 00:00:42things you could think of this
- 00:00:44in general terms as sort of the
- 00:00:45arithmetic logic unit the alu
- 00:00:48there's going to be for each of the
- 00:00:50cores this would be typical of for
- 00:00:52example a current intel processor
- 00:00:54there's going to be some level 1 cache
- 00:00:56that's dedicated to that core it's often
- 00:00:58split into instruction cache and data
- 00:01:00cache
- 00:01:01there's also going to be a bunch of
- 00:01:02control logic so circuitry that kind of
- 00:01:05orchestrates the behavior of the core
- 00:01:07which is distinct from the core itself
- 00:01:09in the sense that the core is there to
- 00:01:11do the computation to do an addition or
- 00:01:13multiplication or whatever and the
- 00:01:14control logic is there to
- 00:01:16orchestrate that process so there's
- 00:01:19basically four copies of that stamped
- 00:01:21out on this particular chip
- 00:01:22and then a very large proportion of the
- 00:01:24chip is going to be taken
- 00:01:26up by additional cache memory to improve
- 00:01:28the performance of the processor
- 00:01:30to lower the latency which is kind of
- 00:01:33the focus
- 00:01:33of a cpu processor
- 00:01:37so we'll have some level 2 cache that's
- 00:01:40going to be
- 00:01:41shared in different ways in different
- 00:01:42architectures and then level 3 cache
- 00:01:43that's usually
- 00:01:44shared across all of the cores and
- 00:01:46that's kind of where the processor chip
- 00:01:48ends and of course we can't have a
- 00:01:51modern computer without additional
- 00:01:53memory so the dram here is just
- 00:01:55the general memory on the motherboard
- 00:01:57for example
- 00:01:59again the point of the architecture of
- 00:02:02the cpu is to
- 00:02:04run a lot of different types of compute
- 00:02:06jobs very flexibly and with very low
- 00:02:09latency
- 00:02:10hence all of the cash that's devoted or
- 00:02:12the space that's devoted on the chip to
- 00:02:14cash
- 00:02:16when we look at the look at the gpu then
- 00:02:18obviously the
- 00:02:19the architecture is quite a lot
- 00:02:20different so there's going to be
- 00:02:23still control logic and a little bit of
- 00:02:26cache memory that's going on
- 00:02:27across the gpu but the overwhelming
- 00:02:30majority of the gpu
- 00:02:32chip itself is going to be given over to
- 00:02:34processing cores
- 00:02:36and very many of them the key difference
- 00:02:38here is that
- 00:02:39the emphasis on the gpu side is not
- 00:02:42going to be
- 00:02:43low latency in other words something
- 00:02:45that's going to require a lot of
- 00:02:46elaborate caching in the in the memory
- 00:02:49hierarchy
- 00:02:50but high throughput so we can apply a
- 00:02:53lot of different processing cores to do
- 00:02:55small pieces of calculations of the
- 00:02:57overall
- 00:02:58job that we're trying to accomplish and
- 00:03:00by
- 00:03:01reducing the amount of chip space that's
- 00:03:03dedicated to these kinds of other things
- 00:03:05like caching and
- 00:03:07more elaborate control you know out of
- 00:03:08order execution and that kind of stuff
- 00:03:10we can dramatically improve the number
- 00:03:13of cores
- 00:03:14or increase the number of cores that are
- 00:03:15available on the chip so we can just do
- 00:03:17more
- 00:03:18processing now it's not going to be as
- 00:03:20low latency because we don't have all of
- 00:03:22that caching we've got a little bit of
- 00:03:23cash but not much by comparison to a cpu
- 00:03:26but because we have so many individual
- 00:03:28processing cores
- 00:03:29we can do a lot of calculation per time
- 00:03:32in other words the throughput can be
- 00:03:33very high
- 00:03:34even though the latency isn't super
- 00:03:36great
- 00:03:37gpu is also going to have some amount of
- 00:03:39level 2 cache that's shared across
- 00:03:41all of the cores and then as with the
- 00:03:44cpu we've got some external memory
- 00:03:46called dram here this would be say in a
- 00:03:49typical
- 00:03:50desktop kind of installation you'd have
- 00:03:52a graphics processing unit that plugged
- 00:03:54into the back plane of the motherboard
- 00:03:56and it would include its own memory
- 00:03:58sometimes called graphics memory or
- 00:04:00frame buffer memory or whatever
- 00:04:02for our purposes in general purpose gpu
- 00:04:04computing
- 00:04:05we're not really thinking of that
- 00:04:06necessarily as something that's going to
- 00:04:07be used to drive
- 00:04:08a video display although in a graphics
- 00:04:10card that's almost always what it's
- 00:04:12being used for but we do have local
- 00:04:15memory on the video card
- 00:04:16that's accessible to the to the gpu
- 00:04:21here's a not so brief summary of
- 00:04:24the various generations that nvidia has
- 00:04:26been through with its cuda architecture
- 00:04:28uh sort of a history of graphics
- 00:04:31processing
- 00:04:31if we go back here to the very beginning
- 00:04:34in 1999
- 00:04:35saw the introduction of nvidia's geforce
- 00:04:38256
- 00:04:39which was really the first graphics
- 00:04:41processing unit to be
- 00:04:42released into the wild and there was a
- 00:04:44whole variety of different
- 00:04:46improvements on that card uh throughout
- 00:04:48the uh
- 00:04:49the early 2000s where we're really
- 00:04:52interested here
- 00:04:52is at the beginning of what i've what
- 00:04:55i've labeled here the beginning of kind
- 00:04:57of the cuda era
- 00:04:58starting here in 2008 and kind of going
- 00:05:00forward until the present
- 00:05:01we've got sort of the modern era of gpus
- 00:05:05and a key distinguishing characteristic
- 00:05:06here is that these gpus allow
- 00:05:09for very high performance graphics
- 00:05:11generation
- 00:05:12but they also are being designed to
- 00:05:14allow for general purpose computing
- 00:05:16which is
- 00:05:17really our interest in them in this
- 00:05:18course
- 00:05:20so the first of these are what what
- 00:05:22nvidia refers to as the micro
- 00:05:24architecture
- 00:05:25which is just kind of a generation or a
- 00:05:27family of gpu
- 00:05:29processors that they've released that
- 00:05:30have kind of common characteristics
- 00:05:33the first of those was called tesla it
- 00:05:35was introduced in 2008
- 00:05:37and there's just some interesting
- 00:05:39statistics here you can see across
- 00:05:41these different generations the first of
- 00:05:43these is the the process
- 00:05:45and the process by that i mean the the
- 00:05:48fabrication process that was used
- 00:05:50to make the chips and one way of
- 00:05:53understanding the the differences from
- 00:05:55one process to another is
- 00:05:57what's known as the feature size or the
- 00:05:59basically the size of a single
- 00:06:00transistor or a single component that
- 00:06:02you're going to put down on the on the
- 00:06:03chip itself
- 00:06:05and that's measured in nanometers which
- 00:06:08is pretty small
- 00:06:09and you can see back in the in the late
- 00:06:1190s with the first gpu
- 00:06:13they were working at 220 nanometers and
- 00:06:16by the time
- 00:06:17we got to 2008 with the tesla we're down
- 00:06:19to 65 nanometers
- 00:06:21and that's been decreasing ever since uh
- 00:06:23to the present
- 00:06:24generation of nvidia gpus uses a seven
- 00:06:28nanometer process
- 00:06:29so it's it's a couple hundred times
- 00:06:32denser
- 00:06:33than what we had back at the beginning
- 00:06:35the upside is that you can can
- 00:06:37you can fit more components on the
- 00:06:39surface of a chip
- 00:06:40which allows you to build something
- 00:06:42that's more and more capable
- 00:06:44you can see also a similar measure here
- 00:06:46in the next column the number of
- 00:06:47transistors that are on that chip
- 00:06:49which surprisingly is something that
- 00:06:51nvidia is pretty happy to trumpet
- 00:06:53if you look for that kind of information
- 00:06:55from intel or amd
- 00:06:56the sort of the cpu vendors they're less
- 00:06:59they're less happy about sharing that
- 00:07:00information
- 00:07:01but you can see we went from you know in
- 00:07:03the neighborhood of a couple of hundred
- 00:07:05million
- 00:07:05transistors which is still a lot to now
- 00:07:08the
- 00:07:09the ampere micro architecture is at 28.3
- 00:07:13billion
- 00:07:13transistors on one chip which is that's
- 00:07:16just a lot of
- 00:07:17billions so we're seeing
- 00:07:20smaller and smaller feature sizes which
- 00:07:22leads to larger and larger numbers of
- 00:07:24transistors that we can use
- 00:07:25and um and that means more and more
- 00:07:28capability over time
- 00:07:30and we're going to kind of unpack some
- 00:07:31of these micro architecture generations
- 00:07:33so we'll start with tesla
- 00:07:35let me just point out some highlights
- 00:07:37here that i've put over here in the
- 00:07:38notes column
- 00:07:39at the pascal architecture one of the
- 00:07:41things that nvidia introduced was this
- 00:07:43notion of a unified memory
- 00:07:46we'll we'll sort of be working with the
- 00:07:48less capable
- 00:07:50parts here so we won't have that
- 00:07:52advantage of this unified memory we have
- 00:07:54to
- 00:07:54we have to orchestrate moving
- 00:07:55information back and forth between cpu
- 00:07:57memory
- 00:07:58and gpu memory but that's kind of
- 00:08:00cumbersome
- 00:08:01and results in some limitations and more
- 00:08:04complexity in the code
- 00:08:06and so nvidia has tried to address that
- 00:08:07by first of all making the
- 00:08:09memory space on the gpu itself unified
- 00:08:12so we'll talk about the different types
- 00:08:14of memory that are on the gpu
- 00:08:16and those now are accessible through a
- 00:08:18single address space
- 00:08:20but they've also now provided a
- 00:08:21mechanism where you can program
- 00:08:23uh your applications in such a way that
- 00:08:26whether you
- 00:08:26allocate memory on the cpu memory or on
- 00:08:29the gpu memory
- 00:08:30they're considered part of the same
- 00:08:32address space and you can kind of move
- 00:08:33things back and forth and access those
- 00:08:35memories
- 00:08:36uh kind of transparently in your code
- 00:08:38and then the runtime environment
- 00:08:40moves things back and forth
- 00:08:44in with the volta micro architecture in
- 00:08:472017
- 00:08:48uh nvidia introduced what they call
- 00:08:50tensor cores
- 00:08:52so obviously at this point in history
- 00:08:55we're seeing the the rise of machine
- 00:08:56learning and deep neural networks
- 00:08:58and the use of gpus specifically to
- 00:09:01train neural networks and also to
- 00:09:02evaluate
- 00:09:03uh and solve problems using those neural
- 00:09:05networks and although the
- 00:09:07the prior generations of gpus were
- 00:09:09really good at that
- 00:09:10um what nvidia tried to do was uh
- 00:09:13introduce
- 00:09:14uh technology specifically designed for
- 00:09:18processing deep neural networks and
- 00:09:19that's what they refer to as their
- 00:09:21tensor cores
- 00:09:22and then in 2019 with the touring micro
- 00:09:25architecture they
- 00:09:26added a ray tracing capability so
- 00:09:29essentially real time ray tracing for
- 00:09:31graphics rendering
- 00:09:32and a fairly considerable percentage of
- 00:09:34the chip surface
- 00:09:36in the touring architecture is dedicated
- 00:09:39specifically to ray tracing
- 00:09:41so in a way we're seeing kind of another
- 00:09:43one of these pendulum swings that we see
- 00:09:45in computing quite regularly
- 00:09:47back in the in the late 90s early 2000s
- 00:09:50there wasn't any kind of general purpose
- 00:09:54focus in the design of these chips it
- 00:09:55was very specific
- 00:09:57logic on these chips that was focused on
- 00:09:59doing
- 00:10:00graphic stuff only and in order to use
- 00:10:03those
- 00:10:04generations of chips for general purpose
- 00:10:06computing you had to kind of recast your
- 00:10:08problem as a graphics problem
- 00:10:10let the gpu do its thing and then kind
- 00:10:12of map it back into the
- 00:10:13the the more familiar computing domain
- 00:10:17and with with starting with the tesla
- 00:10:20architecture that was not really
- 00:10:21necessary anymore you could just program
- 00:10:23your application directly
- 00:10:24but now we're starting to see a return a
- 00:10:26little bit to some dedicated hardware
- 00:10:28that's really focused on graphics
- 00:10:29uh in this case to do to do real-time
- 00:10:32ray tracing
- 00:10:34um another thing here is that there's a
- 00:10:36little bit of terminological confusion
- 00:10:38and we'll see that there's different
- 00:10:39terminology that gets used in different
- 00:10:40generations there's a little variation
- 00:10:42over time
- 00:10:43but because the first microarchitecture
- 00:10:46that nvidia introduced that did general
- 00:10:48purpose computing
- 00:10:49was called tesla they have kind of held
- 00:10:53that
- 00:10:53term to refer not only to this specific
- 00:10:56generation of chips
- 00:10:58but to refer to the basic idea of doing
- 00:11:01general purpose computing on the
- 00:11:02on the nvidia chipset so you'll still
- 00:11:05hear hear and read people talking about
- 00:11:07the tesla cores
- 00:11:08and that really just is a general or has
- 00:11:10become kind of a general reference
- 00:11:12to this ability to do general purpose
- 00:11:14computing on the gpu
- 00:11:16and one final note here from this table
- 00:11:18is the next generation of
- 00:11:20chip that nvidia has announced is
- 00:11:22actually named after admiral grace
- 00:11:24murray hopper
- 00:11:25who was one of the early pioneers in
- 00:11:26computing and it was kind of cool to see
- 00:11:29she's also the first woman on this list
- 00:11:31obviously these are the names of famous
- 00:11:33scientists and engineers and so forth
- 00:11:34over time and she's the first woman
- 00:11:36and one of the sort of founding founders
- 00:11:40of computer science in in many
- 00:11:42ways with the introduction of some of
- 00:11:43the work that she did uh
- 00:11:45years ago
- 00:11:48all right um there's lots of terminology
- 00:11:52that you kind of need to get embedded a
- 00:11:54little bit when you're reading and
- 00:11:56and trying to understand these gpus so
- 00:11:58this is kind of a little secret decoder
- 00:12:00ring some of the common things that
- 00:12:01might be mystifying
- 00:12:02i've starred the ones that are really
- 00:12:04important so these two
- 00:12:06uh these two terms here the streaming
- 00:12:07multiprocessor and the streaming
- 00:12:09processor
- 00:12:10unfortunately they're kind of close in
- 00:12:12in meaning and also
- 00:12:14in terms of the initialism that's used
- 00:12:15to represent them but those are really
- 00:12:17the two
- 00:12:18major kind of groupings of processors or
- 00:12:21processing capability
- 00:12:23that we care about in the gpu there's
- 00:12:26there's still some specialized graphics
- 00:12:27focused kinds of
- 00:12:29hardware units on the gpu itself we're
- 00:12:31not really going to drill into those in
- 00:12:32any sort of detail
- 00:12:34and you got to kind of filter those
- 00:12:36things out when you're reading
- 00:12:38documentation and spec sheets and stuff
- 00:12:40for nvidia products
- 00:12:41when you're just thinking about doing
- 00:12:43cuda programming as opposed to doing
- 00:12:44graphics programming
- 00:12:47i learned something new the other day
- 00:12:49that these are not acronyms
- 00:12:50that they're actually called initialisms
- 00:12:53uh an acronym is an initialism that's
- 00:12:55also
- 00:12:56a pronounceable word so unless you want
- 00:12:58to pronounce
- 00:12:59this sm as smum and smup or something
- 00:13:02like that
- 00:13:03they're not actually acronyms they're
- 00:13:05initialisms so there you go free
- 00:13:06knowledge
- 00:13:08in addition these are some other other
- 00:13:10terminologies that get used pretty
- 00:13:12regularly
- 00:13:13so the the streaming multiprocessor is
- 00:13:16kind of a collection of streaming
- 00:13:17processors so multiprocessor contains
- 00:13:19processor
- 00:13:21and those streaming multiprocessors are
- 00:13:23also clustered together in kind of
- 00:13:24larger units on the chip
- 00:13:26and we'll see references to the tpc
- 00:13:29which stands for texture slash processor
- 00:13:32cluster
- 00:13:33and gpc which stands for graphics
- 00:13:35processing cluster
- 00:13:36and these um i don't know that the
- 00:13:38distinction between them is that
- 00:13:39important
- 00:13:40they're really just larger groupings of
- 00:13:42symmetric multiprocessors which are
- 00:13:44themselves groupings of
- 00:13:46streaming process sorry streaming
- 00:13:48multiprocessors which group streaming
- 00:13:49processors
- 00:13:51and something else here um the there's a
- 00:13:54important difference here between
- 00:13:55this the single precision and double
- 00:13:57precision arithmetic that are performed
- 00:13:59by the streaming processors
- 00:14:01in the early going because the
- 00:14:04the density with which the manufacturers
- 00:14:07could get
- 00:14:08features on the chips the early
- 00:14:11streaming processors tended to be single
- 00:14:13precision
- 00:14:14which basically meant they could do
- 00:14:1632-bit arithmetic
- 00:14:18as opposed to double precision which is
- 00:14:2164-bit so there was a
- 00:14:24in order to do the 64-bit calculations
- 00:14:27which you could still do you had to sort
- 00:14:28of split them up
- 00:14:30and run half of each on two different
- 00:14:32streaming
- 00:14:33processors which essentially halved the
- 00:14:36performance
- 00:14:37but at some point along the way nvidia
- 00:14:39figured out ways to get enough
- 00:14:41transistors on the surface of the chip
- 00:14:43to allow you to do direct
- 00:14:44double precision arithmetic and when
- 00:14:46that starts to arise in the history of
- 00:14:49these
- 00:14:49of these chipsets we'll we'll kind of
- 00:14:51point out where
- 00:14:52where we see them starting to cite both
- 00:14:55single precision and double precision
- 00:14:56processing on a single
- 00:14:58gpu this is a
- 00:15:01an illustration that i thought was
- 00:15:03particularly clear in
- 00:15:04expressing the different main components
- 00:15:07of an nvidia gpu
- 00:15:08this actually corresponds to the geforce
- 00:15:108800 which was one of the first
- 00:15:12tesla micro architecture chip sets that
- 00:15:15was released
- 00:15:17and i don't expect you to kind of
- 00:15:19understand all of the details here but
- 00:15:20it kind of gives you a nice
- 00:15:22sort of overall view of things so we can
- 00:15:24see here that
- 00:15:25uh the the gpu as labeled here is really
- 00:15:29the main portion of what's going on in
- 00:15:31this chip there's some other things
- 00:15:32happening like there's a connection to
- 00:15:34allow you to talk to the host computer
- 00:15:36um to to access the memory on them on
- 00:15:39the motherboard and so forth
- 00:15:41as well as access here to to dram that's
- 00:15:44on the
- 00:15:45on the board right so this is the the
- 00:15:47video memory or the frame buffer memory
- 00:15:49but what we're interested in here is
- 00:15:51kind of this central portion
- 00:15:53that represents the key computational
- 00:15:56components that are in play here on
- 00:15:58a gpu so in this particular chip there
- 00:16:01are
- 00:16:01eight of these tpcs which as you'll
- 00:16:04recall back here
- 00:16:05stands for texture slash processor
- 00:16:07cluster and we're thinking about this as
- 00:16:09the processor cluster so this is just a
- 00:16:12kind of a high level grouping
- 00:16:14of the streaming multiprocessor and the
- 00:16:17streaming processor
- 00:16:18to provide a mechanism to kind of get
- 00:16:21multiple different
- 00:16:22parts of the chip to do different things
- 00:16:23at different times and so forth
- 00:16:25there's quite a lot of quite a lot of
- 00:16:26technology turned under to figure out
- 00:16:28how to do that a lot of these
- 00:16:30block diagram elements up here are are
- 00:16:32focused on doing that
- 00:16:34and we're going to drill down into
- 00:16:35what's going on inside those tpcs
- 00:16:38some other thing going on other things
- 00:16:39going on here um there's a what it just
- 00:16:41says interconnection network
- 00:16:43um the uh the the
- 00:16:46genesis of this whole notion of a
- 00:16:48streaming multi-processor
- 00:16:50or streaming processors within the
- 00:16:51streaming multiprocessors
- 00:16:53is that you can sort of stream a
- 00:16:55calculation from one processor to the
- 00:16:57next processor to the next processor
- 00:17:00within the within the gpu so if you if
- 00:17:02you go back and look at some of the
- 00:17:04implementation details of an early
- 00:17:06graphics processing unit
- 00:17:08what you would find is that you'd get
- 00:17:10some input from
- 00:17:11the cpu right so some some element of
- 00:17:14the model that you were trying to render
- 00:17:16say in 3d or 2d and then there would be
- 00:17:18a series of processors
- 00:17:20that were very specifically tailored in
- 00:17:23the hardware
- 00:17:24to handle the the different stages in
- 00:17:26what's called the
- 00:17:27graphics processing pipeline so the cpu
- 00:17:30feeds the gpu some sort of information
- 00:17:32about a vertex or whatever in the scene
- 00:17:35that's to be rendered
- 00:17:36and then these different stages along
- 00:17:37the way represented very specific pieces
- 00:17:40of hardware that did certain kinds of
- 00:17:41things
- 00:17:42and would stream from one to the next to
- 00:17:44the next and eventually
- 00:17:45at the end of this process the the final
- 00:17:48final processor in the step is going to
- 00:17:50put something into into
- 00:17:51the dram the frame buffer memory so it
- 00:17:54actually shows up on the screen
- 00:17:56now that's all fine and good except
- 00:17:58that's not very flexible so if
- 00:18:00you know if this particular step in the
- 00:18:02process
- 00:18:03is really busy and this step in the
- 00:18:05process is really not doing very much
- 00:18:06for a particular scene that's being
- 00:18:08rendered
- 00:18:09you're sort of leaving processing power
- 00:18:10on the table but because these are
- 00:18:12purpose-built
- 00:18:14units within the gpu there's really no
- 00:18:17way to flexibly
- 00:18:18alter that behavior in order to get
- 00:18:21certain parts of the hardware to do a
- 00:18:22different job than it was designed for
- 00:18:25so the idea here of but let me step back
- 00:18:28you can see here the notion of streaming
- 00:18:30though right the information is coming
- 00:18:31from this
- 00:18:32cpu and it streams from one processing
- 00:18:35element to the next to the next until it
- 00:18:36finally ends up in the frame buffer
- 00:18:38so that notion of streaming is still
- 00:18:40really important but instead of having
- 00:18:42these kind of purpose-built modules that
- 00:18:44only do one kind of calculation
- 00:18:46the idea in the general purpose gpu or
- 00:18:49the cuda architecture
- 00:18:51is to say let's not let's not have these
- 00:18:53specific units let's
- 00:18:55let's instead offer a very powerful very
- 00:18:58flexible
- 00:18:59general-purpose processor and allow it
- 00:19:03to be hooked up in such a way that we
- 00:19:04get this kind of streaming behavior
- 00:19:06so what we'll see here each of the
- 00:19:08little elements inside these
- 00:19:10inside these tpcs is one of those
- 00:19:12processors
- 00:19:13and you'll you'll notice here that uh
- 00:19:17in this in this kind of very primitive
- 00:19:20representation of a graphics processing
- 00:19:22pipeline
- 00:19:23this this is kind of a long series of
- 00:19:25operations
- 00:19:26not me not really long but there's
- 00:19:28multiple steps in that process
- 00:19:30and what what the what the streaming
- 00:19:33multiprocessor the streaming processor
- 00:19:35architecture and cuda tries to do
- 00:19:37is to give you the ability to do this
- 00:19:39kind of streaming
- 00:19:40from one process to another to another
- 00:19:43but using these general purpose
- 00:19:44stream processor elements the
- 00:19:47interconnection network then
- 00:19:49is responsible for sort of routing
- 00:19:51intermediate calculation results from
- 00:19:53one streaming processor to another
- 00:19:55but instead of it being instead of it
- 00:19:58being
- 00:19:59helpful to think about it as kind of a
- 00:20:01linear chain
- 00:20:02instead what you're seeing is the
- 00:20:04ability for say
- 00:20:06one of these tpcs to do a bunch of
- 00:20:08calculations
- 00:20:09and that output goes onto this
- 00:20:11interconnection network and might feed
- 00:20:13into
- 00:20:14another set of stream processors
- 00:20:18which could do a different calculation
- 00:20:19and then those results could stream into
- 00:20:21another one
- 00:20:22and eventually that information is going
- 00:20:23to or the result of those
- 00:20:25calculations is going to make it out to
- 00:20:26dram so you can still get the
- 00:20:30this ability to stream calculations from
- 00:20:32one stage to the next stage the next
- 00:20:33stage
- 00:20:34as you transform the information from
- 00:20:36something that's used to model the scene
- 00:20:38on the cpu and end up with pixels on the
- 00:20:42screen
- 00:20:42at the end of that process but it's now
- 00:20:45more flexible than what you had when you
- 00:20:46had dedicated processing stages
- 00:20:49so if if you have more
- 00:20:52more need to replicate the behavior of
- 00:20:54this block you might allocate several
- 00:20:56of the tpcs to doing this function
- 00:21:00and only one of them for example just
- 00:21:03kind of making this up
- 00:21:05to handle this next set of functionality
- 00:21:07and because the processors in the
- 00:21:09streaming
- 00:21:10streaming multiprocessor are general
- 00:21:12purpose computers right
- 00:21:13you can have them do any of these
- 00:21:14functions that you'd like and and change
- 00:21:17that
- 00:21:17over time during the execution of a
- 00:21:19single program or between
- 00:21:21different programs that are going to run
- 00:21:22on the gpu
- 00:21:24the other cool thing about that is we
- 00:21:27can now use these general purpose
- 00:21:29streaming processors to do whatever kind
- 00:21:31of calculations we want
- 00:21:33whether it's specifically tailored to
- 00:21:34doing some graphics thing or we're just
- 00:21:37doing a say a large matrix operation
- 00:21:39that we're storing out here in
- 00:21:41in frame buffer memory but we're never
- 00:21:43actually going to show on the screen
- 00:21:44what that frame buffer memory contains
- 00:21:46because it's just data at that point
- 00:21:47it's not designed to be pixels
- 00:21:49it's just the uh the input or the output
- 00:21:52from the calculation that we're doing
- 00:21:54on all of these very large number of
- 00:21:56streaming processors
- 00:22:01here's a little zoomed in view so this
- 00:22:02is that the picture of the overall
- 00:22:05architecture of the gpu for the 8800
- 00:22:08and this kind of gives you a little bit
- 00:22:09more detail on this on this tpc the pro
- 00:22:12the processing cluster
- 00:22:14so there's two units inside of here that
- 00:22:17are of interest to us
- 00:22:18both of which are labeled sm for
- 00:22:21streaming multiprocessor
- 00:22:23and within the streaming multiprocessor
- 00:22:25there are the individual
- 00:22:26sps the streaming processor and it's the
- 00:22:29sp that's really
- 00:22:30i mean this is what we're what we refer
- 00:22:31to as a core in
- 00:22:33informal terms right that's a that's a
- 00:22:35gpu core or a cuda core you could call
- 00:22:37it
- 00:22:37and you can see in this architecture we
- 00:22:39have um
- 00:22:41eight of those cores eight of those sps
- 00:22:43grouped together
- 00:22:44into an sm and we've got two of those
- 00:22:47sms that are grouped together into a tpc
- 00:22:49and then we saw previously that we have
- 00:22:51eight of these tpcs that are grouped
- 00:22:53together to make up the entirety of the
- 00:22:55gpu
- 00:22:56i've also got here a little bit of an
- 00:22:58exploded diagram of an individual
- 00:23:00stream streaming multiprocessor so we
- 00:23:02can see here in a little bit more detail
- 00:23:04we've got the streaming processors of
- 00:23:06course we have some special functional
- 00:23:08units which do things like
- 00:23:10transcendental functions like sine and
- 00:23:11cosine that sort of thing
- 00:23:13we've got some caching going on as i
- 00:23:15mentioned per per streaming
- 00:23:17multiprocessor so there's an instruction
- 00:23:19level cache
- 00:23:20there's also what's called a constant
- 00:23:22cache so if you have specific constant
- 00:23:24values that you need to refer to
- 00:23:26over and over again nearby the
- 00:23:28processors but
- 00:23:29that need that need to change over time
- 00:23:31you can store those in that cache
- 00:23:33and then there's also some shared memory
- 00:23:36which allows
- 00:23:37x or provides access to all of the sps
- 00:23:40on an sm
- 00:23:41so if you have a calculation that that
- 00:23:44you need to
- 00:23:45split up into smaller pieces and allow
- 00:23:48those
- 00:23:48individual pieces to collaborate with
- 00:23:50one another to solve the overall problem
- 00:23:52even at the level of
- 00:23:54individual sps within sm they have
- 00:23:56access to this shared memory
- 00:23:58programming the sps if they need to
- 00:24:00collaborate over shared memory
- 00:24:01is quite similar to what we were looking
- 00:24:03at in the early part of the term when we
- 00:24:05were doing p threads programming
- 00:24:06in particular all of the sps can access
- 00:24:09directly the shared memory within the sm
- 00:24:12and in order to get them to cooperate
- 00:24:13with one another you need to do things
- 00:24:15like synchronizing
- 00:24:16access to memory so that you don't step
- 00:24:17on each other's toes so
- 00:24:19buried inside of this whole architecture
- 00:24:22is a little shared memory machine
- 00:24:24with some number of sps attached to that
- 00:24:26shared memory
- 00:24:27and then of course because there is this
- 00:24:30large collection of dram
- 00:24:32memory that's attached to the entire gpu
- 00:24:35the the sps can also access that global
- 00:24:38memory as well
- 00:24:40right so we're going to take a kind of a
- 00:24:42quick tour through these generations of
- 00:24:44the the chips chipsets and the micro
- 00:24:46architectures that we've been talking
- 00:24:47about so
- 00:24:48starting out with the tesla architecture
- 00:24:50and we've kind of looked at this
- 00:24:52uh from from the previous illustration
- 00:24:54uh this is another representation of
- 00:24:56that same
- 00:24:578800 core but i wanted to just kind of
- 00:25:00for consistency here
- 00:25:01break out some of the key statistics so
- 00:25:03this guy has
- 00:25:04eight at this point they were calling
- 00:25:06them graphics processing clusters
- 00:25:08so each of these guys is a gpc and then
- 00:25:11within each of those
- 00:25:12there's two individual
- 00:25:16sms which means we have a total of 16
- 00:25:20streaming multiprocessors and within
- 00:25:22each of those streaming multiprocessors
- 00:25:24as we've seen already there were eight
- 00:25:26streaming processors for a total of 128
- 00:25:28streaming processors in this gpu
- 00:25:31and then there's other things out here
- 00:25:33too this block is
- 00:25:34for information about like shading and
- 00:25:36kind of graphic specific things that
- 00:25:38we're going to just sort of
- 00:25:39ignore and remember that there is an l1
- 00:25:41cache that's associated with the gpc
- 00:25:44and then an l2 cache that can be shared
- 00:25:47among
- 00:25:48multiple sms before we get out to
- 00:25:51frame buffer memory here the fb and then
- 00:25:54we can zoom in again on the individual
- 00:25:57streaming multiprocessor so just one of
- 00:26:00these chunks
- 00:26:00of the the tpc is the streaming
- 00:26:03multiprocessor and it's got streaming
- 00:26:05processors inside of it
- 00:26:07uh and then these other units here have
- 00:26:08more to do with with dealing with some
- 00:26:10graphic specific things like texture
- 00:26:12texture maps
- 00:26:16another release in kind of the second
- 00:26:19generation of tesla
- 00:26:21chips was the geforce 280 this is also
- 00:26:23in 2006.
- 00:26:24i guess there was a little bit more
- 00:26:26emphasis this is again
- 00:26:27an illustration grabbed from some of the
- 00:26:30documentation white papers about these
- 00:26:31guys
- 00:26:32so 3d diagrams was apparently really
- 00:26:34cool at that point
- 00:26:36so uh we have a little bit more capable
- 00:26:39uh chip here we've got 10
- 00:26:41tpcs each of which has three streaming
- 00:26:43multiprocessors so a total of 30
- 00:26:45streaming multiprocessors
- 00:26:47and each of those again had eight uh
- 00:26:50streaming processors for a total of 240
- 00:26:53streaming processors i'll have a little
- 00:26:54summary slide here at the end that lets
- 00:26:56us kind of compare
- 00:26:57these key statistics that i've kind of
- 00:26:59outlined here
- 00:27:00for the number of sms and the number of
- 00:27:02sps that's really the critical
- 00:27:04uh the critical number that we're
- 00:27:06interested in when we're thinking about
- 00:27:07how do we program this
- 00:27:08uh using the cuda platform
- 00:27:12this is a a little larger view of
- 00:27:15an individual tpc right so there's ten
- 00:27:18of these
- 00:27:18so we're looking at one of these guys
- 00:27:20and you can see here's
- 00:27:21the three streaming multiprocessors each
- 00:27:24of which has eight
- 00:27:25individual again cores or s this is the
- 00:27:28streaming processor the sp or the core
- 00:27:31and then there's some local memory that
- 00:27:33would be the shared memory that's
- 00:27:34available to all these cores
- 00:27:35and then some texture stuff and a shared
- 00:27:38l1 cache
- 00:27:39across this this tpc
- 00:27:44okay 2010 saw the introduction of the
- 00:27:47fermi
- 00:27:48microarchitecture and you can see here
- 00:27:51this again ripped from the headlines
- 00:27:54just straight from the documentation
- 00:27:56here we're showing that there's 16
- 00:27:58streaming multiprocessors so they're not
- 00:28:00really breaking these out in
- 00:28:02in detail as tpcs at this point in
- 00:28:04history i'm not sure exactly why
- 00:28:05and we can see though so here's
- 00:28:07basically one of those streaming
- 00:28:08multiprocessors
- 00:28:10and each of those has 32 sps for a total
- 00:28:13of 512 sps
- 00:28:15and they're just kind of laying this out
- 00:28:16differently we still have local cache
- 00:28:18here within
- 00:28:19the the streaming multiprocessors
- 00:28:21there's a level two cache as well
- 00:28:23and then the the blocks on the outside
- 00:28:25are basically showing you
- 00:28:27interfaces to the dynamic memory
- 00:28:30if we zoom in on one of these sms we can
- 00:28:33see here that
- 00:28:34we've got a bunch of cores i mentioned
- 00:28:3730 32 cores or 32 sp's per sm so each of
- 00:28:41these is a core and there's actually a
- 00:28:42little exploded diagram here of the core
- 00:28:44notice that this guy has both a floating
- 00:28:46point unit and
- 00:28:48an integer unit so there's two different
- 00:28:50paths through this core that allow you
- 00:28:52to
- 00:28:53simultaneously do a floating point
- 00:28:55operation
- 00:28:56and an integer operation so you kind of
- 00:28:59get double your money if you want for
- 00:29:01the pro
- 00:29:01for certain kinds of calculations you
- 00:29:03can have both of those things active
- 00:29:05i think at this point we're still in the
- 00:29:07single precision uh
- 00:29:09world so if you wanted to do double
- 00:29:10precision arithmetic you
- 00:29:12had to figure out how to how to use
- 00:29:14these things to do that
- 00:29:17we've also here called out the load
- 00:29:19store unit so this is a way of
- 00:29:21streamlining access to memory
- 00:29:23and then the special function units of
- 00:29:25which there's just four
- 00:29:27because presumably you're not going to
- 00:29:28do that many you know sine and cosine
- 00:29:30and transcendental operations
- 00:29:32uh as you are normal sort of
- 00:29:36arithmetic operations you can see here
- 00:29:38there's there's shared memory the 664k
- 00:29:40of shared memory per
- 00:29:42sm which results in a lot of memory in
- 00:29:44in total
- 00:29:45a connection interconnection network to
- 00:29:47allow you to stream calculations from
- 00:29:48one processor to the next
- 00:29:50and the cores also we haven't seen this
- 00:29:52yet this is essentially
- 00:29:54core local memory so
- 00:29:58in the if you recall in this in a cpu uh
- 00:30:01each of the processor cores has a
- 00:30:03register file right sort of
- 00:30:04really fast nearby static ram that's
- 00:30:08really quick to access by the cores
- 00:30:10there's a similar thing here for the
- 00:30:11cuda cores the register file is quite a
- 00:30:13lot larger
- 00:30:15it's 32 000 entries of 32 032-bit
- 00:30:19registers
- 00:30:20and those are divided evenly across the
- 00:30:23individual cores and since there's 32 of
- 00:30:25those on this chip
- 00:30:26each of those is going to get 1000 or 1k
- 00:30:2932-bit registers that it can
- 00:30:31individually access really quickly
- 00:30:33so there's really three levels in the
- 00:30:34memory hierarchy an individual core
- 00:30:36has its own dedicated registers from the
- 00:30:39register file so this is split up
- 00:30:4132 different ways those cores can all
- 00:30:44access the shared memory the 64k of
- 00:30:46shared memory here
- 00:30:48and then they can also access what's
- 00:30:50called the global memory which is really
- 00:30:51just the dram
- 00:30:52frame buffers on the gpu card itself
- 00:30:57next up was kepler in 2012
- 00:31:00and the illustration just gets more and
- 00:31:03more dense
- 00:31:04um so you can see there's you know many
- 00:31:06more components
- 00:31:07that are fitting on the on the chip
- 00:31:09surface here and again this is just a
- 00:31:10result of having
- 00:31:12a higher density fabrication process
- 00:31:14that lets you put more processing power
- 00:31:16on the same chip so this guy has
- 00:31:1815 streaming multiprocessors each of
- 00:31:21which has 192
- 00:31:23processors for a total of almost 3 000
- 00:31:26sps on a single chip otherwise the
- 00:31:29architecture is quite similar
- 00:31:30level 1 cache level 2 cache access to
- 00:31:33dram
- 00:31:34local register files and so forth here's
- 00:31:36an exploded view of an almost
- 00:31:39in almost impossible to read a view of
- 00:31:42this
- 00:31:42of this chip but you can see that
- 00:31:44there's a whole variety of different
- 00:31:45cores now that are going to be included
- 00:31:46in here
- 00:31:47some of them are starting to be able to
- 00:31:49provide native
- 00:31:51improvements to floating point
- 00:31:52calculations and so forth they've all
- 00:31:54got load store units
- 00:31:56and then there's also some special
- 00:31:57function units here for transcendental
- 00:32:00operations similarly there's still
- 00:32:02shared memory
- 00:32:03and each of these guys is also going to
- 00:32:05have pieces of a register file
- 00:32:07so it's kind of more of the same being
- 00:32:09stamped out here
- 00:32:10to provide higher throughput
- 00:32:12calculations
- 00:32:14moved to the maxwell architecture in
- 00:32:162014 ocs are nicely evenly spaced for
- 00:32:19the most part in two year increments
- 00:32:21i couldn't find a picture of a maxwell
- 00:32:24gpu but i did find an illustration of a
- 00:32:26maxwell
- 00:32:27streaming multiprocessor nvidia kind of
- 00:32:30played with names here so they called
- 00:32:32some of the sms
- 00:32:33sms and some of them smxs and smms
- 00:32:36it's all kind of the same idea here
- 00:32:38we've got on the
- 00:32:40typical maxwell architecture we're going
- 00:32:42to have 16 of these sms
- 00:32:44with 120 sps per sm so this is one of
- 00:32:47those
- 00:32:48sms and you end up with 2
- 00:32:51000 streaming processors
- 00:32:54per chip again very similar kinds of
- 00:32:57pieces of functionality here i'm
- 00:32:59not going to drill down into these for
- 00:33:01each for each chip set but
- 00:33:03you get the idea that it's just kind of
- 00:33:05more of the same
- 00:33:072016 introduced the pascal architecture
- 00:33:10and as you can see higher density we've
- 00:33:13got
- 00:33:14now they're starting to use the term
- 00:33:16graphics processing cluster
- 00:33:18graphics processor cluster so there's
- 00:33:19six of these and you can see each of
- 00:33:21those
- 00:33:22is this larger group of of processing
- 00:33:25elements
- 00:33:26within each of those there's 10
- 00:33:28streaming multiprocessors
- 00:33:30right so back at the early generations
- 00:33:32of tesla we had
- 00:33:3310 streaming processors on the whole
- 00:33:35chip now we've got 10 of them just in
- 00:33:37this one little
- 00:33:38section of the chip and we can stamp
- 00:33:39that out a bunch of different times so
- 00:33:41with six gpcs and 10 sms per we've got
- 00:33:4460 streaming multiprocessors
- 00:33:46and those can each have 64 sps
- 00:33:49giving us a total of almost 4 000
- 00:33:52streaming processors on one chip
- 00:33:54otherwise similar things level 1 cache
- 00:33:56level 2 cache register files
- 00:33:58memory access
- 00:34:02here's a close-up view of the individual
- 00:34:05sms and sps so you can see here there's
- 00:34:09some ordinary cores which are similar to
- 00:34:11what we've been seeing already
- 00:34:12but there's these dp units stand for
- 00:34:15double precision units so they know how
- 00:34:17to do double precision arithmetic
- 00:34:20as well as single precision there's a
- 00:34:22load store unit for memory access and
- 00:34:23then a special function unit for each
- 00:34:25group of four cores
- 00:34:27so again more capability on the chip
- 00:34:30giving you more flexibility for the
- 00:34:32calculations you're going to do
- 00:34:34shared memory there's a register file
- 00:34:36for the individual processors and of
- 00:34:38course we have access to global memory
- 00:34:42volta in 2017 again more of the same
- 00:34:47we've got now still six graphics
- 00:34:49processing clusters but instead of 10
- 00:34:51sms per we've got 14 per for a total of
- 00:34:5384
- 00:34:54streaming multiprocessors and now we're
- 00:34:57to the point where we're kind of
- 00:34:58pulling apart single precision and
- 00:35:01double precision so each of these sms
- 00:35:03has 64 single precision cores for
- 00:35:06floating point
- 00:35:0864 single precision cores for integer
- 00:35:11arithmetic
- 00:35:12and 32 cores for double precision
- 00:35:15floating point
- 00:35:16and that's kind of the holy grail here
- 00:35:18for gpus in general purpose computing
- 00:35:20most of the
- 00:35:21modeling that gets done in in physical
- 00:35:24simulations and similar sorts of things
- 00:35:27require double precision arithmetic to
- 00:35:29be at all accurate and so the
- 00:35:31introduction of
- 00:35:33dedicated cores on these gpus was a big
- 00:35:36deal
- 00:35:36and then if we you know kind of multiply
- 00:35:38these things out we see that we're
- 00:35:40getting an excess of 5000 single
- 00:35:42precision cores
- 00:35:43and almost three thousand double
- 00:35:45precision floating point cores
- 00:35:47on one chip here's
- 00:35:50a illustration of the
- 00:35:54of the sm in individual sm with its sps
- 00:35:58kind of similar to before what's new
- 00:35:59here is these tensor cores
- 00:36:02so as i mentioned before a lot of the
- 00:36:05important uses of a gpu these days is in
- 00:36:08the machine learning
- 00:36:10area where we're training uh deep neural
- 00:36:12networks to
- 00:36:13understand classification problems and
- 00:36:15similar sorts of things
- 00:36:17um and that's the the the dedicated
- 00:36:20hardware that nvidia builds onto their
- 00:36:21chips
- 00:36:22that's focused on providing capabilities
- 00:36:25for that are referred to as tensor cores
- 00:36:27so these are kind of purpose-built
- 00:36:30variations on the underlying
- 00:36:32sp processor that do
- 00:36:35directly some of the basic calculations
- 00:36:38that are necessary for
- 00:36:39training and evaluating a neural net
- 00:36:44touring 2018 here again more the same
- 00:36:48we've got 72
- 00:36:49sms in total 64 cores per sm
- 00:36:538 tensor cores per sm so that kind of
- 00:36:56continues going forward so we've got
- 00:36:58almost 5 000 cores cuda cores and
- 00:37:01more than 500 tensor cores otherwise
- 00:37:04l1 l2 memory access and so forth is
- 00:37:06quite similar
- 00:37:08this is a so this is just a diagram of
- 00:37:11the of the chip
- 00:37:12this right here is actually a micrograph
- 00:37:14so that's actually what the
- 00:37:15what the components on the chip itself
- 00:37:17look like
- 00:37:18and you can see that you know when we
- 00:37:20say stamp out another core that's
- 00:37:22literally what's going on here they're
- 00:37:23just duplicating that particular chunk
- 00:37:25of the
- 00:37:26of the uh of the image that's used to
- 00:37:28etch the
- 00:37:29components into the surface of the
- 00:37:31silicon um and
- 00:37:32and it's really evident how that's set
- 00:37:35up when you see an actual picture of it
- 00:37:38and then here's the the detail here of
- 00:37:40the of a single sm
- 00:37:42and it's sps uh same as before right
- 00:37:44we've got some integer
- 00:37:46floating integer units some floating
- 00:37:47point units some tensor cores
- 00:37:49register files cache special function
- 00:37:51units and so forth
- 00:37:55finally the ampere architecture is the
- 00:37:57current one
- 00:37:58so here's an ampere gpu
- 00:38:01it's got seven uh graphics processing
- 00:38:06clusters well this looks like eight in
- 00:38:09the picture
- 00:38:10uh with 12 sms per so a total of 84
- 00:38:14streaming multiprocessors and then a
- 00:38:18additional statistics here 128 cores per
- 00:38:21sm which gives us a really finally you
- 00:38:22know we're over 10 000 cuda cores just
- 00:38:25on this one chip
- 00:38:26across all of the sms within each gpc
- 00:38:29and 336 tensor cores you know one of the
- 00:38:32things that you see when you look at
- 00:38:33these statistics over time is
- 00:38:35as the manufacturer kind of goes back
- 00:38:37and forth between you know what are they
- 00:38:38truly trying to emphasize here
- 00:38:40um how much space on the chip do they
- 00:38:44dedicate to different kinds of
- 00:38:45processors
- 00:38:46and they're really trying to respond to
- 00:38:48market demands right is it going to be
- 00:38:49more important for us to just do raw
- 00:38:51graphics processing is it going to be
- 00:38:52more important for us to provide good
- 00:38:54capabilities for machine learning and
- 00:38:55that kind of stuff
- 00:38:57and you can see maybe a little decreased
- 00:38:58emphasis in a way on on the tensor cores
- 00:39:00from the previous generation
- 00:39:02architecture and more of an emphasis on
- 00:39:05just individual
- 00:39:06individual standard cuda cores i should
- 00:39:09also point out that
- 00:39:10this is kind of one sample of one
- 00:39:13particular chip
- 00:39:14at each of the architecture levels if
- 00:39:16you go look
- 00:39:17look around at different releases from
- 00:39:19nvidia
- 00:39:20you'll see that that they will have
- 00:39:23various
- 00:39:25various variations on these statistics
- 00:39:27within the same processor family so
- 00:39:29there might be
- 00:39:30you know 10 chip i'm making this up but
- 00:39:3110 chips in the ampere family
- 00:39:34that have different numbers of each of
- 00:39:35these types of processors
- 00:39:37that are intended to kind of focus on
- 00:39:38different markets right is this a chip
- 00:39:40that's going to be
- 00:39:41used for high performance super
- 00:39:43computing is this a chip
- 00:39:44that's going to be used for you know
- 00:39:47high capacity desktop
- 00:39:49rendering like at a graphics company or
- 00:39:52a film production company or
- 00:39:54for a high-end engineering workstation
- 00:39:56or is this going to be a gpu that's
- 00:39:57going to be used in a mobile device
- 00:39:59which obviously has considerable
- 00:40:01constraints around power and that sort
- 00:40:03of thing
- 00:40:03so these are these are not the only
- 00:40:06versions of
- 00:40:07of this of the statistics for these
- 00:40:08chips it's just kind of a representative
- 00:40:10sample
- 00:40:13and then here's the the ampere streaming
- 00:40:16multiprocessor
- 00:40:17and again you can see energy units
- 00:40:19floating point units 64-bit floating
- 00:40:21point units
- 00:40:22and then the tensor core as well
- 00:40:26by way of summary so here's kind of the
- 00:40:29the same
- 00:40:29historic historic perspective here going
- 00:40:31back to 2008 and the different micro
- 00:40:33architectures
- 00:40:34and what i've done here is just
- 00:40:35collected together the from these slides
- 00:40:38and again these are just representative
- 00:40:39numbers you can find
- 00:40:41parts in each of these product families
- 00:40:43that have more or less
- 00:40:45sms or sps but just kind of a
- 00:40:47representative sample
- 00:40:49back here we had 128 sps now we're
- 00:40:52at 10 000 sps which is pretty remarkable
- 00:40:55growth
- 00:40:57i also wanted to point out here that the
- 00:40:59the
- 00:41:00processors on our what are what we call
- 00:41:02our pseudo lab
- 00:41:04those machines actually have a quadro
- 00:41:05k620 which is not a super recent
- 00:41:09super recent gpu board in them
- 00:41:13you can tell here by the model number
- 00:41:14with the k in it that this actually
- 00:41:16follows the kepler architecture
- 00:41:18and the kepler that we looked at earlier
- 00:41:20has 15 sms and
- 00:41:21a lot of sps the ones in the pseudo lab
- 00:41:24actually are
- 00:41:24rather more modest desktop focused
- 00:41:27releases of this chipset
- 00:41:28um that have only 308 only 384
- 00:41:33sps so we'll just have to make two if
- 00:41:34that's what you're using if you've got a
- 00:41:36more capable
- 00:41:37laptop or desktop you may be in in this
- 00:41:40range instead
- 00:41:41that's all good as we'll see the
- 00:41:44the programming model that's provided by
- 00:41:46cuda and this is really an important
- 00:41:48idea to keep in mind as you're
- 00:41:49developing software for these guys
- 00:41:51is it's designed to run exactly the same
- 00:41:54code
- 00:41:55no matter where you are in the history
- 00:41:58of these architectures
- 00:41:59in particular it's designed such that
- 00:42:01you can actually run this on future
- 00:42:03architecture so when the hopper
- 00:42:04architecture
- 00:42:05becomes available as actual silicon
- 00:42:07we'll be able to take those same
- 00:42:09programs and
- 00:42:10run them on the hopper architecture
- 00:42:12without any modifications at all
- 00:42:14because of the abstractions that are
- 00:42:15provided by the cuda
- 00:42:18the cuda model for programming
- CUDA
- GPU
- CPU
- Nvidia
- Architecture
- Tensor Cores
- Pascal
- Ampere
- Fermi
- Kepler