CUDA Hardware

00:42:20
https://www.youtube.com/watch?v=kUqkOAU84bA

摘要

TLDRWannan bayani ya mai da hankali akan yadda kayan aikin CUDA suka canza tsawon shekaru, musamman kan injin amfani da GPUs na kamfanin Nvidia. Bidiyon ya bayyana yadda kowane keɓantaccen fassarar GPU, daga Tesla zuwa Ampere, suke bayar da gwaninta daban a game da lissafi na gaba ɗaya da kuma amfani a fagen zane-zane. An tattauna yadda CPU da GPU suka sha bamban, inda aka bayyana cewa CPU na mayar da hankali akan jinkirin aiwatar da abubuwa, yayin da GPU ke mayar da hankali akan samun yawan aiki. An bayar da tabbacin yadda CUDA ke bawa damar lambace na'urorin mai da hankali ga taskarwa da aiwatar da kimiyyar kere-kere, tare da cikakken sharhi akan kalmomi da shahararrun samfuran fasaha na Nvidia.

心得

  • 🖥️ CPU yawanci yana mai da hankali kan jinkirin aiwatar da abubuwa.
  • 🚀 GPU yana bawa damar samun yawan aiki mafi girma tare da ƙarancin jinkirin mita.
  • 📈 CUDA architecture na ba da damar aiwatar da lissafi na gaba ɗaya.
  • 🔍 Nvidia GPUs suna tare da sabbin fasaha kamar Tensor Cores don AI.
  • 🔄 Sabbin fasaha yana kara ƙarfin GPUs da gwaninta.
  • 🕰️ Mekurorin CUDA sun haɗa da Tesla, Fermi, Kepler, Pascal da kuma Ampere.
  • 💡 Architecture na CUDA yana bawa damar amfani da GPUs a ta foyar lissafi na gaba ɗaya.
  • 🌐 Nvidia GPUs suna bawa damar aiwatar da zane-zane tare da lissafi na zamani.
  • 🚦 Sabbin fasaha a kan Nvidia yana inganta taswirar aiki da na'amni.
  • 🔧 CUDA yana bawa damar aiki tare akan na'urori daban-daban ba tare da wata wahala ba.

时间轴

  • 00:00:00 - 00:05:00

    This section introduces CUDA hardware evolution and gives an understanding of CPU versus GPU architecture. It differentiates how CPUs have fewer cores with complex caching for low latency while GPUs have numerous cores focused on high throughput. GPU's architecture allows more parallel processing by reducing space for caching compared to CPUs.

  • 00:05:00 - 00:10:00

    The transition from earlier Nvidia GPUs aimed solely at graphics to more modern CUDA-enabled GPUs supporting general computing is highlighted. The architecture evolution is discussed from Nvidia's first GPU in 1999 to modern multi-functional units introduced with micro-architectures like Tesla in 2008, with a focus on reducing feature size and increasing transistor count.

  • 00:10:00 - 00:15:00

    It continues to detail Nvidia's advancements with CUDA-enabled GPUs allowing general-purpose computing, including the introduction of unified memory space and tensor cores, catering to the rise of machine learning by optimizing GPU performance for neural network processing.

  • 00:15:00 - 00:20:00

    The historical shift from graphics-specialized hardware to general-purpose computing in GPU chips is discussed. It highlights the redefinition of processing power allocation for flexible computing tasks instead of purely graphics, marking a significant architectural transition starting from the Tesla series.

  • 00:20:00 - 00:25:00

    The architecture of GPUs is further explored using the Geforce 8800, illustrating the central components involved in computation, with particular attention to the streaming multiprocessors (SM) and the flexibility in managing computational tasks through general-purpose cores.

  • 00:25:00 - 00:30:00

    The structuring of GPUs into clusters and multiprocessors enabling highly parallel processing capabilities is detailed. This includes the explanation of register files and memory hierarchy that support high-throughput processing, showcasing the Fermi architecture improvements with enhanced cores and cache systems.

  • 00:30:00 - 00:35:00

    Subsequent GPU architectures like Kepler, Maxwell, and Pascal are analyzed for their increased processing power and improved density through more advanced fabrication. Each generation adds more cores and enhanced precision capabilities, addressing evolving computing needs.

  • 00:35:00 - 00:42:20

    Finally, Volta and Ampere architectures are explored, highlighting technological advancements like tensor cores for machine learning applications and changes in manufacturing focus to balance graphical and computing performance, reflecting market needs and technological progression.

显示更多

思维导图

视频问答

  • Menene bambance-bambance tsakanin CPU da GPU?

    CPU yana nufin jinkirin aiwatar da kebantattun ayyuka, yayin da GPU ke nufin samun yawan aiwatar da abubuwa da yawa a lokaci guda.

  • Me yasa ake amfani da Tensor Cores a cikin GPU?

    Tensor Cores suna taimakawa wajen aiwatar da ayyukan kimiyyar kere-kere da kuma horar da cibiyoyin sadarwa na zurfi.

  • Me yasa ake samun sabbin fasaha a kan Nvidia GPUs?

    Sabbin fasaha na ba da dama ga ingantaccen aiki tare da ƙarfin gudanar da ayyuka daban-daban kamar horar da AI.

  • Menene ma'anar 'CUDA architecture'?

    CUDA architecture wata tsari ce ta GPUs daga Nvidia wadda ke baiwa damar aiwatar da lissafi na gaba ɗaya ta hanyar amfani da jayen GPUs.

  • Menene keɓancewar zamanin CUDA?

    Zamanin CUDA yana mayar da hankali kan samar da GPUs da zasu iya aiki a matsayin na'urori na lissafi na gaba ɗaya baya ga ayyukan hoto.

  • Ta yaya cigaban fasahar Nvidia ke amfani?

    Karuwar karafin fasaha yana bawa damar aiwatar da abubuwa masu yiwuwa na AI, lissafi wanda yafi ƙarfin CPUs.

  • Yaushe aka kirkiro CUDA farko?

    CUDA architecture na farko ya fara a shekarar 2008 tare da shimfiɗar Tesla.

  • Wane alfarma ne ake samu daga guntu-guntun Nvidia na kwanan nan?

    Sun hada da ƙarfin gudanar da nauyi mai yawa da kuma kara adadin ƙwaƙwalwar ajiyar guntu suna ƙara wa damar amfani da su a fannoni daban-daban.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要!
字幕
en
自动滚动:
  • 00:00:03
    this is a brief
  • 00:00:04
    overview of some of the cuda hardware
  • 00:00:06
    that's been around for years now
  • 00:00:09
    i wanted to give you some kind of
  • 00:00:10
    perspective on where we've been where
  • 00:00:12
    the current generation is and
  • 00:00:13
    kind of where we're headed maybe by way
  • 00:00:16
    of overview i want to
  • 00:00:17
    spend a little bit of time just
  • 00:00:18
    refreshing your thinking about the
  • 00:00:20
    differences between a cpu
  • 00:00:22
    and a gpu and on the left here we've got
  • 00:00:25
    a illustration of a typical
  • 00:00:26
    modern cpu this particular one is
  • 00:00:29
    showing a four
  • 00:00:30
    core architecture so there's individual
  • 00:00:33
    processing
  • 00:00:34
    cores that are part of the same chip and
  • 00:00:37
    they have
  • 00:00:38
    the core itself which is the kind of the
  • 00:00:40
    processing logic that's actually doing
  • 00:00:42
    things you could think of this
  • 00:00:44
    in general terms as sort of the
  • 00:00:45
    arithmetic logic unit the alu
  • 00:00:48
    there's going to be for each of the
  • 00:00:50
    cores this would be typical of for
  • 00:00:52
    example a current intel processor
  • 00:00:54
    there's going to be some level 1 cache
  • 00:00:56
    that's dedicated to that core it's often
  • 00:00:58
    split into instruction cache and data
  • 00:01:00
    cache
  • 00:01:01
    there's also going to be a bunch of
  • 00:01:02
    control logic so circuitry that kind of
  • 00:01:05
    orchestrates the behavior of the core
  • 00:01:07
    which is distinct from the core itself
  • 00:01:09
    in the sense that the core is there to
  • 00:01:11
    do the computation to do an addition or
  • 00:01:13
    multiplication or whatever and the
  • 00:01:14
    control logic is there to
  • 00:01:16
    orchestrate that process so there's
  • 00:01:19
    basically four copies of that stamped
  • 00:01:21
    out on this particular chip
  • 00:01:22
    and then a very large proportion of the
  • 00:01:24
    chip is going to be taken
  • 00:01:26
    up by additional cache memory to improve
  • 00:01:28
    the performance of the processor
  • 00:01:30
    to lower the latency which is kind of
  • 00:01:33
    the focus
  • 00:01:33
    of a cpu processor
  • 00:01:37
    so we'll have some level 2 cache that's
  • 00:01:40
    going to be
  • 00:01:41
    shared in different ways in different
  • 00:01:42
    architectures and then level 3 cache
  • 00:01:43
    that's usually
  • 00:01:44
    shared across all of the cores and
  • 00:01:46
    that's kind of where the processor chip
  • 00:01:48
    ends and of course we can't have a
  • 00:01:51
    modern computer without additional
  • 00:01:53
    memory so the dram here is just
  • 00:01:55
    the general memory on the motherboard
  • 00:01:57
    for example
  • 00:01:59
    again the point of the architecture of
  • 00:02:02
    the cpu is to
  • 00:02:04
    run a lot of different types of compute
  • 00:02:06
    jobs very flexibly and with very low
  • 00:02:09
    latency
  • 00:02:10
    hence all of the cash that's devoted or
  • 00:02:12
    the space that's devoted on the chip to
  • 00:02:14
    cash
  • 00:02:16
    when we look at the look at the gpu then
  • 00:02:18
    obviously the
  • 00:02:19
    the architecture is quite a lot
  • 00:02:20
    different so there's going to be
  • 00:02:23
    still control logic and a little bit of
  • 00:02:26
    cache memory that's going on
  • 00:02:27
    across the gpu but the overwhelming
  • 00:02:30
    majority of the gpu
  • 00:02:32
    chip itself is going to be given over to
  • 00:02:34
    processing cores
  • 00:02:36
    and very many of them the key difference
  • 00:02:38
    here is that
  • 00:02:39
    the emphasis on the gpu side is not
  • 00:02:42
    going to be
  • 00:02:43
    low latency in other words something
  • 00:02:45
    that's going to require a lot of
  • 00:02:46
    elaborate caching in the in the memory
  • 00:02:49
    hierarchy
  • 00:02:50
    but high throughput so we can apply a
  • 00:02:53
    lot of different processing cores to do
  • 00:02:55
    small pieces of calculations of the
  • 00:02:57
    overall
  • 00:02:58
    job that we're trying to accomplish and
  • 00:03:00
    by
  • 00:03:01
    reducing the amount of chip space that's
  • 00:03:03
    dedicated to these kinds of other things
  • 00:03:05
    like caching and
  • 00:03:07
    more elaborate control you know out of
  • 00:03:08
    order execution and that kind of stuff
  • 00:03:10
    we can dramatically improve the number
  • 00:03:13
    of cores
  • 00:03:14
    or increase the number of cores that are
  • 00:03:15
    available on the chip so we can just do
  • 00:03:17
    more
  • 00:03:18
    processing now it's not going to be as
  • 00:03:20
    low latency because we don't have all of
  • 00:03:22
    that caching we've got a little bit of
  • 00:03:23
    cash but not much by comparison to a cpu
  • 00:03:26
    but because we have so many individual
  • 00:03:28
    processing cores
  • 00:03:29
    we can do a lot of calculation per time
  • 00:03:32
    in other words the throughput can be
  • 00:03:33
    very high
  • 00:03:34
    even though the latency isn't super
  • 00:03:36
    great
  • 00:03:37
    gpu is also going to have some amount of
  • 00:03:39
    level 2 cache that's shared across
  • 00:03:41
    all of the cores and then as with the
  • 00:03:44
    cpu we've got some external memory
  • 00:03:46
    called dram here this would be say in a
  • 00:03:49
    typical
  • 00:03:50
    desktop kind of installation you'd have
  • 00:03:52
    a graphics processing unit that plugged
  • 00:03:54
    into the back plane of the motherboard
  • 00:03:56
    and it would include its own memory
  • 00:03:58
    sometimes called graphics memory or
  • 00:04:00
    frame buffer memory or whatever
  • 00:04:02
    for our purposes in general purpose gpu
  • 00:04:04
    computing
  • 00:04:05
    we're not really thinking of that
  • 00:04:06
    necessarily as something that's going to
  • 00:04:07
    be used to drive
  • 00:04:08
    a video display although in a graphics
  • 00:04:10
    card that's almost always what it's
  • 00:04:12
    being used for but we do have local
  • 00:04:15
    memory on the video card
  • 00:04:16
    that's accessible to the to the gpu
  • 00:04:21
    here's a not so brief summary of
  • 00:04:24
    the various generations that nvidia has
  • 00:04:26
    been through with its cuda architecture
  • 00:04:28
    uh sort of a history of graphics
  • 00:04:31
    processing
  • 00:04:31
    if we go back here to the very beginning
  • 00:04:34
    in 1999
  • 00:04:35
    saw the introduction of nvidia's geforce
  • 00:04:38
    256
  • 00:04:39
    which was really the first graphics
  • 00:04:41
    processing unit to be
  • 00:04:42
    released into the wild and there was a
  • 00:04:44
    whole variety of different
  • 00:04:46
    improvements on that card uh throughout
  • 00:04:48
    the uh
  • 00:04:49
    the early 2000s where we're really
  • 00:04:52
    interested here
  • 00:04:52
    is at the beginning of what i've what
  • 00:04:55
    i've labeled here the beginning of kind
  • 00:04:57
    of the cuda era
  • 00:04:58
    starting here in 2008 and kind of going
  • 00:05:00
    forward until the present
  • 00:05:01
    we've got sort of the modern era of gpus
  • 00:05:05
    and a key distinguishing characteristic
  • 00:05:06
    here is that these gpus allow
  • 00:05:09
    for very high performance graphics
  • 00:05:11
    generation
  • 00:05:12
    but they also are being designed to
  • 00:05:14
    allow for general purpose computing
  • 00:05:16
    which is
  • 00:05:17
    really our interest in them in this
  • 00:05:18
    course
  • 00:05:20
    so the first of these are what what
  • 00:05:22
    nvidia refers to as the micro
  • 00:05:24
    architecture
  • 00:05:25
    which is just kind of a generation or a
  • 00:05:27
    family of gpu
  • 00:05:29
    processors that they've released that
  • 00:05:30
    have kind of common characteristics
  • 00:05:33
    the first of those was called tesla it
  • 00:05:35
    was introduced in 2008
  • 00:05:37
    and there's just some interesting
  • 00:05:39
    statistics here you can see across
  • 00:05:41
    these different generations the first of
  • 00:05:43
    these is the the process
  • 00:05:45
    and the process by that i mean the the
  • 00:05:48
    fabrication process that was used
  • 00:05:50
    to make the chips and one way of
  • 00:05:53
    understanding the the differences from
  • 00:05:55
    one process to another is
  • 00:05:57
    what's known as the feature size or the
  • 00:05:59
    basically the size of a single
  • 00:06:00
    transistor or a single component that
  • 00:06:02
    you're going to put down on the on the
  • 00:06:03
    chip itself
  • 00:06:05
    and that's measured in nanometers which
  • 00:06:08
    is pretty small
  • 00:06:09
    and you can see back in the in the late
  • 00:06:11
    90s with the first gpu
  • 00:06:13
    they were working at 220 nanometers and
  • 00:06:16
    by the time
  • 00:06:17
    we got to 2008 with the tesla we're down
  • 00:06:19
    to 65 nanometers
  • 00:06:21
    and that's been decreasing ever since uh
  • 00:06:23
    to the present
  • 00:06:24
    generation of nvidia gpus uses a seven
  • 00:06:28
    nanometer process
  • 00:06:29
    so it's it's a couple hundred times
  • 00:06:32
    denser
  • 00:06:33
    than what we had back at the beginning
  • 00:06:35
    the upside is that you can can
  • 00:06:37
    you can fit more components on the
  • 00:06:39
    surface of a chip
  • 00:06:40
    which allows you to build something
  • 00:06:42
    that's more and more capable
  • 00:06:44
    you can see also a similar measure here
  • 00:06:46
    in the next column the number of
  • 00:06:47
    transistors that are on that chip
  • 00:06:49
    which surprisingly is something that
  • 00:06:51
    nvidia is pretty happy to trumpet
  • 00:06:53
    if you look for that kind of information
  • 00:06:55
    from intel or amd
  • 00:06:56
    the sort of the cpu vendors they're less
  • 00:06:59
    they're less happy about sharing that
  • 00:07:00
    information
  • 00:07:01
    but you can see we went from you know in
  • 00:07:03
    the neighborhood of a couple of hundred
  • 00:07:05
    million
  • 00:07:05
    transistors which is still a lot to now
  • 00:07:08
    the
  • 00:07:09
    the ampere micro architecture is at 28.3
  • 00:07:13
    billion
  • 00:07:13
    transistors on one chip which is that's
  • 00:07:16
    just a lot of
  • 00:07:17
    billions so we're seeing
  • 00:07:20
    smaller and smaller feature sizes which
  • 00:07:22
    leads to larger and larger numbers of
  • 00:07:24
    transistors that we can use
  • 00:07:25
    and um and that means more and more
  • 00:07:28
    capability over time
  • 00:07:30
    and we're going to kind of unpack some
  • 00:07:31
    of these micro architecture generations
  • 00:07:33
    so we'll start with tesla
  • 00:07:35
    let me just point out some highlights
  • 00:07:37
    here that i've put over here in the
  • 00:07:38
    notes column
  • 00:07:39
    at the pascal architecture one of the
  • 00:07:41
    things that nvidia introduced was this
  • 00:07:43
    notion of a unified memory
  • 00:07:46
    we'll we'll sort of be working with the
  • 00:07:48
    less capable
  • 00:07:50
    parts here so we won't have that
  • 00:07:52
    advantage of this unified memory we have
  • 00:07:54
    to
  • 00:07:54
    we have to orchestrate moving
  • 00:07:55
    information back and forth between cpu
  • 00:07:57
    memory
  • 00:07:58
    and gpu memory but that's kind of
  • 00:08:00
    cumbersome
  • 00:08:01
    and results in some limitations and more
  • 00:08:04
    complexity in the code
  • 00:08:06
    and so nvidia has tried to address that
  • 00:08:07
    by first of all making the
  • 00:08:09
    memory space on the gpu itself unified
  • 00:08:12
    so we'll talk about the different types
  • 00:08:14
    of memory that are on the gpu
  • 00:08:16
    and those now are accessible through a
  • 00:08:18
    single address space
  • 00:08:20
    but they've also now provided a
  • 00:08:21
    mechanism where you can program
  • 00:08:23
    uh your applications in such a way that
  • 00:08:26
    whether you
  • 00:08:26
    allocate memory on the cpu memory or on
  • 00:08:29
    the gpu memory
  • 00:08:30
    they're considered part of the same
  • 00:08:32
    address space and you can kind of move
  • 00:08:33
    things back and forth and access those
  • 00:08:35
    memories
  • 00:08:36
    uh kind of transparently in your code
  • 00:08:38
    and then the runtime environment
  • 00:08:40
    moves things back and forth
  • 00:08:44
    in with the volta micro architecture in
  • 00:08:47
    2017
  • 00:08:48
    uh nvidia introduced what they call
  • 00:08:50
    tensor cores
  • 00:08:52
    so obviously at this point in history
  • 00:08:55
    we're seeing the the rise of machine
  • 00:08:56
    learning and deep neural networks
  • 00:08:58
    and the use of gpus specifically to
  • 00:09:01
    train neural networks and also to
  • 00:09:02
    evaluate
  • 00:09:03
    uh and solve problems using those neural
  • 00:09:05
    networks and although the
  • 00:09:07
    the prior generations of gpus were
  • 00:09:09
    really good at that
  • 00:09:10
    um what nvidia tried to do was uh
  • 00:09:13
    introduce
  • 00:09:14
    uh technology specifically designed for
  • 00:09:18
    processing deep neural networks and
  • 00:09:19
    that's what they refer to as their
  • 00:09:21
    tensor cores
  • 00:09:22
    and then in 2019 with the touring micro
  • 00:09:25
    architecture they
  • 00:09:26
    added a ray tracing capability so
  • 00:09:29
    essentially real time ray tracing for
  • 00:09:31
    graphics rendering
  • 00:09:32
    and a fairly considerable percentage of
  • 00:09:34
    the chip surface
  • 00:09:36
    in the touring architecture is dedicated
  • 00:09:39
    specifically to ray tracing
  • 00:09:41
    so in a way we're seeing kind of another
  • 00:09:43
    one of these pendulum swings that we see
  • 00:09:45
    in computing quite regularly
  • 00:09:47
    back in the in the late 90s early 2000s
  • 00:09:50
    there wasn't any kind of general purpose
  • 00:09:54
    focus in the design of these chips it
  • 00:09:55
    was very specific
  • 00:09:57
    logic on these chips that was focused on
  • 00:09:59
    doing
  • 00:10:00
    graphic stuff only and in order to use
  • 00:10:03
    those
  • 00:10:04
    generations of chips for general purpose
  • 00:10:06
    computing you had to kind of recast your
  • 00:10:08
    problem as a graphics problem
  • 00:10:10
    let the gpu do its thing and then kind
  • 00:10:12
    of map it back into the
  • 00:10:13
    the the more familiar computing domain
  • 00:10:17
    and with with starting with the tesla
  • 00:10:20
    architecture that was not really
  • 00:10:21
    necessary anymore you could just program
  • 00:10:23
    your application directly
  • 00:10:24
    but now we're starting to see a return a
  • 00:10:26
    little bit to some dedicated hardware
  • 00:10:28
    that's really focused on graphics
  • 00:10:29
    uh in this case to do to do real-time
  • 00:10:32
    ray tracing
  • 00:10:34
    um another thing here is that there's a
  • 00:10:36
    little bit of terminological confusion
  • 00:10:38
    and we'll see that there's different
  • 00:10:39
    terminology that gets used in different
  • 00:10:40
    generations there's a little variation
  • 00:10:42
    over time
  • 00:10:43
    but because the first microarchitecture
  • 00:10:46
    that nvidia introduced that did general
  • 00:10:48
    purpose computing
  • 00:10:49
    was called tesla they have kind of held
  • 00:10:53
    that
  • 00:10:53
    term to refer not only to this specific
  • 00:10:56
    generation of chips
  • 00:10:58
    but to refer to the basic idea of doing
  • 00:11:01
    general purpose computing on the
  • 00:11:02
    on the nvidia chipset so you'll still
  • 00:11:05
    hear hear and read people talking about
  • 00:11:07
    the tesla cores
  • 00:11:08
    and that really just is a general or has
  • 00:11:10
    become kind of a general reference
  • 00:11:12
    to this ability to do general purpose
  • 00:11:14
    computing on the gpu
  • 00:11:16
    and one final note here from this table
  • 00:11:18
    is the next generation of
  • 00:11:20
    chip that nvidia has announced is
  • 00:11:22
    actually named after admiral grace
  • 00:11:24
    murray hopper
  • 00:11:25
    who was one of the early pioneers in
  • 00:11:26
    computing and it was kind of cool to see
  • 00:11:29
    she's also the first woman on this list
  • 00:11:31
    obviously these are the names of famous
  • 00:11:33
    scientists and engineers and so forth
  • 00:11:34
    over time and she's the first woman
  • 00:11:36
    and one of the sort of founding founders
  • 00:11:40
    of computer science in in many
  • 00:11:42
    ways with the introduction of some of
  • 00:11:43
    the work that she did uh
  • 00:11:45
    years ago
  • 00:11:48
    all right um there's lots of terminology
  • 00:11:52
    that you kind of need to get embedded a
  • 00:11:54
    little bit when you're reading and
  • 00:11:56
    and trying to understand these gpus so
  • 00:11:58
    this is kind of a little secret decoder
  • 00:12:00
    ring some of the common things that
  • 00:12:01
    might be mystifying
  • 00:12:02
    i've starred the ones that are really
  • 00:12:04
    important so these two
  • 00:12:06
    uh these two terms here the streaming
  • 00:12:07
    multiprocessor and the streaming
  • 00:12:09
    processor
  • 00:12:10
    unfortunately they're kind of close in
  • 00:12:12
    in meaning and also
  • 00:12:14
    in terms of the initialism that's used
  • 00:12:15
    to represent them but those are really
  • 00:12:17
    the two
  • 00:12:18
    major kind of groupings of processors or
  • 00:12:21
    processing capability
  • 00:12:23
    that we care about in the gpu there's
  • 00:12:26
    there's still some specialized graphics
  • 00:12:27
    focused kinds of
  • 00:12:29
    hardware units on the gpu itself we're
  • 00:12:31
    not really going to drill into those in
  • 00:12:32
    any sort of detail
  • 00:12:34
    and you got to kind of filter those
  • 00:12:36
    things out when you're reading
  • 00:12:38
    documentation and spec sheets and stuff
  • 00:12:40
    for nvidia products
  • 00:12:41
    when you're just thinking about doing
  • 00:12:43
    cuda programming as opposed to doing
  • 00:12:44
    graphics programming
  • 00:12:47
    i learned something new the other day
  • 00:12:49
    that these are not acronyms
  • 00:12:50
    that they're actually called initialisms
  • 00:12:53
    uh an acronym is an initialism that's
  • 00:12:55
    also
  • 00:12:56
    a pronounceable word so unless you want
  • 00:12:58
    to pronounce
  • 00:12:59
    this sm as smum and smup or something
  • 00:13:02
    like that
  • 00:13:03
    they're not actually acronyms they're
  • 00:13:05
    initialisms so there you go free
  • 00:13:06
    knowledge
  • 00:13:08
    in addition these are some other other
  • 00:13:10
    terminologies that get used pretty
  • 00:13:12
    regularly
  • 00:13:13
    so the the streaming multiprocessor is
  • 00:13:16
    kind of a collection of streaming
  • 00:13:17
    processors so multiprocessor contains
  • 00:13:19
    processor
  • 00:13:21
    and those streaming multiprocessors are
  • 00:13:23
    also clustered together in kind of
  • 00:13:24
    larger units on the chip
  • 00:13:26
    and we'll see references to the tpc
  • 00:13:29
    which stands for texture slash processor
  • 00:13:32
    cluster
  • 00:13:33
    and gpc which stands for graphics
  • 00:13:35
    processing cluster
  • 00:13:36
    and these um i don't know that the
  • 00:13:38
    distinction between them is that
  • 00:13:39
    important
  • 00:13:40
    they're really just larger groupings of
  • 00:13:42
    symmetric multiprocessors which are
  • 00:13:44
    themselves groupings of
  • 00:13:46
    streaming process sorry streaming
  • 00:13:48
    multiprocessors which group streaming
  • 00:13:49
    processors
  • 00:13:51
    and something else here um the there's a
  • 00:13:54
    important difference here between
  • 00:13:55
    this the single precision and double
  • 00:13:57
    precision arithmetic that are performed
  • 00:13:59
    by the streaming processors
  • 00:14:01
    in the early going because the
  • 00:14:04
    the density with which the manufacturers
  • 00:14:07
    could get
  • 00:14:08
    features on the chips the early
  • 00:14:11
    streaming processors tended to be single
  • 00:14:13
    precision
  • 00:14:14
    which basically meant they could do
  • 00:14:16
    32-bit arithmetic
  • 00:14:18
    as opposed to double precision which is
  • 00:14:21
    64-bit so there was a
  • 00:14:24
    in order to do the 64-bit calculations
  • 00:14:27
    which you could still do you had to sort
  • 00:14:28
    of split them up
  • 00:14:30
    and run half of each on two different
  • 00:14:32
    streaming
  • 00:14:33
    processors which essentially halved the
  • 00:14:36
    performance
  • 00:14:37
    but at some point along the way nvidia
  • 00:14:39
    figured out ways to get enough
  • 00:14:41
    transistors on the surface of the chip
  • 00:14:43
    to allow you to do direct
  • 00:14:44
    double precision arithmetic and when
  • 00:14:46
    that starts to arise in the history of
  • 00:14:49
    these
  • 00:14:49
    of these chipsets we'll we'll kind of
  • 00:14:51
    point out where
  • 00:14:52
    where we see them starting to cite both
  • 00:14:55
    single precision and double precision
  • 00:14:56
    processing on a single
  • 00:14:58
    gpu this is a
  • 00:15:01
    an illustration that i thought was
  • 00:15:03
    particularly clear in
  • 00:15:04
    expressing the different main components
  • 00:15:07
    of an nvidia gpu
  • 00:15:08
    this actually corresponds to the geforce
  • 00:15:10
    8800 which was one of the first
  • 00:15:12
    tesla micro architecture chip sets that
  • 00:15:15
    was released
  • 00:15:17
    and i don't expect you to kind of
  • 00:15:19
    understand all of the details here but
  • 00:15:20
    it kind of gives you a nice
  • 00:15:22
    sort of overall view of things so we can
  • 00:15:24
    see here that
  • 00:15:25
    uh the the gpu as labeled here is really
  • 00:15:29
    the main portion of what's going on in
  • 00:15:31
    this chip there's some other things
  • 00:15:32
    happening like there's a connection to
  • 00:15:34
    allow you to talk to the host computer
  • 00:15:36
    um to to access the memory on them on
  • 00:15:39
    the motherboard and so forth
  • 00:15:41
    as well as access here to to dram that's
  • 00:15:44
    on the
  • 00:15:45
    on the board right so this is the the
  • 00:15:47
    video memory or the frame buffer memory
  • 00:15:49
    but what we're interested in here is
  • 00:15:51
    kind of this central portion
  • 00:15:53
    that represents the key computational
  • 00:15:56
    components that are in play here on
  • 00:15:58
    a gpu so in this particular chip there
  • 00:16:01
    are
  • 00:16:01
    eight of these tpcs which as you'll
  • 00:16:04
    recall back here
  • 00:16:05
    stands for texture slash processor
  • 00:16:07
    cluster and we're thinking about this as
  • 00:16:09
    the processor cluster so this is just a
  • 00:16:12
    kind of a high level grouping
  • 00:16:14
    of the streaming multiprocessor and the
  • 00:16:17
    streaming processor
  • 00:16:18
    to provide a mechanism to kind of get
  • 00:16:21
    multiple different
  • 00:16:22
    parts of the chip to do different things
  • 00:16:23
    at different times and so forth
  • 00:16:25
    there's quite a lot of quite a lot of
  • 00:16:26
    technology turned under to figure out
  • 00:16:28
    how to do that a lot of these
  • 00:16:30
    block diagram elements up here are are
  • 00:16:32
    focused on doing that
  • 00:16:34
    and we're going to drill down into
  • 00:16:35
    what's going on inside those tpcs
  • 00:16:38
    some other thing going on other things
  • 00:16:39
    going on here um there's a what it just
  • 00:16:41
    says interconnection network
  • 00:16:43
    um the uh the the
  • 00:16:46
    genesis of this whole notion of a
  • 00:16:48
    streaming multi-processor
  • 00:16:50
    or streaming processors within the
  • 00:16:51
    streaming multiprocessors
  • 00:16:53
    is that you can sort of stream a
  • 00:16:55
    calculation from one processor to the
  • 00:16:57
    next processor to the next processor
  • 00:17:00
    within the within the gpu so if you if
  • 00:17:02
    you go back and look at some of the
  • 00:17:04
    implementation details of an early
  • 00:17:06
    graphics processing unit
  • 00:17:08
    what you would find is that you'd get
  • 00:17:10
    some input from
  • 00:17:11
    the cpu right so some some element of
  • 00:17:14
    the model that you were trying to render
  • 00:17:16
    say in 3d or 2d and then there would be
  • 00:17:18
    a series of processors
  • 00:17:20
    that were very specifically tailored in
  • 00:17:23
    the hardware
  • 00:17:24
    to handle the the different stages in
  • 00:17:26
    what's called the
  • 00:17:27
    graphics processing pipeline so the cpu
  • 00:17:30
    feeds the gpu some sort of information
  • 00:17:32
    about a vertex or whatever in the scene
  • 00:17:35
    that's to be rendered
  • 00:17:36
    and then these different stages along
  • 00:17:37
    the way represented very specific pieces
  • 00:17:40
    of hardware that did certain kinds of
  • 00:17:41
    things
  • 00:17:42
    and would stream from one to the next to
  • 00:17:44
    the next and eventually
  • 00:17:45
    at the end of this process the the final
  • 00:17:48
    final processor in the step is going to
  • 00:17:50
    put something into into
  • 00:17:51
    the dram the frame buffer memory so it
  • 00:17:54
    actually shows up on the screen
  • 00:17:56
    now that's all fine and good except
  • 00:17:58
    that's not very flexible so if
  • 00:18:00
    you know if this particular step in the
  • 00:18:02
    process
  • 00:18:03
    is really busy and this step in the
  • 00:18:05
    process is really not doing very much
  • 00:18:06
    for a particular scene that's being
  • 00:18:08
    rendered
  • 00:18:09
    you're sort of leaving processing power
  • 00:18:10
    on the table but because these are
  • 00:18:12
    purpose-built
  • 00:18:14
    units within the gpu there's really no
  • 00:18:17
    way to flexibly
  • 00:18:18
    alter that behavior in order to get
  • 00:18:21
    certain parts of the hardware to do a
  • 00:18:22
    different job than it was designed for
  • 00:18:25
    so the idea here of but let me step back
  • 00:18:28
    you can see here the notion of streaming
  • 00:18:30
    though right the information is coming
  • 00:18:31
    from this
  • 00:18:32
    cpu and it streams from one processing
  • 00:18:35
    element to the next to the next until it
  • 00:18:36
    finally ends up in the frame buffer
  • 00:18:38
    so that notion of streaming is still
  • 00:18:40
    really important but instead of having
  • 00:18:42
    these kind of purpose-built modules that
  • 00:18:44
    only do one kind of calculation
  • 00:18:46
    the idea in the general purpose gpu or
  • 00:18:49
    the cuda architecture
  • 00:18:51
    is to say let's not let's not have these
  • 00:18:53
    specific units let's
  • 00:18:55
    let's instead offer a very powerful very
  • 00:18:58
    flexible
  • 00:18:59
    general-purpose processor and allow it
  • 00:19:03
    to be hooked up in such a way that we
  • 00:19:04
    get this kind of streaming behavior
  • 00:19:06
    so what we'll see here each of the
  • 00:19:08
    little elements inside these
  • 00:19:10
    inside these tpcs is one of those
  • 00:19:12
    processors
  • 00:19:13
    and you'll you'll notice here that uh
  • 00:19:17
    in this in this kind of very primitive
  • 00:19:20
    representation of a graphics processing
  • 00:19:22
    pipeline
  • 00:19:23
    this this is kind of a long series of
  • 00:19:25
    operations
  • 00:19:26
    not me not really long but there's
  • 00:19:28
    multiple steps in that process
  • 00:19:30
    and what what the what the streaming
  • 00:19:33
    multiprocessor the streaming processor
  • 00:19:35
    architecture and cuda tries to do
  • 00:19:37
    is to give you the ability to do this
  • 00:19:39
    kind of streaming
  • 00:19:40
    from one process to another to another
  • 00:19:43
    but using these general purpose
  • 00:19:44
    stream processor elements the
  • 00:19:47
    interconnection network then
  • 00:19:49
    is responsible for sort of routing
  • 00:19:51
    intermediate calculation results from
  • 00:19:53
    one streaming processor to another
  • 00:19:55
    but instead of it being instead of it
  • 00:19:58
    being
  • 00:19:59
    helpful to think about it as kind of a
  • 00:20:01
    linear chain
  • 00:20:02
    instead what you're seeing is the
  • 00:20:04
    ability for say
  • 00:20:06
    one of these tpcs to do a bunch of
  • 00:20:08
    calculations
  • 00:20:09
    and that output goes onto this
  • 00:20:11
    interconnection network and might feed
  • 00:20:13
    into
  • 00:20:14
    another set of stream processors
  • 00:20:18
    which could do a different calculation
  • 00:20:19
    and then those results could stream into
  • 00:20:21
    another one
  • 00:20:22
    and eventually that information is going
  • 00:20:23
    to or the result of those
  • 00:20:25
    calculations is going to make it out to
  • 00:20:26
    dram so you can still get the
  • 00:20:30
    this ability to stream calculations from
  • 00:20:32
    one stage to the next stage the next
  • 00:20:33
    stage
  • 00:20:34
    as you transform the information from
  • 00:20:36
    something that's used to model the scene
  • 00:20:38
    on the cpu and end up with pixels on the
  • 00:20:42
    screen
  • 00:20:42
    at the end of that process but it's now
  • 00:20:45
    more flexible than what you had when you
  • 00:20:46
    had dedicated processing stages
  • 00:20:49
    so if if you have more
  • 00:20:52
    more need to replicate the behavior of
  • 00:20:54
    this block you might allocate several
  • 00:20:56
    of the tpcs to doing this function
  • 00:21:00
    and only one of them for example just
  • 00:21:03
    kind of making this up
  • 00:21:05
    to handle this next set of functionality
  • 00:21:07
    and because the processors in the
  • 00:21:09
    streaming
  • 00:21:10
    streaming multiprocessor are general
  • 00:21:12
    purpose computers right
  • 00:21:13
    you can have them do any of these
  • 00:21:14
    functions that you'd like and and change
  • 00:21:17
    that
  • 00:21:17
    over time during the execution of a
  • 00:21:19
    single program or between
  • 00:21:21
    different programs that are going to run
  • 00:21:22
    on the gpu
  • 00:21:24
    the other cool thing about that is we
  • 00:21:27
    can now use these general purpose
  • 00:21:29
    streaming processors to do whatever kind
  • 00:21:31
    of calculations we want
  • 00:21:33
    whether it's specifically tailored to
  • 00:21:34
    doing some graphics thing or we're just
  • 00:21:37
    doing a say a large matrix operation
  • 00:21:39
    that we're storing out here in
  • 00:21:41
    in frame buffer memory but we're never
  • 00:21:43
    actually going to show on the screen
  • 00:21:44
    what that frame buffer memory contains
  • 00:21:46
    because it's just data at that point
  • 00:21:47
    it's not designed to be pixels
  • 00:21:49
    it's just the uh the input or the output
  • 00:21:52
    from the calculation that we're doing
  • 00:21:54
    on all of these very large number of
  • 00:21:56
    streaming processors
  • 00:22:01
    here's a little zoomed in view so this
  • 00:22:02
    is that the picture of the overall
  • 00:22:05
    architecture of the gpu for the 8800
  • 00:22:08
    and this kind of gives you a little bit
  • 00:22:09
    more detail on this on this tpc the pro
  • 00:22:12
    the processing cluster
  • 00:22:14
    so there's two units inside of here that
  • 00:22:17
    are of interest to us
  • 00:22:18
    both of which are labeled sm for
  • 00:22:21
    streaming multiprocessor
  • 00:22:23
    and within the streaming multiprocessor
  • 00:22:25
    there are the individual
  • 00:22:26
    sps the streaming processor and it's the
  • 00:22:29
    sp that's really
  • 00:22:30
    i mean this is what we're what we refer
  • 00:22:31
    to as a core in
  • 00:22:33
    informal terms right that's a that's a
  • 00:22:35
    gpu core or a cuda core you could call
  • 00:22:37
    it
  • 00:22:37
    and you can see in this architecture we
  • 00:22:39
    have um
  • 00:22:41
    eight of those cores eight of those sps
  • 00:22:43
    grouped together
  • 00:22:44
    into an sm and we've got two of those
  • 00:22:47
    sms that are grouped together into a tpc
  • 00:22:49
    and then we saw previously that we have
  • 00:22:51
    eight of these tpcs that are grouped
  • 00:22:53
    together to make up the entirety of the
  • 00:22:55
    gpu
  • 00:22:56
    i've also got here a little bit of an
  • 00:22:58
    exploded diagram of an individual
  • 00:23:00
    stream streaming multiprocessor so we
  • 00:23:02
    can see here in a little bit more detail
  • 00:23:04
    we've got the streaming processors of
  • 00:23:06
    course we have some special functional
  • 00:23:08
    units which do things like
  • 00:23:10
    transcendental functions like sine and
  • 00:23:11
    cosine that sort of thing
  • 00:23:13
    we've got some caching going on as i
  • 00:23:15
    mentioned per per streaming
  • 00:23:17
    multiprocessor so there's an instruction
  • 00:23:19
    level cache
  • 00:23:20
    there's also what's called a constant
  • 00:23:22
    cache so if you have specific constant
  • 00:23:24
    values that you need to refer to
  • 00:23:26
    over and over again nearby the
  • 00:23:28
    processors but
  • 00:23:29
    that need that need to change over time
  • 00:23:31
    you can store those in that cache
  • 00:23:33
    and then there's also some shared memory
  • 00:23:36
    which allows
  • 00:23:37
    x or provides access to all of the sps
  • 00:23:40
    on an sm
  • 00:23:41
    so if you have a calculation that that
  • 00:23:44
    you need to
  • 00:23:45
    split up into smaller pieces and allow
  • 00:23:48
    those
  • 00:23:48
    individual pieces to collaborate with
  • 00:23:50
    one another to solve the overall problem
  • 00:23:52
    even at the level of
  • 00:23:54
    individual sps within sm they have
  • 00:23:56
    access to this shared memory
  • 00:23:58
    programming the sps if they need to
  • 00:24:00
    collaborate over shared memory
  • 00:24:01
    is quite similar to what we were looking
  • 00:24:03
    at in the early part of the term when we
  • 00:24:05
    were doing p threads programming
  • 00:24:06
    in particular all of the sps can access
  • 00:24:09
    directly the shared memory within the sm
  • 00:24:12
    and in order to get them to cooperate
  • 00:24:13
    with one another you need to do things
  • 00:24:15
    like synchronizing
  • 00:24:16
    access to memory so that you don't step
  • 00:24:17
    on each other's toes so
  • 00:24:19
    buried inside of this whole architecture
  • 00:24:22
    is a little shared memory machine
  • 00:24:24
    with some number of sps attached to that
  • 00:24:26
    shared memory
  • 00:24:27
    and then of course because there is this
  • 00:24:30
    large collection of dram
  • 00:24:32
    memory that's attached to the entire gpu
  • 00:24:35
    the the sps can also access that global
  • 00:24:38
    memory as well
  • 00:24:40
    right so we're going to take a kind of a
  • 00:24:42
    quick tour through these generations of
  • 00:24:44
    the the chips chipsets and the micro
  • 00:24:46
    architectures that we've been talking
  • 00:24:47
    about so
  • 00:24:48
    starting out with the tesla architecture
  • 00:24:50
    and we've kind of looked at this
  • 00:24:52
    uh from from the previous illustration
  • 00:24:54
    uh this is another representation of
  • 00:24:56
    that same
  • 00:24:57
    8800 core but i wanted to just kind of
  • 00:25:00
    for consistency here
  • 00:25:01
    break out some of the key statistics so
  • 00:25:03
    this guy has
  • 00:25:04
    eight at this point they were calling
  • 00:25:06
    them graphics processing clusters
  • 00:25:08
    so each of these guys is a gpc and then
  • 00:25:11
    within each of those
  • 00:25:12
    there's two individual
  • 00:25:16
    sms which means we have a total of 16
  • 00:25:20
    streaming multiprocessors and within
  • 00:25:22
    each of those streaming multiprocessors
  • 00:25:24
    as we've seen already there were eight
  • 00:25:26
    streaming processors for a total of 128
  • 00:25:28
    streaming processors in this gpu
  • 00:25:31
    and then there's other things out here
  • 00:25:33
    too this block is
  • 00:25:34
    for information about like shading and
  • 00:25:36
    kind of graphic specific things that
  • 00:25:38
    we're going to just sort of
  • 00:25:39
    ignore and remember that there is an l1
  • 00:25:41
    cache that's associated with the gpc
  • 00:25:44
    and then an l2 cache that can be shared
  • 00:25:47
    among
  • 00:25:48
    multiple sms before we get out to
  • 00:25:51
    frame buffer memory here the fb and then
  • 00:25:54
    we can zoom in again on the individual
  • 00:25:57
    streaming multiprocessor so just one of
  • 00:26:00
    these chunks
  • 00:26:00
    of the the tpc is the streaming
  • 00:26:03
    multiprocessor and it's got streaming
  • 00:26:05
    processors inside of it
  • 00:26:07
    uh and then these other units here have
  • 00:26:08
    more to do with with dealing with some
  • 00:26:10
    graphic specific things like texture
  • 00:26:12
    texture maps
  • 00:26:16
    another release in kind of the second
  • 00:26:19
    generation of tesla
  • 00:26:21
    chips was the geforce 280 this is also
  • 00:26:23
    in 2006.
  • 00:26:24
    i guess there was a little bit more
  • 00:26:26
    emphasis this is again
  • 00:26:27
    an illustration grabbed from some of the
  • 00:26:30
    documentation white papers about these
  • 00:26:31
    guys
  • 00:26:32
    so 3d diagrams was apparently really
  • 00:26:34
    cool at that point
  • 00:26:36
    so uh we have a little bit more capable
  • 00:26:39
    uh chip here we've got 10
  • 00:26:41
    tpcs each of which has three streaming
  • 00:26:43
    multiprocessors so a total of 30
  • 00:26:45
    streaming multiprocessors
  • 00:26:47
    and each of those again had eight uh
  • 00:26:50
    streaming processors for a total of 240
  • 00:26:53
    streaming processors i'll have a little
  • 00:26:54
    summary slide here at the end that lets
  • 00:26:56
    us kind of compare
  • 00:26:57
    these key statistics that i've kind of
  • 00:26:59
    outlined here
  • 00:27:00
    for the number of sms and the number of
  • 00:27:02
    sps that's really the critical
  • 00:27:04
    uh the critical number that we're
  • 00:27:06
    interested in when we're thinking about
  • 00:27:07
    how do we program this
  • 00:27:08
    uh using the cuda platform
  • 00:27:12
    this is a a little larger view of
  • 00:27:15
    an individual tpc right so there's ten
  • 00:27:18
    of these
  • 00:27:18
    so we're looking at one of these guys
  • 00:27:20
    and you can see here's
  • 00:27:21
    the three streaming multiprocessors each
  • 00:27:24
    of which has eight
  • 00:27:25
    individual again cores or s this is the
  • 00:27:28
    streaming processor the sp or the core
  • 00:27:31
    and then there's some local memory that
  • 00:27:33
    would be the shared memory that's
  • 00:27:34
    available to all these cores
  • 00:27:35
    and then some texture stuff and a shared
  • 00:27:38
    l1 cache
  • 00:27:39
    across this this tpc
  • 00:27:44
    okay 2010 saw the introduction of the
  • 00:27:47
    fermi
  • 00:27:48
    microarchitecture and you can see here
  • 00:27:51
    this again ripped from the headlines
  • 00:27:54
    just straight from the documentation
  • 00:27:56
    here we're showing that there's 16
  • 00:27:58
    streaming multiprocessors so they're not
  • 00:28:00
    really breaking these out in
  • 00:28:02
    in detail as tpcs at this point in
  • 00:28:04
    history i'm not sure exactly why
  • 00:28:05
    and we can see though so here's
  • 00:28:07
    basically one of those streaming
  • 00:28:08
    multiprocessors
  • 00:28:10
    and each of those has 32 sps for a total
  • 00:28:13
    of 512 sps
  • 00:28:15
    and they're just kind of laying this out
  • 00:28:16
    differently we still have local cache
  • 00:28:18
    here within
  • 00:28:19
    the the streaming multiprocessors
  • 00:28:21
    there's a level two cache as well
  • 00:28:23
    and then the the blocks on the outside
  • 00:28:25
    are basically showing you
  • 00:28:27
    interfaces to the dynamic memory
  • 00:28:30
    if we zoom in on one of these sms we can
  • 00:28:33
    see here that
  • 00:28:34
    we've got a bunch of cores i mentioned
  • 00:28:37
    30 32 cores or 32 sp's per sm so each of
  • 00:28:41
    these is a core and there's actually a
  • 00:28:42
    little exploded diagram here of the core
  • 00:28:44
    notice that this guy has both a floating
  • 00:28:46
    point unit and
  • 00:28:48
    an integer unit so there's two different
  • 00:28:50
    paths through this core that allow you
  • 00:28:52
    to
  • 00:28:53
    simultaneously do a floating point
  • 00:28:55
    operation
  • 00:28:56
    and an integer operation so you kind of
  • 00:28:59
    get double your money if you want for
  • 00:29:01
    the pro
  • 00:29:01
    for certain kinds of calculations you
  • 00:29:03
    can have both of those things active
  • 00:29:05
    i think at this point we're still in the
  • 00:29:07
    single precision uh
  • 00:29:09
    world so if you wanted to do double
  • 00:29:10
    precision arithmetic you
  • 00:29:12
    had to figure out how to how to use
  • 00:29:14
    these things to do that
  • 00:29:17
    we've also here called out the load
  • 00:29:19
    store unit so this is a way of
  • 00:29:21
    streamlining access to memory
  • 00:29:23
    and then the special function units of
  • 00:29:25
    which there's just four
  • 00:29:27
    because presumably you're not going to
  • 00:29:28
    do that many you know sine and cosine
  • 00:29:30
    and transcendental operations
  • 00:29:32
    uh as you are normal sort of
  • 00:29:36
    arithmetic operations you can see here
  • 00:29:38
    there's there's shared memory the 664k
  • 00:29:40
    of shared memory per
  • 00:29:42
    sm which results in a lot of memory in
  • 00:29:44
    in total
  • 00:29:45
    a connection interconnection network to
  • 00:29:47
    allow you to stream calculations from
  • 00:29:48
    one processor to the next
  • 00:29:50
    and the cores also we haven't seen this
  • 00:29:52
    yet this is essentially
  • 00:29:54
    core local memory so
  • 00:29:58
    in the if you recall in this in a cpu uh
  • 00:30:01
    each of the processor cores has a
  • 00:30:03
    register file right sort of
  • 00:30:04
    really fast nearby static ram that's
  • 00:30:08
    really quick to access by the cores
  • 00:30:10
    there's a similar thing here for the
  • 00:30:11
    cuda cores the register file is quite a
  • 00:30:13
    lot larger
  • 00:30:15
    it's 32 000 entries of 32 032-bit
  • 00:30:19
    registers
  • 00:30:20
    and those are divided evenly across the
  • 00:30:23
    individual cores and since there's 32 of
  • 00:30:25
    those on this chip
  • 00:30:26
    each of those is going to get 1000 or 1k
  • 00:30:29
    32-bit registers that it can
  • 00:30:31
    individually access really quickly
  • 00:30:33
    so there's really three levels in the
  • 00:30:34
    memory hierarchy an individual core
  • 00:30:36
    has its own dedicated registers from the
  • 00:30:39
    register file so this is split up
  • 00:30:41
    32 different ways those cores can all
  • 00:30:44
    access the shared memory the 64k of
  • 00:30:46
    shared memory here
  • 00:30:48
    and then they can also access what's
  • 00:30:50
    called the global memory which is really
  • 00:30:51
    just the dram
  • 00:30:52
    frame buffers on the gpu card itself
  • 00:30:57
    next up was kepler in 2012
  • 00:31:00
    and the illustration just gets more and
  • 00:31:03
    more dense
  • 00:31:04
    um so you can see there's you know many
  • 00:31:06
    more components
  • 00:31:07
    that are fitting on the on the chip
  • 00:31:09
    surface here and again this is just a
  • 00:31:10
    result of having
  • 00:31:12
    a higher density fabrication process
  • 00:31:14
    that lets you put more processing power
  • 00:31:16
    on the same chip so this guy has
  • 00:31:18
    15 streaming multiprocessors each of
  • 00:31:21
    which has 192
  • 00:31:23
    processors for a total of almost 3 000
  • 00:31:26
    sps on a single chip otherwise the
  • 00:31:29
    architecture is quite similar
  • 00:31:30
    level 1 cache level 2 cache access to
  • 00:31:33
    dram
  • 00:31:34
    local register files and so forth here's
  • 00:31:36
    an exploded view of an almost
  • 00:31:39
    in almost impossible to read a view of
  • 00:31:42
    this
  • 00:31:42
    of this chip but you can see that
  • 00:31:44
    there's a whole variety of different
  • 00:31:45
    cores now that are going to be included
  • 00:31:46
    in here
  • 00:31:47
    some of them are starting to be able to
  • 00:31:49
    provide native
  • 00:31:51
    improvements to floating point
  • 00:31:52
    calculations and so forth they've all
  • 00:31:54
    got load store units
  • 00:31:56
    and then there's also some special
  • 00:31:57
    function units here for transcendental
  • 00:32:00
    operations similarly there's still
  • 00:32:02
    shared memory
  • 00:32:03
    and each of these guys is also going to
  • 00:32:05
    have pieces of a register file
  • 00:32:07
    so it's kind of more of the same being
  • 00:32:09
    stamped out here
  • 00:32:10
    to provide higher throughput
  • 00:32:12
    calculations
  • 00:32:14
    moved to the maxwell architecture in
  • 00:32:16
    2014 ocs are nicely evenly spaced for
  • 00:32:19
    the most part in two year increments
  • 00:32:21
    i couldn't find a picture of a maxwell
  • 00:32:24
    gpu but i did find an illustration of a
  • 00:32:26
    maxwell
  • 00:32:27
    streaming multiprocessor nvidia kind of
  • 00:32:30
    played with names here so they called
  • 00:32:32
    some of the sms
  • 00:32:33
    sms and some of them smxs and smms
  • 00:32:36
    it's all kind of the same idea here
  • 00:32:38
    we've got on the
  • 00:32:40
    typical maxwell architecture we're going
  • 00:32:42
    to have 16 of these sms
  • 00:32:44
    with 120 sps per sm so this is one of
  • 00:32:47
    those
  • 00:32:48
    sms and you end up with 2
  • 00:32:51
    000 streaming processors
  • 00:32:54
    per chip again very similar kinds of
  • 00:32:57
    pieces of functionality here i'm
  • 00:32:59
    not going to drill down into these for
  • 00:33:01
    each for each chip set but
  • 00:33:03
    you get the idea that it's just kind of
  • 00:33:05
    more of the same
  • 00:33:07
    2016 introduced the pascal architecture
  • 00:33:10
    and as you can see higher density we've
  • 00:33:13
    got
  • 00:33:14
    now they're starting to use the term
  • 00:33:16
    graphics processing cluster
  • 00:33:18
    graphics processor cluster so there's
  • 00:33:19
    six of these and you can see each of
  • 00:33:21
    those
  • 00:33:22
    is this larger group of of processing
  • 00:33:25
    elements
  • 00:33:26
    within each of those there's 10
  • 00:33:28
    streaming multiprocessors
  • 00:33:30
    right so back at the early generations
  • 00:33:32
    of tesla we had
  • 00:33:33
    10 streaming processors on the whole
  • 00:33:35
    chip now we've got 10 of them just in
  • 00:33:37
    this one little
  • 00:33:38
    section of the chip and we can stamp
  • 00:33:39
    that out a bunch of different times so
  • 00:33:41
    with six gpcs and 10 sms per we've got
  • 00:33:44
    60 streaming multiprocessors
  • 00:33:46
    and those can each have 64 sps
  • 00:33:49
    giving us a total of almost 4 000
  • 00:33:52
    streaming processors on one chip
  • 00:33:54
    otherwise similar things level 1 cache
  • 00:33:56
    level 2 cache register files
  • 00:33:58
    memory access
  • 00:34:02
    here's a close-up view of the individual
  • 00:34:05
    sms and sps so you can see here there's
  • 00:34:09
    some ordinary cores which are similar to
  • 00:34:11
    what we've been seeing already
  • 00:34:12
    but there's these dp units stand for
  • 00:34:15
    double precision units so they know how
  • 00:34:17
    to do double precision arithmetic
  • 00:34:20
    as well as single precision there's a
  • 00:34:22
    load store unit for memory access and
  • 00:34:23
    then a special function unit for each
  • 00:34:25
    group of four cores
  • 00:34:27
    so again more capability on the chip
  • 00:34:30
    giving you more flexibility for the
  • 00:34:32
    calculations you're going to do
  • 00:34:34
    shared memory there's a register file
  • 00:34:36
    for the individual processors and of
  • 00:34:38
    course we have access to global memory
  • 00:34:42
    volta in 2017 again more of the same
  • 00:34:47
    we've got now still six graphics
  • 00:34:49
    processing clusters but instead of 10
  • 00:34:51
    sms per we've got 14 per for a total of
  • 00:34:53
    84
  • 00:34:54
    streaming multiprocessors and now we're
  • 00:34:57
    to the point where we're kind of
  • 00:34:58
    pulling apart single precision and
  • 00:35:01
    double precision so each of these sms
  • 00:35:03
    has 64 single precision cores for
  • 00:35:06
    floating point
  • 00:35:08
    64 single precision cores for integer
  • 00:35:11
    arithmetic
  • 00:35:12
    and 32 cores for double precision
  • 00:35:15
    floating point
  • 00:35:16
    and that's kind of the holy grail here
  • 00:35:18
    for gpus in general purpose computing
  • 00:35:20
    most of the
  • 00:35:21
    modeling that gets done in in physical
  • 00:35:24
    simulations and similar sorts of things
  • 00:35:27
    require double precision arithmetic to
  • 00:35:29
    be at all accurate and so the
  • 00:35:31
    introduction of
  • 00:35:33
    dedicated cores on these gpus was a big
  • 00:35:36
    deal
  • 00:35:36
    and then if we you know kind of multiply
  • 00:35:38
    these things out we see that we're
  • 00:35:40
    getting an excess of 5000 single
  • 00:35:42
    precision cores
  • 00:35:43
    and almost three thousand double
  • 00:35:45
    precision floating point cores
  • 00:35:47
    on one chip here's
  • 00:35:50
    a illustration of the
  • 00:35:54
    of the sm in individual sm with its sps
  • 00:35:58
    kind of similar to before what's new
  • 00:35:59
    here is these tensor cores
  • 00:36:02
    so as i mentioned before a lot of the
  • 00:36:05
    important uses of a gpu these days is in
  • 00:36:08
    the machine learning
  • 00:36:10
    area where we're training uh deep neural
  • 00:36:12
    networks to
  • 00:36:13
    understand classification problems and
  • 00:36:15
    similar sorts of things
  • 00:36:17
    um and that's the the the dedicated
  • 00:36:20
    hardware that nvidia builds onto their
  • 00:36:21
    chips
  • 00:36:22
    that's focused on providing capabilities
  • 00:36:25
    for that are referred to as tensor cores
  • 00:36:27
    so these are kind of purpose-built
  • 00:36:30
    variations on the underlying
  • 00:36:32
    sp processor that do
  • 00:36:35
    directly some of the basic calculations
  • 00:36:38
    that are necessary for
  • 00:36:39
    training and evaluating a neural net
  • 00:36:44
    touring 2018 here again more the same
  • 00:36:48
    we've got 72
  • 00:36:49
    sms in total 64 cores per sm
  • 00:36:53
    8 tensor cores per sm so that kind of
  • 00:36:56
    continues going forward so we've got
  • 00:36:58
    almost 5 000 cores cuda cores and
  • 00:37:01
    more than 500 tensor cores otherwise
  • 00:37:04
    l1 l2 memory access and so forth is
  • 00:37:06
    quite similar
  • 00:37:08
    this is a so this is just a diagram of
  • 00:37:11
    the of the chip
  • 00:37:12
    this right here is actually a micrograph
  • 00:37:14
    so that's actually what the
  • 00:37:15
    what the components on the chip itself
  • 00:37:17
    look like
  • 00:37:18
    and you can see that you know when we
  • 00:37:20
    say stamp out another core that's
  • 00:37:22
    literally what's going on here they're
  • 00:37:23
    just duplicating that particular chunk
  • 00:37:25
    of the
  • 00:37:26
    of the uh of the image that's used to
  • 00:37:28
    etch the
  • 00:37:29
    components into the surface of the
  • 00:37:31
    silicon um and
  • 00:37:32
    and it's really evident how that's set
  • 00:37:35
    up when you see an actual picture of it
  • 00:37:38
    and then here's the the detail here of
  • 00:37:40
    the of a single sm
  • 00:37:42
    and it's sps uh same as before right
  • 00:37:44
    we've got some integer
  • 00:37:46
    floating integer units some floating
  • 00:37:47
    point units some tensor cores
  • 00:37:49
    register files cache special function
  • 00:37:51
    units and so forth
  • 00:37:55
    finally the ampere architecture is the
  • 00:37:57
    current one
  • 00:37:58
    so here's an ampere gpu
  • 00:38:01
    it's got seven uh graphics processing
  • 00:38:06
    clusters well this looks like eight in
  • 00:38:09
    the picture
  • 00:38:10
    uh with 12 sms per so a total of 84
  • 00:38:14
    streaming multiprocessors and then a
  • 00:38:18
    additional statistics here 128 cores per
  • 00:38:21
    sm which gives us a really finally you
  • 00:38:22
    know we're over 10 000 cuda cores just
  • 00:38:25
    on this one chip
  • 00:38:26
    across all of the sms within each gpc
  • 00:38:29
    and 336 tensor cores you know one of the
  • 00:38:32
    things that you see when you look at
  • 00:38:33
    these statistics over time is
  • 00:38:35
    as the manufacturer kind of goes back
  • 00:38:37
    and forth between you know what are they
  • 00:38:38
    truly trying to emphasize here
  • 00:38:40
    um how much space on the chip do they
  • 00:38:44
    dedicate to different kinds of
  • 00:38:45
    processors
  • 00:38:46
    and they're really trying to respond to
  • 00:38:48
    market demands right is it going to be
  • 00:38:49
    more important for us to just do raw
  • 00:38:51
    graphics processing is it going to be
  • 00:38:52
    more important for us to provide good
  • 00:38:54
    capabilities for machine learning and
  • 00:38:55
    that kind of stuff
  • 00:38:57
    and you can see maybe a little decreased
  • 00:38:58
    emphasis in a way on on the tensor cores
  • 00:39:00
    from the previous generation
  • 00:39:02
    architecture and more of an emphasis on
  • 00:39:05
    just individual
  • 00:39:06
    individual standard cuda cores i should
  • 00:39:09
    also point out that
  • 00:39:10
    this is kind of one sample of one
  • 00:39:13
    particular chip
  • 00:39:14
    at each of the architecture levels if
  • 00:39:16
    you go look
  • 00:39:17
    look around at different releases from
  • 00:39:19
    nvidia
  • 00:39:20
    you'll see that that they will have
  • 00:39:23
    various
  • 00:39:25
    various variations on these statistics
  • 00:39:27
    within the same processor family so
  • 00:39:29
    there might be
  • 00:39:30
    you know 10 chip i'm making this up but
  • 00:39:31
    10 chips in the ampere family
  • 00:39:34
    that have different numbers of each of
  • 00:39:35
    these types of processors
  • 00:39:37
    that are intended to kind of focus on
  • 00:39:38
    different markets right is this a chip
  • 00:39:40
    that's going to be
  • 00:39:41
    used for high performance super
  • 00:39:43
    computing is this a chip
  • 00:39:44
    that's going to be used for you know
  • 00:39:47
    high capacity desktop
  • 00:39:49
    rendering like at a graphics company or
  • 00:39:52
    a film production company or
  • 00:39:54
    for a high-end engineering workstation
  • 00:39:56
    or is this going to be a gpu that's
  • 00:39:57
    going to be used in a mobile device
  • 00:39:59
    which obviously has considerable
  • 00:40:01
    constraints around power and that sort
  • 00:40:03
    of thing
  • 00:40:03
    so these are these are not the only
  • 00:40:06
    versions of
  • 00:40:07
    of this of the statistics for these
  • 00:40:08
    chips it's just kind of a representative
  • 00:40:10
    sample
  • 00:40:13
    and then here's the the ampere streaming
  • 00:40:16
    multiprocessor
  • 00:40:17
    and again you can see energy units
  • 00:40:19
    floating point units 64-bit floating
  • 00:40:21
    point units
  • 00:40:22
    and then the tensor core as well
  • 00:40:26
    by way of summary so here's kind of the
  • 00:40:29
    the same
  • 00:40:29
    historic historic perspective here going
  • 00:40:31
    back to 2008 and the different micro
  • 00:40:33
    architectures
  • 00:40:34
    and what i've done here is just
  • 00:40:35
    collected together the from these slides
  • 00:40:38
    and again these are just representative
  • 00:40:39
    numbers you can find
  • 00:40:41
    parts in each of these product families
  • 00:40:43
    that have more or less
  • 00:40:45
    sms or sps but just kind of a
  • 00:40:47
    representative sample
  • 00:40:49
    back here we had 128 sps now we're
  • 00:40:52
    at 10 000 sps which is pretty remarkable
  • 00:40:55
    growth
  • 00:40:57
    i also wanted to point out here that the
  • 00:40:59
    the
  • 00:41:00
    processors on our what are what we call
  • 00:41:02
    our pseudo lab
  • 00:41:04
    those machines actually have a quadro
  • 00:41:05
    k620 which is not a super recent
  • 00:41:09
    super recent gpu board in them
  • 00:41:13
    you can tell here by the model number
  • 00:41:14
    with the k in it that this actually
  • 00:41:16
    follows the kepler architecture
  • 00:41:18
    and the kepler that we looked at earlier
  • 00:41:20
    has 15 sms and
  • 00:41:21
    a lot of sps the ones in the pseudo lab
  • 00:41:24
    actually are
  • 00:41:24
    rather more modest desktop focused
  • 00:41:27
    releases of this chipset
  • 00:41:28
    um that have only 308 only 384
  • 00:41:33
    sps so we'll just have to make two if
  • 00:41:34
    that's what you're using if you've got a
  • 00:41:36
    more capable
  • 00:41:37
    laptop or desktop you may be in in this
  • 00:41:40
    range instead
  • 00:41:41
    that's all good as we'll see the
  • 00:41:44
    the programming model that's provided by
  • 00:41:46
    cuda and this is really an important
  • 00:41:48
    idea to keep in mind as you're
  • 00:41:49
    developing software for these guys
  • 00:41:51
    is it's designed to run exactly the same
  • 00:41:54
    code
  • 00:41:55
    no matter where you are in the history
  • 00:41:58
    of these architectures
  • 00:41:59
    in particular it's designed such that
  • 00:42:01
    you can actually run this on future
  • 00:42:03
    architecture so when the hopper
  • 00:42:04
    architecture
  • 00:42:05
    becomes available as actual silicon
  • 00:42:07
    we'll be able to take those same
  • 00:42:09
    programs and
  • 00:42:10
    run them on the hopper architecture
  • 00:42:12
    without any modifications at all
  • 00:42:14
    because of the abstractions that are
  • 00:42:15
    provided by the cuda
  • 00:42:18
    the cuda model for programming
标签
  • CUDA
  • GPU
  • CPU
  • Nvidia
  • Architecture
  • Tensor Cores
  • Pascal
  • Ampere
  • Fermi
  • Kepler