Ce tip de calcule poate realiza o placă grafică în fiecare secundă?

O placă grafică avansată poate realiza aproximativ 36 de trilioane de calcule pe secundă.

Care este diferența principală dintre GPU și CPU?

GPU-urile au un număr mult mai mare de nuclee, dar procesarea lor este pentru calcule paralele simple, în timp ce CPU-urile sunt mai versatile și pot gestiona o gamă mai largă de sarcini.

Ce este un "core" CUDA?

Nuclee CUDA sunt unități de calcul simple ce efectuează calcule matematice de bază, utilizate în principal în jocuri.

Cum arată arhitectura unei plăci grafice?

Arhitectura include un PCB, GPU cu miliarde de tranzistori, nuclee CUDA, nuclee tensoriale și nuclee pentru ray tracing.

Ce este memorie GDDR6X?

GDDR6X este un tip de memorie folosită în plăcile grafice pentru a stoca datele necesare procesării grafice.

Ce este 'embarrassingly parallel'?

Este un termen care descrie problemele de calcul care pot fi împărțite ușor între sarcini paralele.

Cum funcționează unitățile tensoriale?

Unitățile tensoriale efectuează operații de înmulțire și adunare pe matrice, esențiale pentru rețele neuronale.

How do Graphics Cards Work? Exploring GPU Architecture

00:28:30

https://www.youtube.com/watch?v=h9Z4oGN89MU

Ringkasan

TLDRAcest video detaliază funcționarea plăcilor grafice avansate, explicând cum GPU-urile pot efectua până la 36 de trilioane de calcule pe secundă. Se compară GPU-urile și CPU-urile, subliniind structura robustă a GPU-urilor cu un număr extrem de mare de nuclee, dar care se concentrează pe procese paralele simple. Video-ul analizează diverse componente interne ale GPU-urilor, cum ar fi nuclee CUDA, nuclee tensoriale și nucleele de ray tracing. De asemenea, se discută despre diferitele tipuri de memorie utilizate, precum GDDR6X, care sprijină performanțele excepționale ale plăcilor în sarcini precum grafica jocurilor, mineritul de Bitcoin și rețele neuronale. În final, se menționează impactul plăcilor grafice în utilizarea AI-ului și a altor tehnologii emergente.

Takeaways

🎮 GPU-urile pot efectua 36 trilioane de calcule/secundă.
⚙️ GPU-urile au peste 10,000 de nuclee comparativ cu CPU-urile care au doar 24.
📊 Nuclee CUDA sunt esențiale pentru rularea jocurilor.
🔄 Arhitectura GPU-urilor este optimizată pentru sarcini paralele.
📈 GDDR6X oferă lățimi de bandă mari pentru transfer rapid de date.
🔍 "Embarrassingly parallel" se referă la problemele ce pot fi rezolvate rapid.
🧠 Unitățile tensoriale sunt folosite în AI și rețele neuronale.
🚀 GPU-urile sunt ideale pentru mineritul de Bitcoin.
💡 Arhitectura SIMT permiteflexibilitate în execuția thread-urilor.
🌌 Designul unei plăci grafice include mai multe componente strategice.

Garis waktu

00:00:00 - 00:05:00
Grafica avansată necesită puteri de calcul exagerate, de la 100 milioane de calcule pe secundă pentru jocurile din 1996, până la 36 de trilioane pentru Cyberpunk 2077. Această capacitate ar necesita 4.400 de Pământuri muncind simultan. Videoclipul va explora cum funcționează plăcile grafice, incluzând o secțiune despre arhitectura GPU și cum acesta procesează date masive. Menționăm influența lui Micron asupra memoriei grafice.
00:05:00 - 00:10:00
Diferențele dintre GPU și CPU sunt esențiale: GPU-ul are peste 10.000 de nuclee, pe când CPU-ul are doar 24, dar acest aspect nu face neapărat GPU-ul mai puternic. Comparația între GPU ca un vapor masiv și CPU ca un avion ilustrează trade-off-ul dintre calcul și viteză. CPU-urile sunt flexibile, procesând o varietate de sarcini, comparativ cu GPU-urile care se concentrează pe calcule simple.
00:10:00 - 00:15:00
Deschiderea plăcii grafice dezvăluie placa de circuit imprimat (PCB) și GPU-ul GA102, compus din 28,3 miliarde de tranzistori și 10.752 de nuclei CUDA. Structura ierarhică a acestuia permite executarea simultană a sarcinilor, iar defectele din procesul de fabricare sunt gestionate astfel încât să se maximizeze utilizarea resurselor funcționale.
00:15:00 - 00:20:00
Un nucleu CUDA este o unitate simplă, având 410.000 de tranzistori, capabil să execute calcule de tip FMA, utilizate în jocuri. GPU-uri diferite folosesc același design de chip, dar sunt clasificate prin numărul de nuclee funcționale disponibile și prin viteza maximă de ceas. Memoria grafică (GDDR6X) permite transferuri rapide de date, fiind crucială pentru performanța în jocuri și aplicații de înaltă intensitate a datelor.
00:20:00 - 00:28:30
Arhitectura computațională a GPU-ului folosește SIMT (instrucțiuni unificate multiple pentru fire) în detrimentul SIMD (instrucțiuni unificate multiple pentru date), ceea ce permite mai multă flexibilitate în execuția firelor. S-a discutat, de asemenea, despre aplicațiile GPU-urilor în mining-ul Bitcoin, folosind algoritmi SHA-256, și despre utilizarea nucleelor tensor pentru calcule matriceale în AI și rețele neuronale.

Tampilkan lebih banyak

Peta Pikiran

Video Tanya Jawab

Ce tip de calcule poate realiza o placă grafică în fiecare secundă?
O placă grafică avansată poate realiza aproximativ 36 de trilioane de calcule pe secundă.
Care este diferența principală dintre GPU și CPU?
GPU-urile au un număr mult mai mare de nuclee, dar procesarea lor este pentru calcule paralele simple, în timp ce CPU-urile sunt mai versatile și pot gestiona o gamă mai largă de sarcini.
Ce este un "core" CUDA?
Nuclee CUDA sunt unități de calcul simple ce efectuează calcule matematice de bază, utilizate în principal în jocuri.
Cum arată arhitectura unei plăci grafice?
Arhitectura include un PCB, GPU cu miliarde de tranzistori, nuclee CUDA, nuclee tensoriale și nuclee pentru ray tracing.
Ce este memorie GDDR6X?
GDDR6X este un tip de memorie folosită în plăcile grafice pentru a stoca datele necesare procesării grafice.
Ce este 'embarrassingly parallel'?
Este un termen care descrie problemele de calcul care pot fi împărțite ușor între sarcini paralele.
Cum funcționează unitățile tensoriale?
Unitățile tensoriale efectuează operații de înmulțire și adunare pe matrice, esențiale pentru rețele neuronale.

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!

Teks

Gulir Otomatis:

00:00:00
How many calculations do you think your graphics card performs every second
00:00:04
while running video games with incredibly realistic graphics? Maybe 100 million? Well,
00:00:11
100 million calculations a second is what’s required to run Mario 64 from 1996. We need
00:00:21
more power. Maybe 100 billion calculations a second? Well, then you would have a computer
00:00:27
that could run Minecraft back in 2011. In order to run the most realistic video games such as
00:00:34
Cyberpunk 2077 you need a graphics card that can perform around 36 trillion calculations a second.
00:00:43
This is an unimaginably large number, so let’s take a second to try to conceptualize it. Imagine
00:00:50
doing a long multiplication problem once every second. Now let’s say everyone on the planet does
00:00:57
a similar type of calculation but with different numbers. To reach the equivalent computational
00:01:03
power of this graphics card and its 36 trillion calculations a second we would need about 4,400
00:01:12
Earths filled with people, all working together and completing one calculation each every second.
00:01:20
It’s rather mind boggling to think that a device can manage all these calculations,
00:01:25
so in this video we’ll see how graphics cards work in two parts. First, we’ll open up this graphics
00:01:32
card and explore the different components inside, as well as the physical design and architecture
00:01:38
of the GPU or graphics processing unit. Second, we’ll explore the computational architecture and
00:01:46
see how GPUs process mountains of data, and why they’re ideal for running video game graphics,
00:01:53
Bitcoin mining, neural networks and AI. So, stick around and let’s jump right in.
00:02:08
This video is sponsored by Micron which manufactures
00:02:11
the graphics memory inside this graphics card. Before we dive into all the parts of the GPU,
00:02:19
let’s first understand the differences between GPUs and CPUs. Inside this graphics card,
00:02:26
the Graphics Processing Unit or GPU has over 10,000 cores. However, when we look at the
00:02:33
CPU or Central Processing Unit that’s mounted to the motherboard, we find an integrated circuit or
00:02:40
chip with only 24 cores. So, which one is more powerful? 10 thousand is a lot more than 24,
00:02:48
so you would think the GPU is more powerful, however, it’s more complicated than that.
00:02:55
A useful analogy is to think of a GPU as a massive cargo ship and a CPU as a jumbo jet airplane.
00:03:03
The amount of cargo capacity is the amount of calculations and data that can be processed,
00:03:09
and the speed of the ship or airplane is the rate at which how quickly those calculations
00:03:15
and data are being processed. Essentially, it’s a trade-off between a massive number
00:03:21
of calculations that are executed at a slower rate versus a few calculations
00:03:27
that can be performed at a much faster rate. Another key difference is that airplanes are a
00:03:32
lot more flexible since they can carry passengers, packages, or containers and can take off and land
00:03:39
at any one of tens of thousands of airports. Likewise CPUs are flexible in that they can run
00:03:46
a variety of programs and instructions. However, giant cargo ships carry only containers with bulk
00:03:53
contents inside and are limited to traveling between ports. Similarly, GPUs are a lot
00:04:00
less flexible than CPUs and can only run simple instructions like basic arithmetic. Additionally
00:04:08
GPUs can’t run operating systems or interface with input devices or networks. This analogy
00:04:15
isn’t perfect, but it helps to answer the question of “which is faster, a CPU or a GPU?”. Essentially
00:04:24
if you want to perform a set of calculations across mountains of data, then a GPU will be
00:04:30
faster at completing the task. However, if you have a lot less data that needs to be evaluated
00:04:36
quickly than a CPU will be faster. Furthermore, if you need to run an operating system or support
00:04:43
network connections and a wide range of different applications and hardware, then you’ll want a CPU.
00:04:50
We’re planning a separate video on CPU architecture, so make sure to subscribe
00:04:55
so you don’t miss it, but let’s now dive into this graphics card and see how it works. In the center
00:05:02
of this graphics card is the printed circuit board or PCB, with all the various components
00:05:08
mounted on it, [Animator Note: Highlight and list out the various parts that will be covered.] and
00:05:10
we’ll start by exploring the brains which is the graphics processing unit or GPU. When we
00:05:17
open it up, we find a large chip or die named GA102 built from 28.3 billion transistors. The
00:05:26
majority of the area of the chip is taken up by the processing cores which have a hierarchical
00:05:33
organization. Specifically, the chip is divided into 7 Graphics Processing Clusters or GPCs,
00:05:41
and within each processing cluster are 12 streaming multiprocessors or SMs. Next,
00:05:48
inside each of these streaming multiprocessors are 4 warps and 1 ray tracing core, and then,
00:05:56
inside each warp are 32 Cuda or shading cores and 1 tensor core. Across the entire GPU are 10752
00:06:08
CUDA cores, 336 Tensor Cores, and 84 Ray Tracing Cores. These three types of cores execute all the
00:06:18
calculations of the GPU, and each has a different function. CUDA cores can be thought of as simple
00:06:24
binary calculators with an addition button, a multiply button and a few others, and are used
00:06:31
the most when running video games. Tensor cores are matrix multiplication and addition calculators
00:06:38
and are used for geometric transformations and working with neural networks and AI. And
00:06:45
ray tracing cores are the largest but the fewest and are used to execute ray tracing algorithms.
00:06:53
Now that we understand the computational resources inside this chip, one rather interesting
00:06:59
fact is that the 3080, 3090, 3080 ti, and 3090 ti graphics cards all use the same GA102 chip design
00:07:11
for their GPU. This might be counterintuitive because they have different prices and were
00:07:16
released in different years, but it’s true. So, why is this? Well, during the manufacturing
00:07:23
process sometimes patterning errors, dust particles, or other manufacturing issues
00:07:29
cause damage and create defective areas of the circuit. Instead of throwing out the entire chip
00:07:35
because of a small defect, engineers find the defective region and permanently isolate and
00:07:41
deactivate the nearby circuitry. By having a GPU with a highly repetitive design, a small defect in
00:07:49
one core only damages that particular streaming multiprocessor circuit and doesn’t affect the
00:07:55
other areas of the chip. As a result, these chips are tested and categorized or binned according
00:08:02
to the number of defects. The 3090ti graphics cards have flawless GA102 chips with all 10752
00:08:13
CUDA cores working properly, the 3090 has 10,496 cores working, the 3080ti has 10,240 and the 3080
00:08:26
has 8704 CUDA cores working, which is equivalent to having 16 damaged and deactivated streaming
00:08:34
multiprocessors. Additionally, different graphics cards differ by their maximum clock speed and the
00:08:41
quantity and generation of graphics memory that supports the GPU, which we’ll explore in a little
00:08:47
bit. Because we’ve been focusing on the physical architecture of this GA102 GPU chip, let’s zoom
00:08:54
into one of these CUDA cores and see what it looks like. Inside this simple calculator is a layout
00:09:00
of approximately 410 thousand transistors. This section of 50 thousand transistors performs the
00:09:08
operation of A times B plus C which is called fused multiply and add or FMA and is the most
00:09:17
common operation performed by graphics cards. Half of the CUDA cores execute FMA using 32-bit
00:09:24
floating-point numbers, which is essentially scientific notation, and the other half
00:09:29
of the cores use either 32-bit integers or 32-bit floating point numbers. Other sections of this
00:09:36
core accommodate negative numbers and perform other simple functions like bit-shifting and bit
00:09:42
masking as well as collecting and queueing the incoming instructions and operands,
00:09:47
and then accumulating and outputting the results. As a result, this single core is just a simple
00:09:55
calculator with a limited number of functions. This calculator completes one multiply and one add
00:10:01
operation each clock cycle and therefore with this 3090 graphics cards and its 10496 cores and 1.7
00:10:12
gigahertz clock, we get 35.6 trillion calculations a second. However, if you’re wondering how the GPU
00:10:20
handles more complicated operations like division, square root, and trigonometric functions, well,
00:10:27
these calculator operations are performed by the special function units which are far fewer
00:10:33
as only 4 of them can be found in each streaming multiprocessor. Now that we have an understanding
00:10:38
of what’s inside a single core, let’s zoom out and take a look at the other sections of the GA102
00:10:45
chip. Around the edge we find 12 graphics memory controllers, the NVLink Controllers and the PCIe
00:10:54
interface. On the bottom is a 6-megabyte Level 2 SRAM Memory Cache, and here’s the Gigathread
00:11:01
Engine which manages all the graphics processing clusters and streaming multiprocessors inside.
00:11:08
Now that we’ve explored this GA102 GPU’s physical architecture, let’s zoom out and take a look at
00:11:15
the other parts inside the graphics card. On this side are the various ports for the displays to be
00:11:22
plugged into, on the other side is the incoming 12 Volt power connector, and then here are the
00:11:28
PCIe pins that plug into the motherboard. On the PCB, the majority of the smaller components
00:11:36
constitute the voltage regulator module which takes the incoming 12 volts and converts it to
00:11:42
one point one volts and supplies hundreds of watts of power to the GPU. Because all
00:11:49
this power heats up the GPU, most of the weight of the graphics card is in the form of a heat
00:11:54
sink with 4 heat pipes that carry heat from the GPU and memory chips to the radiator fins where
00:12:01
fans then help to remove the heat. Perhaps some of the most important components, aside from the GPU,
00:12:09
are the 24 gigabytes of graphics memory chips which are technically called GDDR6X SDRAM and
00:12:17
were manufactured by Micron which is the sponsor of this video. Whenever you start up a video game
00:12:23
or wait for a loading screen, the time it takes to load is mostly spent moving all the 3D models
00:12:29
of a particular scene or environment from the solid-state drive into these graphics memory
00:12:35
chips. As mentioned earlier, the GPU has a small amount of data storage in its 6-megabyte shared
00:12:41
Level 2 cache which can hold the equivalent of about this much of the video game’s environment.
00:12:47
Therefore in order to render a video game, different chunks of scene are continuously being
00:12:53
transferred between the graphics memory and the GPU. Because the cores are constantly performing
00:12:59
tens of trillions of calculations a second, GPUs are data hungry machines and need to be
00:13:06
continuously fed terabytes upon terabytes of data, and thus these graphics memory chips are designed
00:13:14
kind of like multiple cranes loading a cargo ship at the same time. Specifically, these 24 chips
00:13:21
transfer a combined 384 bits at a time, which is called the bus width and the total data that
00:13:28
can be transferred, or the bandwidth is about 1.15 terabytes a second. In contrast the sticks of DRAM
00:13:37
that support the CPU only have a 64-bit bus width and a maximum bandwidth closer to 64 gigabytes a
00:13:44
second. One rather interesting thing is that you may think that computers only work using binary
00:13:50
ones and zeros. However, in order to increase data transfer rates, GDDR6X and the latest graphics
00:13:58
memory, GDDR7 send and receive data across the bus wires using multiple voltage levels beyond just
00:14:06
0 and 1. For example, GDDR7 uses 3 different encoding schemes to combine binary bits into
00:14:14
ternary digits or PAM-3 symbols with voltages of 0, 1, and negative 1. Here’s the encoding scheme
00:14:22
on how 3 binary bits are encoded into 2 ternary digits and this scheme is combined with an 11
00:14:29
bit to 7 ternary digit encoding scheme resulting in sending 276 binary bits using only 176 ternary
00:14:39
digits. The previous generation, GDDR6X, which is the memory in this 3090 graphics card, used a
00:14:47
different encoding scheme, called PAM-4, to send 2 bits of data using 4 different voltage levels,
00:14:54
however, engineers and the graphics memory industry agreed to switch to PAM-3 for future
00:15:00
generations of graphics chips in order to reduce encoder complexity, improve the signal to noise
00:15:06
ratio, and improve power efficiency. Micron delivers consistent innovation to push the
00:15:15
boundaries on how much data can be transferred every second and to design cutting edge memory
00:15:20
chips. Another advancement by Micron is the development of HBM, or the high bandwidth memory,
00:15:27
that surrounds AI chips. HBM is built from stacks of DRAM memory chips and uses TSVs
00:15:35
or through silicon vias, to connect this stack into a single chip, essentially forming a cube
00:15:42
of AI memory. For the latest generation of high bandwidth memory, which is HBM3E, a single cube
00:15:51
can have up to 24 to 36 gigabytes of memory, thus yielding 192 gigabytes of high-speed memory around
00:15:59
the AI chip. Next time you buy an AI accelerator system, make sure it uses Micron’s HBM3E which
00:16:07
uses 30% less power than the competitive products. However, unless you’re building an AI data center,
00:16:15
you’re likely not in the market to buy one of these systems which cost between 25 to
00:16:20
40 thousand dollars and are on backorder for a few years. If you’re curious about high bandwidth
00:16:27
memory, or Micron’s next generation of graphics memory take a look at one of these links in the
00:16:33
description. Alternatively, if designing the next generation of memory chips interests you, Micron
00:16:40
is always looking for talented scientists and engineers to help innovate on cutting edge chips
00:16:46
and you can find out more about working for Micron using this link. Now that we’ve explored many of
00:16:53
the physical components inside this graphics card and GPU, let’s next explore the computational
00:17:00
architecture and see how applications like video game graphics and bitcoin mining run what’s called
00:17:06
“embarrassingly” parallel operations. Although it may sound like a silly name, embarrassingly
00:17:13
parallel is actually a technical classification of computer problems where little or no effort is
00:17:19
needed to divide the problem into parallel tasks, and video game rendering and bitcoin mining easily
00:17:27
fall into this category. Essentially, GPUs solve embarrassingly parallel problems using a principle
00:17:34
called SIMD, which stands for single instruction multiple data where the same instructions or steps
00:17:41
are repeated across thousands to millions of different numbers. Let’s see an example of how
00:17:48
SIMD or single instruction multiple data is used to create this 3D video game environment. As you
00:17:55
may know already, this cowboy hat on the table is composed of approximately 28 thousand triangles
00:18:02
built by connecting together around 14,000 vertices, each with X, Y, and Z coordinates.
00:18:10
These vertex coordinates are built using a coordinate system called model space with the
00:18:16
origin of 0,0,0 being at the center of the hat. To build a 3D world we place hundreds of objects,
00:18:24
each with their own model space into the world environment and, in order for the camera to be
00:18:29
able to tell where each object is relative to other objects, we have to convert or transform
00:18:36
all the vertices from each separate model space into the shared world coordinate system or world
00:18:43
space. So, as an example, how do we convert the 14 thousand vertices of the cowboy hat from model
00:18:51
space into world space? Well, we use a single instruction which adds the position of the origin
00:18:57
of the hat in world space to the corresponding X,Y, and Z coordinate of a single vertex in
00:19:04
model space. Next we copy this instruction to multiple data, which is all the remaining X,Y,
00:19:11
and Z coordinates of the other thousands of vertices that are used to build the hat. Next,
00:19:17
we do the same for the table and the rest of the hundreds of other objects in the scene,
00:19:22
each time using the same instructions but with the different objects’ coordinates in world space,
00:19:29
and each objects’ thousands of vertices in model space. As a result, all the vertices and triangles
00:19:36
of all the objects are converted to a common world space coordinate system and the camera
00:19:42
can now determine which objects are in front and which are behind. This example illustrates
00:19:48
the power of SIMD or single instruction multiple data and how a single instruction is applied to
00:19:54
5,629 different objects with a total of 8.3 million vertices within the scene resulting
00:20:02
in 25 million addition calculations. The key to SIMD and embarrassingly parallel programs is that
00:20:09
every one of these millions of calculations has no dependency on any other calculation,
00:20:15
and thus all these calculations can be distributed to the thousands of cores of the GPU and completed
00:20:22
in parallel with one another. It's important to note that vertex transformation from model space
00:20:28
to world space is just one of the first steps of a rather complicated video game graphics rendering
00:20:34
pipeline and we have a separate video that delves deeper into each of these other steps. Also,
00:20:41
we skipped over the transformations for the rotation and scale of each object, but factoring
00:20:46
in these values is a similar process that requires additional SIMD calculations. Now that we have a
00:20:53
simple understanding of SIMD, let’s discuss how this computational architecture matches up with
00:20:59
the physical architecture. Essentially, each instruction is completed by a thread and this
00:21:05
thread is matched to a single CUDA core. Threads are bundled into groups of 32 called warps,
00:21:11
and the same sequence of instructions is issued to all the threads in a warp. Next warps are grouped
00:21:18
into thread blocks which are handled by the streaming multiprocessor. And then finally thread
00:21:24
blocks are grouped into grids, which are computed across the overall GPU. All these computations are
00:21:32
managed or scheduled by the Gigathread Engine, which efficiently maps thread blocks to the
00:21:38
available streaming multiprocessors. One important distinction is that within SIMD architecture,
00:21:44
all 32 threads in a warp follow the same instructions and are in lockstep with each
00:21:50
other, kind of like a phalanx of soldiers moving together. This lock step execution applied to GPUs
00:21:57
up until around 2016. However, newer GPUs follow a SIMT architecture or single instruction multiple
00:22:06
threads. The difference between SIMD and SIMT is that while both send the same set of instructions
00:22:13
to each thread, with SIMT, the individual threads don’t need to be in lockstep with
00:22:19
each other and can progress at different rates. In technical jargon, each thread is given its own
00:22:25
program counter. Additionally, with SIMT all the threads within a streaming multiprocessor use a
00:22:32
shared 128 kilobyte L1 cache and thus data that’s output by one thread can be subsequently used by
00:22:41
a separate thread. This improvement from SIMD to SIMT allows for more flexibility when encountering
00:22:48
warp divergence via data-dependent conditional branching and easier reconvergence for the threads
00:22:55
to reach the barrier synchronization. Essentially newer architectures of GPUs are more flexible and
00:23:02
efficient especially when encountering branches in code. One additional note is that although
00:23:08
you may think that the term warp is derived from warp drives, it actually comes from weaving and
00:23:14
specifically the Jacquard Loom. This loom from 1804 used programmable punch cards to select
00:23:22
specific threads out of a set to weave together intricate patterns. As fascinating as looms are,
00:23:29
let’s move on. The final topics we’ll explore are bitcoin mining, tensor cores and neural networks.
00:23:37
But first we’d like to ask you to ‘like’ this video, write a quick comment below,
00:23:42
share it with a colleague, friend or on social media, and subscribe if you haven’t already.
00:23:49
The dream of Branch Education is to make free and accessible, visually engaging educational
00:23:55
videos that dive deeply into a variety topics on science, engineering, and how technology works,
00:24:02
and then to combine multiple videos into an entirely free engineering curriculum for high
00:24:08
school and college students. Taking a few seconds to like, subscribe, and comment below helps us
00:24:15
a ton! Additionally, we have a Patreon page with AMAs and behind the scenes footage, and,
00:24:22
if you find what we do useful, we would appreciate any support. Thank you. So now that we’ve explored
00:24:31
how single instruction multiple threads is used in video games, let’s briefly discuss why GPUs
00:24:38
were initially used for mining bitcoin. We’re not going to get too far into the algorithm behind
00:24:44
the blockchain and will save it for a separate episode, but essentially, to create a block on
00:24:50
the blockchain, the SHA-256 hashing algorithm is run on a set of data that includes transactions,
00:24:57
a time stamp, additional data, and a random number called a nonce. After feeding these values through
00:25:04
the SHA-256 hashing algorithm a random 256-bit value is output. You can kind of think of this
00:25:11
algorithm as a lottery ticket generator where you can’t pick the lottery number, but based on the
00:25:17
input data, the SHA-256 algorithm generates a random lottery ticket number. Therefore,
00:25:24
if you change the nonce value and keep the rest of the transaction data the same, you’ll generate
00:25:29
a new random lottery ticket number. The winner of this bitcoin mining lottery is the first randomly
00:25:35
generated lottery number to have the first 80 bits all zeroes, while the rest of the 176 values don’t
00:25:42
matter and once a winning bitcoin lottery ticket is found, the reward is 3 bitcoin and the lottery
00:25:49
resets with a new set of transactions and input values. So, why were graphics cards used? Well,
00:25:56
GPUs ran thousands of iterations of the SHA-256 algorithm with the same transactions, timestamp,
00:26:05
other data, but, with different nonce values. As a result, a graphics card like this one could
00:26:11
generate around 95 million SHA-256 hashes or 95 million randomly numbered lottery tickets every
00:26:19
second, and hopefully one of those lottery numbers would have the first 80 digits as all zeros.
00:26:27
However, nowadays computers filled with ASICs or application specific integrated circuits perform
00:26:34
250 trillion hashes a second or the equivalent of 2600 graphics cards, thereby making graphics
00:26:42
cards look like a spoon when mining bitcoin next to an excavator that is an asic mining computer.
00:26:50
Let’s next discuss the design of the tensor cores. It’ll take multiple full-length videos to cover
00:26:56
generative AI, and neural networks, so we’ll focus on the exact matrix math that tensor cores solve.
00:27:04
Essentially, tensor cores take three matrices and multiply the first two, add in the third and then
00:27:11
output the result. Let’s look at one value of the output. This value is equal to the sum of values
00:27:18
of the first row of the first matrix multiplied by the values from the first column of the
00:27:23
second matrix, and then the corresponding value of the third matrix is added in.
00:27:28
Because all the values of the 3 input matrices are ready at the same time,
00:27:33
the tensor cores complete all of the matrix multiplication and addition calculations
00:27:38
concurrently. Neural Networks and generative AI require trillions to quadrillions of matrix
00:27:45
multiplication and addition operations and typically uses much larger matrices. Finally,
00:27:52
there are Ray Tracing Cores which we explored in a separate video that’s already been released.
00:27:58
That’s pretty much it for graphics cards. We’re thankful to all our Patreon and YouTube
00:28:03
Membership Sponsors for supporting our videos. If you want to financially support our work,
00:28:09
you can find the links in the description below. This is Branch Education, and we create 3D
00:28:15
animations that dive deeply into the technology that drives our modern world. Watch another Branch
00:28:22
video by clicking one of these cards or click here to subscribe. Thanks for watching to the end!