00:00:00
what is inferencing it's an AI model's
00:00:04
time to shine its Moment of Truth a test
00:00:06
of how well the model can apply
00:00:08
information learned during training to
00:00:11
make a prediction or solve a task and
00:00:14
with it comes a focus on cost and speed
00:00:17
let's get into
00:00:19
it so an AI model it goes through two
00:00:25
primary stages what are those the first
00:00:29
of those is the
00:00:33
training stage where the model learns
00:00:36
how to do stuff and then we have the
00:00:41
inferencing stage that comes after
00:00:46
training now we can think of this as the
00:00:50
difference between learning something
00:00:53
and then putting what we've learned into
00:00:55
practice so during training a deep
00:00:58
learning model comput how the examples
00:01:01
in its training set are related what
00:01:03
it's doing effectively here is it's
00:01:06
figuring out
00:01:08
relationships between all of the data in
00:01:11
its training set and it encodes these
00:01:15
relationships into what are called a
00:01:18
series of model weights these are the
00:01:21
weights that connect its artificial
00:01:24
neurons so that's training now during
00:01:27
inference a model goes to work on what
00:01:30
we provide it which is real time data so
00:01:36
this is the actual data that we are
00:01:38
inputting into the
00:01:40
model what happens inferencing is the
00:01:42
model compares the user's query with the
00:01:45
information processed during training
00:01:47
and all of those stored weights and what
00:01:49
the model effectively does is it
00:01:52
generalizes based on everything that has
00:01:54
learned during training so it
00:01:57
generalizes from this storage
00:01:58
representation to be able to interpret
00:02:00
this new unseen data in much the same
00:02:04
way that you and I can draw on prior
00:02:06
knowledge to infer the meaning of a new
00:02:08
word or make sense of a new
00:02:10
situation and what's the goal of this
00:02:13
well the goal of AI inference is to
00:02:15
calculate an output basically a result
00:02:19
an actionable
00:02:22
result so what sort of result are we
00:02:26
talking about well let's consider a
00:02:29
model that attempts to accurately flag
00:02:32
incoming email and it's going to flag it
00:02:35
based on whether or not it thinks it is
00:02:38
Spam we are going to build a Spam
00:02:41
detector
00:02:43
model right so during the training stage
00:02:48
this model would be fed a large labeled
00:02:51
data set so we get in a whole load of
00:02:54
data here and this contains a bunch of
00:02:58
emails that have been labeled
00:03:00
specifically the labels are spam or not
00:03:06
spam for each email and what happens
00:03:10
here is the model learns to recognize
00:03:12
patterns and features commonly
00:03:14
associated with Spam emails so these
00:03:17
might include the presence of certain
00:03:19
keywords yeah those ones so unusual
00:03:23
sender email addresses excessive use of
00:03:25
exclamation marks all that sort of thing
00:03:28
now the model encodes these learned
00:03:30
patterns into its weight here creating a
00:03:34
complex set of rules to identify spam
00:03:38
now during inference this model is put
00:03:41
to the test it's put to the test with
00:03:44
new unseen data in real time like when a
00:03:49
new email arrives in a user's inbox the
00:03:53
model analyzes the incoming email
00:03:56
comparing its characteristics to the
00:03:58
patterns it's learned during training
00:04:01
and then makes a prediction is this new
00:04:04
unseen email spam or not spam now the
00:04:09
actionable result here might be a
00:04:11
probability score indicating How likely
00:04:14
the email is to be spam which is then
00:04:16
tied into a business rule so for example
00:04:19
if the model assigns a
00:04:22
90%
00:04:23
probability that what we're looking at
00:04:25
here is Spam well we should move that
00:04:30
email directly to the spam folder that's
00:04:32
what the business rule would say but if
00:04:34
the probability the model comes back
00:04:36
with is just
00:04:37
50% the business rule might say to leave
00:04:40
the email in the inbox but flag it for
00:04:42
the user to to decide what to do so
00:04:45
what's happening here is the model is
00:04:48
generalizing it can identify spam emails
00:04:51
even if they don't exactly match any
00:04:53
specific example from its training data
00:04:56
as long as they share similar
00:04:58
characteristics with the spam pattern
00:05:00
its learn okay now when the topic of
00:05:03
inferencing comes up it is often
00:05:06
accompanied with four preceeding words
00:05:09
let's cover those
00:05:12
next the high cost of those are the
00:05:16
words often added before inferencing
00:05:20
training AI models particularly large
00:05:21
language models can cost millions of
00:05:23
dollars in Computing processing time but
00:05:26
as expensive as training an AO model can
00:05:28
be it is
00:05:30
by the expense of inferencing each time
00:05:34
someone runs an AI model there's a cost
00:05:36
a cost in kilowatt hours a cost in
00:05:38
dollars a cost in carbon Emissions on
00:05:40
average something like about
00:05:44
90% of an AI model's life is spent in
00:05:49
inferencing mode and therefore most of
00:05:51
the ai's carbon footprint comes from
00:05:54
serving models to the world not in
00:05:55
training them in fact by some estimates
00:05:57
running a large AI model puts more
00:06:00
carbon into the atmosphere over its
00:06:01
lifetime than the average American car
00:06:04
now the high costs of inferencing they
00:06:07
stem from a number of different factors
00:06:10
so let's take a look at some of those
00:06:12
and first of all there's just the the
00:06:15
sheer scale the scale of operations
00:06:18
while training happens just once
00:06:20
inferencing happens millions or even
00:06:23
billions of times over a model's
00:06:25
lifetime a chatbot might field millions
00:06:27
of queries every day each requiring a
00:06:29
separate inference second there's the
00:06:32
need the Need for Speed we want fast AI
00:06:38
models we're working with realtime data
00:06:41
here requiring near instantaneous
00:06:43
responses which often necessitate
00:06:46
powerful energy hungry Hardware like
00:06:50
gpus third we have to consider also just
00:06:54
the general
00:06:55
complexity of these AI models as models
00:06:59
grow larger and more sophisticated to
00:07:01
handle more complex tasks they require
00:07:03
more computational resources for each
00:07:05
inference this is particularly true for
00:07:07
llms with billions of parameters and
00:07:10
then finally there is the cost in terms
00:07:14
of infrastructure costs data centers to
00:07:17
maintain and cool load latency network
00:07:20
connections to power all these factors
00:07:22
contribute to significant ongoing costs
00:07:25
in terms of energy consumption Hardware
00:07:27
wear and tear and operational expenses
00:07:30
which brings up the question of if
00:07:33
there's a better way to do this faster
00:07:36
and more
00:07:38
efficiently how fast an AI model runs
00:07:41
depends on the stack what's the stack
00:07:45
well improvements made at each layer can
00:07:48
speed up inferencing and top of the
00:07:50
stack is Hardware at the hardware level
00:07:55
Engineers are developing specialized
00:07:58
chips these these are chips made for AI
00:08:03
and they're optimized for the types of
00:08:05
mathematical operations that dominate
00:08:07
deep learning particularly matrix
00:08:09
multiplication these AI accelerators can
00:08:12
significantly speed up inferencing tasks
00:08:14
compared to traditional CPUs and even to
00:08:16
gpus and to do so in a more energy
00:08:19
efficient way now bottom of the
00:08:23
stack I've put software and on the
00:08:26
software side there are several
00:08:28
approaches to accelerate inferencing one
00:08:31
is model compression now that involves
00:08:33
techniques like pruning and quantization
00:08:37
so what do we mean by those well first
00:08:39
of all pruning that removes unnecessary
00:08:44
weights from the model so it's reducing
00:08:46
its size without significantly impacting
00:08:50
accuracy and then for quantization what
00:08:53
that is talking about is reducing the
00:08:56
Precision of the model's weights such as
00:08:58
from 32-bit floating Point numbers to 8
00:09:02
bit integers and that can really speed
00:09:03
up computations and reduce memory
00:09:06
requirements okay so we got hardware and
00:09:09
software what's in the middle middle
00:09:13
Weare of course middle Weare Bridges the
00:09:15
gap between the hardware and the
00:09:17
software and middleware Frameworks can
00:09:19
perform a bunch of things to help here
00:09:22
one of those things is called graph
00:09:26
fusion and graph Fusion reduces es the
00:09:29
number of nodes in the communication
00:09:31
graph and that minimizes the round trips
00:09:34
between CPUs and gpus and they can also
00:09:39
Implement parallel tensors as well
00:09:43
strategically splitting the AI models
00:09:46
computational graph into chunks and
00:09:48
those chunks can be spread across
00:09:50
multiple gpus and run at the same time
00:09:53
so running a 70 billion parameter model
00:09:56
that requires something like 150 GB of
00:09:59
memory which is nearly twice as much as
00:10:02
an Nvidia a100 GPU holds but if the par
00:10:07
compiler can split the AI model's
00:10:09
computational graph into strategic
00:10:11
chunks those operations can be spread
00:10:14
across gpus and run at the same time so
00:10:18
that's inferencing it's a game a game of
00:10:21
patent matching that turns complex
00:10:24
training into rapid fire problem solving
00:10:27
one spammy email at at a
00:10:30
time if you have any questions please
00:10:32
drop us a line below and if you want to
00:10:34
see more videos like this in the future
00:10:36
please like And subscribe thanks for
00:10:39
watching