00:00:14
Hi, welcome to this session on
end-to-end MLOps for
00:00:17
architects. My name is Sara van de
Moosdijk, but you can call me
00:00:20
Moose. I'm a Senior AI/ML
Partner Solutions Architect at
00:00:24
AWS. Now, my goal for this
session today is to help
00:00:28
architects and developers,
especially those of you not
00:00:31
specialised in machine learning,
to design an MLOps architecture
00:00:34
for your organisation. I will
introduce the different
00:00:37
components in an effective
MLOps setup, and explain why
00:00:40
these components are necessary
without diving too deep into the
00:00:44
details which only data
scientists need to know. Now,
00:00:47
before watching this session,
you should be comfortable with
00:00:50
architecting on AWS, and know
how to use popular services like
00:00:53
S3, Lambda, EventBridge,
CloudFormation, and so on. If
00:00:57
you're unfamiliar with these
services, I recommend watching
00:01:00
some of the other architecture
sessions and reading up on these
00:01:03
topics, then come back to watch
the session. It will also help
00:01:07
if you have a basic
understanding of machine
00:01:08
learning and how it works. Okay,
let's look at what you can
00:01:13
expect. I'll start off the
session with a brief overview of
00:01:17
the challenges many companies
face when deploying machine
00:01:20
learning models and maintaining
them in production. We will then
00:01:23
briefly define MLOps, before
diving straight into some
00:01:26
architecture diagrams.
Specifically, I have designed
00:01:30
the architecture diagrams
according to t-shirt sizes from
00:01:32
small to large. This will allow
you to choose your starting
00:01:36
point based on the size and
maturity of your organisation.
00:01:40
And finally, I'll end the
session with some advice for
00:01:42
starting your own MLOps
journey. First, I want to
00:01:47
quickly go over the machine
learning process to make sure
00:01:49
that we're all on the same page.
Normally, you'd start your
00:01:52
machine learning process because
you have a business problem
00:01:55
which needs to be solved, and
you've determined that machine
00:01:57
learning is the correct
solution. Then a data scientist
00:02:01
will spend quite a bit of time
collecting the data required,
00:02:04
integrating data from various
sources, cleaning the data, and
00:02:07
analysing the data. Next, the
data scientists will start the
00:02:11
process of engineering features,
training and tuning different
00:02:14
machine learning models and
evaluating the performance of
00:02:17
these models. And then based on
these results, the data
00:02:20
scientist might go back to
collect more data or perform
00:02:23
additional data cleaning steps.
But assuming that the models are
00:02:26
performing well, he or she would
then go ahead and deploy the
00:02:29
model so it can be used to
generate predictions. The
00:02:33
final step and certainly a
crucial one is to monitor the
00:02:36
model that is in production.
Much like a new car which
00:02:40
depreciates in value as soon as
you drive it off the lot, a
00:02:43
machine learning model is out of
date, as soon as you've trained
00:02:45
it. The world is constantly
changing and evolving, so the
00:02:49
older your model gets, the worse
it gets at making predictions.
00:02:52
By monitoring the quality of
your model, you will know when
00:02:55
it's time to retrain or perhaps
gather new data for it. Now,
00:03:00
every data scientist will follow
a process along these lines and
00:03:03
when they're just starting out
this process will likely be
00:03:06
entirely manual. But as
businesses embrace machine
00:03:09
learning across their
organisations, manual workflows
00:03:12
for building, training, deploying
tend to become bottlenecks to
00:03:15
innovation. Even a single
machine learning model in
00:03:18
production needs to be
monitored, managed and retrained
00:03:21
to maintain the quality of its
predictions. Without the right
00:03:24
operational practices in place,
these challenges can negatively
00:03:27
impact data scientist
productivity, model performance
00:03:30
and costs. So to illustrate what
I mean let's take a look at some
00:03:35
architecture diagrams. Here I
have a pretty standard data
00:03:40
science setup. The data
scientist has been given access
00:03:42
to an AWS account within a SageMaker
Studio domain, where they
00:03:46
can use Jupyter Notebooks to
develop their machine learning
00:03:48
models. Data might be pulled
from S3, RDS, Redshift, Glue, any
00:03:54
number of data related AWS
services. The models produced by
00:03:58
the data scientist are then
stored in an S3 bucket. So far,
00:04:02
so good. Unfortunately, this is
often where it stops. Gartner
00:04:06
estimates that only 53% of
machine learning POCs actually
00:04:11
make it into production. And
there are various reasons for
00:04:13
this. Often there's a
misalignment between strategic
00:04:17
objectives of the company and
the machine learning models
00:04:19
being built by the data
scientists. There might be a
00:04:22
lack of communication between
DevOps, security, legal, IT, and
00:04:26
the data scientists,
which can also be a common
00:04:28
challenge blocking models from
reaching production. And
00:04:31
finally, if the company already
has a few models in production,
00:04:35
a data scientist team can
struggle to maintain those
00:04:37
existing models, while also
pushing out new models. But what
00:04:42
if the model does make it
into production? Let's assume the
00:04:45
data scientist spins up a
SageMaker endpoint to host the
00:04:48
model. And the developer of an
application is able to connect
00:04:52
to this model and generate
predictions through API Gateway
00:04:55
connecting to a Lambda function,
which calls the SageMaker
00:04:58
endpoint. So what challenges do
you see with this architecture?
00:05:04
Well first, any changes to the
machine learning model requires
00:05:07
manual actions by the data
scientist in the form of re-
00:05:10
running cells in a Jupyter
Notebook, right. Second, the
00:05:14
code which the data scientist
produces is stuck in these
00:05:17
Jupyter Notebooks, which are
difficult to version and
00:05:19
difficult to automate. Third,
the data scientist might have
00:05:23
forgotten to turn on auto-
scaling for the SageMaker
00:05:26
endpoint, so it cannot adjust
capacity according to the number
00:05:29
of requests coming in. And
finally, there's no feedback
00:05:32
loop. If the quality of the
model deteriorates, you would
00:05:35
only find out through complaints
from disgruntled users. These
00:05:38
are just some of the challenges
which can be avoided with a
00:05:41
proper MLOps setup. So what is
MLOps? Well, MLOps is a set of
00:05:47
operational practices to
automate and standardise model
00:05:51
building training, deployment,
monitoring, management, and
00:05:54
governance. It can help
companies streamline the end-to-
00:05:58
end machine learning lifecycle,
and boost productivity of data
00:06:01
scientists and MLOps teams,
while maintaining high model
00:06:04
accuracy, and enhancing security
and compliance. So the key
00:06:09
phrase in that previous
definition is operational
00:06:12
practices. MLOps, similar to
DevOps, is more than just a set
00:06:16
of technologies or services. You
need the right people, with the
00:06:19
right skills, following the same
standardised processes to
00:06:23
successfully operate machine
learning at scale. The
00:06:26
technology exists to facilitate
these processes and make the job
00:06:30
easier for the people. Now in
this session, I will focus on
00:06:33
the technology and specifically
which AWS services we can use to
00:06:37
build a successful setup. But I
want you to keep in mind that
00:06:41
the architectures provided in
this session will only work if
00:06:43
you have the right teams, and if
those teams are willing to
00:06:46
establish and follow MLOps
processes. Now if you want to
00:06:50
learn more about the people and
process aspects of MLOps, I
00:06:54
will include links to useful
resources at the end of this
00:06:56
session. So without further ado,
let's dive deep with some
00:07:01
architecture diagrams. Now
MLOps can be quite complicated,
00:07:04
with lots of features and
technologies which you could
00:07:07
choose to adopt, but you don't
have to adopt all of it
00:07:09
immediately. So to start off,
I'll give you an example of a
00:07:13
minimal MLOps setup. This would
be suitable for a small company
00:07:17
or a small data science team of
one to three people working on
00:07:21
just a couple of use cases. So
let's take a look. We'll start
00:07:27
off with the same architecture
we looked at previously, only
00:07:30
reduced in size so I can create
more space on the slide. A data
00:07:33
scientist accesses Jupyter
Notebooks through SageMaker
00:07:36
Studio, accesses data from any
of various data sources, and
00:07:40
stores any machine learning
models they create in S3. One of
00:07:45
the challenges I mentioned
previously is that the code is
00:07:47
stuck in Jupyter notebooks and
can be difficult to version and
00:07:49
automate. So the first step
would be to add more versioning
00:07:53
to this architecture. You can
use CodeCommit or any other Git-
00:07:56
based repository to store code.
And you can use Amazon Elastic
00:08:00
Container Registry or ECR to
store Docker containers, thereby
00:08:04
versioning the environments
which were used to train the
00:08:06
machine learning models. By
versioning the code, the
00:08:09
environments and the model
artifacts, you improve your
00:08:12
ability to reproduce models and
collaborate with others. Next,
00:08:16
let's talk about automation.
Another challenge I mentioned
00:08:19
previously is that the data
scientists are manually
00:08:22
re-training models instead of
focussing on developing new
00:08:24
models. To solve this, you want
to set up automatic re-training
00:08:28
pipelines. In this architecture,
I use SageMaker Pipelines, but
00:08:32
you could also use Step
Functions or Airflow to build
00:08:34
these repeatable workflows. The
re-training pipeline built by the
00:08:38
data scientist or by a machine
learning engineer, will use the
00:08:41
version code and environments to
perform data pre-processing,
00:08:44
model training, model
verification, and eventually
00:08:48
save the new model artifacts to
S3. It can use various services
00:08:52
to complete these steps
including SageMaker processing
00:08:54
or training jobs, EMR, or
Lambda. But in order to automate
00:08:59
this pipeline, we need a
trigger. One option is to use
00:09:04
EventBridge to trigger the
pipeline based on a schedule.
00:09:07
Another option is to have
someone manually trigger the
00:09:09
pipeline. Both triggers are
useful in different contexts
00:09:13
and I'll introduce more triggers
as we progress through these
00:09:16
slides. So now that we have an
automated re-training pipeline, I
00:09:20
want to introduce another
important concept in MLOps, and
00:09:23
that's the model registry. While
S3 provides some versioning and
00:09:27
object locking functionality,
which is useful for storing
00:09:29
different models, a model
registry helps to manage these
00:09:32
models and their versions.
SageMaker model registry allows you
00:09:36
to store metadata alongside your
models, including the values of
00:09:40
hyperparameters and evaluation
metrics, or even the bias and
00:09:43
explainability reports. This
enables you to quickly view and
00:09:47
compare different versions of a
model and to approve or reject a
00:09:50
model version for production.
Now the actual artifacts are
00:09:54
still stored in S3, but model
registry sits on top of that as
00:09:58
an additional layer.
00:10:00
Finally, we reach the
deployment stage. At first
00:10:03
glance, this might look very
different from what we saw
00:10:05
earlier in the session. But the
setup is actually very similar.
00:10:08
I still have machine learning
models deployed on real-time
00:10:11
SageMaker endpoints connected
to Lambda and API Gateway to
00:10:15
communicate with an application.
The main difference now is that
00:10:19
I have autoscaling set up for
my SageMaker endpoints. So if
00:10:22
there's an unexpected spike in
users, the endpoints can scale
00:10:25
up to handle the requests and
scale back down when the usage
00:10:29
falls. Now one nice feature of
SageMaker endpoints is that you
00:10:32
can replace the machine learning
model without endpoint downtime.
00:10:36
Since I now have an automated
re-training pipeline creating
00:10:39
new models, and a model registry
where I can approve models, it
00:10:42
would be best if the deployment
of the new models is automated
00:10:45
as well. I can achieve this by
building a Lambda function,
00:10:49
which triggers when a new model
is approved to fetch that model,
00:10:53
and then update the endpoint
with it. So now we have
00:10:56
connected all the pieces and
there's one final feature that I
00:10:59
will take advantage of. Not only
can I update the machine
00:11:02
learning models hosted by the
endpoints, but I can actually do
00:11:04
so gradually using a canary
deployment. This means that a
00:11:08
small portion of the user
requests will be diverted to the
00:11:11
new model, and any errors or
issues will trigger a Cloud-
00:11:15
Watch alarm to inform me. Over
time, the number of requests
00:11:18
sent to the new model will
increase until the new model
00:11:21
gets 100% of the traffic. So I
hope this architecture makes
00:11:24
sense. I started with a very
basic setup, and by adding a few
00:11:28
features and services, I now
have a serviceable MLOps
00:11:31
setup. My deployment strategy is
more robust by using auto-
00:11:34
scaling and canary deployment,
my data scientists save time by
00:11:38
automating model training, and
every artefact is properly
00:11:41
versioned. But as your data
science team grows, this
00:11:45
architecture won't be
sufficient. So let's look at a
00:11:47
slightly more complicated
architecture. The next
00:11:52
architecture will be more
suitable for a growing data
00:11:55
science team of between three to
10 data scientists working on
00:11:59
several different use cases at a
larger company. So again, let's
00:12:03
start with the basics. Our data
scientists work in notebooks
00:12:06
through SageMaker Studio,
pulling from various data
00:12:08
sources and versioning their
code environments and model
00:12:11
artifacts. This should look
familiar. Also bring back the
00:12:16
automated re-training pipeline.
Nothing has changed here, I've
00:12:19
only made it smaller to create
more room on the slide. And
00:12:22
finally, I'll bring back Event-
Bridge to schedule the
00:12:24
re-training pipeline and model
registry for storing model
00:12:27
metadata and approving model
versions. All of this is exactly
00:12:32
the same as in the previous
architecture diagram. So what
00:12:34
about deployment? Well, this is
where things change a little. So
00:12:38
I have the same deployment setup
with SageMaker endpoints and an
00:12:42
autoscaling group connected to
Lambda and API Gateway to allow
00:12:45
users to submit inference
requests. However, these
00:12:48
deployment services now sit in a
separate AWS account. A multi-
00:12:53
account strategy is highly
recommended, because this allows
00:12:55
you to separate different
business units, easily define
00:12:58
separate restrictions for
important production workloads,
00:13:01
and have a fine-grained view of
the costs incurred by each
00:13:04
component of your architecture.
The different accounts are best
00:13:08
managed through AWS
Organizations. Now, data
00:13:12
scientists should not have
access to the production
00:13:14
account. This reduces the chance
of mistakes being made on that
00:13:17
account, which directly affects
your users. In fact, a multi-
00:13:21
account strategy for machine
learning usually has a separate
00:13:24
staging account alongside the
production account. Any new
00:13:27
models are first deployed to the
staging account, tested and only
00:13:31
then deployed on the production
account. So if the data
00:13:34
scientist cannot access these
accounts, clearly, the
00:13:36
deployment must happen
automatically. All of the
00:13:40
services deployed into the
staging and production accounts
00:13:42
are set up automatically using
CloudFormation, controlled by
00:13:45
CodePipeline in the development
account. The next step is to set
00:13:49
up a trigger for CodePipeline.
And we can do so using Event-
00:13:52
Bridge. So when a model version
is approved in model registry,
00:13:57
this will generate an event
which can be used to trigger
00:13:59
deployment via CodePipeline. So
now everything's connected
00:14:03
again, and this is starting to
look like a proper MLOps
00:14:06
setup. But I'm sure you've
noticed I have plenty of space
00:14:08
left on this slide. So let's add
another feature which becomes
00:14:11
crucial when you have multiple
models running in production for
00:14:14
extended periods of time - that's
model monitor. The goal of
00:14:18
monitoring machine learning
models in production is to
00:14:21
detect a change in behaviour or
accuracy. To start, I enabled
00:14:25
data capture on the endpoints in
the staging and production
00:14:28
accounts. This captures the
incoming requests and outgoing
00:14:31
inference results and stores
them in S3 buckets. If you have
00:14:36
a model monitoring use case,
which doesn't require labelling
00:14:39
the incoming requests, then you
could run the whole process
00:14:41
directly on your staging and
production accounts. But in this
00:14:44
case, I assume the data needs to
be combined with labels or other
00:14:48
data that's on the development
account. So I use S3 replication
00:14:51
to move the data onto an S3
bucket in the development account.
00:14:56
Now, in order to tell if the
behaviour of the model or the
00:14:59
data has changed, we need
something to compare it to.
00:15:02
That's where the model baseline
comes in. During the training
00:15:05
process as part of the automated
re-training pipeline, we can
00:15:08
generate a baseline dataset,
which records the expected
00:15:11
behaviour of the data and the
model. So that gives me all the
00:15:15
components I need to set up Sage-
Maker model monitor, which will
00:15:18
compare the two datasets and
generate a report. The final
00:15:22
step in this architecture is to
take action based on the results
00:15:25
of the model monitoring report.
And we can do this by sending an
00:15:28
event to EventBridge to trigger
the re-training pipeline when a
00:15:31
significant change has been
detected. And that's it for the
00:15:34
medium MLOps
architecture! It contains a lot
00:15:37
of the same features used in the
small architecture, but it
00:15:40
expands to a multi-account
setup, and adds model monitoring
00:15:43
for extra quality checks on the
models in production. Hopefully,
00:15:48
you're now wondering what a
large MLOps architecture looks
00:15:51
like and how I can possibly fit
more features onto a single
00:15:54
slide. So let's take a look at
that now. This architecture is
00:15:57
suitable for companies with
large data science teams of 10
00:16:00
or more people and with machine
learning integrated throughout
00:16:03
the business. Of course, I start
with the same basic setup I had
00:16:09
last time but reduced in size
again. The data scientist is
00:16:12
still using SageMaker Studio
through a development account,
00:16:15
and stores model artifacts and
code in S3 and CodeCommit
00:16:18
respectively. The data sources
are also present, but data is
00:16:22
now stored in a separate
account. It's a common strategy
00:16:25
to have your data lakes set up
in one account with fine-grained
00:16:28
access controls to determine
which datasets can be accessed
00:16:31
by resources in other accounts.
Really, the larger a company
00:16:35
becomes the more AWS accounts
they tend to use, all managed
00:16:38
through AWS Organizations. So
let's continue this trend by
00:16:42
bringing back the automated
re-training pipeline in a
00:16:45
separate operations account. And
let's bring back model registry
00:16:48
as well in yet another account.
All of the components are the
00:16:51
same as in the small and medium
architecture diagrams, but just
00:16:54
split across more accounts. The
operations account is normally
00:16:58
used for any automated workflows
which don't require manual
00:17:01
intervention by the data
scientists. It's also good
00:17:04
practice to store all of your
artifacts in a separate artefact
00:17:07
account like I have here for
model registry. Again, this is
00:17:10
an easy way to prevent data
scientists from accidentally
00:17:13
changing production artifacts.
Next, let's bring back the
00:17:17
production and staging accounts
with the deployment setup. This
00:17:20
is exactly the same as in the
previous architecture, just
00:17:22
reduced in size. The
infrastructure in the production
00:17:26
and staging accounts is still
set up automatically through
00:17:28
CloudFormation in CodePipeline,
but CodePipeline sits in a CI/
00:17:32
CD account. Note that I have
built this diagram based on
00:17:36
account structures I have seen
organisations use but your
00:17:39
organisation might use a
different account setup and
00:17:41
that's totally fine. Use this
diagram as an example and adjust
00:17:45
it to your structure and your
needs. Now, let's connect our
00:17:49
model registry to CodePipeline
by using EventBridge exactly
00:17:52
the same as in the previous
architecture. And now we have
00:17:55
all the pieces connected again.
But I don't know if you noticed,
00:17:58
but one of the basic building
blocks is still missing in this
00:18:01
picture. Hopefully you spotted
it - ECR disappeared for a little
00:18:05
while. So let's bring it back by
placing it in the artefact
00:18:08
account because environments,
especially production
00:18:10
environments are artifacts which
need to be protected. There's
00:18:14
one more change I want to make
to my use of ECR here. In the
00:18:17
previous architecture diagrams,
I assumed that data scientists
00:18:20
were building Docker containers
and registering these containers
00:18:23
and ECR manually. This process
can be simplified and indeed
00:18:27
automated using CodeBuild and
CodePipeline. The data
00:18:30
scientist or machine learning
engineer can still write the
00:18:33
Docker file, but the building
and registration of the
00:18:35
container is performed
automatically. This saves even
00:18:38
more time, so data scientists
can focus on what they do best.
00:18:42
Of course, in the previous
architecture, I use model
00:18:44
monitor to trigger model
re-training if significant
00:18:47
changes in model behaviour were
detected. So let's bring that
00:18:50
back as well, starting with the
data capture in the staging and
00:18:53
production accounts, followed by
data replication into the
00:18:56
operations account. As before,
model monitor will need a
00:19:00
baseline to compare performance
and the generation of this
00:19:03
baseline can be a step in
SageMaker Pipelines. Finally, I'll
00:19:07
bring back model monitor to
generate reports on drift and
00:19:10
trigger re-training if necessary.
This leaves us with all of the
00:19:14
components I had in the medium
MLOps diagram. But there's two
00:19:18
more features that I want to
introduce. The first is Sage-
00:19:22
Maker feature store, which sits
in the artefact account because
00:19:25
features are artifacts which can
be reused. If you remember the
00:19:28
basic data science workflow from
the beginning of the session,
00:19:31
data scientists will normally
perform feature engineering
00:19:34
before training a model, and it
has a large impact on model
00:19:36
performance. In large companies,
00:19:39
there's a good chance that data
scientists will be working on
00:19:41
separate use cases which rely on
the same dataset. A feature
00:19:45
store allows data scientists to
take advantage of features
00:19:48
created by others. It reduces
their workload and also ensures
00:19:52
consistency in the features that
are created from a dataset. The
00:19:56
final component I want to
introduce a SageMaker Clarify.
00:20:00
Clarify can be used by data
scientists in the development
00:20:02
phase to identify bias and
datasets and to generate
00:20:05
explainability reports for
models. This technology is
00:20:08
important for responsible AI. Now
similar to model monitor, Clarify
00:20:13
can also be used to generate
baseline bias and explainability
00:20:16
reports, which can then be
compared to the behaviour of the
00:20:19
model in the endpoint. If
Clarify finds that bias is
00:20:23
increasing or the explainability
results are changing, it can
00:20:26
trigger a re-training of the
model. Now both feature store
00:20:30
and Clarify can be introduced
much earlier in the medium or
00:20:33
even the small MLOps
architectures. It really depends
00:20:36
on the needs of your business.
And I hope you can use these
00:20:38
example architectures to design
an architecture which works for
00:20:41
you. Now, the architecture
diagrams in this session rely
00:20:45
heavily on different components
offered by Amazon SageMaker.
00:20:49
SageMaker provides purpose-
built tools and built-in
00:20:51
integrations with other AWS
services so you can adopt MLOps
00:20:54
practices across your
organisation. Using Amazon
00:20:58
SageMaker, you can build CI/CD
pipelines to reduce model
00:21:01
management overhead, automate
machine learning workflows to
00:21:04
accelerate data preparation,
model building and model
00:21:06
training, monitor the quality of
models by automatically
00:21:10
detecting bias model drift and
concept drift, and automatically
00:21:13
track lineage of code datasets
and model artifacts for
00:21:16
governance. But if there's one
thing I want you to take away
00:21:20
from this session, it should be
this: MLOps is a journey, you
00:21:24
don't have to immediately adopt
every feature available in a
00:21:27
complicated architecture design.
Start with the basic steps to
00:21:31
integrate versioning and
automation. Evaluate all the
00:21:34
features I introduced in this
session, and order them
00:21:37
according to the needs of your
business, then start adopting
00:21:39
them as and when it's needed.
The architecture diagrams I
00:21:43
presented in the session are not
the only way to implement
00:21:46
MLOps, but I hope they'll provide
some inspiration to you as an
00:21:49
architect. So to help you get
started, I've collected some
00:21:52
useful resources and placed the
links on this slide. You should
00:21:55
be able to download a copy of
these slides so you can access
00:21:58
these links. The resources on
this page will not only provide
00:22:01
advice on the technology behind
MLOps but also on the
00:22:04
people and processes which we
discussed briefly at the start.
00:22:08
If you're interested in any
other topic related to AWS
00:22:10
Cloud, I recommend checking out
Skill Builder online learning
00:22:13
centre. It offers over 500 free
digital courses for any level of
00:22:17
experience. You should also
consider getting certified
00:22:19
through AWS Certifications. If
you enjoyed the topic of this
00:22:23
particular session, I'd
recommend checking out the
00:22:25
Solutions Architect associate
and professional certifications,
00:22:28
as well as the machine learning
specialty certification. And
00:22:32
that's all I have for today.
Thanks for taking the time to
00:22:35
listen to me talk about MLOps,
and I hope this content helps
00:22:38
you in upcoming projects. I just
have one final request and
00:22:41
that's to complete the session
survey. It's the only way for me
00:22:44
to know if you enjoyed this
session and it only takes a
00:22:47
minute. I hope you have a
wonderful day and enjoy the rest
00:22:49
of Summit!