MLOps is a set of operational practices that automates and standardizes the machine learning lifecycle, including building, training, deploying, and monitoring models.

What are the components of an effective MLOps architecture?

Key components include version control, automated re-training pipelines, model registries, deployment strategies, and monitoring systems.

Why do ML models need to be monitored?

Models require monitoring to ensure their predictions remain accurate over time, as they can degrade in quality due to changes in the underlying data.

What AWS services can be used for MLOps?

AWS services such as S3, SageMaker, Lambda, EventBridge, CodeCommit, ECR, and CodePipeline can be utilized to build a robust MLOps architecture.

How can I start implementing MLOps in my organization?

Begin by integrating versioning and automation in your processes, then evaluate and prioritize MLOps features based on your organization's needs.

What are the challenges in deploying ML models?

Common challenges include misalignment with business objectives, lack of communication between teams, and difficulty in maintaining existing models.

What is a model registry?

A model registry helps manage machine learning models and their versions, storing metadata alongside the models for comparison and tracking.

What is the purpose of canary deployment?

Canary deployment is used to gradually roll out new models to minimize risk and monitor for errors before fully transitioning user traffic.

AWS Summit ANZ 2022 - End-to-end MLOps for architects (ARCH3)

00:23:01

https://www.youtube.com/watch?v=UnAN35gu3Rw

الملخص

TLDRIn this session, Sara van de Moosdijk from AWS discusses MLOps, specifically tailored for architects and developers who may not specialize in machine learning. It covers the challenges organizations face in deploying ML models and stresses the importance of operational practices in automating the machine learning lifecycle. The presentation lays out a range of MLOps architectures fit for varied organizational sizes, from minimal setups for small teams to complex solutions for larger enterprises, incorporating AWS services that facilitate efficient model management, monitoring, and automated retraining. Additionally, it highlights the need for integrating people and processes within MLOps and offers practical resources to help kickstart the journey. Emphasizing that MLOps is a gradual journey, the session encourages architects to adopt components based on organizational needs rather than trying to implement a complete solution all at once.

الوجبات الجاهزة

👉 MLOps involves automating and standardizing the machine learning lifecycle.
👉 A proper MLOps setup can reduce bottlenecks and improve productivity.
👉 Monitor machine learning models to maintain their predictive quality over time.
👉 Utilize AWS services like SageMaker for a robust MLOps architecture.
👉 Start with basic features of MLOps and expand based on needs.
👉 Automate retraining and versioning of models for better model management.
👉 Implement multi-account strategies for better organization of ML workloads.
👉 Model registries help track model versions and associated metadata.

الجدول الزمني

00:00:00 - 00:05:00
Sara van de Moosdijk, a Senior AI/ML Partner Solutions Architect at AWS, introduces the session on end-to-end MLOps for architects, targeting architects and developers with basic ML knowledge. She outlines the agenda, starting with the challenges of deploying machine learning models, defining MLOps, and showcasing architecture diagrams of MLOps setups sized by organizational maturity. She stresses that understanding AWS services and some ML basics is crucial for following along.
00:05:00 - 00:10:00
The speaker discusses challenges in the standard data science workflow, such as various bottlenecks preventing machine learning models from reaching production. Common reasons include misalignment of strategic objectives, communication gaps among teams, and difficulties in model maintenance. Highlighting the importance of monitoring production models, she emphasizes that operations must be in place to ensure productivity and cost-effectiveness for data scientists. She introduces the concept of MLOps, focusing on standardizing and automating the model lifecycle to enhance performance.
00:10:00 - 00:15:00
The speaker introduces a minimal MLOps setup suitable for small organizations, focusing on enhancing version control and automation. Key features include using Git for code versioning, implementing SageMaker Pipelines for automated re-training workflows, and utilizing SageMaker model registry for better model management. The speaker describes how automation can free scientists from manual tasks and allows for better collaboration and reproducibility of ML models.
00:15:00 - 00:23:01
The architecture for medium and large MLOps setups is covered, focusing on multi-account strategies for managing deployment, staging, operations, and data accessibility in large organizations. Introducing model monitoring capabilities and feature stores enhances the production system's robustness. Additionally, the large architecture showcases automation of Docker container registration and re-training processes. The session concludes with resources for further learning and emphasizes starting small with MLOps integration.

اعرض المزيد

الخريطة الذهنية

فيديو أسئلة وأجوبة

What is MLOps?
MLOps is a set of operational practices that automates and standardizes the machine learning lifecycle, including building, training, deploying, and monitoring models.
What are the components of an effective MLOps architecture?
Key components include version control, automated re-training pipelines, model registries, deployment strategies, and monitoring systems.
Why do ML models need to be monitored?
Models require monitoring to ensure their predictions remain accurate over time, as they can degrade in quality due to changes in the underlying data.
What AWS services can be used for MLOps?
AWS services such as S3, SageMaker, Lambda, EventBridge, CodeCommit, ECR, and CodePipeline can be utilized to build a robust MLOps architecture.
How can I start implementing MLOps in my organization?
Begin by integrating versioning and automation in your processes, then evaluate and prioritize MLOps features based on your organization's needs.
What are the challenges in deploying ML models?
Common challenges include misalignment with business objectives, lack of communication between teams, and difficulty in maintaining existing models.
What is a model registry?
A model registry helps manage machine learning models and their versions, storing metadata alongside the models for comparison and tracking.
What is the purpose of canary deployment?
Canary deployment is used to gradually roll out new models to minimize risk and monitor for errors before fully transitioning user traffic.

عرض المزيد من ملخصات الفيديو

احصل على وصول فوري إلى ملخصات فيديو YouTube المجانية المدعومة بالذكاء الاصطناعي!

الترجمات

التمرير التلقائي:

00:00:14
Hi, welcome to this session on end-to-end MLOps for
00:00:17
architects. My name is Sara van de Moosdijk, but you can call me
00:00:20
Moose. I'm a Senior AI/ML Partner Solutions Architect at
00:00:24
AWS. Now, my goal for this session today is to help
00:00:28
architects and developers, especially those of you not
00:00:31
specialised in machine learning, to design an MLOps architecture
00:00:34
for your organisation. I will introduce the different
00:00:37
components in an effective MLOps setup, and explain why
00:00:40
these components are necessary without diving too deep into the
00:00:44
details which only data scientists need to know. Now,
00:00:47
before watching this session, you should be comfortable with
00:00:50
architecting on AWS, and know how to use popular services like
00:00:53
S3, Lambda, EventBridge, CloudFormation, and so on. If
00:00:57
you're unfamiliar with these services, I recommend watching
00:01:00
some of the other architecture sessions and reading up on these
00:01:03
topics, then come back to watch the session. It will also help
00:01:07
if you have a basic understanding of machine
00:01:08
learning and how it works. Okay, let's look at what you can
00:01:13
expect. I'll start off the session with a brief overview of
00:01:17
the challenges many companies face when deploying machine
00:01:20
learning models and maintaining them in production. We will then
00:01:23
briefly define MLOps, before diving straight into some
00:01:26
architecture diagrams. Specifically, I have designed
00:01:30
the architecture diagrams according to t-shirt sizes from
00:01:32
small to large. This will allow you to choose your starting
00:01:36
point based on the size and maturity of your organisation.
00:01:40
And finally, I'll end the session with some advice for
00:01:42
starting your own MLOps journey. First, I want to
00:01:47
quickly go over the machine learning process to make sure
00:01:49
that we're all on the same page. Normally, you'd start your
00:01:52
machine learning process because you have a business problem
00:01:55
which needs to be solved, and you've determined that machine
00:01:57
learning is the correct solution. Then a data scientist
00:02:01
will spend quite a bit of time collecting the data required,
00:02:04
integrating data from various sources, cleaning the data, and
00:02:07
analysing the data. Next, the data scientists will start the
00:02:11
process of engineering features, training and tuning different
00:02:14
machine learning models and evaluating the performance of
00:02:17
these models. And then based on these results, the data
00:02:20
scientist might go back to collect more data or perform
00:02:23
additional data cleaning steps. But assuming that the models are
00:02:26
performing well, he or she would then go ahead and deploy the
00:02:29
model so it can be used to generate predictions. The
00:02:33
final step and certainly a crucial one is to monitor the
00:02:36
model that is in production. Much like a new car which
00:02:40
depreciates in value as soon as you drive it off the lot, a
00:02:43
machine learning model is out of date, as soon as you've trained
00:02:45
it. The world is constantly changing and evolving, so the
00:02:49
older your model gets, the worse it gets at making predictions.
00:02:52
By monitoring the quality of your model, you will know when
00:02:55
it's time to retrain or perhaps gather new data for it. Now,
00:03:00
every data scientist will follow a process along these lines and
00:03:03
when they're just starting out this process will likely be
00:03:06
entirely manual. But as businesses embrace machine
00:03:09
learning across their organisations, manual workflows
00:03:12
for building, training, deploying tend to become bottlenecks to
00:03:15
innovation. Even a single machine learning model in
00:03:18
production needs to be monitored, managed and retrained
00:03:21
to maintain the quality of its predictions. Without the right
00:03:24
operational practices in place, these challenges can negatively
00:03:27
impact data scientist productivity, model performance
00:03:30
and costs. So to illustrate what I mean let's take a look at some
00:03:35
architecture diagrams. Here I have a pretty standard data
00:03:40
science setup. The data scientist has been given access
00:03:42
to an AWS account within a SageMaker Studio domain, where they
00:03:46
can use Jupyter Notebooks to develop their machine learning
00:03:48
models. Data might be pulled from S3, RDS, Redshift, Glue, any
00:03:54
number of data related AWS services. The models produced by
00:03:58
the data scientist are then stored in an S3 bucket. So far,
00:04:02
so good. Unfortunately, this is often where it stops. Gartner
00:04:06
estimates that only 53% of machine learning POCs actually
00:04:11
make it into production. And there are various reasons for
00:04:13
this. Often there's a misalignment between strategic
00:04:17
objectives of the company and the machine learning models
00:04:19
being built by the data scientists. There might be a
00:04:22
lack of communication between DevOps, security, legal, IT, and
00:04:26
the data scientists, which can also be a common
00:04:28
challenge blocking models from reaching production. And
00:04:31
finally, if the company already has a few models in production,
00:04:35
a data scientist team can struggle to maintain those
00:04:37
existing models, while also pushing out new models. But what
00:04:42
if the model does make it into production? Let's assume the
00:04:45
data scientist spins up a SageMaker endpoint to host the
00:04:48
model. And the developer of an application is able to connect
00:04:52
to this model and generate predictions through API Gateway
00:04:55
connecting to a Lambda function, which calls the SageMaker
00:04:58
endpoint. So what challenges do you see with this architecture?
00:05:04
Well first, any changes to the machine learning model requires
00:05:07
manual actions by the data scientist in the form of re-
00:05:10
running cells in a Jupyter Notebook, right. Second, the
00:05:14
code which the data scientist produces is stuck in these
00:05:17
Jupyter Notebooks, which are difficult to version and
00:05:19
difficult to automate. Third, the data scientist might have
00:05:23
forgotten to turn on auto- scaling for the SageMaker
00:05:26
endpoint, so it cannot adjust capacity according to the number
00:05:29
of requests coming in. And finally, there's no feedback
00:05:32
loop. If the quality of the model deteriorates, you would
00:05:35
only find out through complaints from disgruntled users. These
00:05:38
are just some of the challenges which can be avoided with a
00:05:41
proper MLOps setup. So what is MLOps? Well, MLOps is a set of
00:05:47
operational practices to automate and standardise model
00:05:51
building training, deployment, monitoring, management, and
00:05:54
governance. It can help companies streamline the end-to-
00:05:58
end machine learning lifecycle, and boost productivity of data
00:06:01
scientists and MLOps teams, while maintaining high model
00:06:04
accuracy, and enhancing security and compliance. So the key
00:06:09
phrase in that previous definition is operational
00:06:12
practices. MLOps, similar to DevOps, is more than just a set
00:06:16
of technologies or services. You need the right people, with the
00:06:19
right skills, following the same standardised processes to
00:06:23
successfully operate machine learning at scale. The
00:06:26
technology exists to facilitate these processes and make the job
00:06:30
easier for the people. Now in this session, I will focus on
00:06:33
the technology and specifically which AWS services we can use to
00:06:37
build a successful setup. But I want you to keep in mind that
00:06:41
the architectures provided in this session will only work if
00:06:43
you have the right teams, and if those teams are willing to
00:06:46
establish and follow MLOps processes. Now if you want to
00:06:50
learn more about the people and process aspects of MLOps, I
00:06:54
will include links to useful resources at the end of this
00:06:56
session. So without further ado, let's dive deep with some
00:07:01
architecture diagrams. Now MLOps can be quite complicated,
00:07:04
with lots of features and technologies which you could
00:07:07
choose to adopt, but you don't have to adopt all of it
00:07:09
immediately. So to start off, I'll give you an example of a
00:07:13
minimal MLOps setup. This would be suitable for a small company
00:07:17
or a small data science team of one to three people working on
00:07:21
just a couple of use cases. So let's take a look. We'll start
00:07:27
off with the same architecture we looked at previously, only
00:07:30
reduced in size so I can create more space on the slide. A data
00:07:33
scientist accesses Jupyter Notebooks through SageMaker
00:07:36
Studio, accesses data from any of various data sources, and
00:07:40
stores any machine learning models they create in S3. One of
00:07:45
the challenges I mentioned previously is that the code is
00:07:47
stuck in Jupyter notebooks and can be difficult to version and
00:07:49
automate. So the first step would be to add more versioning
00:07:53
to this architecture. You can use CodeCommit or any other Git-
00:07:56
based repository to store code. And you can use Amazon Elastic
00:08:00
Container Registry or ECR to store Docker containers, thereby
00:08:04
versioning the environments which were used to train the
00:08:06
machine learning models. By versioning the code, the
00:08:09
environments and the model artifacts, you improve your
00:08:12
ability to reproduce models and collaborate with others. Next,
00:08:16
let's talk about automation. Another challenge I mentioned
00:08:19
previously is that the data scientists are manually
00:08:22
re-training models instead of focussing on developing new
00:08:24
models. To solve this, you want to set up automatic re-training
00:08:28
pipelines. In this architecture, I use SageMaker Pipelines, but
00:08:32
you could also use Step Functions or Airflow to build
00:08:34
these repeatable workflows. The re-training pipeline built by the
00:08:38
data scientist or by a machine learning engineer, will use the
00:08:41
version code and environments to perform data pre-processing,
00:08:44
model training, model verification, and eventually
00:08:48
save the new model artifacts to S3. It can use various services
00:08:52
to complete these steps including SageMaker processing
00:08:54
or training jobs, EMR, or Lambda. But in order to automate
00:08:59
this pipeline, we need a trigger. One option is to use
00:09:04
EventBridge to trigger the pipeline based on a schedule.
00:09:07
Another option is to have someone manually trigger the
00:09:09
pipeline. Both triggers are useful in different contexts
00:09:13
and I'll introduce more triggers as we progress through these
00:09:16
slides. So now that we have an automated re-training pipeline, I
00:09:20
want to introduce another important concept in MLOps, and
00:09:23
that's the model registry. While S3 provides some versioning and
00:09:27
object locking functionality, which is useful for storing
00:09:29
different models, a model registry helps to manage these
00:09:32
models and their versions. SageMaker model registry allows you
00:09:36
to store metadata alongside your models, including the values of
00:09:40
hyperparameters and evaluation metrics, or even the bias and
00:09:43
explainability reports. This enables you to quickly view and
00:09:47
compare different versions of a model and to approve or reject a
00:09:50
model version for production. Now the actual artifacts are
00:09:54
still stored in S3, but model registry sits on top of that as
00:09:58
an additional layer.
00:10:00
Finally, we reach the deployment stage. At first
00:10:03
glance, this might look very different from what we saw
00:10:05
earlier in the session. But the setup is actually very similar.
00:10:08
I still have machine learning models deployed on real-time
00:10:11
SageMaker endpoints connected to Lambda and API Gateway to
00:10:15
communicate with an application. The main difference now is that
00:10:19
I have autoscaling set up for my SageMaker endpoints. So if
00:10:22
there's an unexpected spike in users, the endpoints can scale
00:10:25
up to handle the requests and scale back down when the usage
00:10:29
falls. Now one nice feature of SageMaker endpoints is that you
00:10:32
can replace the machine learning model without endpoint downtime.
00:10:36
Since I now have an automated re-training pipeline creating
00:10:39
new models, and a model registry where I can approve models, it
00:10:42
would be best if the deployment of the new models is automated
00:10:45
as well. I can achieve this by building a Lambda function,
00:10:49
which triggers when a new model is approved to fetch that model,
00:10:53
and then update the endpoint with it. So now we have
00:10:56
connected all the pieces and there's one final feature that I
00:10:59
will take advantage of. Not only can I update the machine
00:11:02
learning models hosted by the endpoints, but I can actually do
00:11:04
so gradually using a canary deployment. This means that a
00:11:08
small portion of the user requests will be diverted to the
00:11:11
new model, and any errors or issues will trigger a Cloud-
00:11:15
Watch alarm to inform me. Over time, the number of requests
00:11:18
sent to the new model will increase until the new model
00:11:21
gets 100% of the traffic. So I hope this architecture makes
00:11:24
sense. I started with a very basic setup, and by adding a few
00:11:28
features and services, I now have a serviceable MLOps
00:11:31
setup. My deployment strategy is more robust by using auto-
00:11:34
scaling and canary deployment, my data scientists save time by
00:11:38
automating model training, and every artefact is properly
00:11:41
versioned. But as your data science team grows, this
00:11:45
architecture won't be sufficient. So let's look at a
00:11:47
slightly more complicated architecture. The next
00:11:52
architecture will be more suitable for a growing data
00:11:55
science team of between three to 10 data scientists working on
00:11:59
several different use cases at a larger company. So again, let's
00:12:03
start with the basics. Our data scientists work in notebooks
00:12:06
through SageMaker Studio, pulling from various data
00:12:08
sources and versioning their code environments and model
00:12:11
artifacts. This should look familiar. Also bring back the
00:12:16
automated re-training pipeline. Nothing has changed here, I've
00:12:19
only made it smaller to create more room on the slide. And
00:12:22
finally, I'll bring back Event- Bridge to schedule the
00:12:24
re-training pipeline and model registry for storing model
00:12:27
metadata and approving model versions. All of this is exactly
00:12:32
the same as in the previous architecture diagram. So what
00:12:34
about deployment? Well, this is where things change a little. So
00:12:38
I have the same deployment setup with SageMaker endpoints and an
00:12:42
autoscaling group connected to Lambda and API Gateway to allow
00:12:45
users to submit inference requests. However, these
00:12:48
deployment services now sit in a separate AWS account. A multi-
00:12:53
account strategy is highly recommended, because this allows
00:12:55
you to separate different business units, easily define
00:12:58
separate restrictions for important production workloads,
00:13:01
and have a fine-grained view of the costs incurred by each
00:13:04
component of your architecture. The different accounts are best
00:13:08
managed through AWS Organizations. Now, data
00:13:12
scientists should not have access to the production
00:13:14
account. This reduces the chance of mistakes being made on that
00:13:17
account, which directly affects your users. In fact, a multi-
00:13:21
account strategy for machine learning usually has a separate
00:13:24
staging account alongside the production account. Any new
00:13:27
models are first deployed to the staging account, tested and only
00:13:31
then deployed on the production account. So if the data
00:13:34
scientist cannot access these accounts, clearly, the
00:13:36
deployment must happen automatically. All of the
00:13:40
services deployed into the staging and production accounts
00:13:42
are set up automatically using CloudFormation, controlled by
00:13:45
CodePipeline in the development account. The next step is to set
00:13:49
up a trigger for CodePipeline. And we can do so using Event-
00:13:52
Bridge. So when a model version is approved in model registry,
00:13:57
this will generate an event which can be used to trigger
00:13:59
deployment via CodePipeline. So now everything's connected
00:14:03
again, and this is starting to look like a proper MLOps
00:14:06
setup. But I'm sure you've noticed I have plenty of space
00:14:08
left on this slide. So let's add another feature which becomes
00:14:11
crucial when you have multiple models running in production for
00:14:14
extended periods of time - that's model monitor. The goal of
00:14:18
monitoring machine learning models in production is to
00:14:21
detect a change in behaviour or accuracy. To start, I enabled
00:14:25
data capture on the endpoints in the staging and production
00:14:28
accounts. This captures the incoming requests and outgoing
00:14:31
inference results and stores them in S3 buckets. If you have
00:14:36
a model monitoring use case, which doesn't require labelling
00:14:39
the incoming requests, then you could run the whole process
00:14:41
directly on your staging and production accounts. But in this
00:14:44
case, I assume the data needs to be combined with labels or other
00:14:48
data that's on the development account. So I use S3 replication
00:14:51
to move the data onto an S3 bucket in the development account.
00:14:56
Now, in order to tell if the behaviour of the model or the
00:14:59
data has changed, we need something to compare it to.
00:15:02
That's where the model baseline comes in. During the training
00:15:05
process as part of the automated re-training pipeline, we can
00:15:08
generate a baseline dataset, which records the expected
00:15:11
behaviour of the data and the model. So that gives me all the
00:15:15
components I need to set up Sage- Maker model monitor, which will
00:15:18
compare the two datasets and generate a report. The final
00:15:22
step in this architecture is to take action based on the results
00:15:25
of the model monitoring report. And we can do this by sending an
00:15:28
event to EventBridge to trigger the re-training pipeline when a
00:15:31
significant change has been detected. And that's it for the
00:15:34
medium MLOps architecture! It contains a lot
00:15:37
of the same features used in the small architecture, but it
00:15:40
expands to a multi-account setup, and adds model monitoring
00:15:43
for extra quality checks on the models in production. Hopefully,
00:15:48
you're now wondering what a large MLOps architecture looks
00:15:51
like and how I can possibly fit more features onto a single
00:15:54
slide. So let's take a look at that now. This architecture is
00:15:57
suitable for companies with large data science teams of 10
00:16:00
or more people and with machine learning integrated throughout
00:16:03
the business. Of course, I start with the same basic setup I had
00:16:09
last time but reduced in size again. The data scientist is
00:16:12
still using SageMaker Studio through a development account,
00:16:15
and stores model artifacts and code in S3 and CodeCommit
00:16:18
respectively. The data sources are also present, but data is
00:16:22
now stored in a separate account. It's a common strategy
00:16:25
to have your data lakes set up in one account with fine-grained
00:16:28
access controls to determine which datasets can be accessed
00:16:31
by resources in other accounts. Really, the larger a company
00:16:35
becomes the more AWS accounts they tend to use, all managed
00:16:38
through AWS Organizations. So let's continue this trend by
00:16:42
bringing back the automated re-training pipeline in a
00:16:45
separate operations account. And let's bring back model registry
00:16:48
as well in yet another account. All of the components are the
00:16:51
same as in the small and medium architecture diagrams, but just
00:16:54
split across more accounts. The operations account is normally
00:16:58
used for any automated workflows which don't require manual
00:17:01
intervention by the data scientists. It's also good
00:17:04
practice to store all of your artifacts in a separate artefact
00:17:07
account like I have here for model registry. Again, this is
00:17:10
an easy way to prevent data scientists from accidentally
00:17:13
changing production artifacts. Next, let's bring back the
00:17:17
production and staging accounts with the deployment setup. This
00:17:20
is exactly the same as in the previous architecture, just
00:17:22
reduced in size. The infrastructure in the production
00:17:26
and staging accounts is still set up automatically through
00:17:28
CloudFormation in CodePipeline, but CodePipeline sits in a CI/
00:17:32
CD account. Note that I have built this diagram based on
00:17:36
account structures I have seen organisations use but your
00:17:39
organisation might use a different account setup and
00:17:41
that's totally fine. Use this diagram as an example and adjust
00:17:45
it to your structure and your needs. Now, let's connect our
00:17:49
model registry to CodePipeline by using EventBridge exactly
00:17:52
the same as in the previous architecture. And now we have
00:17:55
all the pieces connected again. But I don't know if you noticed,
00:17:58
but one of the basic building blocks is still missing in this
00:18:01
picture. Hopefully you spotted it - ECR disappeared for a little
00:18:05
while. So let's bring it back by placing it in the artefact
00:18:08
account because environments, especially production
00:18:10
environments are artifacts which need to be protected. There's
00:18:14
one more change I want to make to my use of ECR here. In the
00:18:17
previous architecture diagrams, I assumed that data scientists
00:18:20
were building Docker containers and registering these containers
00:18:23
and ECR manually. This process can be simplified and indeed
00:18:27
automated using CodeBuild and CodePipeline. The data
00:18:30
scientist or machine learning engineer can still write the
00:18:33
Docker file, but the building and registration of the
00:18:35
container is performed automatically. This saves even
00:18:38
more time, so data scientists can focus on what they do best.
00:18:42
Of course, in the previous architecture, I use model
00:18:44
monitor to trigger model re-training if significant
00:18:47
changes in model behaviour were detected. So let's bring that
00:18:50
back as well, starting with the data capture in the staging and
00:18:53
production accounts, followed by data replication into the
00:18:56
operations account. As before, model monitor will need a
00:19:00
baseline to compare performance and the generation of this
00:19:03
baseline can be a step in SageMaker Pipelines. Finally, I'll
00:19:07
bring back model monitor to generate reports on drift and
00:19:10
trigger re-training if necessary. This leaves us with all of the
00:19:14
components I had in the medium MLOps diagram. But there's two
00:19:18
more features that I want to introduce. The first is Sage-
00:19:22
Maker feature store, which sits in the artefact account because
00:19:25
features are artifacts which can be reused. If you remember the
00:19:28
basic data science workflow from the beginning of the session,
00:19:31
data scientists will normally perform feature engineering
00:19:34
before training a model, and it has a large impact on model
00:19:36
performance. In large companies,
00:19:39
there's a good chance that data scientists will be working on
00:19:41
separate use cases which rely on the same dataset. A feature
00:19:45
store allows data scientists to take advantage of features
00:19:48
created by others. It reduces their workload and also ensures
00:19:52
consistency in the features that are created from a dataset. The
00:19:56
final component I want to introduce a SageMaker Clarify.
00:20:00
Clarify can be used by data scientists in the development
00:20:02
phase to identify bias and datasets and to generate
00:20:05
explainability reports for models. This technology is
00:20:08
important for responsible AI. Now similar to model monitor, Clarify
00:20:13
can also be used to generate baseline bias and explainability
00:20:16
reports, which can then be compared to the behaviour of the
00:20:19
model in the endpoint. If Clarify finds that bias is
00:20:23
increasing or the explainability results are changing, it can
00:20:26
trigger a re-training of the model. Now both feature store
00:20:30
and Clarify can be introduced much earlier in the medium or
00:20:33
even the small MLOps architectures. It really depends
00:20:36
on the needs of your business. And I hope you can use these
00:20:38
example architectures to design an architecture which works for
00:20:41
you. Now, the architecture diagrams in this session rely
00:20:45
heavily on different components offered by Amazon SageMaker.
00:20:49
SageMaker provides purpose- built tools and built-in
00:20:51
integrations with other AWS services so you can adopt MLOps
00:20:54
practices across your organisation. Using Amazon
00:20:58
SageMaker, you can build CI/CD pipelines to reduce model
00:21:01
management overhead, automate machine learning workflows to
00:21:04
accelerate data preparation, model building and model
00:21:06
training, monitor the quality of models by automatically
00:21:10
detecting bias model drift and concept drift, and automatically
00:21:13
track lineage of code datasets and model artifacts for
00:21:16
governance. But if there's one thing I want you to take away
00:21:20
from this session, it should be this: MLOps is a journey, you
00:21:24
don't have to immediately adopt every feature available in a
00:21:27
complicated architecture design. Start with the basic steps to
00:21:31
integrate versioning and automation. Evaluate all the
00:21:34
features I introduced in this session, and order them
00:21:37
according to the needs of your business, then start adopting
00:21:39
them as and when it's needed. The architecture diagrams I
00:21:43
presented in the session are not the only way to implement
00:21:46
MLOps, but I hope they'll provide some inspiration to you as an
00:21:49
architect. So to help you get started, I've collected some
00:21:52
useful resources and placed the links on this slide. You should
00:21:55
be able to download a copy of these slides so you can access
00:21:58
these links. The resources on this page will not only provide
00:22:01
advice on the technology behind MLOps but also on the
00:22:04
people and processes which we discussed briefly at the start.
00:22:08
If you're interested in any other topic related to AWS
00:22:10
Cloud, I recommend checking out Skill Builder online learning
00:22:13
centre. It offers over 500 free digital courses for any level of
00:22:17
experience. You should also consider getting certified
00:22:19
through AWS Certifications. If you enjoyed the topic of this
00:22:23
particular session, I'd recommend checking out the
00:22:25
Solutions Architect associate and professional certifications,
00:22:28
as well as the machine learning specialty certification. And
00:22:32
that's all I have for today. Thanks for taking the time to
00:22:35
listen to me talk about MLOps, and I hope this content helps
00:22:38
you in upcoming projects. I just have one final request and
00:22:41
that's to complete the session survey. It's the only way for me
00:22:44
to know if you enjoyed this session and it only takes a
00:22:47
minute. I hope you have a wonderful day and enjoy the rest
00:22:49
of Summit!

الوسوم

MLOps
AWS
Machine Learning
Architecture
Model Deployment
Automation
Monitoring
Version Control
Data Science
SageMaker