AWS Summit ANZ 2022 - End-to-end MLOps for architects (ARCH3)

00:23:01
https://www.youtube.com/watch?v=UnAN35gu3Rw

الملخص

TLDRIn this session, Sara van de Moosdijk from AWS discusses MLOps, specifically tailored for architects and developers who may not specialize in machine learning. It covers the challenges organizations face in deploying ML models and stresses the importance of operational practices in automating the machine learning lifecycle. The presentation lays out a range of MLOps architectures fit for varied organizational sizes, from minimal setups for small teams to complex solutions for larger enterprises, incorporating AWS services that facilitate efficient model management, monitoring, and automated retraining. Additionally, it highlights the need for integrating people and processes within MLOps and offers practical resources to help kickstart the journey. Emphasizing that MLOps is a gradual journey, the session encourages architects to adopt components based on organizational needs rather than trying to implement a complete solution all at once.

الوجبات الجاهزة

  • 👉 MLOps involves automating and standardizing the machine learning lifecycle.
  • 👉 A proper MLOps setup can reduce bottlenecks and improve productivity.
  • 👉 Monitor machine learning models to maintain their predictive quality over time.
  • 👉 Utilize AWS services like SageMaker for a robust MLOps architecture.
  • 👉 Start with basic features of MLOps and expand based on needs.
  • 👉 Automate retraining and versioning of models for better model management.
  • 👉 Implement multi-account strategies for better organization of ML workloads.
  • 👉 Model registries help track model versions and associated metadata.

الجدول الزمني

  • 00:00:00 - 00:05:00

    Sara van de Moosdijk, a Senior AI/ML Partner Solutions Architect at AWS, introduces the session on end-to-end MLOps for architects, targeting architects and developers with basic ML knowledge. She outlines the agenda, starting with the challenges of deploying machine learning models, defining MLOps, and showcasing architecture diagrams of MLOps setups sized by organizational maturity. She stresses that understanding AWS services and some ML basics is crucial for following along.

  • 00:05:00 - 00:10:00

    The speaker discusses challenges in the standard data science workflow, such as various bottlenecks preventing machine learning models from reaching production. Common reasons include misalignment of strategic objectives, communication gaps among teams, and difficulties in model maintenance. Highlighting the importance of monitoring production models, she emphasizes that operations must be in place to ensure productivity and cost-effectiveness for data scientists. She introduces the concept of MLOps, focusing on standardizing and automating the model lifecycle to enhance performance.

  • 00:10:00 - 00:15:00

    The speaker introduces a minimal MLOps setup suitable for small organizations, focusing on enhancing version control and automation. Key features include using Git for code versioning, implementing SageMaker Pipelines for automated re-training workflows, and utilizing SageMaker model registry for better model management. The speaker describes how automation can free scientists from manual tasks and allows for better collaboration and reproducibility of ML models.

  • 00:15:00 - 00:23:01

    The architecture for medium and large MLOps setups is covered, focusing on multi-account strategies for managing deployment, staging, operations, and data accessibility in large organizations. Introducing model monitoring capabilities and feature stores enhances the production system's robustness. Additionally, the large architecture showcases automation of Docker container registration and re-training processes. The session concludes with resources for further learning and emphasizes starting small with MLOps integration.

اعرض المزيد

الخريطة الذهنية

فيديو أسئلة وأجوبة

  • What is MLOps?

    MLOps is a set of operational practices that automates and standardizes the machine learning lifecycle, including building, training, deploying, and monitoring models.

  • What are the components of an effective MLOps architecture?

    Key components include version control, automated re-training pipelines, model registries, deployment strategies, and monitoring systems.

  • Why do ML models need to be monitored?

    Models require monitoring to ensure their predictions remain accurate over time, as they can degrade in quality due to changes in the underlying data.

  • What AWS services can be used for MLOps?

    AWS services such as S3, SageMaker, Lambda, EventBridge, CodeCommit, ECR, and CodePipeline can be utilized to build a robust MLOps architecture.

  • How can I start implementing MLOps in my organization?

    Begin by integrating versioning and automation in your processes, then evaluate and prioritize MLOps features based on your organization's needs.

  • What are the challenges in deploying ML models?

    Common challenges include misalignment with business objectives, lack of communication between teams, and difficulty in maintaining existing models.

  • What is a model registry?

    A model registry helps manage machine learning models and their versions, storing metadata alongside the models for comparison and tracking.

  • What is the purpose of canary deployment?

    Canary deployment is used to gradually roll out new models to minimize risk and monitor for errors before fully transitioning user traffic.

عرض المزيد من ملخصات الفيديو

احصل على وصول فوري إلى ملخصات فيديو YouTube المجانية المدعومة بالذكاء الاصطناعي!
الترجمات
en
التمرير التلقائي:
  • 00:00:14
    Hi, welcome to this session on end-to-end MLOps for
  • 00:00:17
    architects. My name is Sara van de Moosdijk, but you can call me
  • 00:00:20
    Moose. I'm a Senior AI/ML Partner Solutions Architect at
  • 00:00:24
    AWS. Now, my goal for this session today is to help
  • 00:00:28
    architects and developers, especially those of you not
  • 00:00:31
    specialised in machine learning, to design an MLOps architecture
  • 00:00:34
    for your organisation. I will introduce the different
  • 00:00:37
    components in an effective MLOps setup, and explain why
  • 00:00:40
    these components are necessary without diving too deep into the
  • 00:00:44
    details which only data scientists need to know. Now,
  • 00:00:47
    before watching this session, you should be comfortable with
  • 00:00:50
    architecting on AWS, and know how to use popular services like
  • 00:00:53
    S3, Lambda, EventBridge, CloudFormation, and so on. If
  • 00:00:57
    you're unfamiliar with these services, I recommend watching
  • 00:01:00
    some of the other architecture sessions and reading up on these
  • 00:01:03
    topics, then come back to watch the session. It will also help
  • 00:01:07
    if you have a basic understanding of machine
  • 00:01:08
    learning and how it works. Okay, let's look at what you can
  • 00:01:13
    expect. I'll start off the session with a brief overview of
  • 00:01:17
    the challenges many companies face when deploying machine
  • 00:01:20
    learning models and maintaining them in production. We will then
  • 00:01:23
    briefly define MLOps, before diving straight into some
  • 00:01:26
    architecture diagrams. Specifically, I have designed
  • 00:01:30
    the architecture diagrams according to t-shirt sizes from
  • 00:01:32
    small to large. This will allow you to choose your starting
  • 00:01:36
    point based on the size and maturity of your organisation.
  • 00:01:40
    And finally, I'll end the session with some advice for
  • 00:01:42
    starting your own MLOps journey. First, I want to
  • 00:01:47
    quickly go over the machine learning process to make sure
  • 00:01:49
    that we're all on the same page. Normally, you'd start your
  • 00:01:52
    machine learning process because you have a business problem
  • 00:01:55
    which needs to be solved, and you've determined that machine
  • 00:01:57
    learning is the correct solution. Then a data scientist
  • 00:02:01
    will spend quite a bit of time collecting the data required,
  • 00:02:04
    integrating data from various sources, cleaning the data, and
  • 00:02:07
    analysing the data. Next, the data scientists will start the
  • 00:02:11
    process of engineering features, training and tuning different
  • 00:02:14
    machine learning models and evaluating the performance of
  • 00:02:17
    these models. And then based on these results, the data
  • 00:02:20
    scientist might go back to collect more data or perform
  • 00:02:23
    additional data cleaning steps. But assuming that the models are
  • 00:02:26
    performing well, he or she would then go ahead and deploy the
  • 00:02:29
    model so it can be used to generate predictions. The
  • 00:02:33
    final step and certainly a crucial one is to monitor the
  • 00:02:36
    model that is in production. Much like a new car which
  • 00:02:40
    depreciates in value as soon as you drive it off the lot, a
  • 00:02:43
    machine learning model is out of date, as soon as you've trained
  • 00:02:45
    it. The world is constantly changing and evolving, so the
  • 00:02:49
    older your model gets, the worse it gets at making predictions.
  • 00:02:52
    By monitoring the quality of your model, you will know when
  • 00:02:55
    it's time to retrain or perhaps gather new data for it. Now,
  • 00:03:00
    every data scientist will follow a process along these lines and
  • 00:03:03
    when they're just starting out this process will likely be
  • 00:03:06
    entirely manual. But as businesses embrace machine
  • 00:03:09
    learning across their organisations, manual workflows
  • 00:03:12
    for building, training, deploying tend to become bottlenecks to
  • 00:03:15
    innovation. Even a single machine learning model in
  • 00:03:18
    production needs to be monitored, managed and retrained
  • 00:03:21
    to maintain the quality of its predictions. Without the right
  • 00:03:24
    operational practices in place, these challenges can negatively
  • 00:03:27
    impact data scientist productivity, model performance
  • 00:03:30
    and costs. So to illustrate what I mean let's take a look at some
  • 00:03:35
    architecture diagrams. Here I have a pretty standard data
  • 00:03:40
    science setup. The data scientist has been given access
  • 00:03:42
    to an AWS account within a SageMaker Studio domain, where they
  • 00:03:46
    can use Jupyter Notebooks to develop their machine learning
  • 00:03:48
    models. Data might be pulled from S3, RDS, Redshift, Glue, any
  • 00:03:54
    number of data related AWS services. The models produced by
  • 00:03:58
    the data scientist are then stored in an S3 bucket. So far,
  • 00:04:02
    so good. Unfortunately, this is often where it stops. Gartner
  • 00:04:06
    estimates that only 53% of machine learning POCs actually
  • 00:04:11
    make it into production. And there are various reasons for
  • 00:04:13
    this. Often there's a misalignment between strategic
  • 00:04:17
    objectives of the company and the machine learning models
  • 00:04:19
    being built by the data scientists. There might be a
  • 00:04:22
    lack of communication between DevOps, security, legal, IT, and
  • 00:04:26
    the data scientists, which can also be a common
  • 00:04:28
    challenge blocking models from reaching production. And
  • 00:04:31
    finally, if the company already has a few models in production,
  • 00:04:35
    a data scientist team can struggle to maintain those
  • 00:04:37
    existing models, while also pushing out new models. But what
  • 00:04:42
    if the model does make it into production? Let's assume the
  • 00:04:45
    data scientist spins up a SageMaker endpoint to host the
  • 00:04:48
    model. And the developer of an application is able to connect
  • 00:04:52
    to this model and generate predictions through API Gateway
  • 00:04:55
    connecting to a Lambda function, which calls the SageMaker
  • 00:04:58
    endpoint. So what challenges do you see with this architecture?
  • 00:05:04
    Well first, any changes to the machine learning model requires
  • 00:05:07
    manual actions by the data scientist in the form of re-
  • 00:05:10
    running cells in a Jupyter Notebook, right. Second, the
  • 00:05:14
    code which the data scientist produces is stuck in these
  • 00:05:17
    Jupyter Notebooks, which are difficult to version and
  • 00:05:19
    difficult to automate. Third, the data scientist might have
  • 00:05:23
    forgotten to turn on auto- scaling for the SageMaker
  • 00:05:26
    endpoint, so it cannot adjust capacity according to the number
  • 00:05:29
    of requests coming in. And finally, there's no feedback
  • 00:05:32
    loop. If the quality of the model deteriorates, you would
  • 00:05:35
    only find out through complaints from disgruntled users. These
  • 00:05:38
    are just some of the challenges which can be avoided with a
  • 00:05:41
    proper MLOps setup. So what is MLOps? Well, MLOps is a set of
  • 00:05:47
    operational practices to automate and standardise model
  • 00:05:51
    building training, deployment, monitoring, management, and
  • 00:05:54
    governance. It can help companies streamline the end-to-
  • 00:05:58
    end machine learning lifecycle, and boost productivity of data
  • 00:06:01
    scientists and MLOps teams, while maintaining high model
  • 00:06:04
    accuracy, and enhancing security and compliance. So the key
  • 00:06:09
    phrase in that previous definition is operational
  • 00:06:12
    practices. MLOps, similar to DevOps, is more than just a set
  • 00:06:16
    of technologies or services. You need the right people, with the
  • 00:06:19
    right skills, following the same standardised processes to
  • 00:06:23
    successfully operate machine learning at scale. The
  • 00:06:26
    technology exists to facilitate these processes and make the job
  • 00:06:30
    easier for the people. Now in this session, I will focus on
  • 00:06:33
    the technology and specifically which AWS services we can use to
  • 00:06:37
    build a successful setup. But I want you to keep in mind that
  • 00:06:41
    the architectures provided in this session will only work if
  • 00:06:43
    you have the right teams, and if those teams are willing to
  • 00:06:46
    establish and follow MLOps processes. Now if you want to
  • 00:06:50
    learn more about the people and process aspects of MLOps, I
  • 00:06:54
    will include links to useful resources at the end of this
  • 00:06:56
    session. So without further ado, let's dive deep with some
  • 00:07:01
    architecture diagrams. Now MLOps can be quite complicated,
  • 00:07:04
    with lots of features and technologies which you could
  • 00:07:07
    choose to adopt, but you don't have to adopt all of it
  • 00:07:09
    immediately. So to start off, I'll give you an example of a
  • 00:07:13
    minimal MLOps setup. This would be suitable for a small company
  • 00:07:17
    or a small data science team of one to three people working on
  • 00:07:21
    just a couple of use cases. So let's take a look. We'll start
  • 00:07:27
    off with the same architecture we looked at previously, only
  • 00:07:30
    reduced in size so I can create more space on the slide. A data
  • 00:07:33
    scientist accesses Jupyter Notebooks through SageMaker
  • 00:07:36
    Studio, accesses data from any of various data sources, and
  • 00:07:40
    stores any machine learning models they create in S3. One of
  • 00:07:45
    the challenges I mentioned previously is that the code is
  • 00:07:47
    stuck in Jupyter notebooks and can be difficult to version and
  • 00:07:49
    automate. So the first step would be to add more versioning
  • 00:07:53
    to this architecture. You can use CodeCommit or any other Git-
  • 00:07:56
    based repository to store code. And you can use Amazon Elastic
  • 00:08:00
    Container Registry or ECR to store Docker containers, thereby
  • 00:08:04
    versioning the environments which were used to train the
  • 00:08:06
    machine learning models. By versioning the code, the
  • 00:08:09
    environments and the model artifacts, you improve your
  • 00:08:12
    ability to reproduce models and collaborate with others. Next,
  • 00:08:16
    let's talk about automation. Another challenge I mentioned
  • 00:08:19
    previously is that the data scientists are manually
  • 00:08:22
    re-training models instead of focussing on developing new
  • 00:08:24
    models. To solve this, you want to set up automatic re-training
  • 00:08:28
    pipelines. In this architecture, I use SageMaker Pipelines, but
  • 00:08:32
    you could also use Step Functions or Airflow to build
  • 00:08:34
    these repeatable workflows. The re-training pipeline built by the
  • 00:08:38
    data scientist or by a machine learning engineer, will use the
  • 00:08:41
    version code and environments to perform data pre-processing,
  • 00:08:44
    model training, model verification, and eventually
  • 00:08:48
    save the new model artifacts to S3. It can use various services
  • 00:08:52
    to complete these steps including SageMaker processing
  • 00:08:54
    or training jobs, EMR, or Lambda. But in order to automate
  • 00:08:59
    this pipeline, we need a trigger. One option is to use
  • 00:09:04
    EventBridge to trigger the pipeline based on a schedule.
  • 00:09:07
    Another option is to have someone manually trigger the
  • 00:09:09
    pipeline. Both triggers are useful in different contexts
  • 00:09:13
    and I'll introduce more triggers as we progress through these
  • 00:09:16
    slides. So now that we have an automated re-training pipeline, I
  • 00:09:20
    want to introduce another important concept in MLOps, and
  • 00:09:23
    that's the model registry. While S3 provides some versioning and
  • 00:09:27
    object locking functionality, which is useful for storing
  • 00:09:29
    different models, a model registry helps to manage these
  • 00:09:32
    models and their versions. SageMaker model registry allows you
  • 00:09:36
    to store metadata alongside your models, including the values of
  • 00:09:40
    hyperparameters and evaluation metrics, or even the bias and
  • 00:09:43
    explainability reports. This enables you to quickly view and
  • 00:09:47
    compare different versions of a model and to approve or reject a
  • 00:09:50
    model version for production. Now the actual artifacts are
  • 00:09:54
    still stored in S3, but model registry sits on top of that as
  • 00:09:58
    an additional layer.
  • 00:10:00
    Finally, we reach the deployment stage. At first
  • 00:10:03
    glance, this might look very different from what we saw
  • 00:10:05
    earlier in the session. But the setup is actually very similar.
  • 00:10:08
    I still have machine learning models deployed on real-time
  • 00:10:11
    SageMaker endpoints connected to Lambda and API Gateway to
  • 00:10:15
    communicate with an application. The main difference now is that
  • 00:10:19
    I have autoscaling set up for my SageMaker endpoints. So if
  • 00:10:22
    there's an unexpected spike in users, the endpoints can scale
  • 00:10:25
    up to handle the requests and scale back down when the usage
  • 00:10:29
    falls. Now one nice feature of SageMaker endpoints is that you
  • 00:10:32
    can replace the machine learning model without endpoint downtime.
  • 00:10:36
    Since I now have an automated re-training pipeline creating
  • 00:10:39
    new models, and a model registry where I can approve models, it
  • 00:10:42
    would be best if the deployment of the new models is automated
  • 00:10:45
    as well. I can achieve this by building a Lambda function,
  • 00:10:49
    which triggers when a new model is approved to fetch that model,
  • 00:10:53
    and then update the endpoint with it. So now we have
  • 00:10:56
    connected all the pieces and there's one final feature that I
  • 00:10:59
    will take advantage of. Not only can I update the machine
  • 00:11:02
    learning models hosted by the endpoints, but I can actually do
  • 00:11:04
    so gradually using a canary deployment. This means that a
  • 00:11:08
    small portion of the user requests will be diverted to the
  • 00:11:11
    new model, and any errors or issues will trigger a Cloud-
  • 00:11:15
    Watch alarm to inform me. Over time, the number of requests
  • 00:11:18
    sent to the new model will increase until the new model
  • 00:11:21
    gets 100% of the traffic. So I hope this architecture makes
  • 00:11:24
    sense. I started with a very basic setup, and by adding a few
  • 00:11:28
    features and services, I now have a serviceable MLOps
  • 00:11:31
    setup. My deployment strategy is more robust by using auto-
  • 00:11:34
    scaling and canary deployment, my data scientists save time by
  • 00:11:38
    automating model training, and every artefact is properly
  • 00:11:41
    versioned. But as your data science team grows, this
  • 00:11:45
    architecture won't be sufficient. So let's look at a
  • 00:11:47
    slightly more complicated architecture. The next
  • 00:11:52
    architecture will be more suitable for a growing data
  • 00:11:55
    science team of between three to 10 data scientists working on
  • 00:11:59
    several different use cases at a larger company. So again, let's
  • 00:12:03
    start with the basics. Our data scientists work in notebooks
  • 00:12:06
    through SageMaker Studio, pulling from various data
  • 00:12:08
    sources and versioning their code environments and model
  • 00:12:11
    artifacts. This should look familiar. Also bring back the
  • 00:12:16
    automated re-training pipeline. Nothing has changed here, I've
  • 00:12:19
    only made it smaller to create more room on the slide. And
  • 00:12:22
    finally, I'll bring back Event- Bridge to schedule the
  • 00:12:24
    re-training pipeline and model registry for storing model
  • 00:12:27
    metadata and approving model versions. All of this is exactly
  • 00:12:32
    the same as in the previous architecture diagram. So what
  • 00:12:34
    about deployment? Well, this is where things change a little. So
  • 00:12:38
    I have the same deployment setup with SageMaker endpoints and an
  • 00:12:42
    autoscaling group connected to Lambda and API Gateway to allow
  • 00:12:45
    users to submit inference requests. However, these
  • 00:12:48
    deployment services now sit in a separate AWS account. A multi-
  • 00:12:53
    account strategy is highly recommended, because this allows
  • 00:12:55
    you to separate different business units, easily define
  • 00:12:58
    separate restrictions for important production workloads,
  • 00:13:01
    and have a fine-grained view of the costs incurred by each
  • 00:13:04
    component of your architecture. The different accounts are best
  • 00:13:08
    managed through AWS Organizations. Now, data
  • 00:13:12
    scientists should not have access to the production
  • 00:13:14
    account. This reduces the chance of mistakes being made on that
  • 00:13:17
    account, which directly affects your users. In fact, a multi-
  • 00:13:21
    account strategy for machine learning usually has a separate
  • 00:13:24
    staging account alongside the production account. Any new
  • 00:13:27
    models are first deployed to the staging account, tested and only
  • 00:13:31
    then deployed on the production account. So if the data
  • 00:13:34
    scientist cannot access these accounts, clearly, the
  • 00:13:36
    deployment must happen automatically. All of the
  • 00:13:40
    services deployed into the staging and production accounts
  • 00:13:42
    are set up automatically using CloudFormation, controlled by
  • 00:13:45
    CodePipeline in the development account. The next step is to set
  • 00:13:49
    up a trigger for CodePipeline. And we can do so using Event-
  • 00:13:52
    Bridge. So when a model version is approved in model registry,
  • 00:13:57
    this will generate an event which can be used to trigger
  • 00:13:59
    deployment via CodePipeline. So now everything's connected
  • 00:14:03
    again, and this is starting to look like a proper MLOps
  • 00:14:06
    setup. But I'm sure you've noticed I have plenty of space
  • 00:14:08
    left on this slide. So let's add another feature which becomes
  • 00:14:11
    crucial when you have multiple models running in production for
  • 00:14:14
    extended periods of time - that's model monitor. The goal of
  • 00:14:18
    monitoring machine learning models in production is to
  • 00:14:21
    detect a change in behaviour or accuracy. To start, I enabled
  • 00:14:25
    data capture on the endpoints in the staging and production
  • 00:14:28
    accounts. This captures the incoming requests and outgoing
  • 00:14:31
    inference results and stores them in S3 buckets. If you have
  • 00:14:36
    a model monitoring use case, which doesn't require labelling
  • 00:14:39
    the incoming requests, then you could run the whole process
  • 00:14:41
    directly on your staging and production accounts. But in this
  • 00:14:44
    case, I assume the data needs to be combined with labels or other
  • 00:14:48
    data that's on the development account. So I use S3 replication
  • 00:14:51
    to move the data onto an S3 bucket in the development account.
  • 00:14:56
    Now, in order to tell if the behaviour of the model or the
  • 00:14:59
    data has changed, we need something to compare it to.
  • 00:15:02
    That's where the model baseline comes in. During the training
  • 00:15:05
    process as part of the automated re-training pipeline, we can
  • 00:15:08
    generate a baseline dataset, which records the expected
  • 00:15:11
    behaviour of the data and the model. So that gives me all the
  • 00:15:15
    components I need to set up Sage- Maker model monitor, which will
  • 00:15:18
    compare the two datasets and generate a report. The final
  • 00:15:22
    step in this architecture is to take action based on the results
  • 00:15:25
    of the model monitoring report. And we can do this by sending an
  • 00:15:28
    event to EventBridge to trigger the re-training pipeline when a
  • 00:15:31
    significant change has been detected. And that's it for the
  • 00:15:34
    medium MLOps architecture! It contains a lot
  • 00:15:37
    of the same features used in the small architecture, but it
  • 00:15:40
    expands to a multi-account setup, and adds model monitoring
  • 00:15:43
    for extra quality checks on the models in production. Hopefully,
  • 00:15:48
    you're now wondering what a large MLOps architecture looks
  • 00:15:51
    like and how I can possibly fit more features onto a single
  • 00:15:54
    slide. So let's take a look at that now. This architecture is
  • 00:15:57
    suitable for companies with large data science teams of 10
  • 00:16:00
    or more people and with machine learning integrated throughout
  • 00:16:03
    the business. Of course, I start with the same basic setup I had
  • 00:16:09
    last time but reduced in size again. The data scientist is
  • 00:16:12
    still using SageMaker Studio through a development account,
  • 00:16:15
    and stores model artifacts and code in S3 and CodeCommit
  • 00:16:18
    respectively. The data sources are also present, but data is
  • 00:16:22
    now stored in a separate account. It's a common strategy
  • 00:16:25
    to have your data lakes set up in one account with fine-grained
  • 00:16:28
    access controls to determine which datasets can be accessed
  • 00:16:31
    by resources in other accounts. Really, the larger a company
  • 00:16:35
    becomes the more AWS accounts they tend to use, all managed
  • 00:16:38
    through AWS Organizations. So let's continue this trend by
  • 00:16:42
    bringing back the automated re-training pipeline in a
  • 00:16:45
    separate operations account. And let's bring back model registry
  • 00:16:48
    as well in yet another account. All of the components are the
  • 00:16:51
    same as in the small and medium architecture diagrams, but just
  • 00:16:54
    split across more accounts. The operations account is normally
  • 00:16:58
    used for any automated workflows which don't require manual
  • 00:17:01
    intervention by the data scientists. It's also good
  • 00:17:04
    practice to store all of your artifacts in a separate artefact
  • 00:17:07
    account like I have here for model registry. Again, this is
  • 00:17:10
    an easy way to prevent data scientists from accidentally
  • 00:17:13
    changing production artifacts. Next, let's bring back the
  • 00:17:17
    production and staging accounts with the deployment setup. This
  • 00:17:20
    is exactly the same as in the previous architecture, just
  • 00:17:22
    reduced in size. The infrastructure in the production
  • 00:17:26
    and staging accounts is still set up automatically through
  • 00:17:28
    CloudFormation in CodePipeline, but CodePipeline sits in a CI/
  • 00:17:32
    CD account. Note that I have built this diagram based on
  • 00:17:36
    account structures I have seen organisations use but your
  • 00:17:39
    organisation might use a different account setup and
  • 00:17:41
    that's totally fine. Use this diagram as an example and adjust
  • 00:17:45
    it to your structure and your needs. Now, let's connect our
  • 00:17:49
    model registry to CodePipeline by using EventBridge exactly
  • 00:17:52
    the same as in the previous architecture. And now we have
  • 00:17:55
    all the pieces connected again. But I don't know if you noticed,
  • 00:17:58
    but one of the basic building blocks is still missing in this
  • 00:18:01
    picture. Hopefully you spotted it - ECR disappeared for a little
  • 00:18:05
    while. So let's bring it back by placing it in the artefact
  • 00:18:08
    account because environments, especially production
  • 00:18:10
    environments are artifacts which need to be protected. There's
  • 00:18:14
    one more change I want to make to my use of ECR here. In the
  • 00:18:17
    previous architecture diagrams, I assumed that data scientists
  • 00:18:20
    were building Docker containers and registering these containers
  • 00:18:23
    and ECR manually. This process can be simplified and indeed
  • 00:18:27
    automated using CodeBuild and CodePipeline. The data
  • 00:18:30
    scientist or machine learning engineer can still write the
  • 00:18:33
    Docker file, but the building and registration of the
  • 00:18:35
    container is performed automatically. This saves even
  • 00:18:38
    more time, so data scientists can focus on what they do best.
  • 00:18:42
    Of course, in the previous architecture, I use model
  • 00:18:44
    monitor to trigger model re-training if significant
  • 00:18:47
    changes in model behaviour were detected. So let's bring that
  • 00:18:50
    back as well, starting with the data capture in the staging and
  • 00:18:53
    production accounts, followed by data replication into the
  • 00:18:56
    operations account. As before, model monitor will need a
  • 00:19:00
    baseline to compare performance and the generation of this
  • 00:19:03
    baseline can be a step in SageMaker Pipelines. Finally, I'll
  • 00:19:07
    bring back model monitor to generate reports on drift and
  • 00:19:10
    trigger re-training if necessary. This leaves us with all of the
  • 00:19:14
    components I had in the medium MLOps diagram. But there's two
  • 00:19:18
    more features that I want to introduce. The first is Sage-
  • 00:19:22
    Maker feature store, which sits in the artefact account because
  • 00:19:25
    features are artifacts which can be reused. If you remember the
  • 00:19:28
    basic data science workflow from the beginning of the session,
  • 00:19:31
    data scientists will normally perform feature engineering
  • 00:19:34
    before training a model, and it has a large impact on model
  • 00:19:36
    performance. In large companies,
  • 00:19:39
    there's a good chance that data scientists will be working on
  • 00:19:41
    separate use cases which rely on the same dataset. A feature
  • 00:19:45
    store allows data scientists to take advantage of features
  • 00:19:48
    created by others. It reduces their workload and also ensures
  • 00:19:52
    consistency in the features that are created from a dataset. The
  • 00:19:56
    final component I want to introduce a SageMaker Clarify.
  • 00:20:00
    Clarify can be used by data scientists in the development
  • 00:20:02
    phase to identify bias and datasets and to generate
  • 00:20:05
    explainability reports for models. This technology is
  • 00:20:08
    important for responsible AI. Now similar to model monitor, Clarify
  • 00:20:13
    can also be used to generate baseline bias and explainability
  • 00:20:16
    reports, which can then be compared to the behaviour of the
  • 00:20:19
    model in the endpoint. If Clarify finds that bias is
  • 00:20:23
    increasing or the explainability results are changing, it can
  • 00:20:26
    trigger a re-training of the model. Now both feature store
  • 00:20:30
    and Clarify can be introduced much earlier in the medium or
  • 00:20:33
    even the small MLOps architectures. It really depends
  • 00:20:36
    on the needs of your business. And I hope you can use these
  • 00:20:38
    example architectures to design an architecture which works for
  • 00:20:41
    you. Now, the architecture diagrams in this session rely
  • 00:20:45
    heavily on different components offered by Amazon SageMaker.
  • 00:20:49
    SageMaker provides purpose- built tools and built-in
  • 00:20:51
    integrations with other AWS services so you can adopt MLOps
  • 00:20:54
    practices across your organisation. Using Amazon
  • 00:20:58
    SageMaker, you can build CI/CD pipelines to reduce model
  • 00:21:01
    management overhead, automate machine learning workflows to
  • 00:21:04
    accelerate data preparation, model building and model
  • 00:21:06
    training, monitor the quality of models by automatically
  • 00:21:10
    detecting bias model drift and concept drift, and automatically
  • 00:21:13
    track lineage of code datasets and model artifacts for
  • 00:21:16
    governance. But if there's one thing I want you to take away
  • 00:21:20
    from this session, it should be this: MLOps is a journey, you
  • 00:21:24
    don't have to immediately adopt every feature available in a
  • 00:21:27
    complicated architecture design. Start with the basic steps to
  • 00:21:31
    integrate versioning and automation. Evaluate all the
  • 00:21:34
    features I introduced in this session, and order them
  • 00:21:37
    according to the needs of your business, then start adopting
  • 00:21:39
    them as and when it's needed. The architecture diagrams I
  • 00:21:43
    presented in the session are not the only way to implement
  • 00:21:46
    MLOps, but I hope they'll provide some inspiration to you as an
  • 00:21:49
    architect. So to help you get started, I've collected some
  • 00:21:52
    useful resources and placed the links on this slide. You should
  • 00:21:55
    be able to download a copy of these slides so you can access
  • 00:21:58
    these links. The resources on this page will not only provide
  • 00:22:01
    advice on the technology behind MLOps but also on the
  • 00:22:04
    people and processes which we discussed briefly at the start.
  • 00:22:08
    If you're interested in any other topic related to AWS
  • 00:22:10
    Cloud, I recommend checking out Skill Builder online learning
  • 00:22:13
    centre. It offers over 500 free digital courses for any level of
  • 00:22:17
    experience. You should also consider getting certified
  • 00:22:19
    through AWS Certifications. If you enjoyed the topic of this
  • 00:22:23
    particular session, I'd recommend checking out the
  • 00:22:25
    Solutions Architect associate and professional certifications,
  • 00:22:28
    as well as the machine learning specialty certification. And
  • 00:22:32
    that's all I have for today. Thanks for taking the time to
  • 00:22:35
    listen to me talk about MLOps, and I hope this content helps
  • 00:22:38
    you in upcoming projects. I just have one final request and
  • 00:22:41
    that's to complete the session survey. It's the only way for me
  • 00:22:44
    to know if you enjoyed this session and it only takes a
  • 00:22:47
    minute. I hope you have a wonderful day and enjoy the rest
  • 00:22:49
    of Summit!
الوسوم
  • MLOps
  • AWS
  • Machine Learning
  • Architecture
  • Model Deployment
  • Automation
  • Monitoring
  • Version Control
  • Data Science
  • SageMaker