AWS re:Invent 2022 - Reliable scalability: How Amazon.com scales in the cloud (ARC206)

00:57:38
https://www.youtube.com/watch?v=QeW9wCB36ck

الملخص

TLDRIn this session, Seth Eliot discusses how Amazon.com achieves reliable scalability on AWS, detailing the evolution of its architecture from a simple setup in 1995 to a complex microservices architecture today. He emphasizes the importance of scalability, reliability, and the Well-Architected Framework, which includes best practices for building in the cloud. The session features examples such as IMDb's transition to serverless microservices and Global Ops Robotics' cell-based architecture, highlighting the use of automation, chaos engineering, and other strategies to ensure systems can handle high traffic and maintain performance during peak times.

الوجبات الجاهزة

  • 👋 Welcome and introduction to reliable scalability.
  • 📜 History of Amazon's architecture evolution.
  • ⚙️ Importance of scalability for handling increased loads.
  • 🔧 Overview of the Well-Architected Framework.
  • 📈 IMDb's transition to serverless microservices.
  • 🏭 Global Ops Robotics and cell-based architecture.
  • 🚚 Amazon Relay app for truck management.
  • 🔍 Chaos engineering for testing resilience.
  • 📊 Best practices for reliability in cloud services.
  • 🤝 Acknowledgment of the engineers behind the systems.

الجدول الزمني

  • 00:00:00 - 00:05:00

    The session introduces the topic of reliable scalability and how Amazon.com utilizes AWS for cloud scalability. Seth Eliot, the speaker, shares his background and experience with Amazon, including his work on the .com side and AWS. He presents a historical overview of Amazon's architecture, starting from its early days in 1995, highlighting the evolution of its systems to meet the demand for scalability.

  • 00:05:00 - 00:10:00

    The need for scalability is emphasized, defined as the ability of a workload to perform its function as the load changes. The architecture evolved from a single server and database to a service-oriented architecture, allowing for more agile development and deployment. The talk transitions to the current architecture, which consists of tens of thousands of microservices interconnected through various dependencies.

  • 00:10:00 - 00:15:00

    The focus shifts to the Well-Architected Framework, particularly the reliability pillar, which includes best practices for building in the cloud. The speaker introduces the first example of IMDb's transition to serverless microservices, explaining how they moved from a monolithic architecture to a federated schema with microservices, improving scalability and reliability.

  • 00:15:00 - 00:20:00

    The architecture of IMDb is discussed, showcasing a gateway-based architecture that connects various backend microservices. The importance of the two-pizza team model is highlighted, emphasizing ownership and accountability within teams, leading to smoother operations and better on-call experiences.

  • 00:20:00 - 00:25:00

    The next best practice discussed is automation in resource scaling. The speaker explains how AWS Lambda functions automatically scale based on requests, and IMDb implemented provisioned concurrency to avoid cold starts, ensuring a seamless user experience during peak loads.

  • 00:25:00 - 00:30:00

    The architecture of IMDb's API gateway is examined, detailing the use of a web application firewall (WAF) and a content delivery network (CDN) to enhance security and performance. The speaker emphasizes the importance of using highly available public endpoints to reduce high-severity issues caused by bots.

  • 00:30:00 - 00:35:00

    The session transitions to Global Ops Robotics, focusing on warehouse management and the use of a cell-based architecture to protect workloads. The concept of bulkhead architecture is introduced, explaining how compartmentalization helps contain failures and maintain operational continuity across fulfillment centers.

  • 00:35:00 - 00:40:00

    The discussion moves to Amazon Relay, which manages the middle mile of the supply chain. The speaker explains how they use multi-region deployments to ensure truck operations continue smoothly, even during service disruptions. The importance of graceful degradation and failover strategies is emphasized to maintain critical functions during outages.

  • 00:40:00 - 00:45:00

    The Classification and Policies Platform is introduced, showcasing how Amazon classifies millions of items using machine learning models. The concept of shuffle sharding is explained, demonstrating how it limits the blast radius of failures by assigning unique worker pairs to clients, enhancing reliability and scalability.

  • 00:45:00 - 00:57:38

    The final example focuses on Amazon Search and the implementation of chaos engineering to test system resilience. The speaker outlines the chaos engineering process, emphasizing the importance of steady state, hypothesis testing, and the use of service level objectives (SLOs) to ensure a positive customer experience during turbulent conditions.

اعرض المزيد

الخريطة الذهنية

فيديو أسئلة وأجوبة

  • What is the main focus of the session?

    The session focuses on how Amazon.com achieves reliable scalability using AWS.

  • Who is the speaker?

    The speaker is Seth Eliot, a principal developer advocate at AWS.

  • What is the Well-Architected Framework?

    The Well-Architected Framework consists of best practices for building in the cloud, focusing on reliability among other pillars.

  • What is chaos engineering?

    Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.

  • What is the significance of scalability for Amazon?

    Scalability allows Amazon to handle increased loads and maintain performance as the scope of their operations changes.

  • What architectural changes did IMDb implement?

    IMDb transitioned from a monolithic architecture to serverless microservices using AWS Lambda.

  • What is a two-pizza team?

    A two-pizza team is a small, cross-functional team at Amazon that can be fed with two pizzas, emphasizing ownership and agility.

  • How does Amazon ensure reliability in its services?

    Amazon uses best practices from the Well-Architected Framework, including automation, bulkhead architectures, and chaos engineering.

  • What is the role of the Global Ops Robotics team?

    The Global Ops Robotics team manages warehouse operations and uses a cell-based architecture for reliability.

  • What is the purpose of the Relay app?

    The Relay app helps truck drivers manage their loads and routes in Amazon's middle mile logistics.

عرض المزيد من ملخصات الفيديو

احصل على وصول فوري إلى ملخصات فيديو YouTube المجانية المدعومة بالذكاء الاصطناعي!
الترجمات
en
التمرير التلقائي:
  • 00:00:00
    - Hello. Welcome, everyone.
  • 00:00:01
    Thank you so much for choosing my session.
  • 00:00:03
    I really appreciate you being here.
  • 00:00:04
    You're here of course for reliable scalability,
  • 00:00:08
    how amazon.com runs on AWS
  • 00:00:11
    and how it scales in the cloud on AWS,
  • 00:00:13
    and we're gonna talk a lot about examples of how amazon.com,
  • 00:00:17
    a large, sophisticated customer, uses AWS.
  • 00:00:21
    So my name is Seth Eliot.
  • 00:00:22
    I am currently a developer advocate,
  • 00:00:24
    principal developer advocate for developer relations,
  • 00:00:27
    just that's a recent change for me.
  • 00:00:29
    Prior to that, I was the reliability lead
  • 00:00:31
    for AWS Well-Architected,
  • 00:00:32
    and Well-Architected's gonna play a big part
  • 00:00:34
    in the talk today, but even before that,
  • 00:00:36
    I actually worked for amazon.com.
  • 00:00:38
    So I joined Amazon back in 2005
  • 00:00:40
    and was working on the .com side before moving to AWS.
  • 00:00:45
    So I always like to start off
  • 00:00:46
    with a bit of a history lesson.
  • 00:00:47
    Now, I wasn't there in 1995,
  • 00:00:49
    but this is what the website looked in 1995.
  • 00:00:52
    Take it in in all its glory.
  • 00:00:55
    Quite amazing for the time period, actually,
  • 00:00:57
    and this is the architecture used
  • 00:01:00
    to run that website you just saw.
  • 00:01:01
    So I want to draw your attention
  • 00:01:03
    to the box that says Obidos.
  • 00:01:03
    Obidos is the place in Brazil
  • 00:01:06
    where the Amazon River is its narrowest and swiftest part,
  • 00:01:09
    and back in those days, they named a lotta things
  • 00:01:12
    after places in Brazil and things on the Amazon River,
  • 00:01:14
    and that is the executable.
  • 00:01:17
    That is a single C, not C++,
  • 00:01:20
    but C binary running on a single server
  • 00:01:23
    talking to a single Oracle database
  • 00:01:26
    running on another server called ACB,
  • 00:01:28
    for amazon.com books, that had all the data in it,
  • 00:01:31
    and that essentially was the architecture.
  • 00:01:33
    You could see there's CC motel. That's a credit card system.
  • 00:01:35
    That was separate so that we could have limited access
  • 00:01:38
    to that so that the credit card numbers could be secure,
  • 00:01:40
    and there's a distribution center,
  • 00:01:42
    later renamed fulfillment centers,
  • 00:01:43
    from which your package would be shipped and show up to you.
  • 00:01:46
    So that's the original architecture.
  • 00:01:48
    Now, the motto of Amazon, especially back in those days,
  • 00:01:51
    is get big fast, and you can see that there's a T-shirt
  • 00:01:55
    from one of the picnics about get big fast,
  • 00:01:57
    and to get big fast, you're gonna need scalability.
  • 00:02:00
    So what is scalability?
  • 00:02:01
    Well, scalability is the ability of a workload
  • 00:02:03
    to perform its agreed function as the scope changes,
  • 00:02:06
    as the load or scope changes.
  • 00:02:08
    So to get there, they had to evolve the architecture.
  • 00:02:12
    So the first thing they looked at was the databases.
  • 00:02:14
    You could see they pulled out this Web database there.
  • 00:02:16
    So that Web database interacts with the customer,
  • 00:02:19
    does the ordering, and then asynchronously syncs
  • 00:02:21
    back to the ACB database periodically.
  • 00:02:24
    Similarly, we've added a new distribution center
  • 00:02:26
    and they each get their own databases too.
  • 00:02:28
    So this is one way to remove one of the big bottlenecks,
  • 00:02:31
    which was the database, but that wasn't enough.
  • 00:02:34
    So let's fast forward to 2000
  • 00:02:36
    and talk about a service-oriented architecture.
  • 00:02:39
    Having a single binary,
  • 00:02:41
    it eventually did become C++, like the original,
  • 00:02:44
    the first engineer at Amazon insisted it stay C,
  • 00:02:47
    but he couldn't control it after a time
  • 00:02:49
    and eventually C++ libraries got into it,
  • 00:02:51
    but still, it was a single binary.
  • 00:02:53
    So if you wanted to make a change,
  • 00:02:54
    so let's say you were in charge of implementing
  • 00:02:56
    one-click ordering, one-click purchase,
  • 00:02:59
    you would have to make your change to that binary
  • 00:03:02
    and everybody else is making changes to that binary
  • 00:03:04
    and you're building along with everybody else
  • 00:03:06
    and you're deploying along with everybody else
  • 00:03:08
    and it's just not a very agile system.
  • 00:03:10
    If somebody else breaks the build,
  • 00:03:11
    you're not deploying today, so that's not great,
  • 00:03:14
    so what can we do?
  • 00:03:15
    Well, in addition to splitting out the databases,
  • 00:03:17
    you can see the customer data got pulled out of ACB
  • 00:03:20
    and you don't wanna be calling the database directly,
  • 00:03:22
    so you're gonna put a service in front of it,
  • 00:03:23
    the customer service, and that customer service
  • 00:03:25
    was originally just for select
  • 00:03:28
    and insert onto that database,
  • 00:03:30
    but it became the location for business logic on customers.
  • 00:03:33
    Similarly, there became an order service and an item service
  • 00:03:36
    and this is the first service-oriented architectures
  • 00:03:39
    at Amazon.
  • 00:03:41
    Now, get big fast. Now let's fast forward to the present.
  • 00:03:44
    The previous Prime Day, Amazon did get big.
  • 00:03:47
    We all know Amazon's big, 100,000 items per minute,
  • 00:03:50
    12 billion in sales as of the last Prime Day,
  • 00:03:53
    but you're not here to learn about that.
  • 00:03:54
    You're here to learn about how they're using AWS, right?
  • 00:03:57
    And so I won't read the numbers off of here.
  • 00:03:59
    There's obviously billions and trillions and millions.
  • 00:04:02
    Go ahead and read them.
  • 00:04:03
    It just shows that Amazon did get big fast
  • 00:04:06
    and they're doing it using AWS and they're letting,
  • 00:04:09
    and they're using AWS to be able to scale
  • 00:04:12
    and scale reliably.
  • 00:04:14
    So if you fast forward to wanna know
  • 00:04:16
    what the architecture looks like today,
  • 00:04:17
    it looks appreciably like it looked back in 2000.
  • 00:04:21
    Anybody believe me on that? No.
  • 00:04:24
    See if you're paying attention. No.
  • 00:04:25
    Okay, so this is actually closer
  • 00:04:27
    to the actual current architecture.
  • 00:04:28
    Each dot on there represents a service or microservice
  • 00:04:31
    or tens of thousands of them running amazon.com
  • 00:04:34
    and they're all connected to each other
  • 00:04:36
    through various dependencies.
  • 00:04:37
    I zoomed in on one of them here just to show you
  • 00:04:39
    that there are indeed lines in that diagram.
  • 00:04:41
    I think the diagram is quite beautiful, isn't it?
  • 00:04:43
    But that is the current architecture
  • 00:04:45
    with tens of thousands of services,
  • 00:04:49
    with many thousands of teams owning those services.
  • 00:04:52
    All right, so that brings us to reliable scalability.
  • 00:04:54
    So reliability is the ability of a workload
  • 00:04:57
    to perform its required function correctly and consistently.
  • 00:05:00
    So as we're thinking about that,
  • 00:05:02
    that's why Amazon needed scalability.
  • 00:05:05
    They needed to get big fast and be reliable,
  • 00:05:08
    hence they needed the scalability,
  • 00:05:09
    and today we're gonna be diving into examples
  • 00:05:11
    of amazon.com teams doing that and building on AWS,
  • 00:05:15
    and we're gonna use the Well-Architected Framework
  • 00:05:18
    as a framework to present that to you.
  • 00:05:20
    So the Well-Architected Framework consists of six pillars
  • 00:05:22
    and they're all important, but honestly,
  • 00:05:24
    today we're focused on reliability.
  • 00:05:26
    The reliability pillar has the,
  • 00:05:28
    Well-Architected has best practices.
  • 00:05:30
    Well-Architected is just a documentation
  • 00:05:33
    of all the best practices for building in the cloud.
  • 00:05:35
    It includes other things too. We have hands-on labs.
  • 00:05:38
    We have a Well-Architected Tool
  • 00:05:39
    where you could review your own workloads,
  • 00:05:41
    but honestly in this case,
  • 00:05:43
    we're gonna look at the best practices reliability pillar.
  • 00:05:45
    There are 66 of them.
  • 00:05:46
    We're not gonna look at all 66 of them, but today,
  • 00:05:49
    as I show you the examples I'm showing you,
  • 00:05:51
    I'm gonna talk about which best practice
  • 00:05:53
    is being illustrated in the architectures we're looking at,
  • 00:05:57
    and we're gonna dive right in with our first example.
  • 00:06:02
    Oh, IMDb re-architected to serverless microservices.
  • 00:06:06
    So IMDb, Internet Movie Database.
  • 00:06:09
    Who here has heard of IMDb?
  • 00:06:11
    Okay, and the rest of you just don't wanna raise your hand
  • 00:06:13
    because you don't wanna raise your hand. (laughing)
  • 00:06:17
    Internet Movie Database was acquired by Amazon in 1998.
  • 00:06:22
    It is the number one location to go to learn about movies,
  • 00:06:25
    TV shows, actors, producers, all that good stuff,
  • 00:06:28
    and prior to the re-architecture,
  • 00:06:30
    they were running a monolithic build with a REST API
  • 00:06:35
    on hundreds of EC2 servers.
  • 00:06:38
    So they're on AWS, but they're running on
  • 00:06:40
    hundreds of EC2 instances, servers,
  • 00:06:43
    and when they re-architected,
  • 00:06:44
    they moved to a federated schema with microservices.
  • 00:06:47
    Now, microservices are small, decoupled services
  • 00:06:51
    focused on a specific business domain.
  • 00:06:53
    As for what federated schema is,
  • 00:06:55
    if you don't know already, I'll get to that,
  • 00:06:56
    and they used Lambda for this.
  • 00:06:58
    So they're using Lambda,
  • 00:06:59
    which is the serverless compute in AWS,
  • 00:07:02
    the ability to run code without servers,
  • 00:07:06
    and now we get to the best practice
  • 00:07:08
    and you're gonna see several of
  • 00:07:09
    these slides throughout the talk.
  • 00:07:10
    They have the Well-Architected logo up there
  • 00:07:12
    and the format might be a little odd.
  • 00:07:13
    This, what it is is a snapshot of the Well-Architected Tool
  • 00:07:16
    which is in the AWS console,
  • 00:07:18
    and the way best practices are shown in the framework
  • 00:07:20
    is there's a question that represents
  • 00:07:23
    a set of best practices.
  • 00:07:24
    Then each of those check boxes are a best practice.
  • 00:07:27
    So in this case, the best practices we're interested in is,
  • 00:07:29
    how do you segment your workload,
  • 00:07:31
    and then how do you focus those segments
  • 00:07:35
    on specific business use cases, on specific business needs?
  • 00:07:38
    And you can see I circled microservices there.
  • 00:07:40
    You don't have to use microservices
  • 00:07:42
    to achieve these best practices,
  • 00:07:43
    but that is what IMDb did, so therefore it's circled.
  • 00:07:46
    So these are the first two best practices to look at,
  • 00:07:48
    and to look at that, we're gonna ask a question.
  • 00:07:51
    You're on IMDb and you type in Jackie Chan
  • 00:07:54
    and what it does is runs a query is,
  • 00:07:56
    what is the top four shows that Jackie Chan is known for?
  • 00:08:00
    Now, Jackie has an id. Every entity in IMDb has an id.
  • 00:08:04
    This nm is a name entity and that's his entity there, 329,
  • 00:08:07
    and so you as a user don't care about that,
  • 00:08:10
    but you've just asked what is Jackie Chan known for,
  • 00:08:13
    and this is the query that the client creates.
  • 00:08:16
    It's GraphQL.
  • 00:08:18
    GraphQL is a query language
  • 00:08:19
    that lets you set up queries like this where you can,
  • 00:08:23
    using a schema, request information
  • 00:08:26
    and get all that information back at once.
  • 00:08:27
    Like with REST, you'd probably have to make four calls.
  • 00:08:29
    Here, you just do it all at once,
  • 00:08:31
    and what is being requested here?
  • 00:08:32
    You could see the name id on top and so that's Jackie Chan,
  • 00:08:36
    and I wanna know the first four things that he's known for,
  • 00:08:39
    and of those, when you gimme those four things,
  • 00:08:40
    I wanna know the title text, I wanna know the release date,
  • 00:08:44
    I wanna know the aggregated ratings,
  • 00:08:46
    and I want an image URL so I could show an image.
  • 00:08:48
    Okay, so that's the query that the front end is making,
  • 00:08:51
    and this is where the microservices
  • 00:08:52
    and the federated schema come into play.
  • 00:08:55
    This request is actually sent
  • 00:08:57
    to four different microservices,
  • 00:08:59
    each fronted by an AWS Lambda in this case.
  • 00:09:02
    So the first one is find me the top four things
  • 00:09:05
    Jackie Chan's known for and it's gonna return the id
  • 00:09:07
    of those four things, which begins with tt.
  • 00:09:09
    Now, that first service doesn't know about
  • 00:09:12
    release date or rating.
  • 00:09:13
    It only knows about top four.
  • 00:09:15
    So the next thing is the title text and the release date.
  • 00:09:19
    That's metadata, so that's gonna go to that other service,
  • 00:09:21
    and that service only knows metadata,
  • 00:09:23
    so it's gonna return the metadata.
  • 00:09:24
    The third one is the ratings,
  • 00:09:26
    so that one only knows ratings.
  • 00:09:27
    It's gonna return the aggregate rating,
  • 00:09:28
    and the last one only knows image URLs,
  • 00:09:30
    so it's gonna return the image URLs,
  • 00:09:31
    and the reason it's a federated schema is
  • 00:09:33
    'cause even though the request is one big schema,
  • 00:09:35
    each of these little microservices only knows
  • 00:09:37
    its own piece to the schema.
  • 00:09:40
    So when the front end gets that response to that request,
  • 00:09:44
    it's gonna show it to the user like this.
  • 00:09:45
    You could see Jackie Chan
  • 00:09:46
    and the four things he's known for.
  • 00:09:48
    You could see the release date.
  • 00:09:49
    You could see the aggregated rating,
  • 00:09:51
    and there's only one thing wrong here.
  • 00:09:54
    "Kung Fu Panda" is missing. How could that be?
  • 00:09:56
    I don't know and I really have a bone to pick
  • 00:09:58
    with the IMDb team.
  • 00:09:59
    I'll let them know about it after the talk.
  • 00:10:02
    All right, so now let's get into architecture.
  • 00:10:04
    Okay, so this is what the IMDb architecture looks like.
  • 00:10:06
    It's a gateway-based architecture.
  • 00:10:08
    So they redesigned their gateway
  • 00:10:09
    into the serverless architecture
  • 00:10:11
    so that it can call all of these backend microservices
  • 00:10:14
    that each know their own little piece of the elephant.
  • 00:10:17
    So here's those backend microservices.
  • 00:10:18
    They're just sitting there fronted by Lambda.
  • 00:10:20
    Some of them are completely serverless.
  • 00:10:22
    Many of the newer ones are.
  • 00:10:23
    Some of them, if there was like a legacy service
  • 00:10:25
    or something that they just wanted to update,
  • 00:10:27
    they'll front it with a Lambda so that they could be called,
  • 00:10:30
    and the Lambda's responsible shaping the data
  • 00:10:33
    so that the GraphQL query response is in the right format.
  • 00:10:38
    Okay, over here, okay, so now each of those microservices
  • 00:10:41
    only knows its piece of the schema and the gateway,
  • 00:10:44
    that which a gateway is the front end that the client calls,
  • 00:10:47
    the gateway needs to know the entire schema.
  • 00:10:49
    You need a schema manager, so here it is.
  • 00:10:51
    When you create a new service or update a service,
  • 00:10:54
    it publishes its little piece of the schema
  • 00:10:56
    to the schema manager, which publishes it into an S3 bucket.
  • 00:10:59
    So the gateway has a full view of the schema,
  • 00:11:03
    and here's the API for, the front end for the API.
  • 00:11:06
    There's a Application Load Balancer. There's a firewall.
  • 00:11:09
    There's a content delivery network piece
  • 00:11:11
    and I'm gonna talk more about that later,
  • 00:11:12
    so I'm gonna put that on hold,
  • 00:11:14
    and this is when we diverge a little bit
  • 00:11:16
    and talk about culture at Amazon.
  • 00:11:17
    I think many of you already heard the two-pizza team.
  • 00:11:20
    A two-pizza team is a team that could be fed
  • 00:11:21
    with approximately two pizzas,
  • 00:11:23
    so not too big, not too small.
  • 00:11:25
    It's a cross-functional team, but it's all about ownership.
  • 00:11:28
    The two-pizza team owns the service
  • 00:11:30
    or services they're responsible for,
  • 00:11:31
    from design to implementation to deployment
  • 00:11:35
    to operation and the business around it.
  • 00:11:38
    So there might be a product manager on the team
  • 00:11:40
    that's a business expert working with developers there.
  • 00:11:42
    So the nice thing about this with this model
  • 00:11:44
    is that this model of creating these federated microservices
  • 00:11:49
    is it moved the business logic for those services
  • 00:11:52
    so that the team could own that business logic.
  • 00:11:54
    So the team that owns the metadata is expert on metadata.
  • 00:11:57
    The team that owns the ratings is expert on ratings,
  • 00:12:00
    and this was organizationally a positive thing for the org,
  • 00:12:05
    and what happened was,
  • 00:12:06
    so they have something called on-call.
  • 00:12:08
    They have a rotating on-call rotation
  • 00:12:09
    where if there's any problem in production,
  • 00:12:11
    they own in production, they have to respond to it,
  • 00:12:12
    and the senior dev told me they were having
  • 00:12:14
    ridiculously smooth on-calls after this
  • 00:12:16
    and that's because the organizational change
  • 00:12:19
    aligned with the technology change
  • 00:12:21
    meant that the teams that owned the business domain
  • 00:12:25
    and the service were available whenever a problem occurred,
  • 00:12:28
    so that a problem occurred in the aggregate service,
  • 00:12:31
    the rating aggregate service,
  • 00:12:33
    that team would be the one called
  • 00:12:34
    and they'd understand what's going on,
  • 00:12:35
    and it also helped that going to serverless
  • 00:12:38
    helped with scalability.
  • 00:12:40
    All right, so the next best practices we're gonna look at
  • 00:12:42
    is using automation when obtaining or scaling resources
  • 00:12:46
    and obtaining resources upon detection that you need them,
  • 00:12:48
    so detecting that you need new resources
  • 00:12:51
    and obtaining them automatically.
  • 00:12:53
    To do that, I'm gonna do a little divergence,
  • 00:12:54
    just talk about Lambda.
  • 00:12:55
    This is not an IMDb architecture.
  • 00:12:57
    This is just a generic serverless architecture
  • 00:12:58
    'cause I wanna talk about Lambda.
  • 00:13:00
    As I said, Lambda is a way to run code without a server,
  • 00:13:03
    but the way it works is you deploy a Lambda instance
  • 00:13:07
    with some code, and then for every request it gets,
  • 00:13:10
    it spins up, invokes a Lambda instance.
  • 00:13:13
    So here you can see six requests. Six Lambdas get spun up.
  • 00:13:17
    They process the requests and if there's no more requests,
  • 00:13:19
    they spin down.
  • 00:13:20
    So it is automatically scaling.
  • 00:13:22
    It'll scale up and down based on
  • 00:13:24
    the number of requests you get,
  • 00:13:26
    and this is the actual metrics for Lambda invocations.
  • 00:13:31
    So these are the number of Lambdas being invoked
  • 00:13:33
    per minute by IMDb and this could also be translated
  • 00:13:37
    to requests per minute because each request,
  • 00:13:39
    each Lambda invocation represents a single request.
  • 00:13:42
    Note it peaks at 800,000 requests per minute,
  • 00:13:44
    which I also converted to requests per second
  • 00:13:46
    if you want to know that,
  • 00:13:47
    and it also goes up and down quite a bit.
  • 00:13:49
    It's quite cyclical.
  • 00:13:51
    I don't know, who saw my Twitter post about this?
  • 00:13:52
    This is one of the things I posted on Twitter and said,
  • 00:13:54
    "Which service is this?"
  • 00:13:55
    Well, it's IMDb. Now you know, and so two things here.
  • 00:13:59
    One is Lambda just scales, right?
  • 00:14:02
    Like, every request it gets, it spins up a Lambda.
  • 00:14:05
    It's auto-scaling,
  • 00:14:06
    but that isn't quite the end of the story.
  • 00:14:09
    The thing about Lambda is that with certain run times,
  • 00:14:13
    when spinning up a new Lambda, it could take,
  • 00:14:15
    there's some latency involved.
  • 00:14:16
    That's called cold start.
  • 00:14:18
    IMDb didn't want any cold starts,
  • 00:14:19
    so they used something called provisioned concurrency.
  • 00:14:21
    With provisioned concurrency,
  • 00:14:23
    you specify a number of Lambdas you wanna keep warm
  • 00:14:25
    and these warm Lambdas won't have cold start,
  • 00:14:27
    and you pay for that, but you pay a fraction
  • 00:14:29
    of what it'll cost to actually run the Lambda.
  • 00:14:31
    So if they specified a flat number like 800,000,
  • 00:14:35
    that'd be wasteful, right?
  • 00:14:36
    'Cause they're not always running 800,000.
  • 00:14:37
    So what you see here is the gray line,
  • 00:14:39
    this is not IMDb, this is a schematic,
  • 00:14:41
    but the gray line represents a number of Lambda invocations
  • 00:14:44
    and the orange stepwise line
  • 00:14:47
    is the provisioned concurrency
  • 00:14:49
    scaling up and then scaling down.
  • 00:14:51
    So not only the number of Lambdas scale up and down,
  • 00:14:54
    but the provisioned concurrency scales up and down,
  • 00:15:01
    which brings us to our next best practice.
  • 00:15:03
    So we're gonna talk about using
  • 00:15:07
    highly available public endpoints.
  • 00:15:09
    So that's that front end I was talking about,
  • 00:15:11
    that actual API endpoint,
  • 00:15:13
    and so I'm gonna zoom in on it here.
  • 00:15:15
    So zooming in on that front end of that gateway,
  • 00:15:18
    we can see a couple of things.
  • 00:15:19
    Okay, they're using the web application firewall, or WAF.
  • 00:15:22
    All right, so WAF is a firewall product offered by AWS
  • 00:15:26
    and they really loved it.
  • 00:15:29
    They said that the initial turn on was exceedingly simple,
  • 00:15:32
    and as soon as they implemented it, it removed,
  • 00:15:34
    they said no more high-sev issues.
  • 00:15:36
    I'll just say vastly reduced their high-sev issues,
  • 00:15:39
    and they didn't have to put
  • 00:15:40
    the manual network blocks in place.
  • 00:15:42
    So what was causing these high-sev issues?
  • 00:15:45
    Robots, either malicious or non-malicious robots.
  • 00:15:48
    That's constantly fighting against the robots
  • 00:15:51
    and so WAF was a solution for them that really worked.
  • 00:15:54
    There's also a CDN here, a content delivery network,
  • 00:15:57
    called CloudFront, and what CloudFront does is
  • 00:15:59
    you might know that AWS is in 30 regions,
  • 00:16:02
    but we have over 410 edge locations.
  • 00:16:05
    So using CloudFront, your users request, someone using IMDb,
  • 00:16:08
    their request will be routed to one of those edge locations
  • 00:16:11
    closer to them than a region possibly.
  • 00:16:14
    That puts it right on the AWS backbone right away,
  • 00:16:16
    gets better performance,
  • 00:16:17
    and also being a content delivery network,
  • 00:16:19
    it offers caching, so there's caching at that edge location,
  • 00:16:22
    so if it could serve from the cache, it will,
  • 00:16:24
    and finally, the ALB, the Application Load Balancer.
  • 00:16:27
    That is the actual front end that's connected to the Lambda
  • 00:16:30
    that's running the gateway.
  • 00:16:33
    All right, that was our first example.
  • 00:16:34
    I hope you enjoyed it, and we got a few more.
  • 00:16:36
    So let's talk about Global Ops Robotics
  • 00:16:38
    and how they protect workloads
  • 00:16:39
    with a cell-based architecture.
  • 00:16:42
    So to understand what Global Ops Robotics is,
  • 00:16:44
    you have to understand a little bit
  • 00:16:46
    about Amazon's supply chain.
  • 00:16:47
    So as a user, you have this ordering layer
  • 00:16:49
    that you're seeing, like I see something,
  • 00:16:51
    I order it, it shows up on my door,
  • 00:16:53
    but under that is a supply chain layer.
  • 00:16:55
    There's the warehouse management piece,
  • 00:16:57
    which is things going on inside the warehouse,
  • 00:16:58
    or fulfillment center as we call them, as Amazon calls them,
  • 00:17:02
    middle mile, which is moving things
  • 00:17:04
    to the warehouse or between them, and last mile,
  • 00:17:07
    which is moving things to your front door.
  • 00:17:09
    Well, Ops Robotics is the warehouse management piece.
  • 00:17:12
    That's what they call it, and with Ops Robotics,
  • 00:17:15
    all of these are about scale.
  • 00:17:16
    I wanna talk about the scale
  • 00:17:17
    of warehouse management at Amazon.
  • 00:17:19
    There's over 500 of these warehouses, fulfillment centers.
  • 00:17:22
    They could be up to a million square feet big
  • 00:17:25
    and there's millions of items per fulfillment center.
  • 00:17:28
    Now, the Ops Robotics team that runs warehouse management
  • 00:17:30
    has multiple services.
  • 00:17:32
    So what kind of services?
  • 00:17:33
    Well, they need services that understand
  • 00:17:35
    when material is received, where it needs to go,
  • 00:17:38
    stow, picking it when someone orders it,
  • 00:17:40
    packing it and shipping it,
  • 00:17:42
    and so all of these are services
  • 00:17:43
    that are part of Global Ops Robotics,
  • 00:17:45
    and behind these services are multiple microservices.
  • 00:17:48
    So you have hundreds,
  • 00:17:49
    maybe even 1,000 microservices operating here,
  • 00:17:53
    and the reliability pillar best practice
  • 00:17:56
    we're gonna talk about is using bulkhead architectures.
  • 00:17:59
    Bulkhead architectures mean setting up compartmentalization
  • 00:18:03
    that you have multiple of these compartments,
  • 00:18:05
    and if a failure occurs in one,
  • 00:18:06
    it can't affect the others,
  • 00:18:09
    and we're doing this with cell-based architecture.
  • 00:18:11
    Again, this is not Global Ops Robotics.
  • 00:18:13
    This is not warehouse managers.
  • 00:18:14
    This is a generic slide on cell-based architectures.
  • 00:18:18
    With cell-based architectures what we're doing
  • 00:18:20
    is stamping out a complete stack multiple times
  • 00:18:23
    isolated from each other,
  • 00:18:24
    they don't share data with each other,
  • 00:18:26
    and putting a thin routing layer on top.
  • 00:18:28
    That routing layer deterministically assigns clients,
  • 00:18:32
    I put, see clients in quotes, to a cell.
  • 00:18:35
    So a given client, and when I say clients in quotes,
  • 00:18:38
    you could actually, it could be user ID.
  • 00:18:39
    It could be whatever.
  • 00:18:40
    It could be several different things,
  • 00:18:41
    some partition key to each cell
  • 00:18:43
    so that you have a certain number of clients
  • 00:18:44
    going to each cell, and if there's a failure in one cell,
  • 00:18:47
    yes, the clients in that cell might be affected,
  • 00:18:49
    but the clients in the other cells
  • 00:18:51
    are isolated from the failure.
  • 00:18:54
    Now, in their case, they're fulfillment centers.
  • 00:18:57
    They're the warehouses and their client ID
  • 00:19:00
    would be the fulfillment center ID.
  • 00:19:02
    Each fulfillment center does not share data
  • 00:19:04
    with the other fulfillment centers.
  • 00:19:05
    It's a discreet data set for them only.
  • 00:19:09
    So it makes sense that when we're just using,
  • 00:19:11
    deciding about that routing layer,
  • 00:19:13
    how you assign requests to cells,
  • 00:19:15
    we do it by fulfillment center.
  • 00:19:17
    Each fulfillment center is assigned a cell
  • 00:19:19
    and all their requests go to their cell.
  • 00:19:22
    They might be sharing with other fulfillment centers,
  • 00:19:24
    but their requests always go to the same cell.
  • 00:19:26
    So in this case you could see three fulfillment centers
  • 00:19:28
    sitting near each other and each one assigned
  • 00:19:31
    to a different cell, and this,
  • 00:19:33
    the thing they wanted to establish
  • 00:19:34
    was this geographic redundancy.
  • 00:19:36
    So you notice that these three are kinda clustered together
  • 00:19:39
    and they're serving this area of the United States,
  • 00:19:41
    Ohio, Indiana, I think that is.
  • 00:19:43
    So what happens if there's a failure?
  • 00:19:46
    It's contained to that cell, Cell2.
  • 00:19:49
    That FC might be offline,
  • 00:19:50
    but there's still two more FCs in that geographic region
  • 00:19:54
    and the trucks continue to roll
  • 00:19:55
    and people get their products still.
  • 00:19:58
    Now, when they're deploying these cells,
  • 00:20:00
    they're actually using separate AWS accounts
  • 00:20:02
    for each cell and they're using pipelines,
  • 00:20:04
    pipelines to deploy the infrastructure,
  • 00:20:06
    pipelines to deploy the code.
  • 00:20:07
    So the first deployment goes to a pre-prod cell
  • 00:20:11
    that's not really used in production
  • 00:20:12
    but used for testing and before it rolls out,
  • 00:20:15
    and then subsequently it gets deployed
  • 00:20:17
    to each of the three other cells
  • 00:20:19
    each in a separate AWS account,
  • 00:20:21
    and there's also another account
  • 00:20:24
    that's a centralized repository of all the logs and traces
  • 00:20:27
    that the other cells are exporting to
  • 00:20:30
    so you can get an all-up view of the system
  • 00:20:32
    'cause you don't wanna look at it cell by cell,
  • 00:20:34
    or you do wanna look at it cell by cell sometimes,
  • 00:20:36
    but you also wanna look at it all up,
  • 00:20:37
    so that's an aggregation point
  • 00:20:39
    where all the logs and traces could be aggregated.
  • 00:20:43
    Now here's what it looks like.
  • 00:20:45
    Each of the green boxes is a cell.
  • 00:20:47
    Each one of the yellow circles with a letter in it
  • 00:20:49
    is a fulfillment center and they each have their own ID.
  • 00:20:52
    They're lettered in this case,
  • 00:20:53
    and this is a cellular architecture.
  • 00:20:55
    We're showing Service 1 and Service 2.
  • 00:20:57
    Service 2 depends on Service 1.
  • 00:20:59
    Service 1 is an upstream dependency of Service 2,
  • 00:21:06
    and this is cellular wise.
  • 00:21:07
    This is a cell-based architecture.
  • 00:21:09
    What they found the problem here is,
  • 00:21:10
    if there's a failure in cell one in Service 1,
  • 00:21:13
    the way this is architected,
  • 00:21:15
    there are negative impacts on the cells and services
  • 00:21:18
    in Service 2 because of the dependencies,
  • 00:21:21
    how each fulfillment center can be swapping cells
  • 00:21:24
    based on the service.
  • 00:21:26
    So what they wanted to do was establish this.
  • 00:21:29
    Each fulfillment center is assigned to a given cell
  • 00:21:33
    and it's only in that cell for every service in the stack,
  • 00:21:36
    and now if there's a failure, like we saw before,
  • 00:21:39
    it's not the greatest thing in the world to have a failure,
  • 00:21:41
    but it's constrained and those other fulfillment centers
  • 00:21:44
    continue to operate normally,
  • 00:21:47
    and the way they did this was they designed a system
  • 00:21:51
    to assign fulfillment centers to cells.
  • 00:21:54
    They did this using DynamoDB,
  • 00:21:55
    which is our NoSQL, very fast database,
  • 00:21:59
    and this had two effects,
  • 00:22:01
    aligning fulfillment centers to a cell
  • 00:22:03
    but also allowing them to load balance between cells
  • 00:22:06
    'cause fulfillment centers are different sizes.
  • 00:22:08
    So you can't just put three per cell
  • 00:22:09
    or whatever like I did here.
  • 00:22:11
    So this system also runs various rules and heuristic
  • 00:22:14
    to balance out the cell so no cell is
  • 00:22:16
    particularly bigger than another one,
  • 00:22:20
    and that's our cellular architecture.
  • 00:22:21
    Now I wanna talk about Amazon Relay
  • 00:22:24
    and how they use multi-region to keep their trucks moving.
  • 00:22:27
    So trucks are involved here.
  • 00:22:28
    So we're still in the supply chain world.
  • 00:22:30
    So we talked about warehouse management.
  • 00:22:32
    Now we're gonna talk about middle mile management.
  • 00:22:34
    Spoiler alert, I do not have an example for last mile.
  • 00:22:37
    So if you're expecting that, come back next year.
  • 00:22:39
    I'll have one next year.
  • 00:22:41
    All right, so this is about middle mile.
  • 00:22:43
    So middle mile is the semi trucks you see on the road
  • 00:22:46
    with the Amazon Prime symbol on it.
  • 00:22:48
    This is about moving stuff into warehouses
  • 00:22:50
    and between warehouses and making sure
  • 00:22:53
    that all the millions and millions of items
  • 00:22:55
    in Amazon's inventory are in the right place
  • 00:22:58
    to be able to serve customers.
  • 00:23:00
    Now, this example I'm gonna show you
  • 00:23:01
    focuses on North America,
  • 00:23:02
    but middle mile exists around the world,
  • 00:23:05
    and I'm gonna talk about the Relay app.
  • 00:23:07
    The Relay app is an app for iOS and Android
  • 00:23:09
    that the truckers use.
  • 00:23:10
    So if you could think of middle mile
  • 00:23:12
    as having this really sophisticated model
  • 00:23:14
    that determines where stuff should be
  • 00:23:16
    and when it should be there all over United States,
  • 00:23:19
    this is how that model's realized.
  • 00:23:21
    That model is just something in a computer.
  • 00:23:23
    It's meaningless unless you can get trucks rolling
  • 00:23:25
    and moving stuff around.
  • 00:23:26
    This is the realization of that model.
  • 00:23:28
    This is the model that truck drivers use
  • 00:23:30
    to know where to go, when to go there,
  • 00:23:33
    what to pick up, where to take it,
  • 00:23:36
    and you could download this app today on your phone.
  • 00:23:38
    I did it and it's pretty useless
  • 00:23:39
    unless you're a truck driver.
  • 00:23:40
    So truck drivers in the audience, feel free to download it,
  • 00:23:43
    but everybody else. (laughing)
  • 00:23:46
    Okay, so best practice I'm gonna talk about
  • 00:23:49
    is using highly available endpoints.
  • 00:23:51
    All right, so we talked about that already, right?
  • 00:23:52
    Highly available endpoints.
  • 00:23:53
    Oh, we talked about that when we talked about
  • 00:23:55
    the IMDb gateway.
  • 00:23:56
    Well, same thing here.
  • 00:23:57
    Amazon Relay being an app has a gateway too,
  • 00:24:00
    again, a single point of entry that the app is,
  • 00:24:03
    the iOS app and the Android app are both calling into,
  • 00:24:07
    and just like IMDb, there's a gateway
  • 00:24:09
    and it's fronting several backend services.
  • 00:24:11
    In this case, they call them modules.
  • 00:24:12
    So I'm gonna call them the modules too.
  • 00:24:14
    So there you can see the modules there.
  • 00:24:15
    The modules are mostly serverless
  • 00:24:18
    consisting of Lambda and DynamoDB, and there's the gateway.
  • 00:24:22
    So unlike IMDb, they're not using Application Load Balancer.
  • 00:24:24
    They're using API Gateway.
  • 00:24:26
    API Gateway is a highly scalable managed API
  • 00:24:30
    and you can see there's multiple API Gateways there
  • 00:24:31
    'cause the way this works is you could use,
  • 00:24:34
    well, see, you could use Route 53, which is not shown there,
  • 00:24:37
    which is DNS system, to create a domain name,
  • 00:24:40
    and then based on path-based routing,
  • 00:24:42
    like what's after the slash and what's after domain name,
  • 00:24:45
    it goes to a different API Gateway
  • 00:24:47
    and then API Gateway fronts one of these backend modules,
  • 00:24:50
    and you can also see there's also
  • 00:24:51
    some authentication logic in there too.
  • 00:24:53
    So that's important and it's calling into
  • 00:24:54
    the Amazon authentication system to do that.
  • 00:24:58
    Now, what they really liked about this model
  • 00:24:59
    when they went to it is there's no shared ownership
  • 00:25:01
    of code or infrastructure between the gateway
  • 00:25:05
    and the backend modules.
  • 00:25:06
    So they could deploy independently.
  • 00:25:08
    They could make changes independently
  • 00:25:09
    as long as they don't break any contracts
  • 00:25:11
    and it gave them a lot more flexibility.
  • 00:25:15
    Now, the other best practice we wanna look into
  • 00:25:17
    with these two teams are to deploy the workload
  • 00:25:19
    to multiple locations and choose the appropriate locations
  • 00:25:22
    for those deployments, and to talk about this,
  • 00:25:26
    we need to go back to December of 2021.
  • 00:25:29
    As many of you know, in us-east-1 in December 2021,
  • 00:25:32
    there was an event that caused several services
  • 00:25:35
    to experience service issues,
  • 00:25:37
    and one of the services affected was SNS,
  • 00:25:39
    or Simple Notification Service,
  • 00:25:41
    and Relay app does depend on that
  • 00:25:44
    and you can see the effect.
  • 00:25:45
    All right, so some truck drivers
  • 00:25:48
    could not get their load assignments.
  • 00:25:49
    They couldn't get the assignment of where to go
  • 00:25:50
    and what to pick up and you can see it wasn't 100%.
  • 00:25:53
    It went up to about 30% at its peak
  • 00:25:55
    and it was for just some limited period of time,
  • 00:25:57
    but that's still an impact on our customers
  • 00:25:59
    and Amazon does not wanna have that kind of impact.
  • 00:26:04
    So what could you do?
  • 00:26:05
    You could redesign to either not use SNS
  • 00:26:08
    or make SNS a soft dependency
  • 00:26:10
    or you could take the approach they did and use spares.
  • 00:26:13
    Spares is where you set up multiple instances of a resource
  • 00:26:16
    so if one of them is not working, you could use the other.
  • 00:26:19
    Now, in this case, SNS is a regional service,
  • 00:26:23
    so in order to be able to use a different SNS service,
  • 00:26:26
    they had to go to another region
  • 00:26:27
    and I'll show you how they did that,
  • 00:26:29
    but first I gotta introduce you
  • 00:26:31
    to another cultural thing at Amazon,
  • 00:26:32
    the COE, or correction of error event.
  • 00:26:35
    So when something like this happens
  • 00:26:36
    where 30% of the truck drivers
  • 00:26:38
    are not able to get their load assignments,
  • 00:26:39
    that's customer impacting, the team does a COE.
  • 00:26:42
    A COE is a deep dive as to what caused the issue
  • 00:26:45
    and how it could be avoided.
  • 00:26:46
    It's blameless. It's not there to point fingers.
  • 00:26:49
    It's not there to find the culprit as a person.
  • 00:26:52
    It's there to find the actual cause of the issue
  • 00:26:54
    and to come up with solutions,
  • 00:26:56
    actions so that issues like this,
  • 00:27:00
    an issue like this or related to this
  • 00:27:01
    can never happen again,
  • 00:27:03
    and here are some of the ones they came up with
  • 00:27:05
    and the ones I'm gonna talk about.
  • 00:27:06
    I'm gonna talk about how they did a app,
  • 00:27:08
    a review of the resiliency of the Relay app
  • 00:27:11
    and how they then deployed to multiple regions
  • 00:27:15
    to enact what was found in that review.
  • 00:27:18
    So in that review,
  • 00:27:21
    their primary goal there was to preserve
  • 00:27:23
    physical operational continuity,
  • 00:27:25
    even if the experience is degraded.
  • 00:27:27
    So what do I mean by degraded? Let's talk about that.
  • 00:27:29
    So the three steps they did was they had to articulate
  • 00:27:31
    the minimum critical workflow.
  • 00:27:33
    So which parts of this have to work, while other parts,
  • 00:27:36
    if they're not working, it's not optimal,
  • 00:27:38
    but we could still keep the trucks rolling.
  • 00:27:41
    Two, design solutions that those critical parts
  • 00:27:44
    remain operational, and three,
  • 00:27:46
    adapt the system so that when the parts
  • 00:27:49
    that are not so critical stop working,
  • 00:27:51
    the system could still operate.
  • 00:27:53
    That's what we mean by the degraded experience.
  • 00:27:55
    It still works, the critical functions are there,
  • 00:27:59
    and they just, as I said before,
  • 00:28:00
    they went with a multi-region approach.
  • 00:28:02
    So they were already deployed in us-east-1,
  • 00:28:05
    and fun fact, they'd been running out of
  • 00:28:08
    what was the predecessor to us-east-1 before AWS existed.
  • 00:28:12
    Amazon had data centers there
  • 00:28:14
    and that's where they ran out of,
  • 00:28:16
    but they also decided to deploy to us-west-2 over in Oregon.
  • 00:28:20
    Amazon has, AWS has 30 regions all over the world.
  • 00:28:25
    You could see those are the ones in North America,
  • 00:28:28
    and the solution looked like this.
  • 00:28:29
    So this is the backend modules, okay?
  • 00:28:31
    So the backend modules weren't as necessarily as simple
  • 00:28:33
    as a Lambda and a DynamoDB,
  • 00:28:35
    but the thing about them is that they all were fronted
  • 00:28:37
    by Lambda so they could integrate with the API Gateway
  • 00:28:39
    and they all persisted their important data,
  • 00:28:41
    the data that needed to be shared, in DynamoDB,
  • 00:28:44
    and so in this case, you could see they deployed
  • 00:28:46
    to us-east-1 and us-west-2,
  • 00:28:48
    and the nice thing about DynamoDB,
  • 00:28:52
    it has something called global tables.
  • 00:28:55
    With DynamoDB global tables,
  • 00:28:56
    you can deploy a table in multiple regions
  • 00:28:59
    and write to any of those tables
  • 00:29:01
    and those writes will be replicated to the other regions.
  • 00:29:03
    So they found that just to be an easy solution
  • 00:29:05
    just to put right in there.
  • 00:29:07
    Now, each of these modules is owned by a two-pizza team
  • 00:29:10
    or a two-pizza team might own more than one of them,
  • 00:29:12
    but they're all owned by a two-pizza team,
  • 00:29:14
    and the two-pizza teams, based on the criticality analysis,
  • 00:29:17
    decided whether they were gonna go multi-region or not.
  • 00:29:19
    Not all of them did because you have to pick
  • 00:29:21
    where to put your resources right, where to invest.
  • 00:29:26
    Now, the gateway part of it, the part in front there,
  • 00:29:29
    also was deployed to two regions.
  • 00:29:30
    You could see that API Gateway
  • 00:29:31
    which is representing the gateway
  • 00:29:33
    going to us-east-1 and us-west-2,
  • 00:29:36
    and now we put Route 53 in front of it.
  • 00:29:38
    So Route 53 is our DNS system.
  • 00:29:40
    This is called an active/active architecture.
  • 00:29:44
    What it means is that each of the two regions here
  • 00:29:47
    actively receive requests.
  • 00:29:49
    A given request doesn't go to both regions.
  • 00:29:51
    It goes to one or the other.
  • 00:29:52
    How does it decide which one to go to?
  • 00:29:53
    Well, Route 53 offers several routing policies.
  • 00:29:56
    In this case, they decided to use latency routing.
  • 00:29:58
    So Route 53, based on past experience,
  • 00:30:00
    will determine for a given request
  • 00:30:02
    which one's gonna give the lowest latency
  • 00:30:04
    and route the request there.
  • 00:30:05
    There are other routing policies.
  • 00:30:06
    There's weighted routing. There's geolocation routing.
  • 00:30:09
    So it routes it based on where the request came from.
  • 00:30:12
    So there's all kinds of different options.
  • 00:30:13
    This team, Relay, went with the latency-based routing,
  • 00:30:17
    and so this is a request to us-east-1
  • 00:30:20
    and you can see the module called module A.
  • 00:30:24
    They are a module that did go multi-region.
  • 00:30:26
    So the request goes to their us-east-1 version
  • 00:30:29
    and then module B there did not go multi-region,
  • 00:30:32
    so the request also goes to us-east-1.
  • 00:30:35
    However, requests that went to us-west-2,
  • 00:30:38
    this is where it gets interesting,
  • 00:30:39
    for module A, it's gonna go to us-west-2,
  • 00:30:42
    but module B, a less critical module,
  • 00:30:44
    didn't set up anything in us-west-2,
  • 00:30:46
    so it's still gonna receive its request in us-east-1,
  • 00:30:49
    and we'll see how that just plays out later
  • 00:30:51
    in various failure scenarios.
  • 00:30:54
    Yeah, so that's how it works.
  • 00:30:55
    All right, so the next best practice
  • 00:30:56
    is to implement graceful degradation
  • 00:30:58
    to turn hard dependencies into soft dependencies.
  • 00:31:01
    Now, I talked a little bit about
  • 00:31:02
    what graceful degradation is.
  • 00:31:03
    It's about maintaining the critical parts of your workload
  • 00:31:06
    while the less critical ones might fail, but overall,
  • 00:31:09
    the end users still can do the things they need to do.
  • 00:31:13
    So this is that analysis they did
  • 00:31:14
    when they wrote up that report.
  • 00:31:16
    From going left to right in order,
  • 00:31:18
    these are the things that a truck, a delivery goes through,
  • 00:31:21
    the various business domain specific things
  • 00:31:24
    that middle mile goes through, and what the red lines,
  • 00:31:26
    the red bars represent are criticality.
  • 00:31:28
    So by creating a graph like this,
  • 00:31:30
    they're able to identify which modules are critical
  • 00:31:33
    and which ones are less critical.
  • 00:31:34
    So for instance, it's critical that they be able
  • 00:31:37
    to complete a delivery.
  • 00:31:39
    It's critical that you could assign drivers
  • 00:31:41
    to pick up their loads.
  • 00:31:42
    Then what's not critical? What can we do without?
  • 00:31:44
    Well, the app provides turn-by-turn navigation.
  • 00:31:47
    So if that goes out, again, not optimal,
  • 00:31:50
    but there's other GPS systems.
  • 00:31:51
    The app also has this long-term booking
  • 00:31:53
    where you could book next week's loads.
  • 00:31:55
    Well, that's important, but it might not be important now
  • 00:31:58
    while there's some kind of issue going on,
  • 00:31:59
    and eventually, whatever the issue is going on,
  • 00:32:02
    it's gonna be solved and then you could assign
  • 00:32:03
    next week's loads.
  • 00:32:04
    So I really like the subtitle here, "The trucks keep moving,
  • 00:32:07
    no products backed up on the docks."
  • 00:32:09
    That's what they told me, "The trucks keep moving,
  • 00:32:10
    no products backed up on the docks,"
  • 00:32:12
    and that's what they're aiming for,
  • 00:32:14
    and so the next best practice
  • 00:32:15
    we're gonna look at is fail over.
  • 00:32:17
    Okay, being able to fail over to healthy resources.
  • 00:32:19
    So what happens if they have another event
  • 00:32:22
    where they want to fail out of us-east-1
  • 00:32:25
    and be purely in us-west-2?
  • 00:32:27
    So in this case, using the routing policy,
  • 00:32:29
    they could turn off all traffic to us-east-1,
  • 00:32:31
    send all the traffic to us-west-2, and this is what happens.
  • 00:32:34
    So, I'm sorry. I'm gonna actually go back.
  • 00:32:36
    So notice that for module A,
  • 00:32:39
    it's gonna use the version of module A that's in us-west-2,
  • 00:32:43
    and that's a critical module
  • 00:32:44
    and it's gonna continue operating.
  • 00:32:46
    What happens to module B?
  • 00:32:47
    Remember, module B never set up a us-west-2 version.
  • 00:32:51
    So one of two things is probably gonna happen.
  • 00:32:52
    Either one, the request is gonna, well,
  • 00:32:54
    the request is gonna go to us-east-1 where we failed out of,
  • 00:32:57
    but we failed outta there because we're seeing some issue
  • 00:32:59
    that we think we wanna fail out for but doesn't mean,
  • 00:33:01
    the region's never hard down.
  • 00:33:03
    That doesn't happen.
  • 00:33:04
    So the service in us-east-1 still might respond
  • 00:33:07
    and that's a best-case scenario,
  • 00:33:08
    but worst-case scenario, it doesn't respond,
  • 00:33:10
    and because of the way system's designed
  • 00:33:12
    and graceful degradation,
  • 00:33:14
    it's again a less than optimal experience
  • 00:33:16
    but an experience that allows the users
  • 00:33:18
    to do their critical functions,
  • 00:33:19
    which is to keep the trucks moving,
  • 00:33:20
    nothing backed up on the docks,
  • 00:33:23
    and the last one we're gonna look at
  • 00:33:25
    is about testing your disaster recovery strategy
  • 00:33:27
    'cause you could have a disaster recovery strategy,
  • 00:33:29
    but if you don't test it, you don't know if it works,
  • 00:33:32
    and so they ran a game day, a game day basically to exercise
  • 00:33:34
    this disaster recovery strategy.
  • 00:33:36
    They wanna be prepared for peak 2022.
  • 00:33:38
    Peak at Amazon represents the holiday season.
  • 00:33:40
    I think we're already in it with Black Friday
  • 00:33:42
    and Cyber Monday already going on.
  • 00:33:45
    So what they did was they initiated
  • 00:33:47
    a fail over in production.
  • 00:33:51
    They acted as if they needed to get out of us-east-1.
  • 00:33:53
    They didn't need to, but they acted as if they did,
  • 00:33:56
    got out of us-east-1, failed over,
  • 00:33:58
    sent all the traffic to us-west-2 and this is what happened.
  • 00:34:02
    You notice that the increase in traffic in us-west-2
  • 00:34:04
    is way over 100%.
  • 00:34:05
    If it was evenly balanced,
  • 00:34:07
    you'd expect it to be 100% increase,
  • 00:34:08
    but it was more than 100% increase.
  • 00:34:09
    So this represents that most of the traffic's still going
  • 00:34:12
    to us-east-1 and that's probably just the nature
  • 00:34:14
    of population density in the United States.
  • 00:34:17
    The other thing I forgot to mention is
  • 00:34:18
    when they went to the active/active model,
  • 00:34:20
    truck drivers in the West started seeing
  • 00:34:22
    much lower latencies 'cause their requests
  • 00:34:23
    were being sent to the West region.
  • 00:34:26
    Also, when they failed over,
  • 00:34:27
    they actually were able to successfully run the service
  • 00:34:30
    without any significant customer impact
  • 00:34:32
    or failures, et cetera.
  • 00:34:34
    It took 'em about 10 minutes to execute the fail over.
  • 00:34:36
    They did see an increase in latency and it was,
  • 00:34:40
    they're working on it and they're still re-engineering
  • 00:34:42
    to try to get that down, but the increase in latency still,
  • 00:34:44
    again, maybe less than optimal,
  • 00:34:46
    but enabled everyone still able
  • 00:34:48
    to do the critical functions, kept the trucks rolling,
  • 00:34:50
    nothing backing up on the docks.
  • 00:34:56
    All right, our next example is
  • 00:34:57
    the Classification and Policies Platform
  • 00:34:59
    and how they use shuffle sharding to limit blast radius.
  • 00:35:02
    So basically similar to before, similar to before,
  • 00:35:06
    blast radius is about containing the failure
  • 00:35:09
    to an area, to a cell, in this case, a shard,
  • 00:35:12
    so that it doesn't affect other parts of the system,
  • 00:35:15
    and so what is Classification and Policy Platform?
  • 00:35:18
    Well, they're part of the catalog service
  • 00:35:21
    and the catalog at Amazon is massive,
  • 00:35:24
    millions and millions of items in the Amazon catalog,
  • 00:35:27
    and every single one of those items needs to be classified.
  • 00:35:30
    So what do I mean by classified?
  • 00:35:31
    Well, there's 50 different classification programs.
  • 00:35:34
    It could be as simple as what type is it.
  • 00:35:35
    Is it clothing? Is it electronics?
  • 00:35:37
    It could be what kinda taxes should be applied.
  • 00:35:40
    Can this thing be put on an airplane? Is it hazardous?
  • 00:35:43
    Is it something that we can sell in a certain state?
  • 00:35:46
    Is it something that children are allowed to use?
  • 00:35:48
    I mean, there's all kinds of classification going on,
  • 00:35:51
    50 of these programs which are actually
  • 00:35:53
    not necessarily part of this team.
  • 00:35:55
    This team runs the platform to host
  • 00:35:57
    all these classification programs
  • 00:35:58
    and applying classification to all the millions
  • 00:36:01
    and millions of things in the Amazon catalog,
  • 00:36:04
    and why is this important?
  • 00:36:05
    Well, I kinda gave this away a little bit because I said,
  • 00:36:07
    all right, so here's an item that I,
  • 00:36:10
    living in Washington, can buy,
  • 00:36:12
    but when John living in California goes to buy it, it says,
  • 00:36:15
    "No, you can't have it because California restricts
  • 00:36:18
    this item or says you can't have it,"
  • 00:36:19
    and that's an example of the classification was applied
  • 00:36:22
    to the item and the ordering service was able
  • 00:36:24
    to read that classification and say,
  • 00:36:25
    "No, I cannot sell it to people in California,"
  • 00:36:29
    and again, it's about scale.
  • 00:36:31
    So there's 50 programs across Amazon
  • 00:36:33
    doing this classification that are using this platform.
  • 00:36:35
    There's over 10,000 machine learning models being applied,
  • 00:36:38
    100,000 rules, so that's like if this, then that,
  • 00:36:41
    so less sophisticated than machine learning
  • 00:36:43
    but still important,
  • 00:36:44
    and there's 100 model updates every day.
  • 00:36:47
    So of these 10,000 machine learning models,
  • 00:36:49
    100 are being updated every day.
  • 00:36:50
    Millions of products are being updated per hour.
  • 00:36:53
    So this is what it looks like.
  • 00:36:54
    All right, so this is,
  • 00:36:55
    if you're dozing off, time to pay attention
  • 00:36:57
    'cause this is where it gets a little complicated.
  • 00:36:58
    I wanna make it simple, all right?
  • 00:37:00
    So you have the millions of items
  • 00:37:01
    that need to be classified and we're breaking them up
  • 00:37:05
    into batches of about 200 each,
  • 00:37:06
    but apparently it can vary quite a bit.
  • 00:37:08
    So that's not so important.
  • 00:37:10
    What's important is that we need to apply about 100,
  • 00:37:12
    not all 10,000 machine learning models,
  • 00:37:14
    but about 100 machine learning models
  • 00:37:17
    to every item coming in
  • 00:37:18
    and we do that in batches of 30 models.
  • 00:37:21
    So that means there's gonna be three requests made,
  • 00:37:23
    three batches of 30.
  • 00:37:24
    Why 30? I'll get to that in a minute.
  • 00:37:27
    So these requests to process these items
  • 00:37:29
    for 30 machine learning models go to a classifier.
  • 00:37:32
    In this case, it's an Elastic Container Service service
  • 00:37:36
    that's running machine learning models against these items,
  • 00:37:39
    and what it does is it pulls the models down from S3.
  • 00:37:43
    So it says, "Oh, these items need these 30 models.
  • 00:37:46
    I'm gonna pull these 30 models down and run them,"
  • 00:37:49
    and it can cache the models and that's important
  • 00:37:51
    'cause we want to actually try to use workers,
  • 00:37:54
    these are all workers, these classifiers,
  • 00:37:55
    that already have those models cached.
  • 00:37:57
    So pulling the models down
  • 00:37:58
    and swapping 'em out is inefficient,
  • 00:38:01
    and then after it does the classification,
  • 00:38:02
    it writes it to DynamoDB.
  • 00:38:04
    So this is the logical view.
  • 00:38:05
    Let me show you the architectural view.
  • 00:38:07
    Oh, wait, I promised to tell you why 30 is important.
  • 00:38:09
    Well, two reasons.
  • 00:38:11
    They found that that's a nice size
  • 00:38:12
    where they could take 30 related models,
  • 00:38:15
    so like one model might actually use
  • 00:38:16
    the output of another one.
  • 00:38:18
    So they're sort of related in some way,
  • 00:38:20
    but the other reason is about
  • 00:38:20
    that caching I was talking about.
  • 00:38:22
    There's only so many models that these services
  • 00:38:25
    can keep in cache.
  • 00:38:27
    So if you told it to be running 100 models,
  • 00:38:29
    it can't possibly keep those all in cache,
  • 00:38:31
    so eliminating to 30, again,
  • 00:38:33
    allows you to avoid the swapping out.
  • 00:38:35
    Remember, trying to avoid that swapping out.
  • 00:38:38
    So this is more an architectural view.
  • 00:38:39
    So taking one of those requests,
  • 00:38:41
    which again is for 30 models,
  • 00:38:43
    goes through Kinesis where Kinesis reads the metadata
  • 00:38:46
    on the request and decides what models are gonna be applied,
  • 00:38:48
    and this is important.
  • 00:38:49
    This is the part where it sends it to an AWS Lambda
  • 00:38:50
    which is acting as a router.
  • 00:38:52
    It's the AWS Lambda that says,
  • 00:38:54
    "Oh, you need these 30 models?
  • 00:38:56
    I'm gonna send you to this worker,"
  • 00:38:58
    and it puts it on an SQS queue where then the workers,
  • 00:39:02
    the ECS services, read it off the queue.
  • 00:39:04
    So in other words, there's 60 of these workers.
  • 00:39:06
    The workers are dumb.
  • 00:39:07
    They'll process whatever you give them.
  • 00:39:09
    You say, you tell 'em to process
  • 00:39:10
    these 30 machine learning models.
  • 00:39:11
    They'll check is in cache.
  • 00:39:12
    Yeah, all right, I'll do it.
  • 00:39:13
    Is not in cache. All right, I'll pull it down.
  • 00:39:15
    They don't care, so all the smarts are in that Lambda.
  • 00:39:18
    That Lambda is attempting to keep
  • 00:39:20
    each of these 30 model requests
  • 00:39:22
    in the same worker or workers
  • 00:39:24
    that have processed those 30 before to avoid the swapping,
  • 00:39:30
    and so we're gonna talk about
  • 00:39:31
    a best practice we talked about
  • 00:39:33
    about using these bulkhead architectures,
  • 00:39:35
    only we're not talking about cells in this case.
  • 00:39:37
    We're gonna be talking about shards
  • 00:39:38
    and specifically shuffle sharding.
  • 00:39:40
    So I'm gonna take about three slides
  • 00:39:42
    to explain shuffle sharding.
  • 00:39:43
    Now, warning, shuffle sharding at this,
  • 00:39:46
    I've seen one-hour talks at this conference
  • 00:39:48
    to explain shuffle sharding
  • 00:39:50
    and I'm gonna do it in three slides.
  • 00:39:51
    So hopefully I land the message. If I don't, don't sweat it.
  • 00:39:53
    I think you can still follow along.
  • 00:39:55
    All right, so this is just an example
  • 00:39:57
    of some service that has multiple workers.
  • 00:39:59
    They could be EC2 instances,
  • 00:40:01
    or like in the case of CPP, they could be ECS services,
  • 00:40:04
    and on top, those different symbols are different clients
  • 00:40:07
    or different callers of the service
  • 00:40:09
    and there's no sharding going on here,
  • 00:40:11
    and the thing about no sharding is that if I have,
  • 00:40:14
    one of those clients does what we call a poison pill.
  • 00:40:16
    It makes either a malformed request, a corrupt request,
  • 00:40:20
    maybe even a malicious request,
  • 00:40:21
    something that kills the service, the process running on it.
  • 00:40:25
    Maybe it tickles a bug that we didn't know we had.
  • 00:40:28
    It takes down that worker.
  • 00:40:31
    Okay, no problem. We have load balancing, right?
  • 00:40:33
    That worker's down. Let's try another worker.
  • 00:40:36
    Oh, it takes that one down.
  • 00:40:38
    It takes the next one down too
  • 00:40:40
    and eventually will work its way through all the workers
  • 00:40:43
    until there are no workers and everybody's outta luck.
  • 00:40:45
    All the clients are now red.
  • 00:40:47
    Nobody's able to call the service.
  • 00:40:50
    That's no sharding. So let's introduce sharding, okay?
  • 00:40:55
    This is sharding.
  • 00:40:56
    It's just taking a resource unlike cells.
  • 00:40:59
    Cells were entire stack.
  • 00:41:00
    This is just taking some resource layer
  • 00:41:01
    and dividing into chunks, in this case, chunks of two.
  • 00:41:05
    Shards of two workers each,
  • 00:41:06
    and in this case, the cat does its thing,
  • 00:41:09
    kills its two workers.
  • 00:41:10
    It and the dog are unhappy, but everybody else is happy.
  • 00:41:14
    That's the bulkhead architecture at work.
  • 00:41:16
    It contained the failure to that shard
  • 00:41:19
    and you could see number of customers impacted
  • 00:41:21
    is customers, 8, divided by shards, 4.
  • 00:41:23
    2 customers impacted. It checks out, right?
  • 00:41:27
    Okay, now shuffle sharding.
  • 00:41:29
    This is where it gets interesting.
  • 00:41:30
    Each client in this case is assigned
  • 00:41:33
    its own unique pair of two workers,
  • 00:41:36
    but they can be sharing workers.
  • 00:41:38
    So what do I mean by this?
  • 00:41:39
    If you look at the bishop here,
  • 00:41:40
    bishop has these two workers.
  • 00:41:42
    That's the bishop shard. The rook has these two workers.
  • 00:41:46
    They are two unique pairs, they're not the same pair,
  • 00:41:49
    but they're sharing a worker.
  • 00:41:51
    Same thing here.
  • 00:41:51
    The cat has these two workers,
  • 00:41:53
    but again, it has its own unique pair of workers.
  • 00:41:56
    None of the clients, none of the eight clients here
  • 00:41:58
    shares the same two.
  • 00:42:00
    They each share with other shards,
  • 00:42:02
    but none of them share the same two with another client.
  • 00:42:05
    So in this case, what happens is the cat does its thing,
  • 00:42:08
    kills its two workers.
  • 00:42:09
    It's down, but even though the rook
  • 00:42:12
    was sharing one worker with it,
  • 00:42:14
    it still has a healthy worker.
  • 00:42:15
    It has its own unique pair of workers.
  • 00:42:17
    Same thing with the bishop and everybody else.
  • 00:42:20
    So the number of customers impacted
  • 00:42:22
    is customers divided by combinations,
  • 00:42:24
    which in this case is 8 customers,
  • 00:42:26
    and I made 8 shuffle shards,
  • 00:42:28
    so only 1 customer was affected,
  • 00:42:30
    which means that if at scale,
  • 00:42:32
    our scope of impact is 12 1/2 percent or 1/8,
  • 00:42:35
    meaning that if we had 800 clients, 100 would be impacted,
  • 00:42:38
    but it gets better than this 'cause actually,
  • 00:42:40
    I can make more than 8 shuffle shards out of this.
  • 00:42:43
    With 8 workers and making combinations of 2,
  • 00:42:47
    some of you might recognize this math, it's 8 choose 2,
  • 00:42:51
    and you can actually make 28 combinations,
  • 00:42:52
    so the actual scope of impact is much less,
  • 00:42:56
    and if you don't know that math, don't worry about it.
  • 00:42:57
    This is how many combinations of unique sets of 2 workers
  • 00:43:01
    you can make given 8, and if you really wanna go crazy,
  • 00:43:05
    there's Route 53, over 2,000 workers making shards of 4.
  • 00:43:10
    Your scope of impact is 1 in 730 billion.
  • 00:43:12
    The math gets kinda crazy at that point,
  • 00:43:15
    but getting back to CPP.
  • 00:43:16
    All right, so CPP has its workers.
  • 00:43:19
    Its workers are tasked with processing these workloads
  • 00:43:22
    of 30 machine learning models at a time.
  • 00:43:25
    There's over 10,000 machine learning models
  • 00:43:28
    and we need to process them in batches of 30.
  • 00:43:31
    So that's 400 total groupings, 400 shards we're gonna need,
  • 00:43:35
    because a given shard we want to be processing
  • 00:43:38
    given machine learning models and not swapping them out.
  • 00:43:40
    So we're gonna need 400 shards.
  • 00:43:41
    So if there are 60 workers, we can do shuffle sharding.
  • 00:43:45
    The blue shard, the green shard, and the orange shard,
  • 00:43:49
    can you see they share workers with each other,
  • 00:43:52
    but it's each one's a unique combination
  • 00:43:54
    of three in this case.
  • 00:43:57
    So what happens if we have that poison pill incident
  • 00:43:59
    where the orange shard goes down
  • 00:44:02
    because those 30 models were somehow corrupt
  • 00:44:04
    and something happens and that shard is poisoned?
  • 00:44:08
    If we had no shuffle sharding, just standard sharding,
  • 00:44:11
    we took our 400 machine learning groups
  • 00:44:13
    and distributed 'em over 20 shards,
  • 00:44:15
    'cause if we take 60 divided by 3, that's 20,
  • 00:44:17
    then 20 of those machine learning groups would be affected,
  • 00:44:20
    but if we use shuffle sharding, we can create 400 shards.
  • 00:44:23
    So 400 groups of machine learning models, 400 shards.
  • 00:44:27
    One of them gets poisoned, then only that one is affected,
  • 00:44:30
    and again, to remind you,
  • 00:44:31
    it's the same case as we saw with the cat.
  • 00:44:33
    It's because each one has its own unique group
  • 00:44:35
    of three workers, and just to go crazy,
  • 00:44:38
    actually you could create a lot more than 400 shards.
  • 00:44:41
    60 choose 3 is over 34,000.
  • 00:44:45
    All right, and to bring this home,
  • 00:44:48
    the other thing they're doing is
  • 00:44:49
    to implement loosely coupled dependencies.
  • 00:44:51
    Let me show you how that works. So remember the router?
  • 00:44:54
    All the smarts are in that Lambda there.
  • 00:44:55
    It decides which shard.
  • 00:44:57
    So remember, before I said worker.
  • 00:44:58
    The Lambda actually decides which shard
  • 00:45:00
    it's gonna send the request to.
  • 00:45:05
    Remember, it's putting things on an SQS queue
  • 00:45:07
    which are then being picked up by the worker.
  • 00:45:09
    So that Lambda is actually monitoring those queues.
  • 00:45:11
    It's actually looking at the age of the oldest message.
  • 00:45:14
    If the age of the oldest message is pretty old,
  • 00:45:16
    it probably means that queue is pretty slow and congested.
  • 00:45:19
    So it's actually using back-pressure
  • 00:45:21
    to decide which worker inside a shard it's gonna call.
  • 00:45:26
    So as a shard of three,
  • 00:45:27
    it can choose the worker that's the least busy.
  • 00:45:29
    So you can see there the middle one's the least busy
  • 00:45:31
    so that's the one it chooses.
  • 00:45:32
    So that Lambda is not just a router.
  • 00:45:34
    It's also a load balancer using back-pressure
  • 00:45:38
    to route along those workers in a shard.
  • 00:45:42
    Now, what if the load is too high?
  • 00:45:44
    What if there's a spike and all of the workers,
  • 00:45:46
    all three workers in the shard are overloaded?
  • 00:45:48
    This is where load shedding comes in.
  • 00:45:50
    It'll send the request to a load shedding queue
  • 00:45:53
    and come back to it after 15 minutes.
  • 00:45:55
    Why 15 minutes? Well, those ECS services are auto-scaling.
  • 00:45:59
    They're based on CPU levels.
  • 00:46:01
    So if there really is a spike going on,
  • 00:46:03
    those ECS services are gonna see elevated CPU
  • 00:46:06
    and they're gonna scale out.
  • 00:46:07
    So 15 minutes later, we'll come back,
  • 00:46:09
    reprocess that request and it should work at that point.
  • 00:46:16
    All right, this is actually our last example of the day.
  • 00:46:18
    It's about Amazon Search
  • 00:46:20
    and how they're using chaos engineering
  • 00:46:21
    to be ready for Prime Day, any day.
  • 00:46:25
    Amazon Search, I think you've all seen it.
  • 00:46:26
    You have probably all seen the search bar here.
  • 00:46:28
    Why don't we search for chaos engineering
  • 00:46:30
    and see what we get?
  • 00:46:33
    All right, over 1,000 results.
  • 00:46:35
    Okay, the top result there is the "Chaos Engineering" book
  • 00:46:38
    by Casey and Nora.
  • 00:46:39
    That's sort of the chaos engineering bible.
  • 00:46:41
    So that's a good result.
  • 00:46:42
    I also want you to notice the SLO,
  • 00:46:49
    the service level objective book there with the doggy on it
  • 00:46:51
    'cause that's gonna be important too,
  • 00:46:53
    and we're talking about scale here.
  • 00:46:54
    So we're talking about millions of products.
  • 00:46:56
    We're talking about 300 million active users.
  • 00:46:59
    We're talking about last Prime Day,
  • 00:47:01
    84,000 requests per second peak during Prime Day.
  • 00:47:04
    Again, the whole point is to show you
  • 00:47:05
    the scale of these services and how they're using AWS
  • 00:47:08
    to meet the need of that scale,
  • 00:47:10
    and Search, like everything else I showed you,
  • 00:47:12
    consists multiple backend services
  • 00:47:14
    and using multiple AWS resources,
  • 00:47:17
    and what I really like about the Search team
  • 00:47:19
    is they have their own resilience team.
  • 00:47:21
    So they have your builtin team dedicated to resilience
  • 00:47:23
    doing operational resilience
  • 00:47:25
    and site reliability engineering
  • 00:47:27
    for the Search org across those 40 services,
  • 00:47:30
    and their main goal, their main motto is,
  • 00:47:32
    "We test, improve, and drive the resilience
  • 00:47:34
    of Amazon Search services."
  • 00:47:36
    How do they do that?
  • 00:47:36
    They do that by promoting resilience initiatives,
  • 00:47:38
    helping with load testing and helping to promote
  • 00:47:42
    and orchestrate chaos engineering,
  • 00:47:44
    and that's the part I want to talk about.
  • 00:47:47
    So the best practice in this case is use chaos engineering
  • 00:47:50
    to test your workload, to test your resilience.
  • 00:47:54
    So what is chaos engineering?
  • 00:47:56
    I'm gonna read a slide for you.
  • 00:47:57
    "Chaos engineering is the discipline of experimenting
  • 00:47:59
    on a system in order to build confidence
  • 00:48:01
    in the system's capability to withstand
  • 00:48:03
    turbulent conditions in production."
  • 00:48:05
    Turbulent conditions in production.
  • 00:48:06
    I think we could all identify with that,
  • 00:48:08
    unusual user activity, network issues,
  • 00:48:12
    infrastructure issues, bad deployments.
  • 00:48:15
    I mean, it's a mess out there
  • 00:48:17
    and we need to be resilient to that.
  • 00:48:19
    So the thing to know about chaos engineering,
  • 00:48:21
    it's not about creating chaos.
  • 00:48:23
    It's about acknowledging the chaos that already exists
  • 00:48:26
    and preparing for it and mitigating it
  • 00:48:29
    and avoiding the impact of that chaos.
  • 00:48:31
    So that's the way you gotta be thinking
  • 00:48:32
    about chaos engineering.
  • 00:48:34
    So how do you do chaos engineering?
  • 00:48:36
    This is a one-slide summary of how to do chaos engineering.
  • 00:48:39
    Chaos engineering is ultimately at its core
  • 00:48:40
    a scientific method.
  • 00:48:42
    This is a circular cycle,
  • 00:48:44
    but I'm gonna start with steady state.
  • 00:48:46
    What the heck is steady state?
  • 00:48:47
    Steady state means your workload, the workload under test
  • 00:48:49
    is operating within design parameters,
  • 00:48:51
    and you have to be able to measure that.
  • 00:48:53
    You have to be able to assign metrics to say
  • 00:48:54
    what does it mean to operate within design parameters.
  • 00:48:57
    Then is the hypothesis.
  • 00:48:59
    The hypothesis is if some bad thing happens,
  • 00:49:02
    and you specify the bad thing, if an EC2 instance dies,
  • 00:49:05
    if an Availability Zone is not available,
  • 00:49:07
    if a network link goes out, then my system,
  • 00:49:11
    because I designed it that way, will maintain steady state.
  • 00:49:15
    It will stay within those operational parameters.
  • 00:49:17
    Now, if you didn't design it that way,
  • 00:49:18
    don't do the chaos engineering,
  • 00:49:20
    but if you designed it that way, you're testing that.
  • 00:49:22
    So you run the experiment. You simulate that EC2 failure.
  • 00:49:25
    You simulate that network link outage,
  • 00:49:27
    and then you validate.
  • 00:49:28
    You verify was the hypothesis confirmed.
  • 00:49:32
    If the hypothesis was not confirmed, oh, okay.
  • 00:49:34
    We experienced some sort of outage.
  • 00:49:36
    We went outside of the established parameters.
  • 00:49:38
    We did not maintain steady state. You need to improve.
  • 00:49:41
    You improve by redesigning,
  • 00:49:43
    applying the best practices in the reliability pillar,
  • 00:49:46
    and then you test it again.
  • 00:49:47
    You run the experiment again.
  • 00:49:49
    Oh, now the hypothesis is confirmed
  • 00:49:51
    and we're back to steady state
  • 00:49:52
    and the whole thing repeats all over again.
  • 00:49:56
    So service level objectives,
  • 00:49:57
    I told you this would come up again, so here it is.
  • 00:50:00
    This is an example service level objective.
  • 00:50:01
    This is not one they actually use.
  • 00:50:04
    They didn't really wanna share those,
  • 00:50:05
    but they did want to share the format of it.
  • 00:50:07
    So this is the format of it.
  • 00:50:08
    In a 28-day trailing window, we'll see 99.9% of requests
  • 00:50:12
    with a latency of less than one second.
  • 00:50:14
    That's an example of a service level objective
  • 00:50:17
    that might be used by the Search team,
  • 00:50:19
    and with this service level objective,
  • 00:50:20
    we've established something called the error budget.
  • 00:50:22
    So what's the error budget?
  • 00:50:23
    Well, 99.9% means that .1% can be greater than a second.
  • 00:50:29
    So that's the start of our budget. That's our budget.
  • 00:50:31
    However, with every request that exceeds one second,
  • 00:50:36
    we're consuming that budget.
  • 00:50:38
    Eventually that whole thing will be consumed
  • 00:50:39
    and we'll be out of budget and you can actually look at
  • 00:50:42
    how fast that budget's being burned.
  • 00:50:43
    It's called the burn rate, but there's good news.
  • 00:50:46
    There's a 28-day trailing window.
  • 00:50:48
    So that means the oldest failures,
  • 00:50:51
    the oldest requests that are greater than a second
  • 00:50:53
    will eventually time out,
  • 00:50:54
    will eventually age out, I should say,
  • 00:50:56
    be older than 28 days and your budget replenishes.
  • 00:50:59
    So that's the concept of the error budget.
  • 00:51:03
    So they wanna do customer-obsessed chaos engineering.
  • 00:51:06
    Chaos engineering is not for the engineering teams.
  • 00:51:09
    It's not for the developers.
  • 00:51:10
    It's so we can establish an experience for our customers
  • 00:51:14
    that's gonna serve their needs,
  • 00:51:15
    and they thought SLO was the best way to do that.
  • 00:51:17
    It's very customer focused.
  • 00:51:18
    It's focused on what the customers experience
  • 00:51:20
    and so the experiments must stay within the error budget,
  • 00:51:25
    and the stop conditions for an experiment,
  • 00:51:27
    you must always have stop conditions
  • 00:51:28
    on your chaos engineering experiments,
  • 00:51:30
    are if the burn rate is too high on the error budget,
  • 00:51:34
    the experiment stops.
  • 00:51:36
    If the Andon cord is pulled.
  • 00:51:37
    So the Andon cord goes back to the Toyota factories
  • 00:51:40
    where they had a actual cord
  • 00:51:42
    that anybody on the assembly line could pull
  • 00:51:45
    if they saw a quality issue.
  • 00:51:46
    Same thing here.
  • 00:51:48
    Several people across the org can push this button
  • 00:51:50
    and will stop and roll back any experiment at any time,
  • 00:51:53
    and then the last thing is
  • 00:51:54
    if there's something going on at this kind of event
  • 00:51:56
    happening across Amazon IT, then that's not a good time
  • 00:51:59
    to be doing your chaos engineering,
  • 00:52:00
    so let's stop it and roll it back then too,
  • 00:52:04
    and this is what they designed.
  • 00:52:06
    We're here to talk about architecture.
  • 00:52:07
    So on the right, I just wanna point out it's all centered
  • 00:52:10
    on Fault Injection Simulator.
  • 00:52:11
    Fault Injection Simulator is a AWS service
  • 00:52:14
    that you can use to run chaos experiments
  • 00:52:18
    and they did build around that.
  • 00:52:20
    So on the far right, you can see ECS and EC2.
  • 00:52:23
    That's the search services.
  • 00:52:25
    Remember, there's 40-plus search services.
  • 00:52:27
    So they're using Fault Injection Simulator
  • 00:52:29
    to do chaos engineering on those services.
  • 00:52:32
    What they built was the part on the left.
  • 00:52:33
    That's the orchestration piece.
  • 00:52:35
    Okay, now follow me down to the API Gateway
  • 00:52:38
    in the lower left-hand corner.
  • 00:52:39
    You can see two APIs.
  • 00:52:40
    The first one's the Andon API and the Andon API
  • 00:52:44
    establishes and configures the Andon cords.
  • 00:52:46
    It does this by setting up various CloudWatch alarms
  • 00:52:49
    that FIS will respond to.
  • 00:52:51
    FIS has guardrails.
  • 00:52:52
    Remember, I said a good experiment has to have a guardrail.
  • 00:52:56
    So FIS has guardrails based on CloudWatch
  • 00:52:57
    and so when someone pulls the Andon cord,
  • 00:53:00
    it sets the CloudWatch alarm
  • 00:53:01
    which then stops the FIS experiment.
  • 00:53:03
    Okay, the other API is the run API.
  • 00:53:05
    It has the ability to run an experiment,
  • 00:53:08
    to schedule it for later,
  • 00:53:10
    so there's a Lambda there that's a scheduler
  • 00:53:12
    that can store schedules in DynamoDB,
  • 00:53:14
    and it provides orchestration.
  • 00:53:16
    You see on the right there those three Lambdas.
  • 00:53:19
    So not only can you run the experiment,
  • 00:53:21
    but it gives you the ability to do things
  • 00:53:22
    before the experiment.
  • 00:53:23
    What might you wanna do before an experiment?
  • 00:53:24
    You might wanna send out an alert to various personnel.
  • 00:53:27
    You might wanna stop any in-process deployments.
  • 00:53:31
    So there's various things you might wanna do
  • 00:53:32
    before an experiment, then run the experiment using FIS,
  • 00:53:35
    and then do post-experiment operations,
  • 00:53:37
    like for instance, cleaning things up,
  • 00:53:39
    like especially some experiments,
  • 00:53:42
    fault injections aren't self-correcting,
  • 00:53:44
    so then you actually have to go in and correct them,
  • 00:53:46
    so it might do something like that,
  • 00:53:49
    and so FIS exists, it's a great service,
  • 00:53:52
    so why did they build this orchestration piece?
  • 00:53:56
    This is why.
  • 00:53:57
    Number one is they're serving 40-plus teams.
  • 00:53:58
    They wanna provide a single pane of glass,
  • 00:54:01
    a consistent experience across those teams
  • 00:54:02
    and make it super easy for them to do chaos engineering.
  • 00:54:07
    They also wanted to add the ability to do scheduling,
  • 00:54:09
    to be able to run it with deployments,
  • 00:54:11
    which FIS can do, but remember,
  • 00:54:13
    all these 40 services are using a pipeline system in common
  • 00:54:16
    so that the orchestration is able to design around that
  • 00:54:19
    and make it super easy to do, run it with deployments,
  • 00:54:22
    provide it consistent guardrails.
  • 00:54:23
    Remember, the SLOs are the important guardrail.
  • 00:54:26
    So they actually have as part of their system
  • 00:54:28
    storage of all the various SLOs
  • 00:54:30
    so that it uses that during experimentation
  • 00:54:32
    to provide a guardrail.
  • 00:54:34
    The Andon cord functionality is not natively part of FIS,
  • 00:54:37
    so they're providing that, and metrics and insights.
  • 00:54:39
    Of course FIS emits metrics,
  • 00:54:42
    but now they could roll up all the metrics
  • 00:54:44
    from all 40 services and provide them
  • 00:54:46
    as a single report to management
  • 00:54:48
    about what kinda chaos engineering they're doing,
  • 00:54:52
    and plus, in addition to FIS,
  • 00:54:53
    they wanna be able to run other kinds of faults.
  • 00:54:55
    Let's talk about that. Oh, well, no.
  • 00:54:57
    When they're doing this all, why did they do it?
  • 00:54:59
    Why did they build the orchestrator?
  • 00:55:00
    'Cause they wanna be ready for Prime Day, any day.
  • 00:55:03
    All right, type of faults. First, there is the FIS faults.
  • 00:55:06
    These are all supported by FIS,
  • 00:55:07
    things like dropping ECS nodes, killing EC2 instances,
  • 00:55:12
    injecting latency, doing very,
  • 00:55:17
    SSM, or Systems Manager, lets you run
  • 00:55:19
    any kind of automation you want,
  • 00:55:20
    so you can maybe even simulate an Availability Zone outage,
  • 00:55:23
    but what kind of faults are they doing that's not FIS?
  • 00:55:26
    Well, there's load testing because, actually,
  • 00:55:29
    internal to Amazon, across Amazon teams,
  • 00:55:30
    is a very popular load test tool.
  • 00:55:32
    They wanna be able to use that
  • 00:55:33
    as part of their experimentation,
  • 00:55:35
    and there's emergency levers.
  • 00:55:36
    So emergency levers are things you can do
  • 00:55:40
    as an operator of a service to help a service under duress.
  • 00:55:43
    For service under duress, you pull the emergency lever
  • 00:55:46
    and now the service can operate well,
  • 00:55:48
    so for instance, blocking all robots,
  • 00:55:51
    and ultimately, I'm running a little low on time,
  • 00:55:53
    so I'm gonna speed through this, ultimately,
  • 00:55:55
    they wanna provide a benefit to the end user.
  • 00:55:56
    I just wanna point out that it's about higher availability,
  • 00:55:58
    improved resiliency for the end customer,
  • 00:56:01
    and so I wanna talk about graceful degradation
  • 00:56:03
    and emergency levers.
  • 00:56:06
    What does the emergency lever look like for Search?
  • 00:56:07
    Well, one of their emergency levers is the,
  • 00:56:10
    actually, they pull the lever
  • 00:56:12
    and it causes graceful degradation on purpose.
  • 00:56:14
    So this is what Search looks like,
  • 00:56:16
    a full Search experience if I'm searching for Lego,
  • 00:56:18
    but if I pull the emergency lever,
  • 00:56:21
    it'll turn off non-critical services.
  • 00:56:23
    So critical services like the image,
  • 00:56:25
    the title, the price are all still there,
  • 00:56:27
    but non-critical services like the reviews
  • 00:56:29
    or the age range are not there.
  • 00:56:31
    So for a system under duress, this can help,
  • 00:56:34
    and they test this using chaos engineering.
  • 00:56:37
    The hypothesis is the lever works
  • 00:56:40
    and it enables Search to handle the stress.
  • 00:56:42
    So they literally generate load during the test
  • 00:56:45
    and then pull the lever and validate that,
  • 00:56:48
    yes, the system's able to handle the duress.
  • 00:56:51
    All right, so in summary,
  • 00:56:52
    these are the five services I covered.
  • 00:56:55
    This part's important to me.
  • 00:56:57
    Okay, so to get all this information
  • 00:56:58
    so I could share it with you, and I hope you enjoyed it,
  • 00:57:01
    I had to work with many smart engineers on multiple teams,
  • 00:57:04
    and the thing about smart engineers
  • 00:57:06
    working on cool stuff is that they're really busy,
  • 00:57:09
    and they took time out to spend with me
  • 00:57:11
    to explain this to me so I could share it with you,
  • 00:57:13
    so my deepest appreciation to those engineers
  • 00:57:16
    and my awe at the engineering that they did.
  • 00:57:18
    I really am impressed by it.
  • 00:57:19
    Hopefully you're impressed by it too.
  • 00:57:22
    Some resources. I won't spend too much time on here.
  • 00:57:24
    You wanna take a snap of that real quick?
  • 00:57:26
    Upcoming talks that might cover the things we talked about,
  • 00:57:29
    and also, two of the examples I covered
  • 00:57:31
    actually have some external resources
  • 00:57:33
    you can check out if you want to learn more.
الوسوم
  • AWS
  • Scalability
  • Reliability
  • Well-Architected Framework
  • Chaos Engineering
  • Microservices
  • Automation
  • Cloud Computing
  • Amazon
  • Architecture