AWS re:Invent 2022 - Reliable scalability: How Amazon.com scales in the cloud (ARC206)
الملخص
TLDRIn this session, Seth Eliot discusses how Amazon.com achieves reliable scalability on AWS, detailing the evolution of its architecture from a simple setup in 1995 to a complex microservices architecture today. He emphasizes the importance of scalability, reliability, and the Well-Architected Framework, which includes best practices for building in the cloud. The session features examples such as IMDb's transition to serverless microservices and Global Ops Robotics' cell-based architecture, highlighting the use of automation, chaos engineering, and other strategies to ensure systems can handle high traffic and maintain performance during peak times.
الوجبات الجاهزة
- 👋 Welcome and introduction to reliable scalability.
- 📜 History of Amazon's architecture evolution.
- ⚙️ Importance of scalability for handling increased loads.
- 🔧 Overview of the Well-Architected Framework.
- 📈 IMDb's transition to serverless microservices.
- 🏭 Global Ops Robotics and cell-based architecture.
- 🚚 Amazon Relay app for truck management.
- 🔍 Chaos engineering for testing resilience.
- 📊 Best practices for reliability in cloud services.
- 🤝 Acknowledgment of the engineers behind the systems.
الجدول الزمني
- 00:00:00 - 00:05:00
The session introduces the topic of reliable scalability and how Amazon.com utilizes AWS for cloud scalability. Seth Eliot, the speaker, shares his background and experience with Amazon, including his work on the .com side and AWS. He presents a historical overview of Amazon's architecture, starting from its early days in 1995, highlighting the evolution of its systems to meet the demand for scalability.
- 00:05:00 - 00:10:00
The need for scalability is emphasized, defined as the ability of a workload to perform its function as the load changes. The architecture evolved from a single server and database to a service-oriented architecture, allowing for more agile development and deployment. The talk transitions to the current architecture, which consists of tens of thousands of microservices interconnected through various dependencies.
- 00:10:00 - 00:15:00
The focus shifts to the Well-Architected Framework, particularly the reliability pillar, which includes best practices for building in the cloud. The speaker introduces the first example of IMDb's transition to serverless microservices, explaining how they moved from a monolithic architecture to a federated schema with microservices, improving scalability and reliability.
- 00:15:00 - 00:20:00
The architecture of IMDb is discussed, showcasing a gateway-based architecture that connects various backend microservices. The importance of the two-pizza team model is highlighted, emphasizing ownership and accountability within teams, leading to smoother operations and better on-call experiences.
- 00:20:00 - 00:25:00
The next best practice discussed is automation in resource scaling. The speaker explains how AWS Lambda functions automatically scale based on requests, and IMDb implemented provisioned concurrency to avoid cold starts, ensuring a seamless user experience during peak loads.
- 00:25:00 - 00:30:00
The architecture of IMDb's API gateway is examined, detailing the use of a web application firewall (WAF) and a content delivery network (CDN) to enhance security and performance. The speaker emphasizes the importance of using highly available public endpoints to reduce high-severity issues caused by bots.
- 00:30:00 - 00:35:00
The session transitions to Global Ops Robotics, focusing on warehouse management and the use of a cell-based architecture to protect workloads. The concept of bulkhead architecture is introduced, explaining how compartmentalization helps contain failures and maintain operational continuity across fulfillment centers.
- 00:35:00 - 00:40:00
The discussion moves to Amazon Relay, which manages the middle mile of the supply chain. The speaker explains how they use multi-region deployments to ensure truck operations continue smoothly, even during service disruptions. The importance of graceful degradation and failover strategies is emphasized to maintain critical functions during outages.
- 00:40:00 - 00:45:00
The Classification and Policies Platform is introduced, showcasing how Amazon classifies millions of items using machine learning models. The concept of shuffle sharding is explained, demonstrating how it limits the blast radius of failures by assigning unique worker pairs to clients, enhancing reliability and scalability.
- 00:45:00 - 00:57:38
The final example focuses on Amazon Search and the implementation of chaos engineering to test system resilience. The speaker outlines the chaos engineering process, emphasizing the importance of steady state, hypothesis testing, and the use of service level objectives (SLOs) to ensure a positive customer experience during turbulent conditions.
الخريطة الذهنية
فيديو أسئلة وأجوبة
What is the main focus of the session?
The session focuses on how Amazon.com achieves reliable scalability using AWS.
Who is the speaker?
The speaker is Seth Eliot, a principal developer advocate at AWS.
What is the Well-Architected Framework?
The Well-Architected Framework consists of best practices for building in the cloud, focusing on reliability among other pillars.
What is chaos engineering?
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
What is the significance of scalability for Amazon?
Scalability allows Amazon to handle increased loads and maintain performance as the scope of their operations changes.
What architectural changes did IMDb implement?
IMDb transitioned from a monolithic architecture to serverless microservices using AWS Lambda.
What is a two-pizza team?
A two-pizza team is a small, cross-functional team at Amazon that can be fed with two pizzas, emphasizing ownership and agility.
How does Amazon ensure reliability in its services?
Amazon uses best practices from the Well-Architected Framework, including automation, bulkhead architectures, and chaos engineering.
What is the role of the Global Ops Robotics team?
The Global Ops Robotics team manages warehouse operations and uses a cell-based architecture for reliability.
What is the purpose of the Relay app?
The Relay app helps truck drivers manage their loads and routes in Amazon's middle mile logistics.
عرض المزيد من ملخصات الفيديو
Zebra DevTalk | Introduction to Zebra Basic Interpreter (ZBI) | March 2025 | Zebra
Cuba L' Avana documentario
Elements of Poetry
Just in Time by Toyota: The Smartest Production System in The World
The Fourier Series and Fourier Transform Demystified
6 May 2025 | The Hindu Newspaper Analysis | 6 May 2025 Current Affairs | 7th May Mock Drill
- 00:00:00- Hello. Welcome, everyone.
- 00:00:01Thank you so much for choosing my session.
- 00:00:03I really appreciate you being here.
- 00:00:04You're here of course for reliable scalability,
- 00:00:08how amazon.com runs on AWS
- 00:00:11and how it scales in the cloud on AWS,
- 00:00:13and we're gonna talk a lot about examples of how amazon.com,
- 00:00:17a large, sophisticated customer, uses AWS.
- 00:00:21So my name is Seth Eliot.
- 00:00:22I am currently a developer advocate,
- 00:00:24principal developer advocate for developer relations,
- 00:00:27just that's a recent change for me.
- 00:00:29Prior to that, I was the reliability lead
- 00:00:31for AWS Well-Architected,
- 00:00:32and Well-Architected's gonna play a big part
- 00:00:34in the talk today, but even before that,
- 00:00:36I actually worked for amazon.com.
- 00:00:38So I joined Amazon back in 2005
- 00:00:40and was working on the .com side before moving to AWS.
- 00:00:45So I always like to start off
- 00:00:46with a bit of a history lesson.
- 00:00:47Now, I wasn't there in 1995,
- 00:00:49but this is what the website looked in 1995.
- 00:00:52Take it in in all its glory.
- 00:00:55Quite amazing for the time period, actually,
- 00:00:57and this is the architecture used
- 00:01:00to run that website you just saw.
- 00:01:01So I want to draw your attention
- 00:01:03to the box that says Obidos.
- 00:01:03Obidos is the place in Brazil
- 00:01:06where the Amazon River is its narrowest and swiftest part,
- 00:01:09and back in those days, they named a lotta things
- 00:01:12after places in Brazil and things on the Amazon River,
- 00:01:14and that is the executable.
- 00:01:17That is a single C, not C++,
- 00:01:20but C binary running on a single server
- 00:01:23talking to a single Oracle database
- 00:01:26running on another server called ACB,
- 00:01:28for amazon.com books, that had all the data in it,
- 00:01:31and that essentially was the architecture.
- 00:01:33You could see there's CC motel. That's a credit card system.
- 00:01:35That was separate so that we could have limited access
- 00:01:38to that so that the credit card numbers could be secure,
- 00:01:40and there's a distribution center,
- 00:01:42later renamed fulfillment centers,
- 00:01:43from which your package would be shipped and show up to you.
- 00:01:46So that's the original architecture.
- 00:01:48Now, the motto of Amazon, especially back in those days,
- 00:01:51is get big fast, and you can see that there's a T-shirt
- 00:01:55from one of the picnics about get big fast,
- 00:01:57and to get big fast, you're gonna need scalability.
- 00:02:00So what is scalability?
- 00:02:01Well, scalability is the ability of a workload
- 00:02:03to perform its agreed function as the scope changes,
- 00:02:06as the load or scope changes.
- 00:02:08So to get there, they had to evolve the architecture.
- 00:02:12So the first thing they looked at was the databases.
- 00:02:14You could see they pulled out this Web database there.
- 00:02:16So that Web database interacts with the customer,
- 00:02:19does the ordering, and then asynchronously syncs
- 00:02:21back to the ACB database periodically.
- 00:02:24Similarly, we've added a new distribution center
- 00:02:26and they each get their own databases too.
- 00:02:28So this is one way to remove one of the big bottlenecks,
- 00:02:31which was the database, but that wasn't enough.
- 00:02:34So let's fast forward to 2000
- 00:02:36and talk about a service-oriented architecture.
- 00:02:39Having a single binary,
- 00:02:41it eventually did become C++, like the original,
- 00:02:44the first engineer at Amazon insisted it stay C,
- 00:02:47but he couldn't control it after a time
- 00:02:49and eventually C++ libraries got into it,
- 00:02:51but still, it was a single binary.
- 00:02:53So if you wanted to make a change,
- 00:02:54so let's say you were in charge of implementing
- 00:02:56one-click ordering, one-click purchase,
- 00:02:59you would have to make your change to that binary
- 00:03:02and everybody else is making changes to that binary
- 00:03:04and you're building along with everybody else
- 00:03:06and you're deploying along with everybody else
- 00:03:08and it's just not a very agile system.
- 00:03:10If somebody else breaks the build,
- 00:03:11you're not deploying today, so that's not great,
- 00:03:14so what can we do?
- 00:03:15Well, in addition to splitting out the databases,
- 00:03:17you can see the customer data got pulled out of ACB
- 00:03:20and you don't wanna be calling the database directly,
- 00:03:22so you're gonna put a service in front of it,
- 00:03:23the customer service, and that customer service
- 00:03:25was originally just for select
- 00:03:28and insert onto that database,
- 00:03:30but it became the location for business logic on customers.
- 00:03:33Similarly, there became an order service and an item service
- 00:03:36and this is the first service-oriented architectures
- 00:03:39at Amazon.
- 00:03:41Now, get big fast. Now let's fast forward to the present.
- 00:03:44The previous Prime Day, Amazon did get big.
- 00:03:47We all know Amazon's big, 100,000 items per minute,
- 00:03:5012 billion in sales as of the last Prime Day,
- 00:03:53but you're not here to learn about that.
- 00:03:54You're here to learn about how they're using AWS, right?
- 00:03:57And so I won't read the numbers off of here.
- 00:03:59There's obviously billions and trillions and millions.
- 00:04:02Go ahead and read them.
- 00:04:03It just shows that Amazon did get big fast
- 00:04:06and they're doing it using AWS and they're letting,
- 00:04:09and they're using AWS to be able to scale
- 00:04:12and scale reliably.
- 00:04:14So if you fast forward to wanna know
- 00:04:16what the architecture looks like today,
- 00:04:17it looks appreciably like it looked back in 2000.
- 00:04:21Anybody believe me on that? No.
- 00:04:24See if you're paying attention. No.
- 00:04:25Okay, so this is actually closer
- 00:04:27to the actual current architecture.
- 00:04:28Each dot on there represents a service or microservice
- 00:04:31or tens of thousands of them running amazon.com
- 00:04:34and they're all connected to each other
- 00:04:36through various dependencies.
- 00:04:37I zoomed in on one of them here just to show you
- 00:04:39that there are indeed lines in that diagram.
- 00:04:41I think the diagram is quite beautiful, isn't it?
- 00:04:43But that is the current architecture
- 00:04:45with tens of thousands of services,
- 00:04:49with many thousands of teams owning those services.
- 00:04:52All right, so that brings us to reliable scalability.
- 00:04:54So reliability is the ability of a workload
- 00:04:57to perform its required function correctly and consistently.
- 00:05:00So as we're thinking about that,
- 00:05:02that's why Amazon needed scalability.
- 00:05:05They needed to get big fast and be reliable,
- 00:05:08hence they needed the scalability,
- 00:05:09and today we're gonna be diving into examples
- 00:05:11of amazon.com teams doing that and building on AWS,
- 00:05:15and we're gonna use the Well-Architected Framework
- 00:05:18as a framework to present that to you.
- 00:05:20So the Well-Architected Framework consists of six pillars
- 00:05:22and they're all important, but honestly,
- 00:05:24today we're focused on reliability.
- 00:05:26The reliability pillar has the,
- 00:05:28Well-Architected has best practices.
- 00:05:30Well-Architected is just a documentation
- 00:05:33of all the best practices for building in the cloud.
- 00:05:35It includes other things too. We have hands-on labs.
- 00:05:38We have a Well-Architected Tool
- 00:05:39where you could review your own workloads,
- 00:05:41but honestly in this case,
- 00:05:43we're gonna look at the best practices reliability pillar.
- 00:05:45There are 66 of them.
- 00:05:46We're not gonna look at all 66 of them, but today,
- 00:05:49as I show you the examples I'm showing you,
- 00:05:51I'm gonna talk about which best practice
- 00:05:53is being illustrated in the architectures we're looking at,
- 00:05:57and we're gonna dive right in with our first example.
- 00:06:02Oh, IMDb re-architected to serverless microservices.
- 00:06:06So IMDb, Internet Movie Database.
- 00:06:09Who here has heard of IMDb?
- 00:06:11Okay, and the rest of you just don't wanna raise your hand
- 00:06:13because you don't wanna raise your hand. (laughing)
- 00:06:17Internet Movie Database was acquired by Amazon in 1998.
- 00:06:22It is the number one location to go to learn about movies,
- 00:06:25TV shows, actors, producers, all that good stuff,
- 00:06:28and prior to the re-architecture,
- 00:06:30they were running a monolithic build with a REST API
- 00:06:35on hundreds of EC2 servers.
- 00:06:38So they're on AWS, but they're running on
- 00:06:40hundreds of EC2 instances, servers,
- 00:06:43and when they re-architected,
- 00:06:44they moved to a federated schema with microservices.
- 00:06:47Now, microservices are small, decoupled services
- 00:06:51focused on a specific business domain.
- 00:06:53As for what federated schema is,
- 00:06:55if you don't know already, I'll get to that,
- 00:06:56and they used Lambda for this.
- 00:06:58So they're using Lambda,
- 00:06:59which is the serverless compute in AWS,
- 00:07:02the ability to run code without servers,
- 00:07:06and now we get to the best practice
- 00:07:08and you're gonna see several of
- 00:07:09these slides throughout the talk.
- 00:07:10They have the Well-Architected logo up there
- 00:07:12and the format might be a little odd.
- 00:07:13This, what it is is a snapshot of the Well-Architected Tool
- 00:07:16which is in the AWS console,
- 00:07:18and the way best practices are shown in the framework
- 00:07:20is there's a question that represents
- 00:07:23a set of best practices.
- 00:07:24Then each of those check boxes are a best practice.
- 00:07:27So in this case, the best practices we're interested in is,
- 00:07:29how do you segment your workload,
- 00:07:31and then how do you focus those segments
- 00:07:35on specific business use cases, on specific business needs?
- 00:07:38And you can see I circled microservices there.
- 00:07:40You don't have to use microservices
- 00:07:42to achieve these best practices,
- 00:07:43but that is what IMDb did, so therefore it's circled.
- 00:07:46So these are the first two best practices to look at,
- 00:07:48and to look at that, we're gonna ask a question.
- 00:07:51You're on IMDb and you type in Jackie Chan
- 00:07:54and what it does is runs a query is,
- 00:07:56what is the top four shows that Jackie Chan is known for?
- 00:08:00Now, Jackie has an id. Every entity in IMDb has an id.
- 00:08:04This nm is a name entity and that's his entity there, 329,
- 00:08:07and so you as a user don't care about that,
- 00:08:10but you've just asked what is Jackie Chan known for,
- 00:08:13and this is the query that the client creates.
- 00:08:16It's GraphQL.
- 00:08:18GraphQL is a query language
- 00:08:19that lets you set up queries like this where you can,
- 00:08:23using a schema, request information
- 00:08:26and get all that information back at once.
- 00:08:27Like with REST, you'd probably have to make four calls.
- 00:08:29Here, you just do it all at once,
- 00:08:31and what is being requested here?
- 00:08:32You could see the name id on top and so that's Jackie Chan,
- 00:08:36and I wanna know the first four things that he's known for,
- 00:08:39and of those, when you gimme those four things,
- 00:08:40I wanna know the title text, I wanna know the release date,
- 00:08:44I wanna know the aggregated ratings,
- 00:08:46and I want an image URL so I could show an image.
- 00:08:48Okay, so that's the query that the front end is making,
- 00:08:51and this is where the microservices
- 00:08:52and the federated schema come into play.
- 00:08:55This request is actually sent
- 00:08:57to four different microservices,
- 00:08:59each fronted by an AWS Lambda in this case.
- 00:09:02So the first one is find me the top four things
- 00:09:05Jackie Chan's known for and it's gonna return the id
- 00:09:07of those four things, which begins with tt.
- 00:09:09Now, that first service doesn't know about
- 00:09:12release date or rating.
- 00:09:13It only knows about top four.
- 00:09:15So the next thing is the title text and the release date.
- 00:09:19That's metadata, so that's gonna go to that other service,
- 00:09:21and that service only knows metadata,
- 00:09:23so it's gonna return the metadata.
- 00:09:24The third one is the ratings,
- 00:09:26so that one only knows ratings.
- 00:09:27It's gonna return the aggregate rating,
- 00:09:28and the last one only knows image URLs,
- 00:09:30so it's gonna return the image URLs,
- 00:09:31and the reason it's a federated schema is
- 00:09:33'cause even though the request is one big schema,
- 00:09:35each of these little microservices only knows
- 00:09:37its own piece to the schema.
- 00:09:40So when the front end gets that response to that request,
- 00:09:44it's gonna show it to the user like this.
- 00:09:45You could see Jackie Chan
- 00:09:46and the four things he's known for.
- 00:09:48You could see the release date.
- 00:09:49You could see the aggregated rating,
- 00:09:51and there's only one thing wrong here.
- 00:09:54"Kung Fu Panda" is missing. How could that be?
- 00:09:56I don't know and I really have a bone to pick
- 00:09:58with the IMDb team.
- 00:09:59I'll let them know about it after the talk.
- 00:10:02All right, so now let's get into architecture.
- 00:10:04Okay, so this is what the IMDb architecture looks like.
- 00:10:06It's a gateway-based architecture.
- 00:10:08So they redesigned their gateway
- 00:10:09into the serverless architecture
- 00:10:11so that it can call all of these backend microservices
- 00:10:14that each know their own little piece of the elephant.
- 00:10:17So here's those backend microservices.
- 00:10:18They're just sitting there fronted by Lambda.
- 00:10:20Some of them are completely serverless.
- 00:10:22Many of the newer ones are.
- 00:10:23Some of them, if there was like a legacy service
- 00:10:25or something that they just wanted to update,
- 00:10:27they'll front it with a Lambda so that they could be called,
- 00:10:30and the Lambda's responsible shaping the data
- 00:10:33so that the GraphQL query response is in the right format.
- 00:10:38Okay, over here, okay, so now each of those microservices
- 00:10:41only knows its piece of the schema and the gateway,
- 00:10:44that which a gateway is the front end that the client calls,
- 00:10:47the gateway needs to know the entire schema.
- 00:10:49You need a schema manager, so here it is.
- 00:10:51When you create a new service or update a service,
- 00:10:54it publishes its little piece of the schema
- 00:10:56to the schema manager, which publishes it into an S3 bucket.
- 00:10:59So the gateway has a full view of the schema,
- 00:11:03and here's the API for, the front end for the API.
- 00:11:06There's a Application Load Balancer. There's a firewall.
- 00:11:09There's a content delivery network piece
- 00:11:11and I'm gonna talk more about that later,
- 00:11:12so I'm gonna put that on hold,
- 00:11:14and this is when we diverge a little bit
- 00:11:16and talk about culture at Amazon.
- 00:11:17I think many of you already heard the two-pizza team.
- 00:11:20A two-pizza team is a team that could be fed
- 00:11:21with approximately two pizzas,
- 00:11:23so not too big, not too small.
- 00:11:25It's a cross-functional team, but it's all about ownership.
- 00:11:28The two-pizza team owns the service
- 00:11:30or services they're responsible for,
- 00:11:31from design to implementation to deployment
- 00:11:35to operation and the business around it.
- 00:11:38So there might be a product manager on the team
- 00:11:40that's a business expert working with developers there.
- 00:11:42So the nice thing about this with this model
- 00:11:44is that this model of creating these federated microservices
- 00:11:49is it moved the business logic for those services
- 00:11:52so that the team could own that business logic.
- 00:11:54So the team that owns the metadata is expert on metadata.
- 00:11:57The team that owns the ratings is expert on ratings,
- 00:12:00and this was organizationally a positive thing for the org,
- 00:12:05and what happened was,
- 00:12:06so they have something called on-call.
- 00:12:08They have a rotating on-call rotation
- 00:12:09where if there's any problem in production,
- 00:12:11they own in production, they have to respond to it,
- 00:12:12and the senior dev told me they were having
- 00:12:14ridiculously smooth on-calls after this
- 00:12:16and that's because the organizational change
- 00:12:19aligned with the technology change
- 00:12:21meant that the teams that owned the business domain
- 00:12:25and the service were available whenever a problem occurred,
- 00:12:28so that a problem occurred in the aggregate service,
- 00:12:31the rating aggregate service,
- 00:12:33that team would be the one called
- 00:12:34and they'd understand what's going on,
- 00:12:35and it also helped that going to serverless
- 00:12:38helped with scalability.
- 00:12:40All right, so the next best practices we're gonna look at
- 00:12:42is using automation when obtaining or scaling resources
- 00:12:46and obtaining resources upon detection that you need them,
- 00:12:48so detecting that you need new resources
- 00:12:51and obtaining them automatically.
- 00:12:53To do that, I'm gonna do a little divergence,
- 00:12:54just talk about Lambda.
- 00:12:55This is not an IMDb architecture.
- 00:12:57This is just a generic serverless architecture
- 00:12:58'cause I wanna talk about Lambda.
- 00:13:00As I said, Lambda is a way to run code without a server,
- 00:13:03but the way it works is you deploy a Lambda instance
- 00:13:07with some code, and then for every request it gets,
- 00:13:10it spins up, invokes a Lambda instance.
- 00:13:13So here you can see six requests. Six Lambdas get spun up.
- 00:13:17They process the requests and if there's no more requests,
- 00:13:19they spin down.
- 00:13:20So it is automatically scaling.
- 00:13:22It'll scale up and down based on
- 00:13:24the number of requests you get,
- 00:13:26and this is the actual metrics for Lambda invocations.
- 00:13:31So these are the number of Lambdas being invoked
- 00:13:33per minute by IMDb and this could also be translated
- 00:13:37to requests per minute because each request,
- 00:13:39each Lambda invocation represents a single request.
- 00:13:42Note it peaks at 800,000 requests per minute,
- 00:13:44which I also converted to requests per second
- 00:13:46if you want to know that,
- 00:13:47and it also goes up and down quite a bit.
- 00:13:49It's quite cyclical.
- 00:13:51I don't know, who saw my Twitter post about this?
- 00:13:52This is one of the things I posted on Twitter and said,
- 00:13:54"Which service is this?"
- 00:13:55Well, it's IMDb. Now you know, and so two things here.
- 00:13:59One is Lambda just scales, right?
- 00:14:02Like, every request it gets, it spins up a Lambda.
- 00:14:05It's auto-scaling,
- 00:14:06but that isn't quite the end of the story.
- 00:14:09The thing about Lambda is that with certain run times,
- 00:14:13when spinning up a new Lambda, it could take,
- 00:14:15there's some latency involved.
- 00:14:16That's called cold start.
- 00:14:18IMDb didn't want any cold starts,
- 00:14:19so they used something called provisioned concurrency.
- 00:14:21With provisioned concurrency,
- 00:14:23you specify a number of Lambdas you wanna keep warm
- 00:14:25and these warm Lambdas won't have cold start,
- 00:14:27and you pay for that, but you pay a fraction
- 00:14:29of what it'll cost to actually run the Lambda.
- 00:14:31So if they specified a flat number like 800,000,
- 00:14:35that'd be wasteful, right?
- 00:14:36'Cause they're not always running 800,000.
- 00:14:37So what you see here is the gray line,
- 00:14:39this is not IMDb, this is a schematic,
- 00:14:41but the gray line represents a number of Lambda invocations
- 00:14:44and the orange stepwise line
- 00:14:47is the provisioned concurrency
- 00:14:49scaling up and then scaling down.
- 00:14:51So not only the number of Lambdas scale up and down,
- 00:14:54but the provisioned concurrency scales up and down,
- 00:15:01which brings us to our next best practice.
- 00:15:03So we're gonna talk about using
- 00:15:07highly available public endpoints.
- 00:15:09So that's that front end I was talking about,
- 00:15:11that actual API endpoint,
- 00:15:13and so I'm gonna zoom in on it here.
- 00:15:15So zooming in on that front end of that gateway,
- 00:15:18we can see a couple of things.
- 00:15:19Okay, they're using the web application firewall, or WAF.
- 00:15:22All right, so WAF is a firewall product offered by AWS
- 00:15:26and they really loved it.
- 00:15:29They said that the initial turn on was exceedingly simple,
- 00:15:32and as soon as they implemented it, it removed,
- 00:15:34they said no more high-sev issues.
- 00:15:36I'll just say vastly reduced their high-sev issues,
- 00:15:39and they didn't have to put
- 00:15:40the manual network blocks in place.
- 00:15:42So what was causing these high-sev issues?
- 00:15:45Robots, either malicious or non-malicious robots.
- 00:15:48That's constantly fighting against the robots
- 00:15:51and so WAF was a solution for them that really worked.
- 00:15:54There's also a CDN here, a content delivery network,
- 00:15:57called CloudFront, and what CloudFront does is
- 00:15:59you might know that AWS is in 30 regions,
- 00:16:02but we have over 410 edge locations.
- 00:16:05So using CloudFront, your users request, someone using IMDb,
- 00:16:08their request will be routed to one of those edge locations
- 00:16:11closer to them than a region possibly.
- 00:16:14That puts it right on the AWS backbone right away,
- 00:16:16gets better performance,
- 00:16:17and also being a content delivery network,
- 00:16:19it offers caching, so there's caching at that edge location,
- 00:16:22so if it could serve from the cache, it will,
- 00:16:24and finally, the ALB, the Application Load Balancer.
- 00:16:27That is the actual front end that's connected to the Lambda
- 00:16:30that's running the gateway.
- 00:16:33All right, that was our first example.
- 00:16:34I hope you enjoyed it, and we got a few more.
- 00:16:36So let's talk about Global Ops Robotics
- 00:16:38and how they protect workloads
- 00:16:39with a cell-based architecture.
- 00:16:42So to understand what Global Ops Robotics is,
- 00:16:44you have to understand a little bit
- 00:16:46about Amazon's supply chain.
- 00:16:47So as a user, you have this ordering layer
- 00:16:49that you're seeing, like I see something,
- 00:16:51I order it, it shows up on my door,
- 00:16:53but under that is a supply chain layer.
- 00:16:55There's the warehouse management piece,
- 00:16:57which is things going on inside the warehouse,
- 00:16:58or fulfillment center as we call them, as Amazon calls them,
- 00:17:02middle mile, which is moving things
- 00:17:04to the warehouse or between them, and last mile,
- 00:17:07which is moving things to your front door.
- 00:17:09Well, Ops Robotics is the warehouse management piece.
- 00:17:12That's what they call it, and with Ops Robotics,
- 00:17:15all of these are about scale.
- 00:17:16I wanna talk about the scale
- 00:17:17of warehouse management at Amazon.
- 00:17:19There's over 500 of these warehouses, fulfillment centers.
- 00:17:22They could be up to a million square feet big
- 00:17:25and there's millions of items per fulfillment center.
- 00:17:28Now, the Ops Robotics team that runs warehouse management
- 00:17:30has multiple services.
- 00:17:32So what kind of services?
- 00:17:33Well, they need services that understand
- 00:17:35when material is received, where it needs to go,
- 00:17:38stow, picking it when someone orders it,
- 00:17:40packing it and shipping it,
- 00:17:42and so all of these are services
- 00:17:43that are part of Global Ops Robotics,
- 00:17:45and behind these services are multiple microservices.
- 00:17:48So you have hundreds,
- 00:17:49maybe even 1,000 microservices operating here,
- 00:17:53and the reliability pillar best practice
- 00:17:56we're gonna talk about is using bulkhead architectures.
- 00:17:59Bulkhead architectures mean setting up compartmentalization
- 00:18:03that you have multiple of these compartments,
- 00:18:05and if a failure occurs in one,
- 00:18:06it can't affect the others,
- 00:18:09and we're doing this with cell-based architecture.
- 00:18:11Again, this is not Global Ops Robotics.
- 00:18:13This is not warehouse managers.
- 00:18:14This is a generic slide on cell-based architectures.
- 00:18:18With cell-based architectures what we're doing
- 00:18:20is stamping out a complete stack multiple times
- 00:18:23isolated from each other,
- 00:18:24they don't share data with each other,
- 00:18:26and putting a thin routing layer on top.
- 00:18:28That routing layer deterministically assigns clients,
- 00:18:32I put, see clients in quotes, to a cell.
- 00:18:35So a given client, and when I say clients in quotes,
- 00:18:38you could actually, it could be user ID.
- 00:18:39It could be whatever.
- 00:18:40It could be several different things,
- 00:18:41some partition key to each cell
- 00:18:43so that you have a certain number of clients
- 00:18:44going to each cell, and if there's a failure in one cell,
- 00:18:47yes, the clients in that cell might be affected,
- 00:18:49but the clients in the other cells
- 00:18:51are isolated from the failure.
- 00:18:54Now, in their case, they're fulfillment centers.
- 00:18:57They're the warehouses and their client ID
- 00:19:00would be the fulfillment center ID.
- 00:19:02Each fulfillment center does not share data
- 00:19:04with the other fulfillment centers.
- 00:19:05It's a discreet data set for them only.
- 00:19:09So it makes sense that when we're just using,
- 00:19:11deciding about that routing layer,
- 00:19:13how you assign requests to cells,
- 00:19:15we do it by fulfillment center.
- 00:19:17Each fulfillment center is assigned a cell
- 00:19:19and all their requests go to their cell.
- 00:19:22They might be sharing with other fulfillment centers,
- 00:19:24but their requests always go to the same cell.
- 00:19:26So in this case you could see three fulfillment centers
- 00:19:28sitting near each other and each one assigned
- 00:19:31to a different cell, and this,
- 00:19:33the thing they wanted to establish
- 00:19:34was this geographic redundancy.
- 00:19:36So you notice that these three are kinda clustered together
- 00:19:39and they're serving this area of the United States,
- 00:19:41Ohio, Indiana, I think that is.
- 00:19:43So what happens if there's a failure?
- 00:19:46It's contained to that cell, Cell2.
- 00:19:49That FC might be offline,
- 00:19:50but there's still two more FCs in that geographic region
- 00:19:54and the trucks continue to roll
- 00:19:55and people get their products still.
- 00:19:58Now, when they're deploying these cells,
- 00:20:00they're actually using separate AWS accounts
- 00:20:02for each cell and they're using pipelines,
- 00:20:04pipelines to deploy the infrastructure,
- 00:20:06pipelines to deploy the code.
- 00:20:07So the first deployment goes to a pre-prod cell
- 00:20:11that's not really used in production
- 00:20:12but used for testing and before it rolls out,
- 00:20:15and then subsequently it gets deployed
- 00:20:17to each of the three other cells
- 00:20:19each in a separate AWS account,
- 00:20:21and there's also another account
- 00:20:24that's a centralized repository of all the logs and traces
- 00:20:27that the other cells are exporting to
- 00:20:30so you can get an all-up view of the system
- 00:20:32'cause you don't wanna look at it cell by cell,
- 00:20:34or you do wanna look at it cell by cell sometimes,
- 00:20:36but you also wanna look at it all up,
- 00:20:37so that's an aggregation point
- 00:20:39where all the logs and traces could be aggregated.
- 00:20:43Now here's what it looks like.
- 00:20:45Each of the green boxes is a cell.
- 00:20:47Each one of the yellow circles with a letter in it
- 00:20:49is a fulfillment center and they each have their own ID.
- 00:20:52They're lettered in this case,
- 00:20:53and this is a cellular architecture.
- 00:20:55We're showing Service 1 and Service 2.
- 00:20:57Service 2 depends on Service 1.
- 00:20:59Service 1 is an upstream dependency of Service 2,
- 00:21:06and this is cellular wise.
- 00:21:07This is a cell-based architecture.
- 00:21:09What they found the problem here is,
- 00:21:10if there's a failure in cell one in Service 1,
- 00:21:13the way this is architected,
- 00:21:15there are negative impacts on the cells and services
- 00:21:18in Service 2 because of the dependencies,
- 00:21:21how each fulfillment center can be swapping cells
- 00:21:24based on the service.
- 00:21:26So what they wanted to do was establish this.
- 00:21:29Each fulfillment center is assigned to a given cell
- 00:21:33and it's only in that cell for every service in the stack,
- 00:21:36and now if there's a failure, like we saw before,
- 00:21:39it's not the greatest thing in the world to have a failure,
- 00:21:41but it's constrained and those other fulfillment centers
- 00:21:44continue to operate normally,
- 00:21:47and the way they did this was they designed a system
- 00:21:51to assign fulfillment centers to cells.
- 00:21:54They did this using DynamoDB,
- 00:21:55which is our NoSQL, very fast database,
- 00:21:59and this had two effects,
- 00:22:01aligning fulfillment centers to a cell
- 00:22:03but also allowing them to load balance between cells
- 00:22:06'cause fulfillment centers are different sizes.
- 00:22:08So you can't just put three per cell
- 00:22:09or whatever like I did here.
- 00:22:11So this system also runs various rules and heuristic
- 00:22:14to balance out the cell so no cell is
- 00:22:16particularly bigger than another one,
- 00:22:20and that's our cellular architecture.
- 00:22:21Now I wanna talk about Amazon Relay
- 00:22:24and how they use multi-region to keep their trucks moving.
- 00:22:27So trucks are involved here.
- 00:22:28So we're still in the supply chain world.
- 00:22:30So we talked about warehouse management.
- 00:22:32Now we're gonna talk about middle mile management.
- 00:22:34Spoiler alert, I do not have an example for last mile.
- 00:22:37So if you're expecting that, come back next year.
- 00:22:39I'll have one next year.
- 00:22:41All right, so this is about middle mile.
- 00:22:43So middle mile is the semi trucks you see on the road
- 00:22:46with the Amazon Prime symbol on it.
- 00:22:48This is about moving stuff into warehouses
- 00:22:50and between warehouses and making sure
- 00:22:53that all the millions and millions of items
- 00:22:55in Amazon's inventory are in the right place
- 00:22:58to be able to serve customers.
- 00:23:00Now, this example I'm gonna show you
- 00:23:01focuses on North America,
- 00:23:02but middle mile exists around the world,
- 00:23:05and I'm gonna talk about the Relay app.
- 00:23:07The Relay app is an app for iOS and Android
- 00:23:09that the truckers use.
- 00:23:10So if you could think of middle mile
- 00:23:12as having this really sophisticated model
- 00:23:14that determines where stuff should be
- 00:23:16and when it should be there all over United States,
- 00:23:19this is how that model's realized.
- 00:23:21That model is just something in a computer.
- 00:23:23It's meaningless unless you can get trucks rolling
- 00:23:25and moving stuff around.
- 00:23:26This is the realization of that model.
- 00:23:28This is the model that truck drivers use
- 00:23:30to know where to go, when to go there,
- 00:23:33what to pick up, where to take it,
- 00:23:36and you could download this app today on your phone.
- 00:23:38I did it and it's pretty useless
- 00:23:39unless you're a truck driver.
- 00:23:40So truck drivers in the audience, feel free to download it,
- 00:23:43but everybody else. (laughing)
- 00:23:46Okay, so best practice I'm gonna talk about
- 00:23:49is using highly available endpoints.
- 00:23:51All right, so we talked about that already, right?
- 00:23:52Highly available endpoints.
- 00:23:53Oh, we talked about that when we talked about
- 00:23:55the IMDb gateway.
- 00:23:56Well, same thing here.
- 00:23:57Amazon Relay being an app has a gateway too,
- 00:24:00again, a single point of entry that the app is,
- 00:24:03the iOS app and the Android app are both calling into,
- 00:24:07and just like IMDb, there's a gateway
- 00:24:09and it's fronting several backend services.
- 00:24:11In this case, they call them modules.
- 00:24:12So I'm gonna call them the modules too.
- 00:24:14So there you can see the modules there.
- 00:24:15The modules are mostly serverless
- 00:24:18consisting of Lambda and DynamoDB, and there's the gateway.
- 00:24:22So unlike IMDb, they're not using Application Load Balancer.
- 00:24:24They're using API Gateway.
- 00:24:26API Gateway is a highly scalable managed API
- 00:24:30and you can see there's multiple API Gateways there
- 00:24:31'cause the way this works is you could use,
- 00:24:34well, see, you could use Route 53, which is not shown there,
- 00:24:37which is DNS system, to create a domain name,
- 00:24:40and then based on path-based routing,
- 00:24:42like what's after the slash and what's after domain name,
- 00:24:45it goes to a different API Gateway
- 00:24:47and then API Gateway fronts one of these backend modules,
- 00:24:50and you can also see there's also
- 00:24:51some authentication logic in there too.
- 00:24:53So that's important and it's calling into
- 00:24:54the Amazon authentication system to do that.
- 00:24:58Now, what they really liked about this model
- 00:24:59when they went to it is there's no shared ownership
- 00:25:01of code or infrastructure between the gateway
- 00:25:05and the backend modules.
- 00:25:06So they could deploy independently.
- 00:25:08They could make changes independently
- 00:25:09as long as they don't break any contracts
- 00:25:11and it gave them a lot more flexibility.
- 00:25:15Now, the other best practice we wanna look into
- 00:25:17with these two teams are to deploy the workload
- 00:25:19to multiple locations and choose the appropriate locations
- 00:25:22for those deployments, and to talk about this,
- 00:25:26we need to go back to December of 2021.
- 00:25:29As many of you know, in us-east-1 in December 2021,
- 00:25:32there was an event that caused several services
- 00:25:35to experience service issues,
- 00:25:37and one of the services affected was SNS,
- 00:25:39or Simple Notification Service,
- 00:25:41and Relay app does depend on that
- 00:25:44and you can see the effect.
- 00:25:45All right, so some truck drivers
- 00:25:48could not get their load assignments.
- 00:25:49They couldn't get the assignment of where to go
- 00:25:50and what to pick up and you can see it wasn't 100%.
- 00:25:53It went up to about 30% at its peak
- 00:25:55and it was for just some limited period of time,
- 00:25:57but that's still an impact on our customers
- 00:25:59and Amazon does not wanna have that kind of impact.
- 00:26:04So what could you do?
- 00:26:05You could redesign to either not use SNS
- 00:26:08or make SNS a soft dependency
- 00:26:10or you could take the approach they did and use spares.
- 00:26:13Spares is where you set up multiple instances of a resource
- 00:26:16so if one of them is not working, you could use the other.
- 00:26:19Now, in this case, SNS is a regional service,
- 00:26:23so in order to be able to use a different SNS service,
- 00:26:26they had to go to another region
- 00:26:27and I'll show you how they did that,
- 00:26:29but first I gotta introduce you
- 00:26:31to another cultural thing at Amazon,
- 00:26:32the COE, or correction of error event.
- 00:26:35So when something like this happens
- 00:26:36where 30% of the truck drivers
- 00:26:38are not able to get their load assignments,
- 00:26:39that's customer impacting, the team does a COE.
- 00:26:42A COE is a deep dive as to what caused the issue
- 00:26:45and how it could be avoided.
- 00:26:46It's blameless. It's not there to point fingers.
- 00:26:49It's not there to find the culprit as a person.
- 00:26:52It's there to find the actual cause of the issue
- 00:26:54and to come up with solutions,
- 00:26:56actions so that issues like this,
- 00:27:00an issue like this or related to this
- 00:27:01can never happen again,
- 00:27:03and here are some of the ones they came up with
- 00:27:05and the ones I'm gonna talk about.
- 00:27:06I'm gonna talk about how they did a app,
- 00:27:08a review of the resiliency of the Relay app
- 00:27:11and how they then deployed to multiple regions
- 00:27:15to enact what was found in that review.
- 00:27:18So in that review,
- 00:27:21their primary goal there was to preserve
- 00:27:23physical operational continuity,
- 00:27:25even if the experience is degraded.
- 00:27:27So what do I mean by degraded? Let's talk about that.
- 00:27:29So the three steps they did was they had to articulate
- 00:27:31the minimum critical workflow.
- 00:27:33So which parts of this have to work, while other parts,
- 00:27:36if they're not working, it's not optimal,
- 00:27:38but we could still keep the trucks rolling.
- 00:27:41Two, design solutions that those critical parts
- 00:27:44remain operational, and three,
- 00:27:46adapt the system so that when the parts
- 00:27:49that are not so critical stop working,
- 00:27:51the system could still operate.
- 00:27:53That's what we mean by the degraded experience.
- 00:27:55It still works, the critical functions are there,
- 00:27:59and they just, as I said before,
- 00:28:00they went with a multi-region approach.
- 00:28:02So they were already deployed in us-east-1,
- 00:28:05and fun fact, they'd been running out of
- 00:28:08what was the predecessor to us-east-1 before AWS existed.
- 00:28:12Amazon had data centers there
- 00:28:14and that's where they ran out of,
- 00:28:16but they also decided to deploy to us-west-2 over in Oregon.
- 00:28:20Amazon has, AWS has 30 regions all over the world.
- 00:28:25You could see those are the ones in North America,
- 00:28:28and the solution looked like this.
- 00:28:29So this is the backend modules, okay?
- 00:28:31So the backend modules weren't as necessarily as simple
- 00:28:33as a Lambda and a DynamoDB,
- 00:28:35but the thing about them is that they all were fronted
- 00:28:37by Lambda so they could integrate with the API Gateway
- 00:28:39and they all persisted their important data,
- 00:28:41the data that needed to be shared, in DynamoDB,
- 00:28:44and so in this case, you could see they deployed
- 00:28:46to us-east-1 and us-west-2,
- 00:28:48and the nice thing about DynamoDB,
- 00:28:52it has something called global tables.
- 00:28:55With DynamoDB global tables,
- 00:28:56you can deploy a table in multiple regions
- 00:28:59and write to any of those tables
- 00:29:01and those writes will be replicated to the other regions.
- 00:29:03So they found that just to be an easy solution
- 00:29:05just to put right in there.
- 00:29:07Now, each of these modules is owned by a two-pizza team
- 00:29:10or a two-pizza team might own more than one of them,
- 00:29:12but they're all owned by a two-pizza team,
- 00:29:14and the two-pizza teams, based on the criticality analysis,
- 00:29:17decided whether they were gonna go multi-region or not.
- 00:29:19Not all of them did because you have to pick
- 00:29:21where to put your resources right, where to invest.
- 00:29:26Now, the gateway part of it, the part in front there,
- 00:29:29also was deployed to two regions.
- 00:29:30You could see that API Gateway
- 00:29:31which is representing the gateway
- 00:29:33going to us-east-1 and us-west-2,
- 00:29:36and now we put Route 53 in front of it.
- 00:29:38So Route 53 is our DNS system.
- 00:29:40This is called an active/active architecture.
- 00:29:44What it means is that each of the two regions here
- 00:29:47actively receive requests.
- 00:29:49A given request doesn't go to both regions.
- 00:29:51It goes to one or the other.
- 00:29:52How does it decide which one to go to?
- 00:29:53Well, Route 53 offers several routing policies.
- 00:29:56In this case, they decided to use latency routing.
- 00:29:58So Route 53, based on past experience,
- 00:30:00will determine for a given request
- 00:30:02which one's gonna give the lowest latency
- 00:30:04and route the request there.
- 00:30:05There are other routing policies.
- 00:30:06There's weighted routing. There's geolocation routing.
- 00:30:09So it routes it based on where the request came from.
- 00:30:12So there's all kinds of different options.
- 00:30:13This team, Relay, went with the latency-based routing,
- 00:30:17and so this is a request to us-east-1
- 00:30:20and you can see the module called module A.
- 00:30:24They are a module that did go multi-region.
- 00:30:26So the request goes to their us-east-1 version
- 00:30:29and then module B there did not go multi-region,
- 00:30:32so the request also goes to us-east-1.
- 00:30:35However, requests that went to us-west-2,
- 00:30:38this is where it gets interesting,
- 00:30:39for module A, it's gonna go to us-west-2,
- 00:30:42but module B, a less critical module,
- 00:30:44didn't set up anything in us-west-2,
- 00:30:46so it's still gonna receive its request in us-east-1,
- 00:30:49and we'll see how that just plays out later
- 00:30:51in various failure scenarios.
- 00:30:54Yeah, so that's how it works.
- 00:30:55All right, so the next best practice
- 00:30:56is to implement graceful degradation
- 00:30:58to turn hard dependencies into soft dependencies.
- 00:31:01Now, I talked a little bit about
- 00:31:02what graceful degradation is.
- 00:31:03It's about maintaining the critical parts of your workload
- 00:31:06while the less critical ones might fail, but overall,
- 00:31:09the end users still can do the things they need to do.
- 00:31:13So this is that analysis they did
- 00:31:14when they wrote up that report.
- 00:31:16From going left to right in order,
- 00:31:18these are the things that a truck, a delivery goes through,
- 00:31:21the various business domain specific things
- 00:31:24that middle mile goes through, and what the red lines,
- 00:31:26the red bars represent are criticality.
- 00:31:28So by creating a graph like this,
- 00:31:30they're able to identify which modules are critical
- 00:31:33and which ones are less critical.
- 00:31:34So for instance, it's critical that they be able
- 00:31:37to complete a delivery.
- 00:31:39It's critical that you could assign drivers
- 00:31:41to pick up their loads.
- 00:31:42Then what's not critical? What can we do without?
- 00:31:44Well, the app provides turn-by-turn navigation.
- 00:31:47So if that goes out, again, not optimal,
- 00:31:50but there's other GPS systems.
- 00:31:51The app also has this long-term booking
- 00:31:53where you could book next week's loads.
- 00:31:55Well, that's important, but it might not be important now
- 00:31:58while there's some kind of issue going on,
- 00:31:59and eventually, whatever the issue is going on,
- 00:32:02it's gonna be solved and then you could assign
- 00:32:03next week's loads.
- 00:32:04So I really like the subtitle here, "The trucks keep moving,
- 00:32:07no products backed up on the docks."
- 00:32:09That's what they told me, "The trucks keep moving,
- 00:32:10no products backed up on the docks,"
- 00:32:12and that's what they're aiming for,
- 00:32:14and so the next best practice
- 00:32:15we're gonna look at is fail over.
- 00:32:17Okay, being able to fail over to healthy resources.
- 00:32:19So what happens if they have another event
- 00:32:22where they want to fail out of us-east-1
- 00:32:25and be purely in us-west-2?
- 00:32:27So in this case, using the routing policy,
- 00:32:29they could turn off all traffic to us-east-1,
- 00:32:31send all the traffic to us-west-2, and this is what happens.
- 00:32:34So, I'm sorry. I'm gonna actually go back.
- 00:32:36So notice that for module A,
- 00:32:39it's gonna use the version of module A that's in us-west-2,
- 00:32:43and that's a critical module
- 00:32:44and it's gonna continue operating.
- 00:32:46What happens to module B?
- 00:32:47Remember, module B never set up a us-west-2 version.
- 00:32:51So one of two things is probably gonna happen.
- 00:32:52Either one, the request is gonna, well,
- 00:32:54the request is gonna go to us-east-1 where we failed out of,
- 00:32:57but we failed outta there because we're seeing some issue
- 00:32:59that we think we wanna fail out for but doesn't mean,
- 00:33:01the region's never hard down.
- 00:33:03That doesn't happen.
- 00:33:04So the service in us-east-1 still might respond
- 00:33:07and that's a best-case scenario,
- 00:33:08but worst-case scenario, it doesn't respond,
- 00:33:10and because of the way system's designed
- 00:33:12and graceful degradation,
- 00:33:14it's again a less than optimal experience
- 00:33:16but an experience that allows the users
- 00:33:18to do their critical functions,
- 00:33:19which is to keep the trucks moving,
- 00:33:20nothing backed up on the docks,
- 00:33:23and the last one we're gonna look at
- 00:33:25is about testing your disaster recovery strategy
- 00:33:27'cause you could have a disaster recovery strategy,
- 00:33:29but if you don't test it, you don't know if it works,
- 00:33:32and so they ran a game day, a game day basically to exercise
- 00:33:34this disaster recovery strategy.
- 00:33:36They wanna be prepared for peak 2022.
- 00:33:38Peak at Amazon represents the holiday season.
- 00:33:40I think we're already in it with Black Friday
- 00:33:42and Cyber Monday already going on.
- 00:33:45So what they did was they initiated
- 00:33:47a fail over in production.
- 00:33:51They acted as if they needed to get out of us-east-1.
- 00:33:53They didn't need to, but they acted as if they did,
- 00:33:56got out of us-east-1, failed over,
- 00:33:58sent all the traffic to us-west-2 and this is what happened.
- 00:34:02You notice that the increase in traffic in us-west-2
- 00:34:04is way over 100%.
- 00:34:05If it was evenly balanced,
- 00:34:07you'd expect it to be 100% increase,
- 00:34:08but it was more than 100% increase.
- 00:34:09So this represents that most of the traffic's still going
- 00:34:12to us-east-1 and that's probably just the nature
- 00:34:14of population density in the United States.
- 00:34:17The other thing I forgot to mention is
- 00:34:18when they went to the active/active model,
- 00:34:20truck drivers in the West started seeing
- 00:34:22much lower latencies 'cause their requests
- 00:34:23were being sent to the West region.
- 00:34:26Also, when they failed over,
- 00:34:27they actually were able to successfully run the service
- 00:34:30without any significant customer impact
- 00:34:32or failures, et cetera.
- 00:34:34It took 'em about 10 minutes to execute the fail over.
- 00:34:36They did see an increase in latency and it was,
- 00:34:40they're working on it and they're still re-engineering
- 00:34:42to try to get that down, but the increase in latency still,
- 00:34:44again, maybe less than optimal,
- 00:34:46but enabled everyone still able
- 00:34:48to do the critical functions, kept the trucks rolling,
- 00:34:50nothing backing up on the docks.
- 00:34:56All right, our next example is
- 00:34:57the Classification and Policies Platform
- 00:34:59and how they use shuffle sharding to limit blast radius.
- 00:35:02So basically similar to before, similar to before,
- 00:35:06blast radius is about containing the failure
- 00:35:09to an area, to a cell, in this case, a shard,
- 00:35:12so that it doesn't affect other parts of the system,
- 00:35:15and so what is Classification and Policy Platform?
- 00:35:18Well, they're part of the catalog service
- 00:35:21and the catalog at Amazon is massive,
- 00:35:24millions and millions of items in the Amazon catalog,
- 00:35:27and every single one of those items needs to be classified.
- 00:35:30So what do I mean by classified?
- 00:35:31Well, there's 50 different classification programs.
- 00:35:34It could be as simple as what type is it.
- 00:35:35Is it clothing? Is it electronics?
- 00:35:37It could be what kinda taxes should be applied.
- 00:35:40Can this thing be put on an airplane? Is it hazardous?
- 00:35:43Is it something that we can sell in a certain state?
- 00:35:46Is it something that children are allowed to use?
- 00:35:48I mean, there's all kinds of classification going on,
- 00:35:5150 of these programs which are actually
- 00:35:53not necessarily part of this team.
- 00:35:55This team runs the platform to host
- 00:35:57all these classification programs
- 00:35:58and applying classification to all the millions
- 00:36:01and millions of things in the Amazon catalog,
- 00:36:04and why is this important?
- 00:36:05Well, I kinda gave this away a little bit because I said,
- 00:36:07all right, so here's an item that I,
- 00:36:10living in Washington, can buy,
- 00:36:12but when John living in California goes to buy it, it says,
- 00:36:15"No, you can't have it because California restricts
- 00:36:18this item or says you can't have it,"
- 00:36:19and that's an example of the classification was applied
- 00:36:22to the item and the ordering service was able
- 00:36:24to read that classification and say,
- 00:36:25"No, I cannot sell it to people in California,"
- 00:36:29and again, it's about scale.
- 00:36:31So there's 50 programs across Amazon
- 00:36:33doing this classification that are using this platform.
- 00:36:35There's over 10,000 machine learning models being applied,
- 00:36:38100,000 rules, so that's like if this, then that,
- 00:36:41so less sophisticated than machine learning
- 00:36:43but still important,
- 00:36:44and there's 100 model updates every day.
- 00:36:47So of these 10,000 machine learning models,
- 00:36:49100 are being updated every day.
- 00:36:50Millions of products are being updated per hour.
- 00:36:53So this is what it looks like.
- 00:36:54All right, so this is,
- 00:36:55if you're dozing off, time to pay attention
- 00:36:57'cause this is where it gets a little complicated.
- 00:36:58I wanna make it simple, all right?
- 00:37:00So you have the millions of items
- 00:37:01that need to be classified and we're breaking them up
- 00:37:05into batches of about 200 each,
- 00:37:06but apparently it can vary quite a bit.
- 00:37:08So that's not so important.
- 00:37:10What's important is that we need to apply about 100,
- 00:37:12not all 10,000 machine learning models,
- 00:37:14but about 100 machine learning models
- 00:37:17to every item coming in
- 00:37:18and we do that in batches of 30 models.
- 00:37:21So that means there's gonna be three requests made,
- 00:37:23three batches of 30.
- 00:37:24Why 30? I'll get to that in a minute.
- 00:37:27So these requests to process these items
- 00:37:29for 30 machine learning models go to a classifier.
- 00:37:32In this case, it's an Elastic Container Service service
- 00:37:36that's running machine learning models against these items,
- 00:37:39and what it does is it pulls the models down from S3.
- 00:37:43So it says, "Oh, these items need these 30 models.
- 00:37:46I'm gonna pull these 30 models down and run them,"
- 00:37:49and it can cache the models and that's important
- 00:37:51'cause we want to actually try to use workers,
- 00:37:54these are all workers, these classifiers,
- 00:37:55that already have those models cached.
- 00:37:57So pulling the models down
- 00:37:58and swapping 'em out is inefficient,
- 00:38:01and then after it does the classification,
- 00:38:02it writes it to DynamoDB.
- 00:38:04So this is the logical view.
- 00:38:05Let me show you the architectural view.
- 00:38:07Oh, wait, I promised to tell you why 30 is important.
- 00:38:09Well, two reasons.
- 00:38:11They found that that's a nice size
- 00:38:12where they could take 30 related models,
- 00:38:15so like one model might actually use
- 00:38:16the output of another one.
- 00:38:18So they're sort of related in some way,
- 00:38:20but the other reason is about
- 00:38:20that caching I was talking about.
- 00:38:22There's only so many models that these services
- 00:38:25can keep in cache.
- 00:38:27So if you told it to be running 100 models,
- 00:38:29it can't possibly keep those all in cache,
- 00:38:31so eliminating to 30, again,
- 00:38:33allows you to avoid the swapping out.
- 00:38:35Remember, trying to avoid that swapping out.
- 00:38:38So this is more an architectural view.
- 00:38:39So taking one of those requests,
- 00:38:41which again is for 30 models,
- 00:38:43goes through Kinesis where Kinesis reads the metadata
- 00:38:46on the request and decides what models are gonna be applied,
- 00:38:48and this is important.
- 00:38:49This is the part where it sends it to an AWS Lambda
- 00:38:50which is acting as a router.
- 00:38:52It's the AWS Lambda that says,
- 00:38:54"Oh, you need these 30 models?
- 00:38:56I'm gonna send you to this worker,"
- 00:38:58and it puts it on an SQS queue where then the workers,
- 00:39:02the ECS services, read it off the queue.
- 00:39:04So in other words, there's 60 of these workers.
- 00:39:06The workers are dumb.
- 00:39:07They'll process whatever you give them.
- 00:39:09You say, you tell 'em to process
- 00:39:10these 30 machine learning models.
- 00:39:11They'll check is in cache.
- 00:39:12Yeah, all right, I'll do it.
- 00:39:13Is not in cache. All right, I'll pull it down.
- 00:39:15They don't care, so all the smarts are in that Lambda.
- 00:39:18That Lambda is attempting to keep
- 00:39:20each of these 30 model requests
- 00:39:22in the same worker or workers
- 00:39:24that have processed those 30 before to avoid the swapping,
- 00:39:30and so we're gonna talk about
- 00:39:31a best practice we talked about
- 00:39:33about using these bulkhead architectures,
- 00:39:35only we're not talking about cells in this case.
- 00:39:37We're gonna be talking about shards
- 00:39:38and specifically shuffle sharding.
- 00:39:40So I'm gonna take about three slides
- 00:39:42to explain shuffle sharding.
- 00:39:43Now, warning, shuffle sharding at this,
- 00:39:46I've seen one-hour talks at this conference
- 00:39:48to explain shuffle sharding
- 00:39:50and I'm gonna do it in three slides.
- 00:39:51So hopefully I land the message. If I don't, don't sweat it.
- 00:39:53I think you can still follow along.
- 00:39:55All right, so this is just an example
- 00:39:57of some service that has multiple workers.
- 00:39:59They could be EC2 instances,
- 00:40:01or like in the case of CPP, they could be ECS services,
- 00:40:04and on top, those different symbols are different clients
- 00:40:07or different callers of the service
- 00:40:09and there's no sharding going on here,
- 00:40:11and the thing about no sharding is that if I have,
- 00:40:14one of those clients does what we call a poison pill.
- 00:40:16It makes either a malformed request, a corrupt request,
- 00:40:20maybe even a malicious request,
- 00:40:21something that kills the service, the process running on it.
- 00:40:25Maybe it tickles a bug that we didn't know we had.
- 00:40:28It takes down that worker.
- 00:40:31Okay, no problem. We have load balancing, right?
- 00:40:33That worker's down. Let's try another worker.
- 00:40:36Oh, it takes that one down.
- 00:40:38It takes the next one down too
- 00:40:40and eventually will work its way through all the workers
- 00:40:43until there are no workers and everybody's outta luck.
- 00:40:45All the clients are now red.
- 00:40:47Nobody's able to call the service.
- 00:40:50That's no sharding. So let's introduce sharding, okay?
- 00:40:55This is sharding.
- 00:40:56It's just taking a resource unlike cells.
- 00:40:59Cells were entire stack.
- 00:41:00This is just taking some resource layer
- 00:41:01and dividing into chunks, in this case, chunks of two.
- 00:41:05Shards of two workers each,
- 00:41:06and in this case, the cat does its thing,
- 00:41:09kills its two workers.
- 00:41:10It and the dog are unhappy, but everybody else is happy.
- 00:41:14That's the bulkhead architecture at work.
- 00:41:16It contained the failure to that shard
- 00:41:19and you could see number of customers impacted
- 00:41:21is customers, 8, divided by shards, 4.
- 00:41:232 customers impacted. It checks out, right?
- 00:41:27Okay, now shuffle sharding.
- 00:41:29This is where it gets interesting.
- 00:41:30Each client in this case is assigned
- 00:41:33its own unique pair of two workers,
- 00:41:36but they can be sharing workers.
- 00:41:38So what do I mean by this?
- 00:41:39If you look at the bishop here,
- 00:41:40bishop has these two workers.
- 00:41:42That's the bishop shard. The rook has these two workers.
- 00:41:46They are two unique pairs, they're not the same pair,
- 00:41:49but they're sharing a worker.
- 00:41:51Same thing here.
- 00:41:51The cat has these two workers,
- 00:41:53but again, it has its own unique pair of workers.
- 00:41:56None of the clients, none of the eight clients here
- 00:41:58shares the same two.
- 00:42:00They each share with other shards,
- 00:42:02but none of them share the same two with another client.
- 00:42:05So in this case, what happens is the cat does its thing,
- 00:42:08kills its two workers.
- 00:42:09It's down, but even though the rook
- 00:42:12was sharing one worker with it,
- 00:42:14it still has a healthy worker.
- 00:42:15It has its own unique pair of workers.
- 00:42:17Same thing with the bishop and everybody else.
- 00:42:20So the number of customers impacted
- 00:42:22is customers divided by combinations,
- 00:42:24which in this case is 8 customers,
- 00:42:26and I made 8 shuffle shards,
- 00:42:28so only 1 customer was affected,
- 00:42:30which means that if at scale,
- 00:42:32our scope of impact is 12 1/2 percent or 1/8,
- 00:42:35meaning that if we had 800 clients, 100 would be impacted,
- 00:42:38but it gets better than this 'cause actually,
- 00:42:40I can make more than 8 shuffle shards out of this.
- 00:42:43With 8 workers and making combinations of 2,
- 00:42:47some of you might recognize this math, it's 8 choose 2,
- 00:42:51and you can actually make 28 combinations,
- 00:42:52so the actual scope of impact is much less,
- 00:42:56and if you don't know that math, don't worry about it.
- 00:42:57This is how many combinations of unique sets of 2 workers
- 00:43:01you can make given 8, and if you really wanna go crazy,
- 00:43:05there's Route 53, over 2,000 workers making shards of 4.
- 00:43:10Your scope of impact is 1 in 730 billion.
- 00:43:12The math gets kinda crazy at that point,
- 00:43:15but getting back to CPP.
- 00:43:16All right, so CPP has its workers.
- 00:43:19Its workers are tasked with processing these workloads
- 00:43:22of 30 machine learning models at a time.
- 00:43:25There's over 10,000 machine learning models
- 00:43:28and we need to process them in batches of 30.
- 00:43:31So that's 400 total groupings, 400 shards we're gonna need,
- 00:43:35because a given shard we want to be processing
- 00:43:38given machine learning models and not swapping them out.
- 00:43:40So we're gonna need 400 shards.
- 00:43:41So if there are 60 workers, we can do shuffle sharding.
- 00:43:45The blue shard, the green shard, and the orange shard,
- 00:43:49can you see they share workers with each other,
- 00:43:52but it's each one's a unique combination
- 00:43:54of three in this case.
- 00:43:57So what happens if we have that poison pill incident
- 00:43:59where the orange shard goes down
- 00:44:02because those 30 models were somehow corrupt
- 00:44:04and something happens and that shard is poisoned?
- 00:44:08If we had no shuffle sharding, just standard sharding,
- 00:44:11we took our 400 machine learning groups
- 00:44:13and distributed 'em over 20 shards,
- 00:44:15'cause if we take 60 divided by 3, that's 20,
- 00:44:17then 20 of those machine learning groups would be affected,
- 00:44:20but if we use shuffle sharding, we can create 400 shards.
- 00:44:23So 400 groups of machine learning models, 400 shards.
- 00:44:27One of them gets poisoned, then only that one is affected,
- 00:44:30and again, to remind you,
- 00:44:31it's the same case as we saw with the cat.
- 00:44:33It's because each one has its own unique group
- 00:44:35of three workers, and just to go crazy,
- 00:44:38actually you could create a lot more than 400 shards.
- 00:44:4160 choose 3 is over 34,000.
- 00:44:45All right, and to bring this home,
- 00:44:48the other thing they're doing is
- 00:44:49to implement loosely coupled dependencies.
- 00:44:51Let me show you how that works. So remember the router?
- 00:44:54All the smarts are in that Lambda there.
- 00:44:55It decides which shard.
- 00:44:57So remember, before I said worker.
- 00:44:58The Lambda actually decides which shard
- 00:45:00it's gonna send the request to.
- 00:45:05Remember, it's putting things on an SQS queue
- 00:45:07which are then being picked up by the worker.
- 00:45:09So that Lambda is actually monitoring those queues.
- 00:45:11It's actually looking at the age of the oldest message.
- 00:45:14If the age of the oldest message is pretty old,
- 00:45:16it probably means that queue is pretty slow and congested.
- 00:45:19So it's actually using back-pressure
- 00:45:21to decide which worker inside a shard it's gonna call.
- 00:45:26So as a shard of three,
- 00:45:27it can choose the worker that's the least busy.
- 00:45:29So you can see there the middle one's the least busy
- 00:45:31so that's the one it chooses.
- 00:45:32So that Lambda is not just a router.
- 00:45:34It's also a load balancer using back-pressure
- 00:45:38to route along those workers in a shard.
- 00:45:42Now, what if the load is too high?
- 00:45:44What if there's a spike and all of the workers,
- 00:45:46all three workers in the shard are overloaded?
- 00:45:48This is where load shedding comes in.
- 00:45:50It'll send the request to a load shedding queue
- 00:45:53and come back to it after 15 minutes.
- 00:45:55Why 15 minutes? Well, those ECS services are auto-scaling.
- 00:45:59They're based on CPU levels.
- 00:46:01So if there really is a spike going on,
- 00:46:03those ECS services are gonna see elevated CPU
- 00:46:06and they're gonna scale out.
- 00:46:07So 15 minutes later, we'll come back,
- 00:46:09reprocess that request and it should work at that point.
- 00:46:16All right, this is actually our last example of the day.
- 00:46:18It's about Amazon Search
- 00:46:20and how they're using chaos engineering
- 00:46:21to be ready for Prime Day, any day.
- 00:46:25Amazon Search, I think you've all seen it.
- 00:46:26You have probably all seen the search bar here.
- 00:46:28Why don't we search for chaos engineering
- 00:46:30and see what we get?
- 00:46:33All right, over 1,000 results.
- 00:46:35Okay, the top result there is the "Chaos Engineering" book
- 00:46:38by Casey and Nora.
- 00:46:39That's sort of the chaos engineering bible.
- 00:46:41So that's a good result.
- 00:46:42I also want you to notice the SLO,
- 00:46:49the service level objective book there with the doggy on it
- 00:46:51'cause that's gonna be important too,
- 00:46:53and we're talking about scale here.
- 00:46:54So we're talking about millions of products.
- 00:46:56We're talking about 300 million active users.
- 00:46:59We're talking about last Prime Day,
- 00:47:0184,000 requests per second peak during Prime Day.
- 00:47:04Again, the whole point is to show you
- 00:47:05the scale of these services and how they're using AWS
- 00:47:08to meet the need of that scale,
- 00:47:10and Search, like everything else I showed you,
- 00:47:12consists multiple backend services
- 00:47:14and using multiple AWS resources,
- 00:47:17and what I really like about the Search team
- 00:47:19is they have their own resilience team.
- 00:47:21So they have your builtin team dedicated to resilience
- 00:47:23doing operational resilience
- 00:47:25and site reliability engineering
- 00:47:27for the Search org across those 40 services,
- 00:47:30and their main goal, their main motto is,
- 00:47:32"We test, improve, and drive the resilience
- 00:47:34of Amazon Search services."
- 00:47:36How do they do that?
- 00:47:36They do that by promoting resilience initiatives,
- 00:47:38helping with load testing and helping to promote
- 00:47:42and orchestrate chaos engineering,
- 00:47:44and that's the part I want to talk about.
- 00:47:47So the best practice in this case is use chaos engineering
- 00:47:50to test your workload, to test your resilience.
- 00:47:54So what is chaos engineering?
- 00:47:56I'm gonna read a slide for you.
- 00:47:57"Chaos engineering is the discipline of experimenting
- 00:47:59on a system in order to build confidence
- 00:48:01in the system's capability to withstand
- 00:48:03turbulent conditions in production."
- 00:48:05Turbulent conditions in production.
- 00:48:06I think we could all identify with that,
- 00:48:08unusual user activity, network issues,
- 00:48:12infrastructure issues, bad deployments.
- 00:48:15I mean, it's a mess out there
- 00:48:17and we need to be resilient to that.
- 00:48:19So the thing to know about chaos engineering,
- 00:48:21it's not about creating chaos.
- 00:48:23It's about acknowledging the chaos that already exists
- 00:48:26and preparing for it and mitigating it
- 00:48:29and avoiding the impact of that chaos.
- 00:48:31So that's the way you gotta be thinking
- 00:48:32about chaos engineering.
- 00:48:34So how do you do chaos engineering?
- 00:48:36This is a one-slide summary of how to do chaos engineering.
- 00:48:39Chaos engineering is ultimately at its core
- 00:48:40a scientific method.
- 00:48:42This is a circular cycle,
- 00:48:44but I'm gonna start with steady state.
- 00:48:46What the heck is steady state?
- 00:48:47Steady state means your workload, the workload under test
- 00:48:49is operating within design parameters,
- 00:48:51and you have to be able to measure that.
- 00:48:53You have to be able to assign metrics to say
- 00:48:54what does it mean to operate within design parameters.
- 00:48:57Then is the hypothesis.
- 00:48:59The hypothesis is if some bad thing happens,
- 00:49:02and you specify the bad thing, if an EC2 instance dies,
- 00:49:05if an Availability Zone is not available,
- 00:49:07if a network link goes out, then my system,
- 00:49:11because I designed it that way, will maintain steady state.
- 00:49:15It will stay within those operational parameters.
- 00:49:17Now, if you didn't design it that way,
- 00:49:18don't do the chaos engineering,
- 00:49:20but if you designed it that way, you're testing that.
- 00:49:22So you run the experiment. You simulate that EC2 failure.
- 00:49:25You simulate that network link outage,
- 00:49:27and then you validate.
- 00:49:28You verify was the hypothesis confirmed.
- 00:49:32If the hypothesis was not confirmed, oh, okay.
- 00:49:34We experienced some sort of outage.
- 00:49:36We went outside of the established parameters.
- 00:49:38We did not maintain steady state. You need to improve.
- 00:49:41You improve by redesigning,
- 00:49:43applying the best practices in the reliability pillar,
- 00:49:46and then you test it again.
- 00:49:47You run the experiment again.
- 00:49:49Oh, now the hypothesis is confirmed
- 00:49:51and we're back to steady state
- 00:49:52and the whole thing repeats all over again.
- 00:49:56So service level objectives,
- 00:49:57I told you this would come up again, so here it is.
- 00:50:00This is an example service level objective.
- 00:50:01This is not one they actually use.
- 00:50:04They didn't really wanna share those,
- 00:50:05but they did want to share the format of it.
- 00:50:07So this is the format of it.
- 00:50:08In a 28-day trailing window, we'll see 99.9% of requests
- 00:50:12with a latency of less than one second.
- 00:50:14That's an example of a service level objective
- 00:50:17that might be used by the Search team,
- 00:50:19and with this service level objective,
- 00:50:20we've established something called the error budget.
- 00:50:22So what's the error budget?
- 00:50:23Well, 99.9% means that .1% can be greater than a second.
- 00:50:29So that's the start of our budget. That's our budget.
- 00:50:31However, with every request that exceeds one second,
- 00:50:36we're consuming that budget.
- 00:50:38Eventually that whole thing will be consumed
- 00:50:39and we'll be out of budget and you can actually look at
- 00:50:42how fast that budget's being burned.
- 00:50:43It's called the burn rate, but there's good news.
- 00:50:46There's a 28-day trailing window.
- 00:50:48So that means the oldest failures,
- 00:50:51the oldest requests that are greater than a second
- 00:50:53will eventually time out,
- 00:50:54will eventually age out, I should say,
- 00:50:56be older than 28 days and your budget replenishes.
- 00:50:59So that's the concept of the error budget.
- 00:51:03So they wanna do customer-obsessed chaos engineering.
- 00:51:06Chaos engineering is not for the engineering teams.
- 00:51:09It's not for the developers.
- 00:51:10It's so we can establish an experience for our customers
- 00:51:14that's gonna serve their needs,
- 00:51:15and they thought SLO was the best way to do that.
- 00:51:17It's very customer focused.
- 00:51:18It's focused on what the customers experience
- 00:51:20and so the experiments must stay within the error budget,
- 00:51:25and the stop conditions for an experiment,
- 00:51:27you must always have stop conditions
- 00:51:28on your chaos engineering experiments,
- 00:51:30are if the burn rate is too high on the error budget,
- 00:51:34the experiment stops.
- 00:51:36If the Andon cord is pulled.
- 00:51:37So the Andon cord goes back to the Toyota factories
- 00:51:40where they had a actual cord
- 00:51:42that anybody on the assembly line could pull
- 00:51:45if they saw a quality issue.
- 00:51:46Same thing here.
- 00:51:48Several people across the org can push this button
- 00:51:50and will stop and roll back any experiment at any time,
- 00:51:53and then the last thing is
- 00:51:54if there's something going on at this kind of event
- 00:51:56happening across Amazon IT, then that's not a good time
- 00:51:59to be doing your chaos engineering,
- 00:52:00so let's stop it and roll it back then too,
- 00:52:04and this is what they designed.
- 00:52:06We're here to talk about architecture.
- 00:52:07So on the right, I just wanna point out it's all centered
- 00:52:10on Fault Injection Simulator.
- 00:52:11Fault Injection Simulator is a AWS service
- 00:52:14that you can use to run chaos experiments
- 00:52:18and they did build around that.
- 00:52:20So on the far right, you can see ECS and EC2.
- 00:52:23That's the search services.
- 00:52:25Remember, there's 40-plus search services.
- 00:52:27So they're using Fault Injection Simulator
- 00:52:29to do chaos engineering on those services.
- 00:52:32What they built was the part on the left.
- 00:52:33That's the orchestration piece.
- 00:52:35Okay, now follow me down to the API Gateway
- 00:52:38in the lower left-hand corner.
- 00:52:39You can see two APIs.
- 00:52:40The first one's the Andon API and the Andon API
- 00:52:44establishes and configures the Andon cords.
- 00:52:46It does this by setting up various CloudWatch alarms
- 00:52:49that FIS will respond to.
- 00:52:51FIS has guardrails.
- 00:52:52Remember, I said a good experiment has to have a guardrail.
- 00:52:56So FIS has guardrails based on CloudWatch
- 00:52:57and so when someone pulls the Andon cord,
- 00:53:00it sets the CloudWatch alarm
- 00:53:01which then stops the FIS experiment.
- 00:53:03Okay, the other API is the run API.
- 00:53:05It has the ability to run an experiment,
- 00:53:08to schedule it for later,
- 00:53:10so there's a Lambda there that's a scheduler
- 00:53:12that can store schedules in DynamoDB,
- 00:53:14and it provides orchestration.
- 00:53:16You see on the right there those three Lambdas.
- 00:53:19So not only can you run the experiment,
- 00:53:21but it gives you the ability to do things
- 00:53:22before the experiment.
- 00:53:23What might you wanna do before an experiment?
- 00:53:24You might wanna send out an alert to various personnel.
- 00:53:27You might wanna stop any in-process deployments.
- 00:53:31So there's various things you might wanna do
- 00:53:32before an experiment, then run the experiment using FIS,
- 00:53:35and then do post-experiment operations,
- 00:53:37like for instance, cleaning things up,
- 00:53:39like especially some experiments,
- 00:53:42fault injections aren't self-correcting,
- 00:53:44so then you actually have to go in and correct them,
- 00:53:46so it might do something like that,
- 00:53:49and so FIS exists, it's a great service,
- 00:53:52so why did they build this orchestration piece?
- 00:53:56This is why.
- 00:53:57Number one is they're serving 40-plus teams.
- 00:53:58They wanna provide a single pane of glass,
- 00:54:01a consistent experience across those teams
- 00:54:02and make it super easy for them to do chaos engineering.
- 00:54:07They also wanted to add the ability to do scheduling,
- 00:54:09to be able to run it with deployments,
- 00:54:11which FIS can do, but remember,
- 00:54:13all these 40 services are using a pipeline system in common
- 00:54:16so that the orchestration is able to design around that
- 00:54:19and make it super easy to do, run it with deployments,
- 00:54:22provide it consistent guardrails.
- 00:54:23Remember, the SLOs are the important guardrail.
- 00:54:26So they actually have as part of their system
- 00:54:28storage of all the various SLOs
- 00:54:30so that it uses that during experimentation
- 00:54:32to provide a guardrail.
- 00:54:34The Andon cord functionality is not natively part of FIS,
- 00:54:37so they're providing that, and metrics and insights.
- 00:54:39Of course FIS emits metrics,
- 00:54:42but now they could roll up all the metrics
- 00:54:44from all 40 services and provide them
- 00:54:46as a single report to management
- 00:54:48about what kinda chaos engineering they're doing,
- 00:54:52and plus, in addition to FIS,
- 00:54:53they wanna be able to run other kinds of faults.
- 00:54:55Let's talk about that. Oh, well, no.
- 00:54:57When they're doing this all, why did they do it?
- 00:54:59Why did they build the orchestrator?
- 00:55:00'Cause they wanna be ready for Prime Day, any day.
- 00:55:03All right, type of faults. First, there is the FIS faults.
- 00:55:06These are all supported by FIS,
- 00:55:07things like dropping ECS nodes, killing EC2 instances,
- 00:55:12injecting latency, doing very,
- 00:55:17SSM, or Systems Manager, lets you run
- 00:55:19any kind of automation you want,
- 00:55:20so you can maybe even simulate an Availability Zone outage,
- 00:55:23but what kind of faults are they doing that's not FIS?
- 00:55:26Well, there's load testing because, actually,
- 00:55:29internal to Amazon, across Amazon teams,
- 00:55:30is a very popular load test tool.
- 00:55:32They wanna be able to use that
- 00:55:33as part of their experimentation,
- 00:55:35and there's emergency levers.
- 00:55:36So emergency levers are things you can do
- 00:55:40as an operator of a service to help a service under duress.
- 00:55:43For service under duress, you pull the emergency lever
- 00:55:46and now the service can operate well,
- 00:55:48so for instance, blocking all robots,
- 00:55:51and ultimately, I'm running a little low on time,
- 00:55:53so I'm gonna speed through this, ultimately,
- 00:55:55they wanna provide a benefit to the end user.
- 00:55:56I just wanna point out that it's about higher availability,
- 00:55:58improved resiliency for the end customer,
- 00:56:01and so I wanna talk about graceful degradation
- 00:56:03and emergency levers.
- 00:56:06What does the emergency lever look like for Search?
- 00:56:07Well, one of their emergency levers is the,
- 00:56:10actually, they pull the lever
- 00:56:12and it causes graceful degradation on purpose.
- 00:56:14So this is what Search looks like,
- 00:56:16a full Search experience if I'm searching for Lego,
- 00:56:18but if I pull the emergency lever,
- 00:56:21it'll turn off non-critical services.
- 00:56:23So critical services like the image,
- 00:56:25the title, the price are all still there,
- 00:56:27but non-critical services like the reviews
- 00:56:29or the age range are not there.
- 00:56:31So for a system under duress, this can help,
- 00:56:34and they test this using chaos engineering.
- 00:56:37The hypothesis is the lever works
- 00:56:40and it enables Search to handle the stress.
- 00:56:42So they literally generate load during the test
- 00:56:45and then pull the lever and validate that,
- 00:56:48yes, the system's able to handle the duress.
- 00:56:51All right, so in summary,
- 00:56:52these are the five services I covered.
- 00:56:55This part's important to me.
- 00:56:57Okay, so to get all this information
- 00:56:58so I could share it with you, and I hope you enjoyed it,
- 00:57:01I had to work with many smart engineers on multiple teams,
- 00:57:04and the thing about smart engineers
- 00:57:06working on cool stuff is that they're really busy,
- 00:57:09and they took time out to spend with me
- 00:57:11to explain this to me so I could share it with you,
- 00:57:13so my deepest appreciation to those engineers
- 00:57:16and my awe at the engineering that they did.
- 00:57:18I really am impressed by it.
- 00:57:19Hopefully you're impressed by it too.
- 00:57:22Some resources. I won't spend too much time on here.
- 00:57:24You wanna take a snap of that real quick?
- 00:57:26Upcoming talks that might cover the things we talked about,
- 00:57:29and also, two of the examples I covered
- 00:57:31actually have some external resources
- 00:57:33you can check out if you want to learn more.
- AWS
- Scalability
- Reliability
- Well-Architected Framework
- Chaos Engineering
- Microservices
- Automation
- Cloud Computing
- Amazon
- Architecture