What is the main focus of the session?

The session focuses on how Amazon.com achieves reliable scalability using AWS.

The speaker is Seth Eliot, a principal developer advocate at AWS.

What is the Well-Architected Framework?

The Well-Architected Framework consists of best practices for building in the cloud, focusing on reliability among other pillars.

What is chaos engineering?

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.

What is the significance of scalability for Amazon?

Scalability allows Amazon to handle increased loads and maintain performance as the scope of their operations changes.

What architectural changes did IMDb implement?

IMDb transitioned from a monolithic architecture to serverless microservices using AWS Lambda.

What is a two-pizza team?

A two-pizza team is a small, cross-functional team at Amazon that can be fed with two pizzas, emphasizing ownership and agility.

How does Amazon ensure reliability in its services?

Amazon uses best practices from the Well-Architected Framework, including automation, bulkhead architectures, and chaos engineering.

What is the role of the Global Ops Robotics team?

The Global Ops Robotics team manages warehouse operations and uses a cell-based architecture for reliability.

What is the purpose of the Relay app?

The Relay app helps truck drivers manage their loads and routes in Amazon's middle mile logistics.

AWS re:Invent 2022 - Reliable scalability: How Amazon.com scales in the cloud (ARC206)

00:57:38

https://www.youtube.com/watch?v=QeW9wCB36ck

الملخص

TLDRIn this session, Seth Eliot discusses how Amazon.com achieves reliable scalability on AWS, detailing the evolution of its architecture from a simple setup in 1995 to a complex microservices architecture today. He emphasizes the importance of scalability, reliability, and the Well-Architected Framework, which includes best practices for building in the cloud. The session features examples such as IMDb's transition to serverless microservices and Global Ops Robotics' cell-based architecture, highlighting the use of automation, chaos engineering, and other strategies to ensure systems can handle high traffic and maintain performance during peak times.

الوجبات الجاهزة

👋 Welcome and introduction to reliable scalability.
📜 History of Amazon's architecture evolution.
⚙️ Importance of scalability for handling increased loads.
🔧 Overview of the Well-Architected Framework.
📈 IMDb's transition to serverless microservices.
🏭 Global Ops Robotics and cell-based architecture.
🚚 Amazon Relay app for truck management.
🔍 Chaos engineering for testing resilience.
📊 Best practices for reliability in cloud services.
🤝 Acknowledgment of the engineers behind the systems.

الجدول الزمني

00:00:00 - 00:05:00
The session introduces the topic of reliable scalability and how Amazon.com utilizes AWS for cloud scalability. Seth Eliot, the speaker, shares his background and experience with Amazon, including his work on the .com side and AWS. He presents a historical overview of Amazon's architecture, starting from its early days in 1995, highlighting the evolution of its systems to meet the demand for scalability.
00:05:00 - 00:10:00
The need for scalability is emphasized, defined as the ability of a workload to perform its function as the load changes. The architecture evolved from a single server and database to a service-oriented architecture, allowing for more agile development and deployment. The talk transitions to the current architecture, which consists of tens of thousands of microservices interconnected through various dependencies.
00:10:00 - 00:15:00
The focus shifts to the Well-Architected Framework, particularly the reliability pillar, which includes best practices for building in the cloud. The speaker introduces the first example of IMDb's transition to serverless microservices, explaining how they moved from a monolithic architecture to a federated schema with microservices, improving scalability and reliability.
00:15:00 - 00:20:00
The architecture of IMDb is discussed, showcasing a gateway-based architecture that connects various backend microservices. The importance of the two-pizza team model is highlighted, emphasizing ownership and accountability within teams, leading to smoother operations and better on-call experiences.
00:20:00 - 00:25:00
The next best practice discussed is automation in resource scaling. The speaker explains how AWS Lambda functions automatically scale based on requests, and IMDb implemented provisioned concurrency to avoid cold starts, ensuring a seamless user experience during peak loads.
00:25:00 - 00:30:00
The architecture of IMDb's API gateway is examined, detailing the use of a web application firewall (WAF) and a content delivery network (CDN) to enhance security and performance. The speaker emphasizes the importance of using highly available public endpoints to reduce high-severity issues caused by bots.
00:30:00 - 00:35:00
The session transitions to Global Ops Robotics, focusing on warehouse management and the use of a cell-based architecture to protect workloads. The concept of bulkhead architecture is introduced, explaining how compartmentalization helps contain failures and maintain operational continuity across fulfillment centers.
00:35:00 - 00:40:00
The discussion moves to Amazon Relay, which manages the middle mile of the supply chain. The speaker explains how they use multi-region deployments to ensure truck operations continue smoothly, even during service disruptions. The importance of graceful degradation and failover strategies is emphasized to maintain critical functions during outages.
00:40:00 - 00:45:00
The Classification and Policies Platform is introduced, showcasing how Amazon classifies millions of items using machine learning models. The concept of shuffle sharding is explained, demonstrating how it limits the blast radius of failures by assigning unique worker pairs to clients, enhancing reliability and scalability.
00:45:00 - 00:57:38
The final example focuses on Amazon Search and the implementation of chaos engineering to test system resilience. The speaker outlines the chaos engineering process, emphasizing the importance of steady state, hypothesis testing, and the use of service level objectives (SLOs) to ensure a positive customer experience during turbulent conditions.

اعرض المزيد

الخريطة الذهنية

فيديو أسئلة وأجوبة

What is the main focus of the session?
The session focuses on how Amazon.com achieves reliable scalability using AWS.
Who is the speaker?
The speaker is Seth Eliot, a principal developer advocate at AWS.
What is the Well-Architected Framework?
The Well-Architected Framework consists of best practices for building in the cloud, focusing on reliability among other pillars.
What is chaos engineering?
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
What is the significance of scalability for Amazon?
Scalability allows Amazon to handle increased loads and maintain performance as the scope of their operations changes.
What architectural changes did IMDb implement?
IMDb transitioned from a monolithic architecture to serverless microservices using AWS Lambda.
What is a two-pizza team?
A two-pizza team is a small, cross-functional team at Amazon that can be fed with two pizzas, emphasizing ownership and agility.
How does Amazon ensure reliability in its services?
Amazon uses best practices from the Well-Architected Framework, including automation, bulkhead architectures, and chaos engineering.
What is the role of the Global Ops Robotics team?
The Global Ops Robotics team manages warehouse operations and uses a cell-based architecture for reliability.
What is the purpose of the Relay app?
The Relay app helps truck drivers manage their loads and routes in Amazon's middle mile logistics.

عرض المزيد من ملخصات الفيديو

احصل على وصول فوري إلى ملخصات فيديو YouTube المجانية المدعومة بالذكاء الاصطناعي!

الترجمات

التمرير التلقائي:

00:00:00
- Hello. Welcome, everyone.
00:00:01
Thank you so much for choosing my session.
00:00:03
I really appreciate you being here.
00:00:04
You're here of course for reliable scalability,
00:00:08
how amazon.com runs on AWS
00:00:11
and how it scales in the cloud on AWS,
00:00:13
and we're gonna talk a lot about examples of how amazon.com,
00:00:17
a large, sophisticated customer, uses AWS.
00:00:21
So my name is Seth Eliot.
00:00:22
I am currently a developer advocate,
00:00:24
principal developer advocate for developer relations,
00:00:27
just that's a recent change for me.
00:00:29
Prior to that, I was the reliability lead
00:00:31
for AWS Well-Architected,
00:00:32
and Well-Architected's gonna play a big part
00:00:34
in the talk today, but even before that,
00:00:36
I actually worked for amazon.com.
00:00:38
So I joined Amazon back in 2005
00:00:40
and was working on the .com side before moving to AWS.
00:00:45
So I always like to start off
00:00:46
with a bit of a history lesson.
00:00:47
Now, I wasn't there in 1995,
00:00:49
but this is what the website looked in 1995.
00:00:52
Take it in in all its glory.
00:00:55
Quite amazing for the time period, actually,
00:00:57
and this is the architecture used
00:01:00
to run that website you just saw.
00:01:01
So I want to draw your attention
00:01:03
to the box that says Obidos.
00:01:03
Obidos is the place in Brazil
00:01:06
where the Amazon River is its narrowest and swiftest part,
00:01:09
and back in those days, they named a lotta things
00:01:12
after places in Brazil and things on the Amazon River,
00:01:14
and that is the executable.
00:01:17
That is a single C, not C++,
00:01:20
but C binary running on a single server
00:01:23
talking to a single Oracle database
00:01:26
running on another server called ACB,
00:01:28
for amazon.com books, that had all the data in it,
00:01:31
and that essentially was the architecture.
00:01:33
You could see there's CC motel. That's a credit card system.
00:01:35
That was separate so that we could have limited access
00:01:38
to that so that the credit card numbers could be secure,
00:01:40
and there's a distribution center,
00:01:42
later renamed fulfillment centers,
00:01:43
from which your package would be shipped and show up to you.
00:01:46
So that's the original architecture.
00:01:48
Now, the motto of Amazon, especially back in those days,
00:01:51
is get big fast, and you can see that there's a T-shirt
00:01:55
from one of the picnics about get big fast,
00:01:57
and to get big fast, you're gonna need scalability.
00:02:00
So what is scalability?
00:02:01
Well, scalability is the ability of a workload
00:02:03
to perform its agreed function as the scope changes,
00:02:06
as the load or scope changes.
00:02:08
So to get there, they had to evolve the architecture.
00:02:12
So the first thing they looked at was the databases.
00:02:14
You could see they pulled out this Web database there.
00:02:16
So that Web database interacts with the customer,
00:02:19
does the ordering, and then asynchronously syncs
00:02:21
back to the ACB database periodically.
00:02:24
Similarly, we've added a new distribution center
00:02:26
and they each get their own databases too.
00:02:28
So this is one way to remove one of the big bottlenecks,
00:02:31
which was the database, but that wasn't enough.
00:02:34
So let's fast forward to 2000
00:02:36
and talk about a service-oriented architecture.
00:02:39
Having a single binary,
00:02:41
it eventually did become C++, like the original,
00:02:44
the first engineer at Amazon insisted it stay C,
00:02:47
but he couldn't control it after a time
00:02:49
and eventually C++ libraries got into it,
00:02:51
but still, it was a single binary.
00:02:53
So if you wanted to make a change,
00:02:54
so let's say you were in charge of implementing
00:02:56
one-click ordering, one-click purchase,
00:02:59
you would have to make your change to that binary
00:03:02
and everybody else is making changes to that binary
00:03:04
and you're building along with everybody else
00:03:06
and you're deploying along with everybody else
00:03:08
and it's just not a very agile system.
00:03:10
If somebody else breaks the build,
00:03:11
you're not deploying today, so that's not great,
00:03:14
so what can we do?
00:03:15
Well, in addition to splitting out the databases,
00:03:17
you can see the customer data got pulled out of ACB
00:03:20
and you don't wanna be calling the database directly,
00:03:22
so you're gonna put a service in front of it,
00:03:23
the customer service, and that customer service
00:03:25
was originally just for select
00:03:28
and insert onto that database,
00:03:30
but it became the location for business logic on customers.
00:03:33
Similarly, there became an order service and an item service
00:03:36
and this is the first service-oriented architectures
00:03:39
at Amazon.
00:03:41
Now, get big fast. Now let's fast forward to the present.
00:03:44
The previous Prime Day, Amazon did get big.
00:03:47
We all know Amazon's big, 100,000 items per minute,
00:03:50
12 billion in sales as of the last Prime Day,
00:03:53
but you're not here to learn about that.
00:03:54
You're here to learn about how they're using AWS, right?
00:03:57
And so I won't read the numbers off of here.
00:03:59
There's obviously billions and trillions and millions.
00:04:02
Go ahead and read them.
00:04:03
It just shows that Amazon did get big fast
00:04:06
and they're doing it using AWS and they're letting,
00:04:09
and they're using AWS to be able to scale
00:04:12
and scale reliably.
00:04:14
So if you fast forward to wanna know
00:04:16
what the architecture looks like today,
00:04:17
it looks appreciably like it looked back in 2000.
00:04:21
Anybody believe me on that? No.
00:04:24
See if you're paying attention. No.
00:04:25
Okay, so this is actually closer
00:04:27
to the actual current architecture.
00:04:28
Each dot on there represents a service or microservice
00:04:31
or tens of thousands of them running amazon.com
00:04:34
and they're all connected to each other
00:04:36
through various dependencies.
00:04:37
I zoomed in on one of them here just to show you
00:04:39
that there are indeed lines in that diagram.
00:04:41
I think the diagram is quite beautiful, isn't it?
00:04:43
But that is the current architecture
00:04:45
with tens of thousands of services,
00:04:49
with many thousands of teams owning those services.
00:04:52
All right, so that brings us to reliable scalability.
00:04:54
So reliability is the ability of a workload
00:04:57
to perform its required function correctly and consistently.
00:05:00
So as we're thinking about that,
00:05:02
that's why Amazon needed scalability.
00:05:05
They needed to get big fast and be reliable,
00:05:08
hence they needed the scalability,
00:05:09
and today we're gonna be diving into examples
00:05:11
of amazon.com teams doing that and building on AWS,
00:05:15
and we're gonna use the Well-Architected Framework
00:05:18
as a framework to present that to you.
00:05:20
So the Well-Architected Framework consists of six pillars
00:05:22
and they're all important, but honestly,
00:05:24
today we're focused on reliability.
00:05:26
The reliability pillar has the,
00:05:28
Well-Architected has best practices.
00:05:30
Well-Architected is just a documentation
00:05:33
of all the best practices for building in the cloud.
00:05:35
It includes other things too. We have hands-on labs.
00:05:38
We have a Well-Architected Tool
00:05:39
where you could review your own workloads,
00:05:41
but honestly in this case,
00:05:43
we're gonna look at the best practices reliability pillar.
00:05:45
There are 66 of them.
00:05:46
We're not gonna look at all 66 of them, but today,
00:05:49
as I show you the examples I'm showing you,
00:05:51
I'm gonna talk about which best practice
00:05:53
is being illustrated in the architectures we're looking at,
00:05:57
and we're gonna dive right in with our first example.
00:06:02
Oh, IMDb re-architected to serverless microservices.
00:06:06
So IMDb, Internet Movie Database.
00:06:09
Who here has heard of IMDb?
00:06:11
Okay, and the rest of you just don't wanna raise your hand
00:06:13
because you don't wanna raise your hand. (laughing)
00:06:17
Internet Movie Database was acquired by Amazon in 1998.
00:06:22
It is the number one location to go to learn about movies,
00:06:25
TV shows, actors, producers, all that good stuff,
00:06:28
and prior to the re-architecture,
00:06:30
they were running a monolithic build with a REST API
00:06:35
on hundreds of EC2 servers.
00:06:38
So they're on AWS, but they're running on
00:06:40
hundreds of EC2 instances, servers,
00:06:43
and when they re-architected,
00:06:44
they moved to a federated schema with microservices.
00:06:47
Now, microservices are small, decoupled services
00:06:51
focused on a specific business domain.
00:06:53
As for what federated schema is,
00:06:55
if you don't know already, I'll get to that,
00:06:56
and they used Lambda for this.
00:06:58
So they're using Lambda,
00:06:59
which is the serverless compute in AWS,
00:07:02
the ability to run code without servers,
00:07:06
and now we get to the best practice
00:07:08
and you're gonna see several of
00:07:09
these slides throughout the talk.
00:07:10
They have the Well-Architected logo up there
00:07:12
and the format might be a little odd.
00:07:13
This, what it is is a snapshot of the Well-Architected Tool
00:07:16
which is in the AWS console,
00:07:18
and the way best practices are shown in the framework
00:07:20
is there's a question that represents
00:07:23
a set of best practices.
00:07:24
Then each of those check boxes are a best practice.
00:07:27
So in this case, the best practices we're interested in is,
00:07:29
how do you segment your workload,
00:07:31
and then how do you focus those segments
00:07:35
on specific business use cases, on specific business needs?
00:07:38
And you can see I circled microservices there.
00:07:40
You don't have to use microservices
00:07:42
to achieve these best practices,
00:07:43
but that is what IMDb did, so therefore it's circled.
00:07:46
So these are the first two best practices to look at,
00:07:48
and to look at that, we're gonna ask a question.
00:07:51
You're on IMDb and you type in Jackie Chan
00:07:54
and what it does is runs a query is,
00:07:56
what is the top four shows that Jackie Chan is known for?
00:08:00
Now, Jackie has an id. Every entity in IMDb has an id.
00:08:04
This nm is a name entity and that's his entity there, 329,
00:08:07
and so you as a user don't care about that,
00:08:10
but you've just asked what is Jackie Chan known for,
00:08:13
and this is the query that the client creates.
00:08:16
It's GraphQL.
00:08:18
GraphQL is a query language
00:08:19
that lets you set up queries like this where you can,
00:08:23
using a schema, request information
00:08:26
and get all that information back at once.
00:08:27
Like with REST, you'd probably have to make four calls.
00:08:29
Here, you just do it all at once,
00:08:31
and what is being requested here?
00:08:32
You could see the name id on top and so that's Jackie Chan,
00:08:36
and I wanna know the first four things that he's known for,
00:08:39
and of those, when you gimme those four things,
00:08:40
I wanna know the title text, I wanna know the release date,
00:08:44
I wanna know the aggregated ratings,
00:08:46
and I want an image URL so I could show an image.
00:08:48
Okay, so that's the query that the front end is making,
00:08:51
and this is where the microservices
00:08:52
and the federated schema come into play.
00:08:55
This request is actually sent
00:08:57
to four different microservices,
00:08:59
each fronted by an AWS Lambda in this case.
00:09:02
So the first one is find me the top four things
00:09:05
Jackie Chan's known for and it's gonna return the id
00:09:07
of those four things, which begins with tt.
00:09:09
Now, that first service doesn't know about
00:09:12
release date or rating.
00:09:13
It only knows about top four.
00:09:15
So the next thing is the title text and the release date.
00:09:19
That's metadata, so that's gonna go to that other service,
00:09:21
and that service only knows metadata,
00:09:23
so it's gonna return the metadata.
00:09:24
The third one is the ratings,
00:09:26
so that one only knows ratings.
00:09:27
It's gonna return the aggregate rating,
00:09:28
and the last one only knows image URLs,
00:09:30
so it's gonna return the image URLs,
00:09:31
and the reason it's a federated schema is
00:09:33
'cause even though the request is one big schema,
00:09:35
each of these little microservices only knows
00:09:37
its own piece to the schema.
00:09:40
So when the front end gets that response to that request,
00:09:44
it's gonna show it to the user like this.
00:09:45
You could see Jackie Chan
00:09:46
and the four things he's known for.
00:09:48
You could see the release date.
00:09:49
You could see the aggregated rating,
00:09:51
and there's only one thing wrong here.
00:09:54
"Kung Fu Panda" is missing. How could that be?
00:09:56
I don't know and I really have a bone to pick
00:09:58
with the IMDb team.
00:09:59
I'll let them know about it after the talk.
00:10:02
All right, so now let's get into architecture.
00:10:04
Okay, so this is what the IMDb architecture looks like.
00:10:06
It's a gateway-based architecture.
00:10:08
So they redesigned their gateway
00:10:09
into the serverless architecture
00:10:11
so that it can call all of these backend microservices
00:10:14
that each know their own little piece of the elephant.
00:10:17
So here's those backend microservices.
00:10:18
They're just sitting there fronted by Lambda.
00:10:20
Some of them are completely serverless.
00:10:22
Many of the newer ones are.
00:10:23
Some of them, if there was like a legacy service
00:10:25
or something that they just wanted to update,
00:10:27
they'll front it with a Lambda so that they could be called,
00:10:30
and the Lambda's responsible shaping the data
00:10:33
so that the GraphQL query response is in the right format.
00:10:38
Okay, over here, okay, so now each of those microservices
00:10:41
only knows its piece of the schema and the gateway,
00:10:44
that which a gateway is the front end that the client calls,
00:10:47
the gateway needs to know the entire schema.
00:10:49
You need a schema manager, so here it is.
00:10:51
When you create a new service or update a service,
00:10:54
it publishes its little piece of the schema
00:10:56
to the schema manager, which publishes it into an S3 bucket.
00:10:59
So the gateway has a full view of the schema,
00:11:03
and here's the API for, the front end for the API.
00:11:06
There's a Application Load Balancer. There's a firewall.
00:11:09
There's a content delivery network piece
00:11:11
and I'm gonna talk more about that later,
00:11:12
so I'm gonna put that on hold,
00:11:14
and this is when we diverge a little bit
00:11:16
and talk about culture at Amazon.
00:11:17
I think many of you already heard the two-pizza team.
00:11:20
A two-pizza team is a team that could be fed
00:11:21
with approximately two pizzas,
00:11:23
so not too big, not too small.
00:11:25
It's a cross-functional team, but it's all about ownership.
00:11:28
The two-pizza team owns the service
00:11:30
or services they're responsible for,
00:11:31
from design to implementation to deployment
00:11:35
to operation and the business around it.
00:11:38
So there might be a product manager on the team
00:11:40
that's a business expert working with developers there.
00:11:42
So the nice thing about this with this model
00:11:44
is that this model of creating these federated microservices
00:11:49
is it moved the business logic for those services
00:11:52
so that the team could own that business logic.
00:11:54
So the team that owns the metadata is expert on metadata.
00:11:57
The team that owns the ratings is expert on ratings,
00:12:00
and this was organizationally a positive thing for the org,
00:12:05
and what happened was,
00:12:06
so they have something called on-call.
00:12:08
They have a rotating on-call rotation
00:12:09
where if there's any problem in production,
00:12:11
they own in production, they have to respond to it,
00:12:12
and the senior dev told me they were having
00:12:14
ridiculously smooth on-calls after this
00:12:16
and that's because the organizational change
00:12:19
aligned with the technology change
00:12:21
meant that the teams that owned the business domain
00:12:25
and the service were available whenever a problem occurred,
00:12:28
so that a problem occurred in the aggregate service,
00:12:31
the rating aggregate service,
00:12:33
that team would be the one called
00:12:34
and they'd understand what's going on,
00:12:35
and it also helped that going to serverless
00:12:38
helped with scalability.
00:12:40
All right, so the next best practices we're gonna look at
00:12:42
is using automation when obtaining or scaling resources
00:12:46
and obtaining resources upon detection that you need them,
00:12:48
so detecting that you need new resources
00:12:51
and obtaining them automatically.
00:12:53
To do that, I'm gonna do a little divergence,
00:12:54
just talk about Lambda.
00:12:55
This is not an IMDb architecture.
00:12:57
This is just a generic serverless architecture
00:12:58
'cause I wanna talk about Lambda.
00:13:00
As I said, Lambda is a way to run code without a server,
00:13:03
but the way it works is you deploy a Lambda instance
00:13:07
with some code, and then for every request it gets,
00:13:10
it spins up, invokes a Lambda instance.
00:13:13
So here you can see six requests. Six Lambdas get spun up.
00:13:17
They process the requests and if there's no more requests,
00:13:19
they spin down.
00:13:20
So it is automatically scaling.
00:13:22
It'll scale up and down based on
00:13:24
the number of requests you get,
00:13:26
and this is the actual metrics for Lambda invocations.
00:13:31
So these are the number of Lambdas being invoked
00:13:33
per minute by IMDb and this could also be translated
00:13:37
to requests per minute because each request,
00:13:39
each Lambda invocation represents a single request.
00:13:42
Note it peaks at 800,000 requests per minute,
00:13:44
which I also converted to requests per second
00:13:46
if you want to know that,
00:13:47
and it also goes up and down quite a bit.
00:13:49
It's quite cyclical.
00:13:51
I don't know, who saw my Twitter post about this?
00:13:52
This is one of the things I posted on Twitter and said,
00:13:54
"Which service is this?"
00:13:55
Well, it's IMDb. Now you know, and so two things here.
00:13:59
One is Lambda just scales, right?
00:14:02
Like, every request it gets, it spins up a Lambda.
00:14:05
It's auto-scaling,
00:14:06
but that isn't quite the end of the story.
00:14:09
The thing about Lambda is that with certain run times,
00:14:13
when spinning up a new Lambda, it could take,
00:14:15
there's some latency involved.
00:14:16
That's called cold start.
00:14:18
IMDb didn't want any cold starts,
00:14:19
so they used something called provisioned concurrency.
00:14:21
With provisioned concurrency,
00:14:23
you specify a number of Lambdas you wanna keep warm
00:14:25
and these warm Lambdas won't have cold start,
00:14:27
and you pay for that, but you pay a fraction
00:14:29
of what it'll cost to actually run the Lambda.
00:14:31
So if they specified a flat number like 800,000,
00:14:35
that'd be wasteful, right?
00:14:36
'Cause they're not always running 800,000.
00:14:37
So what you see here is the gray line,
00:14:39
this is not IMDb, this is a schematic,
00:14:41
but the gray line represents a number of Lambda invocations
00:14:44
and the orange stepwise line
00:14:47
is the provisioned concurrency
00:14:49
scaling up and then scaling down.
00:14:51
So not only the number of Lambdas scale up and down,
00:14:54
but the provisioned concurrency scales up and down,
00:15:01
which brings us to our next best practice.
00:15:03
So we're gonna talk about using
00:15:07
highly available public endpoints.
00:15:09
So that's that front end I was talking about,
00:15:11
that actual API endpoint,
00:15:13
and so I'm gonna zoom in on it here.
00:15:15
So zooming in on that front end of that gateway,
00:15:18
we can see a couple of things.
00:15:19
Okay, they're using the web application firewall, or WAF.
00:15:22
All right, so WAF is a firewall product offered by AWS
00:15:26
and they really loved it.
00:15:29
They said that the initial turn on was exceedingly simple,
00:15:32
and as soon as they implemented it, it removed,
00:15:34
they said no more high-sev issues.
00:15:36
I'll just say vastly reduced their high-sev issues,
00:15:39
and they didn't have to put
00:15:40
the manual network blocks in place.
00:15:42
So what was causing these high-sev issues?
00:15:45
Robots, either malicious or non-malicious robots.
00:15:48
That's constantly fighting against the robots
00:15:51
and so WAF was a solution for them that really worked.
00:15:54
There's also a CDN here, a content delivery network,
00:15:57
called CloudFront, and what CloudFront does is
00:15:59
you might know that AWS is in 30 regions,
00:16:02
but we have over 410 edge locations.
00:16:05
So using CloudFront, your users request, someone using IMDb,
00:16:08
their request will be routed to one of those edge locations
00:16:11
closer to them than a region possibly.
00:16:14
That puts it right on the AWS backbone right away,
00:16:16
gets better performance,
00:16:17
and also being a content delivery network,
00:16:19
it offers caching, so there's caching at that edge location,
00:16:22
so if it could serve from the cache, it will,
00:16:24
and finally, the ALB, the Application Load Balancer.
00:16:27
That is the actual front end that's connected to the Lambda
00:16:30
that's running the gateway.
00:16:33
All right, that was our first example.
00:16:34
I hope you enjoyed it, and we got a few more.
00:16:36
So let's talk about Global Ops Robotics
00:16:38
and how they protect workloads
00:16:39
with a cell-based architecture.
00:16:42
So to understand what Global Ops Robotics is,
00:16:44
you have to understand a little bit
00:16:46
about Amazon's supply chain.
00:16:47
So as a user, you have this ordering layer
00:16:49
that you're seeing, like I see something,
00:16:51
I order it, it shows up on my door,
00:16:53
but under that is a supply chain layer.
00:16:55
There's the warehouse management piece,
00:16:57
which is things going on inside the warehouse,
00:16:58
or fulfillment center as we call them, as Amazon calls them,
00:17:02
middle mile, which is moving things
00:17:04
to the warehouse or between them, and last mile,
00:17:07
which is moving things to your front door.
00:17:09
Well, Ops Robotics is the warehouse management piece.
00:17:12
That's what they call it, and with Ops Robotics,
00:17:15
all of these are about scale.
00:17:16
I wanna talk about the scale
00:17:17
of warehouse management at Amazon.
00:17:19
There's over 500 of these warehouses, fulfillment centers.
00:17:22
They could be up to a million square feet big
00:17:25
and there's millions of items per fulfillment center.
00:17:28
Now, the Ops Robotics team that runs warehouse management
00:17:30
has multiple services.
00:17:32
So what kind of services?
00:17:33
Well, they need services that understand
00:17:35
when material is received, where it needs to go,
00:17:38
stow, picking it when someone orders it,
00:17:40
packing it and shipping it,
00:17:42
and so all of these are services
00:17:43
that are part of Global Ops Robotics,
00:17:45
and behind these services are multiple microservices.
00:17:48
So you have hundreds,
00:17:49
maybe even 1,000 microservices operating here,
00:17:53
and the reliability pillar best practice
00:17:56
we're gonna talk about is using bulkhead architectures.
00:17:59
Bulkhead architectures mean setting up compartmentalization
00:18:03
that you have multiple of these compartments,
00:18:05
and if a failure occurs in one,
00:18:06
it can't affect the others,
00:18:09
and we're doing this with cell-based architecture.
00:18:11
Again, this is not Global Ops Robotics.
00:18:13
This is not warehouse managers.
00:18:14
This is a generic slide on cell-based architectures.
00:18:18
With cell-based architectures what we're doing
00:18:20
is stamping out a complete stack multiple times
00:18:23
isolated from each other,
00:18:24
they don't share data with each other,
00:18:26
and putting a thin routing layer on top.
00:18:28
That routing layer deterministically assigns clients,
00:18:32
I put, see clients in quotes, to a cell.
00:18:35
So a given client, and when I say clients in quotes,
00:18:38
you could actually, it could be user ID.
00:18:39
It could be whatever.
00:18:40
It could be several different things,
00:18:41
some partition key to each cell
00:18:43
so that you have a certain number of clients
00:18:44
going to each cell, and if there's a failure in one cell,
00:18:47
yes, the clients in that cell might be affected,
00:18:49
but the clients in the other cells
00:18:51
are isolated from the failure.
00:18:54
Now, in their case, they're fulfillment centers.
00:18:57
They're the warehouses and their client ID
00:19:00
would be the fulfillment center ID.
00:19:02
Each fulfillment center does not share data
00:19:04
with the other fulfillment centers.
00:19:05
It's a discreet data set for them only.
00:19:09
So it makes sense that when we're just using,
00:19:11
deciding about that routing layer,
00:19:13
how you assign requests to cells,
00:19:15
we do it by fulfillment center.
00:19:17
Each fulfillment center is assigned a cell
00:19:19
and all their requests go to their cell.
00:19:22
They might be sharing with other fulfillment centers,
00:19:24
but their requests always go to the same cell.
00:19:26
So in this case you could see three fulfillment centers
00:19:28
sitting near each other and each one assigned
00:19:31
to a different cell, and this,
00:19:33
the thing they wanted to establish
00:19:34
was this geographic redundancy.
00:19:36
So you notice that these three are kinda clustered together
00:19:39
and they're serving this area of the United States,
00:19:41
Ohio, Indiana, I think that is.
00:19:43
So what happens if there's a failure?
00:19:46
It's contained to that cell, Cell2.
00:19:49
That FC might be offline,
00:19:50
but there's still two more FCs in that geographic region
00:19:54
and the trucks continue to roll
00:19:55
and people get their products still.
00:19:58
Now, when they're deploying these cells,
00:20:00
they're actually using separate AWS accounts
00:20:02
for each cell and they're using pipelines,
00:20:04
pipelines to deploy the infrastructure,
00:20:06
pipelines to deploy the code.
00:20:07
So the first deployment goes to a pre-prod cell
00:20:11
that's not really used in production
00:20:12
but used for testing and before it rolls out,
00:20:15
and then subsequently it gets deployed
00:20:17
to each of the three other cells
00:20:19
each in a separate AWS account,
00:20:21
and there's also another account
00:20:24
that's a centralized repository of all the logs and traces
00:20:27
that the other cells are exporting to
00:20:30
so you can get an all-up view of the system
00:20:32
'cause you don't wanna look at it cell by cell,
00:20:34
or you do wanna look at it cell by cell sometimes,
00:20:36
but you also wanna look at it all up,
00:20:37
so that's an aggregation point
00:20:39
where all the logs and traces could be aggregated.
00:20:43
Now here's what it looks like.
00:20:45
Each of the green boxes is a cell.
00:20:47
Each one of the yellow circles with a letter in it
00:20:49
is a fulfillment center and they each have their own ID.
00:20:52
They're lettered in this case,
00:20:53
and this is a cellular architecture.
00:20:55
We're showing Service 1 and Service 2.
00:20:57
Service 2 depends on Service 1.
00:20:59
Service 1 is an upstream dependency of Service 2,
00:21:06
and this is cellular wise.
00:21:07
This is a cell-based architecture.
00:21:09
What they found the problem here is,
00:21:10
if there's a failure in cell one in Service 1,
00:21:13
the way this is architected,
00:21:15
there are negative impacts on the cells and services
00:21:18
in Service 2 because of the dependencies,
00:21:21
how each fulfillment center can be swapping cells
00:21:24
based on the service.
00:21:26
So what they wanted to do was establish this.
00:21:29
Each fulfillment center is assigned to a given cell
00:21:33
and it's only in that cell for every service in the stack,
00:21:36
and now if there's a failure, like we saw before,
00:21:39
it's not the greatest thing in the world to have a failure,
00:21:41
but it's constrained and those other fulfillment centers
00:21:44
continue to operate normally,
00:21:47
and the way they did this was they designed a system
00:21:51
to assign fulfillment centers to cells.
00:21:54
They did this using DynamoDB,
00:21:55
which is our NoSQL, very fast database,
00:21:59
and this had two effects,
00:22:01
aligning fulfillment centers to a cell
00:22:03
but also allowing them to load balance between cells
00:22:06
'cause fulfillment centers are different sizes.
00:22:08
So you can't just put three per cell
00:22:09
or whatever like I did here.
00:22:11
So this system also runs various rules and heuristic
00:22:14
to balance out the cell so no cell is
00:22:16
particularly bigger than another one,
00:22:20
and that's our cellular architecture.
00:22:21
Now I wanna talk about Amazon Relay
00:22:24
and how they use multi-region to keep their trucks moving.
00:22:27
So trucks are involved here.
00:22:28
So we're still in the supply chain world.
00:22:30
So we talked about warehouse management.
00:22:32
Now we're gonna talk about middle mile management.
00:22:34
Spoiler alert, I do not have an example for last mile.
00:22:37
So if you're expecting that, come back next year.
00:22:39
I'll have one next year.
00:22:41
All right, so this is about middle mile.
00:22:43
So middle mile is the semi trucks you see on the road
00:22:46
with the Amazon Prime symbol on it.
00:22:48
This is about moving stuff into warehouses
00:22:50
and between warehouses and making sure
00:22:53
that all the millions and millions of items
00:22:55
in Amazon's inventory are in the right place
00:22:58
to be able to serve customers.
00:23:00
Now, this example I'm gonna show you
00:23:01
focuses on North America,
00:23:02
but middle mile exists around the world,
00:23:05
and I'm gonna talk about the Relay app.
00:23:07
The Relay app is an app for iOS and Android
00:23:09
that the truckers use.
00:23:10
So if you could think of middle mile
00:23:12
as having this really sophisticated model
00:23:14
that determines where stuff should be
00:23:16
and when it should be there all over United States,
00:23:19
this is how that model's realized.
00:23:21
That model is just something in a computer.
00:23:23
It's meaningless unless you can get trucks rolling
00:23:25
and moving stuff around.
00:23:26
This is the realization of that model.
00:23:28
This is the model that truck drivers use
00:23:30
to know where to go, when to go there,
00:23:33
what to pick up, where to take it,
00:23:36
and you could download this app today on your phone.
00:23:38
I did it and it's pretty useless
00:23:39
unless you're a truck driver.
00:23:40
So truck drivers in the audience, feel free to download it,
00:23:43
but everybody else. (laughing)
00:23:46
Okay, so best practice I'm gonna talk about
00:23:49
is using highly available endpoints.
00:23:51
All right, so we talked about that already, right?
00:23:52
Highly available endpoints.
00:23:53
Oh, we talked about that when we talked about
00:23:55
the IMDb gateway.
00:23:56
Well, same thing here.
00:23:57
Amazon Relay being an app has a gateway too,
00:24:00
again, a single point of entry that the app is,
00:24:03
the iOS app and the Android app are both calling into,
00:24:07
and just like IMDb, there's a gateway
00:24:09
and it's fronting several backend services.
00:24:11
In this case, they call them modules.
00:24:12
So I'm gonna call them the modules too.
00:24:14
So there you can see the modules there.
00:24:15
The modules are mostly serverless
00:24:18
consisting of Lambda and DynamoDB, and there's the gateway.
00:24:22
So unlike IMDb, they're not using Application Load Balancer.
00:24:24
They're using API Gateway.
00:24:26
API Gateway is a highly scalable managed API
00:24:30
and you can see there's multiple API Gateways there
00:24:31
'cause the way this works is you could use,
00:24:34
well, see, you could use Route 53, which is not shown there,
00:24:37
which is DNS system, to create a domain name,
00:24:40
and then based on path-based routing,
00:24:42
like what's after the slash and what's after domain name,
00:24:45
it goes to a different API Gateway
00:24:47
and then API Gateway fronts one of these backend modules,
00:24:50
and you can also see there's also
00:24:51
some authentication logic in there too.
00:24:53
So that's important and it's calling into
00:24:54
the Amazon authentication system to do that.
00:24:58
Now, what they really liked about this model
00:24:59
when they went to it is there's no shared ownership
00:25:01
of code or infrastructure between the gateway
00:25:05
and the backend modules.
00:25:06
So they could deploy independently.
00:25:08
They could make changes independently
00:25:09
as long as they don't break any contracts
00:25:11
and it gave them a lot more flexibility.
00:25:15
Now, the other best practice we wanna look into
00:25:17
with these two teams are to deploy the workload
00:25:19
to multiple locations and choose the appropriate locations
00:25:22
for those deployments, and to talk about this,
00:25:26
we need to go back to December of 2021.
00:25:29
As many of you know, in us-east-1 in December 2021,
00:25:32
there was an event that caused several services
00:25:35
to experience service issues,
00:25:37
and one of the services affected was SNS,
00:25:39
or Simple Notification Service,
00:25:41
and Relay app does depend on that
00:25:44
and you can see the effect.
00:25:45
All right, so some truck drivers
00:25:48
could not get their load assignments.
00:25:49
They couldn't get the assignment of where to go
00:25:50
and what to pick up and you can see it wasn't 100%.
00:25:53
It went up to about 30% at its peak
00:25:55
and it was for just some limited period of time,
00:25:57
but that's still an impact on our customers
00:25:59
and Amazon does not wanna have that kind of impact.
00:26:04
So what could you do?
00:26:05
You could redesign to either not use SNS
00:26:08
or make SNS a soft dependency
00:26:10
or you could take the approach they did and use spares.
00:26:13
Spares is where you set up multiple instances of a resource
00:26:16
so if one of them is not working, you could use the other.
00:26:19
Now, in this case, SNS is a regional service,
00:26:23
so in order to be able to use a different SNS service,
00:26:26
they had to go to another region
00:26:27
and I'll show you how they did that,
00:26:29
but first I gotta introduce you
00:26:31
to another cultural thing at Amazon,
00:26:32
the COE, or correction of error event.
00:26:35
So when something like this happens
00:26:36
where 30% of the truck drivers
00:26:38
are not able to get their load assignments,
00:26:39
that's customer impacting, the team does a COE.
00:26:42
A COE is a deep dive as to what caused the issue
00:26:45
and how it could be avoided.
00:26:46
It's blameless. It's not there to point fingers.
00:26:49
It's not there to find the culprit as a person.
00:26:52
It's there to find the actual cause of the issue
00:26:54
and to come up with solutions,
00:26:56
actions so that issues like this,
00:27:00
an issue like this or related to this
00:27:01
can never happen again,
00:27:03
and here are some of the ones they came up with
00:27:05
and the ones I'm gonna talk about.
00:27:06
I'm gonna talk about how they did a app,
00:27:08
a review of the resiliency of the Relay app
00:27:11
and how they then deployed to multiple regions
00:27:15
to enact what was found in that review.
00:27:18
So in that review,
00:27:21
their primary goal there was to preserve
00:27:23
physical operational continuity,
00:27:25
even if the experience is degraded.
00:27:27
So what do I mean by degraded? Let's talk about that.
00:27:29
So the three steps they did was they had to articulate
00:27:31
the minimum critical workflow.
00:27:33
So which parts of this have to work, while other parts,
00:27:36
if they're not working, it's not optimal,
00:27:38
but we could still keep the trucks rolling.
00:27:41
Two, design solutions that those critical parts
00:27:44
remain operational, and three,
00:27:46
adapt the system so that when the parts
00:27:49
that are not so critical stop working,
00:27:51
the system could still operate.
00:27:53
That's what we mean by the degraded experience.
00:27:55
It still works, the critical functions are there,
00:27:59
and they just, as I said before,
00:28:00
they went with a multi-region approach.
00:28:02
So they were already deployed in us-east-1,
00:28:05
and fun fact, they'd been running out of
00:28:08
what was the predecessor to us-east-1 before AWS existed.
00:28:12
Amazon had data centers there
00:28:14
and that's where they ran out of,
00:28:16
but they also decided to deploy to us-west-2 over in Oregon.
00:28:20
Amazon has, AWS has 30 regions all over the world.
00:28:25
You could see those are the ones in North America,
00:28:28
and the solution looked like this.
00:28:29
So this is the backend modules, okay?
00:28:31
So the backend modules weren't as necessarily as simple
00:28:33
as a Lambda and a DynamoDB,
00:28:35
but the thing about them is that they all were fronted
00:28:37
by Lambda so they could integrate with the API Gateway
00:28:39
and they all persisted their important data,
00:28:41
the data that needed to be shared, in DynamoDB,
00:28:44
and so in this case, you could see they deployed
00:28:46
to us-east-1 and us-west-2,
00:28:48
and the nice thing about DynamoDB,
00:28:52
it has something called global tables.
00:28:55
With DynamoDB global tables,
00:28:56
you can deploy a table in multiple regions
00:28:59
and write to any of those tables
00:29:01
and those writes will be replicated to the other regions.
00:29:03
So they found that just to be an easy solution
00:29:05
just to put right in there.
00:29:07
Now, each of these modules is owned by a two-pizza team
00:29:10
or a two-pizza team might own more than one of them,
00:29:12
but they're all owned by a two-pizza team,
00:29:14
and the two-pizza teams, based on the criticality analysis,
00:29:17
decided whether they were gonna go multi-region or not.
00:29:19
Not all of them did because you have to pick
00:29:21
where to put your resources right, where to invest.
00:29:26
Now, the gateway part of it, the part in front there,
00:29:29
also was deployed to two regions.
00:29:30
You could see that API Gateway
00:29:31
which is representing the gateway
00:29:33
going to us-east-1 and us-west-2,
00:29:36
and now we put Route 53 in front of it.
00:29:38
So Route 53 is our DNS system.
00:29:40
This is called an active/active architecture.
00:29:44
What it means is that each of the two regions here
00:29:47
actively receive requests.
00:29:49
A given request doesn't go to both regions.
00:29:51
It goes to one or the other.
00:29:52
How does it decide which one to go to?
00:29:53
Well, Route 53 offers several routing policies.
00:29:56
In this case, they decided to use latency routing.
00:29:58
So Route 53, based on past experience,
00:30:00
will determine for a given request
00:30:02
which one's gonna give the lowest latency
00:30:04
and route the request there.
00:30:05
There are other routing policies.
00:30:06
There's weighted routing. There's geolocation routing.
00:30:09
So it routes it based on where the request came from.
00:30:12
So there's all kinds of different options.
00:30:13
This team, Relay, went with the latency-based routing,
00:30:17
and so this is a request to us-east-1
00:30:20
and you can see the module called module A.
00:30:24
They are a module that did go multi-region.
00:30:26
So the request goes to their us-east-1 version
00:30:29
and then module B there did not go multi-region,
00:30:32
so the request also goes to us-east-1.
00:30:35
However, requests that went to us-west-2,
00:30:38
this is where it gets interesting,
00:30:39
for module A, it's gonna go to us-west-2,
00:30:42
but module B, a less critical module,
00:30:44
didn't set up anything in us-west-2,
00:30:46
so it's still gonna receive its request in us-east-1,
00:30:49
and we'll see how that just plays out later
00:30:51
in various failure scenarios.
00:30:54
Yeah, so that's how it works.
00:30:55
All right, so the next best practice
00:30:56
is to implement graceful degradation
00:30:58
to turn hard dependencies into soft dependencies.
00:31:01
Now, I talked a little bit about
00:31:02
what graceful degradation is.
00:31:03
It's about maintaining the critical parts of your workload
00:31:06
while the less critical ones might fail, but overall,
00:31:09
the end users still can do the things they need to do.
00:31:13
So this is that analysis they did
00:31:14
when they wrote up that report.
00:31:16
From going left to right in order,
00:31:18
these are the things that a truck, a delivery goes through,
00:31:21
the various business domain specific things
00:31:24
that middle mile goes through, and what the red lines,
00:31:26
the red bars represent are criticality.
00:31:28
So by creating a graph like this,
00:31:30
they're able to identify which modules are critical
00:31:33
and which ones are less critical.
00:31:34
So for instance, it's critical that they be able
00:31:37
to complete a delivery.
00:31:39
It's critical that you could assign drivers
00:31:41
to pick up their loads.
00:31:42
Then what's not critical? What can we do without?
00:31:44
Well, the app provides turn-by-turn navigation.
00:31:47
So if that goes out, again, not optimal,
00:31:50
but there's other GPS systems.
00:31:51
The app also has this long-term booking
00:31:53
where you could book next week's loads.
00:31:55
Well, that's important, but it might not be important now
00:31:58
while there's some kind of issue going on,
00:31:59
and eventually, whatever the issue is going on,
00:32:02
it's gonna be solved and then you could assign
00:32:03
next week's loads.
00:32:04
So I really like the subtitle here, "The trucks keep moving,
00:32:07
no products backed up on the docks."
00:32:09
That's what they told me, "The trucks keep moving,
00:32:10
no products backed up on the docks,"
00:32:12
and that's what they're aiming for,
00:32:14
and so the next best practice
00:32:15
we're gonna look at is fail over.
00:32:17
Okay, being able to fail over to healthy resources.
00:32:19
So what happens if they have another event
00:32:22
where they want to fail out of us-east-1
00:32:25
and be purely in us-west-2?
00:32:27
So in this case, using the routing policy,
00:32:29
they could turn off all traffic to us-east-1,
00:32:31
send all the traffic to us-west-2, and this is what happens.
00:32:34
So, I'm sorry. I'm gonna actually go back.
00:32:36
So notice that for module A,
00:32:39
it's gonna use the version of module A that's in us-west-2,
00:32:43
and that's a critical module
00:32:44
and it's gonna continue operating.
00:32:46
What happens to module B?
00:32:47
Remember, module B never set up a us-west-2 version.
00:32:51
So one of two things is probably gonna happen.
00:32:52
Either one, the request is gonna, well,
00:32:54
the request is gonna go to us-east-1 where we failed out of,
00:32:57
but we failed outta there because we're seeing some issue
00:32:59
that we think we wanna fail out for but doesn't mean,
00:33:01
the region's never hard down.
00:33:03
That doesn't happen.
00:33:04
So the service in us-east-1 still might respond
00:33:07
and that's a best-case scenario,
00:33:08
but worst-case scenario, it doesn't respond,
00:33:10
and because of the way system's designed
00:33:12
and graceful degradation,
00:33:14
it's again a less than optimal experience
00:33:16
but an experience that allows the users
00:33:18
to do their critical functions,
00:33:19
which is to keep the trucks moving,
00:33:20
nothing backed up on the docks,
00:33:23
and the last one we're gonna look at
00:33:25
is about testing your disaster recovery strategy
00:33:27
'cause you could have a disaster recovery strategy,
00:33:29
but if you don't test it, you don't know if it works,
00:33:32
and so they ran a game day, a game day basically to exercise
00:33:34
this disaster recovery strategy.
00:33:36
They wanna be prepared for peak 2022.
00:33:38
Peak at Amazon represents the holiday season.
00:33:40
I think we're already in it with Black Friday
00:33:42
and Cyber Monday already going on.
00:33:45
So what they did was they initiated
00:33:47
a fail over in production.
00:33:51
They acted as if they needed to get out of us-east-1.
00:33:53
They didn't need to, but they acted as if they did,
00:33:56
got out of us-east-1, failed over,
00:33:58
sent all the traffic to us-west-2 and this is what happened.
00:34:02
You notice that the increase in traffic in us-west-2
00:34:04
is way over 100%.
00:34:05
If it was evenly balanced,
00:34:07
you'd expect it to be 100% increase,
00:34:08
but it was more than 100% increase.
00:34:09
So this represents that most of the traffic's still going
00:34:12
to us-east-1 and that's probably just the nature
00:34:14
of population density in the United States.
00:34:17
The other thing I forgot to mention is
00:34:18
when they went to the active/active model,
00:34:20
truck drivers in the West started seeing
00:34:22
much lower latencies 'cause their requests
00:34:23
were being sent to the West region.
00:34:26
Also, when they failed over,
00:34:27
they actually were able to successfully run the service
00:34:30
without any significant customer impact
00:34:32
or failures, et cetera.
00:34:34
It took 'em about 10 minutes to execute the fail over.
00:34:36
They did see an increase in latency and it was,
00:34:40
they're working on it and they're still re-engineering
00:34:42
to try to get that down, but the increase in latency still,
00:34:44
again, maybe less than optimal,
00:34:46
but enabled everyone still able
00:34:48
to do the critical functions, kept the trucks rolling,
00:34:50
nothing backing up on the docks.
00:34:56
All right, our next example is
00:34:57
the Classification and Policies Platform
00:34:59
and how they use shuffle sharding to limit blast radius.
00:35:02
So basically similar to before, similar to before,
00:35:06
blast radius is about containing the failure
00:35:09
to an area, to a cell, in this case, a shard,
00:35:12
so that it doesn't affect other parts of the system,
00:35:15
and so what is Classification and Policy Platform?
00:35:18
Well, they're part of the catalog service
00:35:21
and the catalog at Amazon is massive,
00:35:24
millions and millions of items in the Amazon catalog,
00:35:27
and every single one of those items needs to be classified.
00:35:30
So what do I mean by classified?
00:35:31
Well, there's 50 different classification programs.
00:35:34
It could be as simple as what type is it.
00:35:35
Is it clothing? Is it electronics?
00:35:37
It could be what kinda taxes should be applied.
00:35:40
Can this thing be put on an airplane? Is it hazardous?
00:35:43
Is it something that we can sell in a certain state?
00:35:46
Is it something that children are allowed to use?
00:35:48
I mean, there's all kinds of classification going on,
00:35:51
50 of these programs which are actually
00:35:53
not necessarily part of this team.
00:35:55
This team runs the platform to host
00:35:57
all these classification programs
00:35:58
and applying classification to all the millions
00:36:01
and millions of things in the Amazon catalog,
00:36:04
and why is this important?
00:36:05
Well, I kinda gave this away a little bit because I said,
00:36:07
all right, so here's an item that I,
00:36:10
living in Washington, can buy,
00:36:12
but when John living in California goes to buy it, it says,
00:36:15
"No, you can't have it because California restricts
00:36:18
this item or says you can't have it,"
00:36:19
and that's an example of the classification was applied
00:36:22
to the item and the ordering service was able
00:36:24
to read that classification and say,
00:36:25
"No, I cannot sell it to people in California,"
00:36:29
and again, it's about scale.
00:36:31
So there's 50 programs across Amazon
00:36:33
doing this classification that are using this platform.
00:36:35
There's over 10,000 machine learning models being applied,
00:36:38
100,000 rules, so that's like if this, then that,
00:36:41
so less sophisticated than machine learning
00:36:43
but still important,
00:36:44
and there's 100 model updates every day.
00:36:47
So of these 10,000 machine learning models,
00:36:49
100 are being updated every day.
00:36:50
Millions of products are being updated per hour.
00:36:53
So this is what it looks like.
00:36:54
All right, so this is,
00:36:55
if you're dozing off, time to pay attention
00:36:57
'cause this is where it gets a little complicated.
00:36:58
I wanna make it simple, all right?
00:37:00
So you have the millions of items
00:37:01
that need to be classified and we're breaking them up
00:37:05
into batches of about 200 each,
00:37:06
but apparently it can vary quite a bit.
00:37:08
So that's not so important.
00:37:10
What's important is that we need to apply about 100,
00:37:12
not all 10,000 machine learning models,
00:37:14
but about 100 machine learning models
00:37:17
to every item coming in
00:37:18
and we do that in batches of 30 models.
00:37:21
So that means there's gonna be three requests made,
00:37:23
three batches of 30.
00:37:24
Why 30? I'll get to that in a minute.
00:37:27
So these requests to process these items
00:37:29
for 30 machine learning models go to a classifier.
00:37:32
In this case, it's an Elastic Container Service service
00:37:36
that's running machine learning models against these items,
00:37:39
and what it does is it pulls the models down from S3.
00:37:43
So it says, "Oh, these items need these 30 models.
00:37:46
I'm gonna pull these 30 models down and run them,"
00:37:49
and it can cache the models and that's important
00:37:51
'cause we want to actually try to use workers,
00:37:54
these are all workers, these classifiers,
00:37:55
that already have those models cached.
00:37:57
So pulling the models down
00:37:58
and swapping 'em out is inefficient,
00:38:01
and then after it does the classification,
00:38:02
it writes it to DynamoDB.
00:38:04
So this is the logical view.
00:38:05
Let me show you the architectural view.
00:38:07
Oh, wait, I promised to tell you why 30 is important.
00:38:09
Well, two reasons.
00:38:11
They found that that's a nice size
00:38:12
where they could take 30 related models,
00:38:15
so like one model might actually use
00:38:16
the output of another one.
00:38:18
So they're sort of related in some way,
00:38:20
but the other reason is about
00:38:20
that caching I was talking about.
00:38:22
There's only so many models that these services
00:38:25
can keep in cache.
00:38:27
So if you told it to be running 100 models,
00:38:29
it can't possibly keep those all in cache,
00:38:31
so eliminating to 30, again,
00:38:33
allows you to avoid the swapping out.
00:38:35
Remember, trying to avoid that swapping out.
00:38:38
So this is more an architectural view.
00:38:39
So taking one of those requests,
00:38:41
which again is for 30 models,
00:38:43
goes through Kinesis where Kinesis reads the metadata
00:38:46
on the request and decides what models are gonna be applied,
00:38:48
and this is important.
00:38:49
This is the part where it sends it to an AWS Lambda
00:38:50
which is acting as a router.
00:38:52
It's the AWS Lambda that says,
00:38:54
"Oh, you need these 30 models?
00:38:56
I'm gonna send you to this worker,"
00:38:58
and it puts it on an SQS queue where then the workers,
00:39:02
the ECS services, read it off the queue.
00:39:04
So in other words, there's 60 of these workers.
00:39:06
The workers are dumb.
00:39:07
They'll process whatever you give them.
00:39:09
You say, you tell 'em to process
00:39:10
these 30 machine learning models.
00:39:11
They'll check is in cache.
00:39:12
Yeah, all right, I'll do it.
00:39:13
Is not in cache. All right, I'll pull it down.
00:39:15
They don't care, so all the smarts are in that Lambda.
00:39:18
That Lambda is attempting to keep
00:39:20
each of these 30 model requests
00:39:22
in the same worker or workers
00:39:24
that have processed those 30 before to avoid the swapping,
00:39:30
and so we're gonna talk about
00:39:31
a best practice we talked about
00:39:33
about using these bulkhead architectures,
00:39:35
only we're not talking about cells in this case.
00:39:37
We're gonna be talking about shards
00:39:38
and specifically shuffle sharding.
00:39:40
So I'm gonna take about three slides
00:39:42
to explain shuffle sharding.
00:39:43
Now, warning, shuffle sharding at this,
00:39:46
I've seen one-hour talks at this conference
00:39:48
to explain shuffle sharding
00:39:50
and I'm gonna do it in three slides.
00:39:51
So hopefully I land the message. If I don't, don't sweat it.
00:39:53
I think you can still follow along.
00:39:55
All right, so this is just an example
00:39:57
of some service that has multiple workers.
00:39:59
They could be EC2 instances,
00:40:01
or like in the case of CPP, they could be ECS services,
00:40:04
and on top, those different symbols are different clients
00:40:07
or different callers of the service
00:40:09
and there's no sharding going on here,
00:40:11
and the thing about no sharding is that if I have,
00:40:14
one of those clients does what we call a poison pill.
00:40:16
It makes either a malformed request, a corrupt request,
00:40:20
maybe even a malicious request,
00:40:21
something that kills the service, the process running on it.
00:40:25
Maybe it tickles a bug that we didn't know we had.
00:40:28
It takes down that worker.
00:40:31
Okay, no problem. We have load balancing, right?
00:40:33
That worker's down. Let's try another worker.
00:40:36
Oh, it takes that one down.
00:40:38
It takes the next one down too
00:40:40
and eventually will work its way through all the workers
00:40:43
until there are no workers and everybody's outta luck.
00:40:45
All the clients are now red.
00:40:47
Nobody's able to call the service.
00:40:50
That's no sharding. So let's introduce sharding, okay?
00:40:55
This is sharding.
00:40:56
It's just taking a resource unlike cells.
00:40:59
Cells were entire stack.
00:41:00
This is just taking some resource layer
00:41:01
and dividing into chunks, in this case, chunks of two.
00:41:05
Shards of two workers each,
00:41:06
and in this case, the cat does its thing,
00:41:09
kills its two workers.
00:41:10
It and the dog are unhappy, but everybody else is happy.
00:41:14
That's the bulkhead architecture at work.
00:41:16
It contained the failure to that shard
00:41:19
and you could see number of customers impacted
00:41:21
is customers, 8, divided by shards, 4.
00:41:23
2 customers impacted. It checks out, right?
00:41:27
Okay, now shuffle sharding.
00:41:29
This is where it gets interesting.
00:41:30
Each client in this case is assigned
00:41:33
its own unique pair of two workers,
00:41:36
but they can be sharing workers.
00:41:38
So what do I mean by this?
00:41:39
If you look at the bishop here,
00:41:40
bishop has these two workers.
00:41:42
That's the bishop shard. The rook has these two workers.
00:41:46
They are two unique pairs, they're not the same pair,
00:41:49
but they're sharing a worker.
00:41:51
Same thing here.
00:41:51
The cat has these two workers,
00:41:53
but again, it has its own unique pair of workers.
00:41:56
None of the clients, none of the eight clients here
00:41:58
shares the same two.
00:42:00
They each share with other shards,
00:42:02
but none of them share the same two with another client.
00:42:05
So in this case, what happens is the cat does its thing,
00:42:08
kills its two workers.
00:42:09
It's down, but even though the rook
00:42:12
was sharing one worker with it,
00:42:14
it still has a healthy worker.
00:42:15
It has its own unique pair of workers.
00:42:17
Same thing with the bishop and everybody else.
00:42:20
So the number of customers impacted
00:42:22
is customers divided by combinations,
00:42:24
which in this case is 8 customers,
00:42:26
and I made 8 shuffle shards,
00:42:28
so only 1 customer was affected,
00:42:30
which means that if at scale,
00:42:32
our scope of impact is 12 1/2 percent or 1/8,
00:42:35
meaning that if we had 800 clients, 100 would be impacted,
00:42:38
but it gets better than this 'cause actually,
00:42:40
I can make more than 8 shuffle shards out of this.
00:42:43
With 8 workers and making combinations of 2,
00:42:47
some of you might recognize this math, it's 8 choose 2,
00:42:51
and you can actually make 28 combinations,
00:42:52
so the actual scope of impact is much less,
00:42:56
and if you don't know that math, don't worry about it.
00:42:57
This is how many combinations of unique sets of 2 workers
00:43:01
you can make given 8, and if you really wanna go crazy,
00:43:05
there's Route 53, over 2,000 workers making shards of 4.
00:43:10
Your scope of impact is 1 in 730 billion.
00:43:12
The math gets kinda crazy at that point,
00:43:15
but getting back to CPP.
00:43:16
All right, so CPP has its workers.
00:43:19
Its workers are tasked with processing these workloads
00:43:22
of 30 machine learning models at a time.
00:43:25
There's over 10,000 machine learning models
00:43:28
and we need to process them in batches of 30.
00:43:31
So that's 400 total groupings, 400 shards we're gonna need,
00:43:35
because a given shard we want to be processing
00:43:38
given machine learning models and not swapping them out.
00:43:40
So we're gonna need 400 shards.
00:43:41
So if there are 60 workers, we can do shuffle sharding.
00:43:45
The blue shard, the green shard, and the orange shard,
00:43:49
can you see they share workers with each other,
00:43:52
but it's each one's a unique combination
00:43:54
of three in this case.
00:43:57
So what happens if we have that poison pill incident
00:43:59
where the orange shard goes down
00:44:02
because those 30 models were somehow corrupt
00:44:04
and something happens and that shard is poisoned?
00:44:08
If we had no shuffle sharding, just standard sharding,
00:44:11
we took our 400 machine learning groups
00:44:13
and distributed 'em over 20 shards,
00:44:15
'cause if we take 60 divided by 3, that's 20,
00:44:17
then 20 of those machine learning groups would be affected,
00:44:20
but if we use shuffle sharding, we can create 400 shards.
00:44:23
So 400 groups of machine learning models, 400 shards.
00:44:27
One of them gets poisoned, then only that one is affected,
00:44:30
and again, to remind you,
00:44:31
it's the same case as we saw with the cat.
00:44:33
It's because each one has its own unique group
00:44:35
of three workers, and just to go crazy,
00:44:38
actually you could create a lot more than 400 shards.
00:44:41
60 choose 3 is over 34,000.
00:44:45
All right, and to bring this home,
00:44:48
the other thing they're doing is
00:44:49
to implement loosely coupled dependencies.
00:44:51
Let me show you how that works. So remember the router?
00:44:54
All the smarts are in that Lambda there.
00:44:55
It decides which shard.
00:44:57
So remember, before I said worker.
00:44:58
The Lambda actually decides which shard
00:45:00
it's gonna send the request to.
00:45:05
Remember, it's putting things on an SQS queue
00:45:07
which are then being picked up by the worker.
00:45:09
So that Lambda is actually monitoring those queues.
00:45:11
It's actually looking at the age of the oldest message.
00:45:14
If the age of the oldest message is pretty old,
00:45:16
it probably means that queue is pretty slow and congested.
00:45:19
So it's actually using back-pressure
00:45:21
to decide which worker inside a shard it's gonna call.
00:45:26
So as a shard of three,
00:45:27
it can choose the worker that's the least busy.
00:45:29
So you can see there the middle one's the least busy
00:45:31
so that's the one it chooses.
00:45:32
So that Lambda is not just a router.
00:45:34
It's also a load balancer using back-pressure
00:45:38
to route along those workers in a shard.
00:45:42
Now, what if the load is too high?
00:45:44
What if there's a spike and all of the workers,
00:45:46
all three workers in the shard are overloaded?
00:45:48
This is where load shedding comes in.
00:45:50
It'll send the request to a load shedding queue
00:45:53
and come back to it after 15 minutes.
00:45:55
Why 15 minutes? Well, those ECS services are auto-scaling.
00:45:59
They're based on CPU levels.
00:46:01
So if there really is a spike going on,
00:46:03
those ECS services are gonna see elevated CPU
00:46:06
and they're gonna scale out.
00:46:07
So 15 minutes later, we'll come back,
00:46:09
reprocess that request and it should work at that point.
00:46:16
All right, this is actually our last example of the day.
00:46:18
It's about Amazon Search
00:46:20
and how they're using chaos engineering
00:46:21
to be ready for Prime Day, any day.
00:46:25
Amazon Search, I think you've all seen it.
00:46:26
You have probably all seen the search bar here.
00:46:28
Why don't we search for chaos engineering
00:46:30
and see what we get?
00:46:33
All right, over 1,000 results.
00:46:35
Okay, the top result there is the "Chaos Engineering" book
00:46:38
by Casey and Nora.
00:46:39
That's sort of the chaos engineering bible.
00:46:41
So that's a good result.
00:46:42
I also want you to notice the SLO,
00:46:49
the service level objective book there with the doggy on it
00:46:51
'cause that's gonna be important too,
00:46:53
and we're talking about scale here.
00:46:54
So we're talking about millions of products.
00:46:56
We're talking about 300 million active users.
00:46:59
We're talking about last Prime Day,
00:47:01
84,000 requests per second peak during Prime Day.
00:47:04
Again, the whole point is to show you
00:47:05
the scale of these services and how they're using AWS
00:47:08
to meet the need of that scale,
00:47:10
and Search, like everything else I showed you,
00:47:12
consists multiple backend services
00:47:14
and using multiple AWS resources,
00:47:17
and what I really like about the Search team
00:47:19
is they have their own resilience team.
00:47:21
So they have your builtin team dedicated to resilience
00:47:23
doing operational resilience
00:47:25
and site reliability engineering
00:47:27
for the Search org across those 40 services,
00:47:30
and their main goal, their main motto is,
00:47:32
"We test, improve, and drive the resilience
00:47:34
of Amazon Search services."
00:47:36
How do they do that?
00:47:36
They do that by promoting resilience initiatives,
00:47:38
helping with load testing and helping to promote
00:47:42
and orchestrate chaos engineering,
00:47:44
and that's the part I want to talk about.
00:47:47
So the best practice in this case is use chaos engineering
00:47:50
to test your workload, to test your resilience.
00:47:54
So what is chaos engineering?
00:47:56
I'm gonna read a slide for you.
00:47:57
"Chaos engineering is the discipline of experimenting
00:47:59
on a system in order to build confidence
00:48:01
in the system's capability to withstand
00:48:03
turbulent conditions in production."
00:48:05
Turbulent conditions in production.
00:48:06
I think we could all identify with that,
00:48:08
unusual user activity, network issues,
00:48:12
infrastructure issues, bad deployments.
00:48:15
I mean, it's a mess out there
00:48:17
and we need to be resilient to that.
00:48:19
So the thing to know about chaos engineering,
00:48:21
it's not about creating chaos.
00:48:23
It's about acknowledging the chaos that already exists
00:48:26
and preparing for it and mitigating it
00:48:29
and avoiding the impact of that chaos.
00:48:31
So that's the way you gotta be thinking
00:48:32
about chaos engineering.
00:48:34
So how do you do chaos engineering?
00:48:36
This is a one-slide summary of how to do chaos engineering.
00:48:39
Chaos engineering is ultimately at its core
00:48:40
a scientific method.
00:48:42
This is a circular cycle,
00:48:44
but I'm gonna start with steady state.
00:48:46
What the heck is steady state?
00:48:47
Steady state means your workload, the workload under test
00:48:49
is operating within design parameters,
00:48:51
and you have to be able to measure that.
00:48:53
You have to be able to assign metrics to say
00:48:54
what does it mean to operate within design parameters.
00:48:57
Then is the hypothesis.
00:48:59
The hypothesis is if some bad thing happens,
00:49:02
and you specify the bad thing, if an EC2 instance dies,
00:49:05
if an Availability Zone is not available,
00:49:07
if a network link goes out, then my system,
00:49:11
because I designed it that way, will maintain steady state.
00:49:15
It will stay within those operational parameters.
00:49:17
Now, if you didn't design it that way,
00:49:18
don't do the chaos engineering,
00:49:20
but if you designed it that way, you're testing that.
00:49:22
So you run the experiment. You simulate that EC2 failure.
00:49:25
You simulate that network link outage,
00:49:27
and then you validate.
00:49:28
You verify was the hypothesis confirmed.
00:49:32
If the hypothesis was not confirmed, oh, okay.
00:49:34
We experienced some sort of outage.
00:49:36
We went outside of the established parameters.
00:49:38
We did not maintain steady state. You need to improve.
00:49:41
You improve by redesigning,
00:49:43
applying the best practices in the reliability pillar,
00:49:46
and then you test it again.
00:49:47
You run the experiment again.
00:49:49
Oh, now the hypothesis is confirmed
00:49:51
and we're back to steady state
00:49:52
and the whole thing repeats all over again.
00:49:56
So service level objectives,
00:49:57
I told you this would come up again, so here it is.
00:50:00
This is an example service level objective.
00:50:01
This is not one they actually use.
00:50:04
They didn't really wanna share those,
00:50:05
but they did want to share the format of it.
00:50:07
So this is the format of it.
00:50:08
In a 28-day trailing window, we'll see 99.9% of requests
00:50:12
with a latency of less than one second.
00:50:14
That's an example of a service level objective
00:50:17
that might be used by the Search team,
00:50:19
and with this service level objective,
00:50:20
we've established something called the error budget.
00:50:22
So what's the error budget?
00:50:23
Well, 99.9% means that .1% can be greater than a second.
00:50:29
So that's the start of our budget. That's our budget.
00:50:31
However, with every request that exceeds one second,
00:50:36
we're consuming that budget.
00:50:38
Eventually that whole thing will be consumed
00:50:39
and we'll be out of budget and you can actually look at
00:50:42
how fast that budget's being burned.
00:50:43
It's called the burn rate, but there's good news.
00:50:46
There's a 28-day trailing window.
00:50:48
So that means the oldest failures,
00:50:51
the oldest requests that are greater than a second
00:50:53
will eventually time out,
00:50:54
will eventually age out, I should say,
00:50:56
be older than 28 days and your budget replenishes.
00:50:59
So that's the concept of the error budget.
00:51:03
So they wanna do customer-obsessed chaos engineering.
00:51:06
Chaos engineering is not for the engineering teams.
00:51:09
It's not for the developers.
00:51:10
It's so we can establish an experience for our customers
00:51:14
that's gonna serve their needs,
00:51:15
and they thought SLO was the best way to do that.
00:51:17
It's very customer focused.
00:51:18
It's focused on what the customers experience
00:51:20
and so the experiments must stay within the error budget,
00:51:25
and the stop conditions for an experiment,
00:51:27
you must always have stop conditions
00:51:28
on your chaos engineering experiments,
00:51:30
are if the burn rate is too high on the error budget,
00:51:34
the experiment stops.
00:51:36
If the Andon cord is pulled.
00:51:37
So the Andon cord goes back to the Toyota factories
00:51:40
where they had a actual cord
00:51:42
that anybody on the assembly line could pull
00:51:45
if they saw a quality issue.
00:51:46
Same thing here.
00:51:48
Several people across the org can push this button
00:51:50
and will stop and roll back any experiment at any time,
00:51:53
and then the last thing is
00:51:54
if there's something going on at this kind of event
00:51:56
happening across Amazon IT, then that's not a good time
00:51:59
to be doing your chaos engineering,
00:52:00
so let's stop it and roll it back then too,
00:52:04
and this is what they designed.
00:52:06
We're here to talk about architecture.
00:52:07
So on the right, I just wanna point out it's all centered
00:52:10
on Fault Injection Simulator.
00:52:11
Fault Injection Simulator is a AWS service
00:52:14
that you can use to run chaos experiments
00:52:18
and they did build around that.
00:52:20
So on the far right, you can see ECS and EC2.
00:52:23
That's the search services.
00:52:25
Remember, there's 40-plus search services.
00:52:27
So they're using Fault Injection Simulator
00:52:29
to do chaos engineering on those services.
00:52:32
What they built was the part on the left.
00:52:33
That's the orchestration piece.
00:52:35
Okay, now follow me down to the API Gateway
00:52:38
in the lower left-hand corner.
00:52:39
You can see two APIs.
00:52:40
The first one's the Andon API and the Andon API
00:52:44
establishes and configures the Andon cords.
00:52:46
It does this by setting up various CloudWatch alarms
00:52:49
that FIS will respond to.
00:52:51
FIS has guardrails.
00:52:52
Remember, I said a good experiment has to have a guardrail.
00:52:56
So FIS has guardrails based on CloudWatch
00:52:57
and so when someone pulls the Andon cord,
00:53:00
it sets the CloudWatch alarm
00:53:01
which then stops the FIS experiment.
00:53:03
Okay, the other API is the run API.
00:53:05
It has the ability to run an experiment,
00:53:08
to schedule it for later,
00:53:10
so there's a Lambda there that's a scheduler
00:53:12
that can store schedules in DynamoDB,
00:53:14
and it provides orchestration.
00:53:16
You see on the right there those three Lambdas.
00:53:19
So not only can you run the experiment,
00:53:21
but it gives you the ability to do things
00:53:22
before the experiment.
00:53:23
What might you wanna do before an experiment?
00:53:24
You might wanna send out an alert to various personnel.
00:53:27
You might wanna stop any in-process deployments.
00:53:31
So there's various things you might wanna do
00:53:32
before an experiment, then run the experiment using FIS,
00:53:35
and then do post-experiment operations,
00:53:37
like for instance, cleaning things up,
00:53:39
like especially some experiments,
00:53:42
fault injections aren't self-correcting,
00:53:44
so then you actually have to go in and correct them,
00:53:46
so it might do something like that,
00:53:49
and so FIS exists, it's a great service,
00:53:52
so why did they build this orchestration piece?
00:53:56
This is why.
00:53:57
Number one is they're serving 40-plus teams.
00:53:58
They wanna provide a single pane of glass,
00:54:01
a consistent experience across those teams
00:54:02
and make it super easy for them to do chaos engineering.
00:54:07
They also wanted to add the ability to do scheduling,
00:54:09
to be able to run it with deployments,
00:54:11
which FIS can do, but remember,
00:54:13
all these 40 services are using a pipeline system in common
00:54:16
so that the orchestration is able to design around that
00:54:19
and make it super easy to do, run it with deployments,
00:54:22
provide it consistent guardrails.
00:54:23
Remember, the SLOs are the important guardrail.
00:54:26
So they actually have as part of their system
00:54:28
storage of all the various SLOs
00:54:30
so that it uses that during experimentation
00:54:32
to provide a guardrail.
00:54:34
The Andon cord functionality is not natively part of FIS,
00:54:37
so they're providing that, and metrics and insights.
00:54:39
Of course FIS emits metrics,
00:54:42
but now they could roll up all the metrics
00:54:44
from all 40 services and provide them
00:54:46
as a single report to management
00:54:48
about what kinda chaos engineering they're doing,
00:54:52
and plus, in addition to FIS,
00:54:53
they wanna be able to run other kinds of faults.
00:54:55
Let's talk about that. Oh, well, no.
00:54:57
When they're doing this all, why did they do it?
00:54:59
Why did they build the orchestrator?
00:55:00
'Cause they wanna be ready for Prime Day, any day.
00:55:03
All right, type of faults. First, there is the FIS faults.
00:55:06
These are all supported by FIS,
00:55:07
things like dropping ECS nodes, killing EC2 instances,
00:55:12
injecting latency, doing very,
00:55:17
SSM, or Systems Manager, lets you run
00:55:19
any kind of automation you want,
00:55:20
so you can maybe even simulate an Availability Zone outage,
00:55:23
but what kind of faults are they doing that's not FIS?
00:55:26
Well, there's load testing because, actually,
00:55:29
internal to Amazon, across Amazon teams,
00:55:30
is a very popular load test tool.
00:55:32
They wanna be able to use that
00:55:33
as part of their experimentation,
00:55:35
and there's emergency levers.
00:55:36
So emergency levers are things you can do
00:55:40
as an operator of a service to help a service under duress.
00:55:43
For service under duress, you pull the emergency lever
00:55:46
and now the service can operate well,
00:55:48
so for instance, blocking all robots,
00:55:51
and ultimately, I'm running a little low on time,
00:55:53
so I'm gonna speed through this, ultimately,
00:55:55
they wanna provide a benefit to the end user.
00:55:56
I just wanna point out that it's about higher availability,
00:55:58
improved resiliency for the end customer,
00:56:01
and so I wanna talk about graceful degradation
00:56:03
and emergency levers.
00:56:06
What does the emergency lever look like for Search?
00:56:07
Well, one of their emergency levers is the,
00:56:10
actually, they pull the lever
00:56:12
and it causes graceful degradation on purpose.
00:56:14
So this is what Search looks like,
00:56:16
a full Search experience if I'm searching for Lego,
00:56:18
but if I pull the emergency lever,
00:56:21
it'll turn off non-critical services.
00:56:23
So critical services like the image,
00:56:25
the title, the price are all still there,
00:56:27
but non-critical services like the reviews
00:56:29
or the age range are not there.
00:56:31
So for a system under duress, this can help,
00:56:34
and they test this using chaos engineering.
00:56:37
The hypothesis is the lever works
00:56:40
and it enables Search to handle the stress.
00:56:42
So they literally generate load during the test
00:56:45
and then pull the lever and validate that,
00:56:48
yes, the system's able to handle the duress.
00:56:51
All right, so in summary,
00:56:52
these are the five services I covered.
00:56:55
This part's important to me.
00:56:57
Okay, so to get all this information
00:56:58
so I could share it with you, and I hope you enjoyed it,
00:57:01
I had to work with many smart engineers on multiple teams,
00:57:04
and the thing about smart engineers
00:57:06
working on cool stuff is that they're really busy,
00:57:09
and they took time out to spend with me
00:57:11
to explain this to me so I could share it with you,
00:57:13
so my deepest appreciation to those engineers
00:57:16
and my awe at the engineering that they did.
00:57:18
I really am impressed by it.
00:57:19
Hopefully you're impressed by it too.
00:57:22
Some resources. I won't spend too much time on here.
00:57:24
You wanna take a snap of that real quick?
00:57:26
Upcoming talks that might cover the things we talked about,
00:57:29
and also, two of the examples I covered
00:57:31
actually have some external resources
00:57:33
you can check out if you want to learn more.

الوسوم

AWS
Scalability
Reliability
Well-Architected Framework
Chaos Engineering
Microservices
Automation
Cloud Computing
Amazon
Architecture