What are some fundamental properties that affect distributed systems?

The speed of light, the CAP theorem, and Murphy's Law are key properties that affect distributed systems.

How does AWS ensure operational resilience?

AWS ensures operational resilience through staggered deployments, automation, and end-to-end ownership by service teams.

What is shuffle sharding?

Shuffle sharding is a technique that assigns customers to multiple nodes randomly to minimize the impact of failures.

What is the significance of availability zones?

Availability zones provide fault tolerance within a region, reducing the impact of failures.

How does AWS handle software deployments?

AWS handles software deployments in a staggered manner to minimize risk and ensure system stability.

What is the role of the control plane in AWS services?

The control plane manages resource administration, while the data plane handles the actual work.

What is the importance of automation in AWS operations?

Automation reduces human error, ensures consistency, and allows for predictable operations.

How does AWS approach failure analysis?

AWS conducts post-mortem analyses to identify root causes and implement changes to prevent future failures.

AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)

00:55:43

https://www.youtube.com/watch?v=swQbA4zub20

Résumé

TLDRIn his talk, Peter Voss, a distinguished engineer at AWS, explores how AWS minimizes the blast radius of failures in distributed systems. He discusses fundamental properties that impact system design, such as the speed of light, the CAP theorem, and Murphy's Law. Voss outlines various techniques employed by AWS, including region isolation, availability zone independence, cell-based architecture, and shuffle sharding, to contain failures and reduce their impact on customers. He emphasizes the importance of operational practices, such as staggered deployments and automation, in maintaining system resilience. The talk aims to provide insights into building resilient systems and offers practical techniques for reducing the blast radius in attendees' own systems.

A retenir

🌍 AWS focuses on minimizing the blast radius of failures.
⚡ The speed of light affects long-distance communications.
📉 The CAP theorem states you can't have consistency and availability simultaneously.
🔧 AWS employs region isolation to contain failures.
🏢 Availability zones provide fault tolerance within regions.
🔄 Cell-based architecture helps in isolating failures further.
🎲 Shuffle sharding reduces the impact of problematic requests.
🔍 Staggered deployments help in managing risks during updates.
🤖 Automation is key to reducing human error in operations.
🔄 Post-mortem analyses are conducted to learn from failures.

Chronologie

00:00:00 - 00:05:00
The session begins with Peter Voss introducing himself as a distinguished engineer at AWS, discussing the importance of minimizing the blast radius of failures in distributed systems. He highlights the challenges posed by the speed of light, the CAP theorem, and Murphy's Law, emphasizing the need for resilience in AWS systems.
00:05:00 - 00:10:00
Voss explains the concept of blast radius, which refers to the degree of impact a failure can have on customers, workloads, and locations. He stresses the importance of reducing this blast radius and outlines AWS's commitment to maintaining high availability while preparing for potential failures.
00:10:00 - 00:15:00
The discussion shifts to the various ways systems can fail, including server crashes, disk failures, network issues, and external factors like storms. Voss also mentions non-physical failures such as traffic surges and software bugs, setting the stage for techniques to contain failures.
00:15:00 - 00:20:00
Voss introduces techniques for reducing blast radius, starting with region isolation. He explains that AWS operates in multiple regions, each with its own set of services, ensuring that failures in one region do not affect others, thus providing a strong layer of isolation.
00:20:00 - 00:25:00
Next, Voss discusses availability zone independence, explaining that each region consists of multiple availability zones (AZs) that are physically separated to minimize correlated failures. He illustrates how applications can be designed to leverage this architecture for fault tolerance.
00:25:00 - 00:30:00
The presentation continues with the concept of cell-based architecture, where services are compartmentalized into cells that operate independently. This design allows for better fault isolation and resilience, as failures in one cell do not impact others.
00:30:00 - 00:35:00
Voss elaborates on shuffle sharding, a technique that further reduces blast radius by randomly assigning customers to multiple nodes. This method ensures that even if one node fails, customers can still access services through other nodes, significantly lowering the impact of failures.
00:35:00 - 00:40:00
The operational practices that support these architectural techniques are discussed, including staggered deployments, automated testing, and end-to-end ownership by service teams. Voss emphasizes the importance of automation in reducing human error and maintaining system integrity.
00:40:00 - 00:45:00
Voss concludes by summarizing the various mechanisms AWS employs to minimize blast radius, including region isolation, availability zones, cell-based architecture, and shuffle sharding, all supported by robust operational practices.
00:45:00 - 00:50:00
The session ends with a Q&A segment where Voss addresses questions about real-world examples of failures, the relationship between regions and cells, and the complexities of version management during deployments.
00:50:00 - 00:55:43
Overall, the talk provides insights into AWS's strategies for building resilient systems that can withstand failures while minimizing their impact on customers.

Afficher plus

Carte mentale

Vidéo Q&R

What is the main focus of Peter Voss's talk?
The main focus is on how AWS minimizes the blast radius of failures in distributed systems.
What are some fundamental properties that affect distributed systems?
The speed of light, the CAP theorem, and Murphy's Law are key properties that affect distributed systems.
What techniques does AWS use to reduce the blast radius?
AWS uses region isolation, availability zone independence, cell-based architecture, and shuffle sharding to reduce the blast radius.
How does AWS ensure operational resilience?
AWS ensures operational resilience through staggered deployments, automation, and end-to-end ownership by service teams.
What is shuffle sharding?
Shuffle sharding is a technique that assigns customers to multiple nodes randomly to minimize the impact of failures.
What is the significance of availability zones?
Availability zones provide fault tolerance within a region, reducing the impact of failures.
How does AWS handle software deployments?
AWS handles software deployments in a staggered manner to minimize risk and ensure system stability.
What is the role of the control plane in AWS services?
The control plane manages resource administration, while the data plane handles the actual work.
What is the importance of automation in AWS operations?
Automation reduces human error, ensures consistency, and allows for predictable operations.
How does AWS approach failure analysis?
AWS conducts post-mortem analyses to identify root causes and implement changes to prevent future failures.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !

Sous-titres

Défilement automatique:

00:00:01
good afternoon thanks for coming to this
00:00:05
session who's here at reinvent for the
00:00:07
first time awesome welcome should be an
00:00:12
exciting week lots of sessions lots of
00:00:15
parties lots of fun so I am Peter Voss
00:00:19
all I am a distinguished engineer at AWS
00:00:22
and today I'm going to talk about how
00:00:25
AWS minimizes the blast radius of
00:00:27
failures so I spent my career focused on
00:00:31
building highly available large-scale
00:00:33
distributed systems and one of the
00:00:36
things I've learned and something you
00:00:38
learn quickly in this field is that
00:00:40
there are some fundamental properties of
00:00:41
the universe that you basically have to
00:00:43
contend with the first is the speed of
00:00:46
light so 186 miles per millisecond is
00:00:51
pretty darn fast but as the folks at JPL
00:00:54
this morning trying to land their their
00:00:57
Mars rover learned it really gets in the
00:00:59
way when you travel any long distance
00:01:01
communications by the way they did land
00:01:03
the rover successfully so
00:01:05
congratulations to our our friends at
00:01:07
JPL a happy Amazon customer a tip us
00:01:10
customer the next thing is sort of this
00:01:14
pesky property universe known as the cap
00:01:17
theorem so this is a an observation you
00:01:21
can't Simas simultaneously have
00:01:22
consistency and availability in a
00:01:25
distributed system that is built to be
00:01:26
partitioned tolerant so this was first
00:01:28
postulated by Eric Brewer in 98 and then
00:01:31
proven by Seth Gilbert and Nancy Lynch
00:01:34
at MIT
00:01:36
perhaps most annoying though is this
00:01:38
observation by some guy named Murphy
00:01:40
which is basically stuff breaks
00:01:43
inevitably things will fail in a variety
00:01:46
of interesting ways I don't know if this
00:01:48
has been proven but in my experience
00:01:49
that seems to have been borne out to be
00:01:51
the truth now security availability
00:01:56
durability these are all super important
00:01:59
to AWS in fact they're our top priority
00:02:01
and yet we have to contend with with
00:02:04
Murphy's Law so today I'm going to talk
00:02:08
about the techniques that Amazon uses
00:02:11
AWS uses
00:02:13
to contain the blast radius or the the
00:02:16
degree of impact when Murphy's Law does
00:02:19
in fact strike and I hope you'll walk
00:02:21
away from this talk getting a better
00:02:23
understanding of how a DPS builds
00:02:25
resilience into our systems and also
00:02:28
some techniques that you can use to
00:02:30
reduce the blast radius in your own
00:02:32
systems so this term blast radius is a
00:02:36
it's a really useful one because failure
00:02:40
is in binary it's not a it's failing or
00:02:43
it's not there is a degree of impact
00:02:45
it's a really useful term it's part of
00:02:48
our common language at AWS and basically
00:02:52
it's a way to describe the degree of
00:02:54
impact so a table us if we have a
00:02:57
failure one way to talk about blast
00:02:59
radius as well how many customers did it
00:03:01
impact or how many workloads or what
00:03:04
functionality maybe it was just a
00:03:05
portion of the functionality that was
00:03:07
impacted and then finally in what
00:03:09
locations was it a racket a data center
00:03:11
was it a whole data center was it an
00:03:13
entire region obviously we would always
00:03:16
prefer a smaller blast we as a single
00:03:18
rack failing to a larger scope failure
00:03:23
so while we do relentlessly focus on
00:03:26
preventing failures and and I would say
00:03:28
that we've gotten very good at keeping
00:03:30
our systems extremely highly available
00:03:32
we also relentlessly focus on reducing
00:03:35
blast radius for those very rare cases
00:03:38
where we do have a failure and try to
00:03:40
contain it
00:03:41
and make it as small as possible one way
00:03:46
to do that we do this is in our
00:03:48
correction of errors process so that's
00:03:51
our post mortem process that we use
00:03:53
whenever there's an event the service
00:03:55
team will go through analyze what were
00:03:58
the root causes of the failure and then
00:04:01
identify a set of actions to take to
00:04:03
prevent recurrence one of the questions
00:04:06
in the template for doing this
00:04:09
correction of errors is about customer
00:04:10
impact and we asked as a thought
00:04:12
exercise
00:04:12
how could you cut the blast radius for a
00:04:14
similar event in half so we're always
00:04:16
thinking about even when we do have an
00:04:18
event how can we make it even smaller
00:04:20
blast radius the next time
00:04:23
so before talking about how to do that
00:04:25
let's talk about how things can fail so
00:04:26
one way things can fail as well servers
00:04:28
crash maybe not like this but servers do
00:04:31
crash disks fail in a variety of
00:04:33
interesting ways they might go
00:04:35
completely offline they might have
00:04:36
random i/o errors network devices can
00:04:39
fail and if they don't fail they could
00:04:41
introduce random bit flips we've
00:04:43
definitely seen this happen a few times
00:04:46
meanwhile outside the data center
00:04:49
utility workers can accidentally cause
00:04:52
fiber cuts you can have electrical
00:04:55
storms that can cause utility power
00:04:58
failures in a more extreme scenario you
00:05:03
could have data centers actually get
00:05:05
physically damaged by storms or fires so
00:05:10
far I've talked about physical failures
00:05:11
right but there's also non-physical
00:05:14
affairs we have to worry about one is
00:05:16
just a surge of traffic whether it's a
00:05:18
DDoS attack or a extreme surge in demand
00:05:23
on one of our services which can cause
00:05:26
overload conditions we have to worry
00:05:29
about black swan' requests we actually
00:05:31
call them these poison pills internally
00:05:33
I've done a little research and that
00:05:35
seems to be just an AWS term but you can
00:05:37
think of these as these particularly
00:05:39
problematic requests that are either
00:05:41
really expensive or there's something
00:05:43
about them that trigger a bug in in the
00:05:46
system and they can be particularly
00:05:48
pernicious because a client will retry
00:05:51
after a failure and so a single failure
00:05:54
can cascade and cause an entire system
00:05:57
to get infected
00:05:58
so we worried a lot about these poison
00:06:00
pills or black swans we also have to be
00:06:04
mindful the fact that sometimes a
00:06:05
software deployment or configuration
00:06:06
change could introduce the problem and
00:06:10
then of course most generally there are
00:06:12
bugs and obviously none of us want to
00:06:16
have bugs get into production and we
00:06:18
have a very good success rate and not
00:06:20
letting bugs get into production but
00:06:21
there is still that that possible
00:06:23
eventualities and so now that I've
00:06:28
talked about failures and what we mean
00:06:30
by blast radius I'll better talk about
00:06:32
the various techniques we use to contain
00:06:34
it
00:06:35
so region isolation availability zone
00:06:38
independence cell based architecture
00:06:40
shuffle sharding and then finally in
00:06:43
tandem with all these techniques the
00:06:44
operational practices that we use to get
00:06:47
the desired outcome so let's start with
00:06:50
region isolation so as you probably know
00:06:53
AWS is deployed globally in 19 separate
00:06:57
locations that we call regions and we
00:07:02
actually have five additional regions
00:07:03
that we've announced that will be coming
00:07:04
on online soon and as a customer you
00:07:08
choose the region that you want to run
00:07:10
your workloads in based on factors like
00:07:12
like latency or maybe data residency
00:07:17
each one of these regions is a separate
00:07:20
distinct stack of AWS services and each
00:07:26
region has a separate set of end points
00:07:28
for the api's you use to interact with
00:07:30
the services so for example in ec2 if
00:07:32
you want to interact with ec2 in uswest
00:07:35
one you would use the ec2 that uswest
00:07:38
one that either Sam it's not that common
00:07:40
endpoint meanwhile if you wanted to use
00:07:43
easy to than us used you wussies - then
00:07:46
you would use separate endpoint and the
00:07:49
reason is there's no single global ec2
00:07:53
there's a separate stack in every region
00:07:56
and these are isolated in San Shi ations
00:07:58
and they don't know about each other
00:08:01
so this is a shared nothing architecture
00:08:03
which gives us the ultimate in blast
00:08:05
radius protection so there's our ec2 you
00:08:10
know Amazon Elastic Compute cloud
00:08:11
service in u.s. west wall and US east -
00:08:14
they don't talk to each other they don't
00:08:16
know about each other same is true for a
00:08:18
service like Amazon FQs Amazon sage
00:08:20
maker and and so on now you might be
00:08:24
wondering well but there are multi
00:08:27
region or global features you know what
00:08:30
about those features things like s3
00:08:34
cross region replication or dynamos
00:08:36
global tables or ec2 virtual private
00:08:40
cloud peering so let me talk about how
00:08:43
we address those by going through an
00:08:45
example so there
00:08:46
is an example of inter-regional private
00:08:49
cloud peering so virtual private cloud
00:08:51
or VPC basically lets you provision the
00:08:54
logically isolated section of the AWS
00:08:57
cloud where you can launch resources in
00:09:00
a virtual network that you define you
00:09:02
define the IP addresses in your virtual
00:09:03
network configure route tables gateways
00:09:06
and so forth and with inter region DPC
00:09:10
peering you can actually take two v pcs
00:09:12
in two different regions and virtually
00:09:15
connect them so they can communicate
00:09:17
without having to go over the the public
00:09:20
internet so here's our V PC peering
00:09:24
connection now setting one of these
00:09:26
things up involves a workflow and a
00:09:28
setup approvals particularly because
00:09:30
they can be v pcs owned by different
00:09:32
customers so you want both customers to
00:09:34
agree to establishing this connection
00:09:37
and that involves configuration changes
00:09:39
in the V PC configuration on each of
00:09:44
these in each of these regions but these
00:09:47
easy to control planes don't talk to
00:09:49
each other so how do we accomplish this
00:09:51
the answer we have a dedicated service
00:09:55
called the cross region Orchestrator
00:09:56
that sits on top of these systems and
00:10:01
implements the workflow manages through
00:10:03
the approval process and comes through
00:10:05
the sort of the front door of BC - just
00:10:08
like any other API would and it also has
00:10:11
a certain number of safety features it
00:10:13
ring fences some of the interactions to
00:10:16
ensure that there is no possibility for
00:10:18
multiple regions to be impacted by you
00:10:20
know whatever issue there could be at
00:10:22
the same time so we can preserve that
00:10:24
single region blast radius so kind of a
00:10:29
recap here we've got all these regions
00:10:32
and we're deeply committed to this
00:10:34
principle of region isolation so in the
00:10:37
event of say an earthquake in Japan
00:10:39
which we did experience a few years ago
00:10:42
there could be impacted in that region
00:10:44
but it's going to be limited to that
00:10:45
region so in the worst case we have a
00:10:48
single region blast radius in case of a
00:10:50
failure that's still pretty big right
00:10:54
that's that's a lot of impact especially
00:10:56
if you're
00:10:57
a customer that sits only in that region
00:11:00
so obviously we want to do better so how
00:11:04
do we do better now I'm gonna talk about
00:11:06
availability zone independence so let's
00:11:09
drill into the design of a region kind
00:11:11
of double click on it and see how we
00:11:13
limit blast radius within a region so
00:11:17
there's our region a region is actually
00:11:18
composed of well before I talk about its
00:11:21
proposed over a region sits in a in a
00:11:23
location right so at the macro scale its
00:11:25
Northern Virginia or Dublin or Frankfurt
00:11:28
and but it's not a single data center
00:11:31
it's it's actually a it's composed of
00:11:35
multiple data centers that are spread
00:11:38
across the metropolitan area of the
00:11:41
region and we call these different
00:11:43
locations availability zones and their
00:11:44
and their cross connected with
00:11:47
high-speed private fiber links now these
00:11:53
are far enough apart from each other
00:11:54
that there's a very very low possibility
00:11:58
for correlated failure except maybe in
00:12:00
the in the earthquake case so you can
00:12:02
think of them as miles apart and that
00:12:06
means if a tornado comes through it's
00:12:07
unlikely that it would hit multiple
00:12:09
facilities if there's a utility issue we
00:12:13
actually run off of different utility
00:12:15
suppliers in a region so that won't
00:12:17
affect multiple availability zones at
00:12:18
the same time and at the same time
00:12:21
they're close enough to each other so
00:12:23
again think about it as miles away they
00:12:26
don't have that pesky speed of light
00:12:27
issues you can think of them logically
00:12:28
being in the same place run things like
00:12:31
synchronous replication protocols and so
00:12:32
forth without any latency penalty so
00:12:37
each availability zone is basically n
00:12:40
data centers it's not just one data
00:12:41
center in some so in some cases because
00:12:44
we keep the size of the data center set
00:12:48
to a fixed maximum size in larger AZ's
00:12:51
we might have multiple buildings for na
00:12:53
Z and then we have n of those
00:12:56
availability zones per region usually
00:12:58
three or more in fact in all future
00:13:01
region builds will have at least three
00:13:03
and that's useful first for a certain
00:13:06
should be the system's consensus
00:13:08
protocols that
00:13:10
best suited when you have had these
00:13:11
three locations so you can get to
00:13:14
consensus agreement across them globally
00:13:17
we have 19 regions with 57 total
00:13:21
availability zones and five more regions
00:13:23
coming online with 15 more AZ's so with
00:13:28
this architecture of AZ's within a
00:13:30
region we now have the possibility to
00:13:32
reduce the blast radius because we
00:13:35
reduce the possibility of a correlated
00:13:37
failure across the entire region and you
00:13:40
can take advantage of this multi a-z
00:13:42
architecture in old applications just
00:13:44
like we do in a DBMS by using a multi
00:13:46
a-z architecture for your application so
00:13:49
let me go through a quick example here
00:13:51
super simple so you have your
00:13:54
application it runs on a set of
00:13:55
instances that you've deployed across
00:13:57
multiple AZ's and then run an elastic
00:14:00
load balancer to low balance the traffic
00:14:02
across them and then behind the scenes
00:14:04
are using you know a relational database
00:14:06
maybe for your persistence and that set
00:14:09
up in a multi a Z primary standby pair
00:14:12
so if you have a failure in one of the
00:14:14
availability zones the elastic load
00:14:16
balancer will detect that and stop
00:14:18
sending traffic to the field a Z or the
00:14:20
instances in the failed a Z
00:14:23
meanwhile the application is connected
00:14:26
to the master database that has gone
00:14:28
away me because there's a power failure
00:14:29
or a network failure but you could fail
00:14:33
that database over to the to the healthy
00:14:38
AZ which happens automatically with
00:14:40
Amazon RDS and then your application is
00:14:43
up and running again
00:14:43
and these failure detection events could
00:14:47
happen fairly quickly so there might be
00:14:49
a slight hiccup the operation your
00:14:51
application but otherwise it's almost
00:14:53
like a non-event that you lost you know
00:14:55
one of the data centers in your
00:14:56
application so this is how all of our
00:14:58
services that run regionally in AWS are
00:15:03
designed and operated it's a really
00:15:07
powerful model cuz it gives you this
00:15:08
fault tolerance and it basically means
00:15:11
that you can withstand an AZ failure and
00:15:14
not have any impact on your customers
00:15:16
it's also a powerful design for
00:15:18
durability so s3 uses the multi AZ model
00:15:21
to get its 11
00:15:22
of durability and so with this
00:15:25
architecture you can basically get to
00:15:26
zero blast radius when you have a data
00:15:28
center and get hit by tornado or a
00:15:31
utility worker hitting a cutting out
00:15:33
fiber connection so what about the
00:15:36
services in a WMS that are zonal so some
00:15:39
of our applications are some of our
00:15:41
services give you the opportunities and
00:15:43
application to have zonal resources like
00:15:46
an ec2 instance or an EBS volume you
00:15:50
decide which AZ these live in and that's
00:15:52
part of the ear story for creating a
00:15:54
resilient application
00:15:56
these are zone specific resources and
00:16:00
for minimum blast radius considerations
00:16:04
we actually have a zone local control
00:16:06
plane for the resources in each of our
00:16:09
Easy's with a principle of availability
00:16:12
zone independence so we talk about
00:16:15
control planes real quickly if you're
00:16:16
not familiar with the term so control
00:16:18
plane is the thing that you interact
00:16:19
with to administer these resources
00:16:21
whether they're zonal or regional and
00:16:25
then the data plane is the thing you
00:16:26
interact with to actually do the work
00:16:28
you want to get accomplished so some
00:16:30
examples are here for Amazon ec2 the
00:16:32
control plane is what handles run
00:16:34
instances so it takes your API request
00:16:36
and then does all the work necessary to
00:16:39
launch the VM as per your instructions
00:16:41
in your API and then the data plane for
00:16:44
you to do is basically the instance
00:16:46
right
00:16:47
the thing you SSH into or the run your
00:16:49
application on it's also the network
00:16:51
that attaches your you should do it
00:16:54
since to other instances in your V PC so
00:16:59
let's talk about azi her availability
00:17:00
zone independence for these things so
00:17:02
the data plane runs in each of these
00:17:04
AZ's and these are isolated from each
00:17:07
other they can obviously communicate
00:17:08
over a network but otherwise the data
00:17:09
planes don't really know about each
00:17:11
other and the same is true of the
00:17:13
control planes for these for these
00:17:16
resources now I mentioned earlier
00:17:20
there's a regional endpoint to access
00:17:22
ec2 so there must be some layer that you
00:17:25
connect to that sits on top of these and
00:17:27
there is there's a regional control
00:17:28
plane that acts as the entry point and
00:17:30
also handles things that are
00:17:33
zone specific so for easy to things like
00:17:36
security groups or not zone specific it
00:17:40
also aggregates API is like describe
00:17:42
instances so if you want to find out
00:17:44
about all your instances in a region it
00:17:46
will need to interrogate the control
00:17:48
planes across all the easies so in this
00:17:52
model again if you lose an AZ you're
00:17:55
gonna lose your zonal resources but
00:17:56
you've expected that anyway but the data
00:17:59
planes and the other ATS are fine the
00:18:01
control planes for those zones are fine
00:18:03
and the regional control control plane
00:18:06
is built to be multi AZ fault tolerant
00:18:08
so it will also continue to operate fine
00:18:12
except for the fact that it won't be
00:18:14
able to service requests that target the
00:18:15
zone that's down so if you have API
00:18:18
calls in to one of the healthy as ease
00:18:20
to launch an instance there it should be
00:18:23
able to route around this particular
00:18:25
type of AZ failure
00:18:30
okay so let's review the black phase
00:18:32
improvements that we get from
00:18:33
availability zones so here's a regional
00:18:36
service spread across three AZ's if zone
00:18:41
a fails regional service is fine because
00:18:43
it's able to fail away from from the
00:18:45
failed AZ meanwhile our zonal service
00:18:49
has an impact adjust in that zone but
00:18:51
not the other zones the other zones are
00:18:53
isolated and they won't have any impact
00:18:55
these are all these are all good things
00:18:59
the theoretical blast radius is a
00:19:01
different story so by a theoretical
00:19:02
blast rates I mean in the worst case
00:19:05
scenario the thing that you know the
00:19:07
Black Swan event like what's the worst
00:19:09
case that could happen here for the
00:19:11
regional service it is the entire
00:19:13
service and that's the thing that we
00:19:15
lose sleep at night every night thinking
00:19:17
about for the zonal service it's it's
00:19:20
still the zone so there's something nice
00:19:22
about this property of these
00:19:23
availability zones we still don't like
00:19:25
that there's a non infrastructure event
00:19:27
that could take out a service in his own
00:19:29
but it is still nice that it's contained
00:19:31
to the zone can we get the same kind of
00:19:33
resilience in our regional service
00:19:36
without having to sort of target it at a
00:19:38
in a zone local kind of way
00:19:41
so let's take a step back and look at
00:19:42
this abstracted architecture so there's
00:19:46
an entry point a regional entry point
00:19:49
into the service it has an aggregation
00:19:52
layer that might do a few things but
00:19:54
mostly it's a routing layer into a set
00:19:56
of compartmentalize resources and then
00:19:59
there's failure isolation between them
00:20:01
so in the availability zone case these
00:20:03
are AZ's down here and then a regional
00:20:06
control plane that accesses them but
00:20:09
more generally there's this
00:20:10
compartmentalization that is giving us
00:20:13
this nice fault isolation can we use
00:20:16
that in a in a different way in a
00:20:18
different dimension than AZ's and get
00:20:21
some smaller blast radius there's
00:20:24
another way to think about this
00:20:25
abstracted architecture which is how
00:20:28
they build chips so for centuries now
00:20:31
ships have been built with these
00:20:32
watertight compartments that are
00:20:35
separated by bulkheads and the reason
00:20:38
that this is that if there's a there's
00:20:39
damage to the hull and it causes
00:20:40
flooding the flooding is contained into
00:20:43
one of those compartments and the rest
00:20:46
of the ship is still intact and the ship
00:20:48
stays afloat those bulkheads also
00:20:51
provide structural integrity these are
00:20:53
both nice properties right you have this
00:20:56
fault Tower it's minimized impact of
00:20:59
failure and higher structural integrity
00:21:03
and so we've taken these ideas and
00:21:06
applied them to our regional services
00:21:08
and what we call cellular architecture
00:21:10
or cell based architecture so let me go
00:21:15
through our simple example again of a
00:21:18
application with a load balancer compute
00:21:20
and some storage and it's not shown here
00:21:24
but this is an application that is
00:21:25
running in multiple lazy's and has the
00:21:27
the failover as it described before so
00:21:31
in cell based architecture we take this
00:21:34
service stack configuration and we
00:21:37
create multiple instantiations of it and
00:21:43
these are fully iced I didn't know about
00:21:44
each other and each one of these stacks
00:21:48
is what we call a cell
00:21:52
and then what we'll take our workload
00:21:53
and basically low balance it partition
00:21:57
it over these these cells one way to do
00:21:59
that might be by customer so we'll put
00:22:01
this section into cell zero the section
00:22:03
into cell one this section into cell n
00:22:07
now you guys all look nice but maybe
00:22:10
there's someone naughty in here it's
00:22:11
gonna cause us a problem they're gonna
00:22:12
only gonna cause up the problem in that
00:22:14
one cell the other cells are gonna be
00:22:16
fine now we need some way to contain
00:22:19
this thing or make this thing look like
00:22:21
a single service still so we put a cell
00:22:23
router on top of it that makes those
00:22:24
routing decisions and that whole thing
00:22:28
is what we call a cell based service so
00:22:32
the cells are an internal structure
00:22:33
that's invisible to you as a customer
00:22:35
but provide resilience and fault
00:22:38
tolerance and this looks just like the
00:22:41
picture for azi or availability zone
00:22:44
independence but on a different
00:22:46
dimension so let's talk about what that
00:22:49
looks like
00:22:49
so here's the regional service getting
00:22:52
resilience from availability zones
00:22:54
here's that same service divided into
00:22:56
cells and you can see now that with an
00:23:00
availability zone failure both services
00:23:02
are resilient to that because their
00:23:03
fault tolerant across AZ's and the other
00:23:08
examples of failure the ones that we
00:23:10
lose sleep at at night over the failure
00:23:13
in their in the cell based service it's
00:23:15
contained to the cell so the impact is 1
00:23:17
over N or n is the number of cells
00:23:19
rather than the whole service and I've
00:23:22
shown three here for the purposes of
00:23:25
presentation but the number of cells can
00:23:27
actually be much higher so one over n
00:23:29
could be a fairly small percentage of
00:23:30
the overall set of workloads that you're
00:23:33
supporting now that this approach can
00:23:38
also be applied to zonal services and we
00:23:40
do this in our ec2 control plane where
00:23:44
you divide each of the zonal services
00:23:45
into cells as well so you still have a
00:23:48
failure as you'd expect if an
00:23:50
availability zone goes down what's
00:23:52
interesting though is as I mentioned
00:23:53
some availability zones or multiple data
00:23:55
centers and so you can actually have a
00:23:57
smaller blast radius
00:23:58
in certain cases if your zonal cells are
00:24:01
aligned with the physical infrastructure
00:24:03
which is the case with our zonal
00:24:05
easy to control plane services that's
00:24:08
actually a nice improvement and then
00:24:10
again for the other failure cases that
00:24:13
we worry about there is also a smaller
00:24:15
ablation Shrek blast radius even for the
00:24:17
zonal services so let's look at the
00:24:21
system properties of a cell based
00:24:23
architecture we've I talked about some
00:24:26
of them the first one is workload
00:24:28
isolation and this is useful not just
00:24:30
for failures but also just noisy
00:24:31
neighbor problems and then of course
00:24:35
there's the the failure containment so
00:24:37
if we lose a cell the other cells are
00:24:38
fine there's also this nice property
00:24:43
that is really powerful really important
00:24:45
to us at AWS which is how we scale these
00:24:47
things
00:24:48
so rather than scaling up a service
00:24:52
which is sort of the traditional way
00:24:53
just add you know more and more capacity
00:24:55
to it in a cell based architecture you
00:24:59
could also add more capacity to cells
00:25:00
but one of the things that we include in
00:25:04
our goals for a sublet are characters
00:25:06
that cells like our data centers have a
00:25:09
maximum size we won't let them grow past
00:25:10
a certain point and so if we need to
00:25:13
continue to grow a system rather than
00:25:15
growing the cells past that point we'll
00:25:18
add another cell so you grow the system
00:25:20
by scaling it out with more and more
00:25:22
cells the fact that the cells have a
00:25:25
maximum size means you can test them at
00:25:28
that maximum size with a reasonable test
00:25:30
configuration so you can test them to
00:25:33
failure you can do all sorts of stress
00:25:34
testing and get confidence that you
00:25:36
understand how that piece of the system
00:25:38
is going to operate as you you know as
00:25:41
you get more and more demand on your
00:25:43
system these cells are also more
00:25:46
manageable because they're smaller so if
00:25:47
there's some issue you need to look
00:25:49
through logs or otherwise inspect the
00:25:52
nodes in the cell it's gonna be smaller
00:25:55
it's just a piece of your system rather
00:25:56
than the whole system so it's gonna be
00:25:58
easier to work through so let's now talk
00:26:03
about some of the core considerations in
00:26:05
a cell based architecture so the first
00:26:09
is cell size
00:26:14
sort of the trade-off here is you can
00:26:16
have a large number of smaller cells or
00:26:19
a smaller number of large cells in the
00:26:23
the case where you have smaller cells
00:26:25
that's nice because now your blast
00:26:26
radius is you know that much smaller and
00:26:29
those smaller things are easy to test
00:26:31
easier to break to understand what their
00:26:33
breaking points are and 30 to operate in
00:26:36
terms of figuring out if there's an
00:26:37
issue you know smaller number of notes
00:26:39
to go in and take a peek at on the other
00:26:43
end of the spectrum though larger cells
00:26:45
have some good properties which is first
00:26:47
if there's a fixed cost to each of these
00:26:49
which often there is you know maybe
00:26:51
there's a separate low bounce or each
00:26:53
one of them then you get cost efficiency
00:26:55
by having fewer of them
00:26:57
you also get reduced splits so that this
00:27:00
is an important consideration if if
00:27:03
we're dividing our workload by a
00:27:06
customer some of our customers might be
00:27:07
large they have a large number of
00:27:09
workloads and that may be too large to
00:27:12
fit in a smaller cell if we're using
00:27:16
larger cells we may be able to fit that
00:27:18
larger customer into a single cell right
00:27:20
and not have to worry about deal with
00:27:21
the complexity of splitting across
00:27:23
multiple cells and finally as a whole
00:27:26
this system that has fewer cells is
00:27:29
easier to operate because you it's
00:27:30
easier to think about easier to look at
00:27:31
dashboards and so forth there's no right
00:27:34
answer here except that all things being
00:27:36
equal we will always prefer the lower
00:27:39
about blast radius do these other
00:27:41
considerations another which I'm sure
00:27:46
you've been thinking about looking at
00:27:49
this diagram is well what about the
00:27:50
router the cell router is this remaining
00:27:53
shared component across the entire
00:27:56
system and so it's really important that
00:27:58
that thing not fail because now you're
00:28:00
back to the regional blast radius and so
00:28:04
we spent a lot of effort making sure
00:28:06
that that component is stress tested and
00:28:11
battle-hardened so that we know when it
00:28:13
we have high confidence that even in the
00:28:16
Black Swan scenarios that it's going to
00:28:17
stay resilient and stay up and and one
00:28:21
of the ways to accomplish that is to
00:28:22
keep it as simple as possible
00:28:25
so when we talk to teams about adopting
00:28:27
a sublation architecture we call this
00:28:29
component the thinnest possible layer to
00:28:30
kind of reinforce should be really
00:28:33
simple and yeah that's all I'll say
00:28:38
about that another consideration is
00:28:42
partitioning dimension so I've talked
00:28:44
about how we might divide cells along
00:28:48
lines of customers but then in the ec2
00:28:51
control plane case there's an aspect of
00:28:54
the control plane that actually is cell
00:28:56
based based on physical infrastructure
00:28:58
in our data centers which makes sense
00:29:00
for that application another scenario we
00:29:06
may divide not by customer by VPC
00:29:08
especially because sometimes V pcs may
00:29:11
have crossed customer scenarios and so
00:29:14
it takes some analysis to decide what's
00:29:16
the right way to carve this thing up and
00:29:19
the recommendation I always use is cut
00:29:21
with the grain and if you don't know
00:29:23
what that means then think about it this
00:29:25
way that you know wood has a certain
00:29:26
grain and it's easy to split along one
00:29:28
dimension and really hard to split cross
00:29:31
across the grain and every system has
00:29:34
has a natural grain to it another
00:29:40
consideration is what I call cross cell
00:29:42
use cases these may be unavoidable the
00:29:47
goal is to keep them to a minimum
00:29:48
because that adds complexity to the
00:29:51
thinnest possible layer and also
00:29:53
increases the blast radius for those
00:29:56
those operations one example to scatter
00:29:59
gatherer queries so what this means is
00:30:01
there may be an API that comes in that
00:30:03
needs to interrogate multiple cells so
00:30:07
if scatter requests out and then gather
00:30:09
responses are sent out a single reply so
00:30:11
an example in ec2 is the describe
00:30:13
instance this case I mentioned earlier
00:30:15
now there is batch operations so if you
00:30:18
need to execute work on multiple cells
00:30:20
in a single operation so again an AC -
00:30:23
maybe the terminate instances API where
00:30:25
you can spend send multiple instance IDs
00:30:27
that can be a cross sale use case the
00:30:33
last in the probably hardest is
00:30:34
coordinated writes where you're actually
00:30:36
doing trying sending atomically but a
00:30:38
car
00:30:39
multiple cells those require careful
00:30:42
consideration and one example of that is
00:30:44
cell migration so cell migration is when
00:30:47
you relocate a workload from one cell to
00:30:50
another it's maybe we decided this
00:30:52
customer we're gonna move that customer
00:30:53
into cell two from cell 1 and you may
00:30:58
choose do this because you want to
00:31:00
manage the amount of load or heat that's
00:31:02
on each cell or maybe just want to load
00:31:05
bounce the sizes of them or maybe you've
00:31:07
added a salad you and you need to and
00:31:09
your approach for adding cells involves
00:31:12
you know moving existing workloads over
00:31:14
into the new cell and the process that
00:31:17
you used to do the migration is not
00:31:20
unlike a VM migration so if you're
00:31:22
familiar with how VM migration works
00:31:24
they say there's a invisible clone that
00:31:27
gets created in the target location and
00:31:29
it gets brought up to date and
00:31:32
synchronized with the source of course
00:31:35
the source is still changing so this
00:31:37
could take a while for it to get close
00:31:40
to begin sync and at the last possible
00:31:42
moment
00:31:42
both are frozen for the final completion
00:31:46
of the syncing and then an atomic flip
00:31:48
over to the to the target location that
00:31:52
works for VMs and that's the same
00:31:53
approach that we use for migrating
00:31:55
workloads across cells it requires a
00:31:58
careful coordination best-managed at the
00:32:02
router level we have a few approaches
00:32:05
that we've been using to accomplish this
00:32:09
so again with the cell Bay structure
00:32:13
we're able to reduce the blast radius
00:32:14
from 100 percent down to one over n
00:32:17
where n is the number of cells
00:32:18
I should reinforce that the events that
00:32:23
cause these types of additives are
00:32:25
exceedingly rare we could spend a long
00:32:32
time not even having to worry about the
00:32:34
kind of failure happening because of all
00:32:35
the other things that abus does but
00:32:38
we're so focused on resilience that
00:32:40
we're investing additional engineering
00:32:41
work to get to that picture on the right
00:32:44
which is a smaller blast radius even
00:32:46
when those Black Swan events occur
00:32:50
so cells are great but there's another
00:32:56
technique that we've been using that is
00:32:57
even more impressive and more exciting I
00:33:01
think which is called shuffle charting
00:33:04
and shuffle charting is a technique that
00:33:06
is like cell based architectures and
00:33:09
it's particularly useful in stateless or
00:33:12
soft state services so we'll be walk
00:33:16
through what shuffle charting looks like
00:33:17
so here's another simple service we've
00:33:20
got eight nodes and these ain't nodes or
00:33:23
handling requests that are sent by a
00:33:26
load balancer and then we have eight
00:33:29
different customers that are sending
00:33:30
requests so we're in Vegas I use some
00:33:33
some gambling relevant to high cons here
00:33:36
there were ten are different customers
00:33:39
and let's imagine one of them Dimond
00:33:42
here is is introducing a bad workload
00:33:45
for whatever reason maybe it's expensive
00:33:46
request maybe it's one of these requests
00:33:50
that triggers a bug in the system so
00:33:53
Dimon sends a request in and that
00:33:56
request causes one of our servers to
00:33:57
crash okay that's all right we've got
00:34:00
seven others maybe Dimon will go away or
00:34:03
change what it's doing probably not
00:34:06
it'll probably keep retrying and
00:34:09
eventually take out the whole system so
00:34:11
here our blast radius is basically all
00:34:14
the customers this is like the worst
00:34:16
case scenario we really want to avoid
00:34:18
this this is where we go to sell based
00:34:21
architecture so we divide our our
00:34:23
customers assign a subset of customers
00:34:25
to each sell
00:34:28
now when diamond comes along and causes
00:34:31
problems that problems contained adjust
00:34:34
to the sell this is a 4x improvement
00:34:38
right we've gone from 100% down to 25%
00:34:40
the blast radius is the number of
00:34:41
customers divided by the number of cells
00:34:44
and again we could approve that further
00:34:46
by adding more cells or in a system
00:34:50
where we don't need to really worry
00:34:52
about which customers land on which
00:34:54
nodes we could shuffle shard them which
00:34:58
is a little bit different and it's
00:34:59
nuanced but you'll see shortly how
00:35:01
powerful this is
00:35:03
so we'll take each customer and assign
00:35:05
them to two nodes effectively at random
00:35:08
not really a random own will use hash
00:35:10
functions to be predictable that where
00:35:15
are these customers land on these nodes
00:35:16
but basically we've assigned them at
00:35:18
random so diamond gets assigned to the
00:35:20
first and fourth nodes will put spades
00:35:23
on those two our role of two on the dice
00:35:27
goes to those two nodes and so on
00:35:29
Ceaser's basically shuffled randomly
00:35:33
across our set of capacity so now again
00:35:37
diamond comes along takes out the two
00:35:39
nodes that are assigned to it but here's
00:35:41
where it gets interesting look at who's
00:35:44
sharing those nodes with diamond one of
00:35:47
them is hearts hearts however has a
00:35:52
second know that it's assigned to that's
00:35:55
not impacted to buy the average so as
00:35:56
long as that customer retries it's
00:35:59
fault-tolerant even though one of its
00:36:02
nodes is down one of its noses up it's
00:36:04
able to continue operation the same is
00:36:06
true on the other node where clubs has
00:36:09
that same property so in this case our
00:36:14
blast radius is actually the number of
00:36:16
customers divided by the number of
00:36:17
combinations of of two pairs out of
00:36:21
eight which turns out there are 28 of
00:36:25
them which is 3.6 percent so if we had a
00:36:29
much larger number of customers we'd
00:36:30
expect these are well distributed
00:36:33
randomly you would have three poor 63.6%
00:36:35
of customers impacted by the failure at
00:36:38
that I showed meanwhile less than half
00:36:42
of the customers would be in that
00:36:43
scenario where they're sharing at least
00:36:45
one of the nodes so they may see a
00:36:47
little bit of impact a little bit of
00:36:48
hiccup but they're fine so we went from
00:36:51
25 percent down to 3.6 percent going
00:36:55
from the cell-based down to shuffle
00:36:56
shark now this is a small system oh I
00:36:59
should show you the math so the math
00:37:00
here is probably as you remember from
00:37:01
high school the binomial coefficient as
00:37:06
you look at this math to realize as n
00:37:08
grows our number of combinations grows
00:37:12
really quickly so let's say we go from 8
00:37:14
nodes up to 100
00:37:15
larged like a huge number of notes it's
00:37:18
a reasonable number to run in a
00:37:19
large-scale system so say we have a
00:37:22
hundred nodes and then we give each
00:37:23
customer five combinations we're sorry
00:37:26
if five nodes to represent their
00:37:28
combination the math tells us that's
00:37:31
going to be 75 million different
00:37:33
combinations I think of it basically as
00:37:35
you know a deck couple hundred cards
00:37:36
there's fire you know 75 million
00:37:39
different combinations of cards you can
00:37:40
get from by picking randomly from that
00:37:42
deck which is amazing because now you
00:37:45
can see all right 77 percent of
00:37:47
customers are not going to see any
00:37:48
impact when diamond comes along and
00:37:50
takes out it's five nodes but more
00:37:52
interestingly ninety-nine point eight
00:37:55
percent so those first three rows are
00:37:58
still gonna have a majority of their
00:37:59
nodes available so they're gonna have a
00:38:01
better chance than not to completely you
00:38:05
know route around that problem without
00:38:06
even having to try and meanwhile that
00:38:10
very very very low percentage of
00:38:12
customers is basically the percentage of
00:38:16
customers that are going to be sharing
00:38:17
completely those same five notes what's
00:38:21
magical about this and it's all in the
00:38:23
math is we've created a multi-tenant
00:38:27
system and then used the shuffle
00:38:30
charting to create a single basically a
00:38:32
single tenant experience which is
00:38:34
obviously what a POS aspires to do now
00:38:38
this needs a fault our client as I
00:38:40
mentioned so one that when it gets a
00:38:42
failure will retry but that's that's not
00:38:44
hard that's that's pretty common what's
00:38:47
interesting also is this not only works
00:38:48
for servers it can work for queues they
00:38:51
can work for other resources it's also
00:38:54
critically dependent on fixed
00:38:56
assignments so you're you're stuck with
00:39:00
the hand that we deal you if there's any
00:39:04
sort of failover like oh well your five
00:39:05
nodes are down I'll give you these five
00:39:06
then you get back into that old world
00:39:10
where now a problem can infect and
00:39:13
cascade across an entire system so it
00:39:15
really depends on those fixed
00:39:16
assignments and if I need some sort of
00:39:18
routing mechanism so either a shuffle
00:39:20
sharding aware router or dns can be
00:39:25
another so in some of our servers both
00:39:27
will hand a customer's
00:39:28
if ik DNS name and that will resolve to
00:39:32
the customer specific shuffle sharted
00:39:34
set for that customer which basically
00:39:37
gives them the routing for free cool so
00:39:42
let's talk about the operational
00:39:43
practices that we now layer on top of
00:39:46
these architectural techniques to
00:39:48
achieve the lowest possible blast radius
00:39:53
the first is not even the practice but
00:39:56
really ever a mindset and maybe even a
00:39:59
religion at a table us which is probably
00:40:02
best captured by Vernors blog from
00:40:04
earlier this year about
00:40:06
compartmentalization
00:40:07
now that's this back shall just read it
00:40:09
so I won't do as accident I have nothing
00:40:12
to see in burger talk but I'd wager that
00:40:15
every new a TBS engineer knows within
00:40:16
their first week it's not their first
00:40:18
day that we never want to touch more
00:40:20
than one zone at a time this is so
00:40:23
important because if we have
00:40:26
availability zone fault isolation or we
00:40:29
have region isolation as a core tenet of
00:40:31
our blast radius reduction that's gonna
00:40:34
go out the window if we have you know
00:40:36
some correlated failure introduced by
00:40:38
some manual action or automated action
00:40:40
on multiple of these at the same time
00:40:43
the most common and most I guess obvious
00:40:47
example of this is with software
00:40:48
deployments so our software deployments
00:40:51
are done in a staggered way across zones
00:40:55
across regions over time you know
00:41:00
quickly enough that we can get features
00:41:01
app to customers that you know because
00:41:03
we like to launch features but slowly
00:41:06
enough that we have confidence that as
00:41:08
we're pushing this change broader and
00:41:10
wider that it's not gonna cause an issue
00:41:13
so we'll start slow observe test and
00:41:17
then maybe speed it up as as it goes out
00:41:20
broader and broader and that will that's
00:41:25
the case with cells it's the case with
00:41:26
availability zones casement regions and
00:41:29
then within which each of those
00:41:30
deployment units will do a fractional
00:41:33
deployment to their to so a one box test
00:41:35
has a very first step for a service for
00:41:39
our ec2 Harbor deployments
00:41:42
we'll start with maybe five or ten
00:41:45
machines at first verify that things are
00:41:49
working and then gradually speed it up
00:41:52
as it expands across the infrastructure
00:41:55
in tandem with that we have a bunch of
00:41:57
automated tests that run that's part of
00:41:59
the deployment as well as Canaries or
00:42:02
these test applications that are
00:42:03
mimicking you know a real-world customer
00:42:06
in booking api's and will monitor the
00:42:09
results of those if there's any problem
00:42:10
the role the deployment get
00:42:12
automatically rolled back we'll look at
00:42:14
what happened and either decide it was
00:42:17
not an issue that we need to worry about
00:42:18
or fix the problem before we start the
00:42:20
deployment again so this is really
00:42:23
important this is this is what we need
00:42:25
to do to make sure that we've not
00:42:27
compromised the boundaries that we've
00:42:29
put between cells and AZ's and regions
00:42:35
all that automation sorry all that
00:42:37
deployment machinery is automated
00:42:41
including the rules about the tests that
00:42:45
have to succeed including the timing and
00:42:47
the windows when we progress to the next
00:42:49
stage but automation is in key in
00:42:52
general right it's not just employments
00:42:53
there are other things that we do with
00:42:54
our infrastructure to manage it whether
00:42:56
it's configuring network devices or you
00:43:00
know propagating new credentials to the
00:43:02
stacks that could be done by hand
00:43:05
but humans are prone to error it's much
00:43:08
better to automate it but it could be
00:43:10
because you can review whatever it is
00:43:13
that you're building in your automation
00:43:15
you can test it it's predictable and how
00:43:18
it's gonna operate and you can repeat it
00:43:19
over and over again and know that it's
00:43:21
gonna work the same way every time and
00:43:25
then finally as you may have heard AWS
00:43:28
and Amazon in general is has a very
00:43:31
strong philosophy around end-to-end
00:43:32
ownership so our service teams are
00:43:36
composed of engineers who are builder
00:43:38
operators so as engineers we build this
00:43:40
up we design the software we test it we
00:43:43
also are the ones that deploy it and
00:43:46
we're the ones that operate it and
00:43:47
respond to issues in production what
00:43:49
this does is it gives us this wonderful
00:43:51
opportunity to have a feedback loop in
00:43:53
terms of design choices what
00:43:56
impact is operationally and
00:43:58
understanding what changes we need to
00:43:59
make to avoid future problems when there
00:44:03
is a failure that we need to worry about
00:44:05
and so I mentioned earlier the
00:44:07
correction of errors template that we
00:44:08
use that's usually filled out by
00:44:11
engineers or in partnership with
00:44:12
engineers where they think about the
00:44:15
blast radius and so they're in a great
00:44:17
position to then go and implement the
00:44:19
changes to make sure the next time if
00:44:21
that event occurs the blast radius is
00:44:23
cut in half
00:44:23
or more so to wrap up we've got a
00:44:31
variety of containment
00:44:33
compartmentalization mechanisms that we
00:44:35
use to reduce blast radius
00:44:37
it starts with regions and the strong
00:44:38
isolation between regions then it goes
00:44:41
into availability zones and then this
00:44:44
alternate dimension of availability
00:44:46
zones compartmentalization which
00:44:47
ourselves and then the magic of shuffle
00:44:51
charting to get sort of virtual cells
00:44:54
and taking advantage of the
00:44:56
combinatorics of the shuffle charting
00:44:59
and then with those we protect that
00:45:02
compartmentalization with operation
00:45:04
operation operational practices so step
00:45:08
by step phase deployments that are
00:45:10
automated and then service teams that
00:45:13
are builder operators so they are close
00:45:16
to the the frontlines understand how
00:45:18
their decisions are impacting the
00:45:20
availability of their systems and all
00:45:24
with the goal of reducing blast radius
00:45:28
so that's my talk I hope you learned a
00:45:32
few things and I'm happy to take
00:45:34
questions now if anyone has them and I
00:45:36
think there are yeah there are live
00:45:38
microphones down the center aisles here
00:46:01
yeah so the question was is there an
00:46:02
example of Murphy's Law where we thought
00:46:06
we had everything nailed down everything
00:46:08
sorted out that we answered although the
00:46:12
possible failure modes its it goes back
00:46:16
to my slide of the the bit flips on the
00:46:21
networking side so this is a example for
00:46:23
many many years ago
00:46:24
so in s/3 s/3 cares deeply about data
00:46:29
care deeply about the integrity of data
00:46:31
and there are many layers in s3 of
00:46:35
checksumming
00:46:37
so if there is an arrant but bit flip
00:46:41
introduced we make sure it doesn't you
00:46:43
know we detected it's not an issue in
00:46:47
2008 we had an event in s3 where there
00:46:52
was one network card on one server that
00:46:57
every now and then was flipping one bit
00:47:01
and there was a layer of the system that
00:47:04
handles basically the group
00:47:06
communication across the system uses
00:47:09
gossip protocols so can detect whether I
00:47:13
can understand what the state is of all
00:47:14
the servers in the system whether
00:47:15
they're healthy or not
00:47:16
well the gossip protocol noticed there's
00:47:20
a funny server name in this packet
00:47:23
because one of the bits was flipped in
00:47:25
the server name I never heard of that
00:47:27
hos before and it triggered a much more
00:47:29
expensive sort of reconciliation
00:47:31
protocol long story short you could read
00:47:35
the long story there's a post mortem
00:47:37
that Stoll published on the somewhere in
00:47:40
the dashboard status dashboard pages
00:47:42
talks about the outage but it took down
00:47:43
s3 completely so one server one NIC one
00:47:49
bit took down our regional service and
00:47:52
that was the only layer we had checks
00:47:55
coming all the way up and down the stack
00:47:56
that was the only layer that didn't have
00:47:58
the check something someone we'd
00:47:59
forgotten to add it there
00:48:09
yeah so the follow question here is
00:48:12
wasn't there a more recent event where
00:48:14
there was a human error involved in an
00:48:16
s3 patent yes there was and this comes
00:48:19
back to my operation back to slides I
00:48:21
was talking about which is all your best
00:48:25
plans can can fall apart if whatever
00:48:30
you're doing doesn't respect those fault
00:48:32
isolation boundaries in this case the
00:48:36
engineer very certainly knew our
00:48:38
philosophy and it was more of a just a
00:48:42
human error mistake on the command line
00:48:45
which goes back to the importance of
00:48:46
automation and importance of testing of
00:48:48
those those things
00:48:49
Peter yes with the everything that you
00:48:54
talked about in the different layers
00:48:55
from regions availability zones to cells
00:48:58
to the shuffle shorty to your customers
00:49:01
are are they given to them or are they
00:49:05
for purchase as services good question
00:49:09
so regions are given to you these are
00:49:13
given to you cells and shuffle charting
00:49:16
are sort of different things in that
00:49:18
they are they're like the watertight
00:49:22
compartments inside the ship so you
00:49:24
could ask the the purser on the boat to
00:49:27
give you a tour of the lower decks and
00:49:28
maybe they'd show you the compartments
00:49:29
but otherwise you know you don't really
00:49:32
care about that and you get them for
00:49:33
free it's just part of the safety of the
00:49:35
ship so they're visible and they're
00:49:39
they're free I wanted to have a cell for
00:49:51
my own service that I'm making and I
00:49:53
actually care about aligning that cell
00:49:55
with a particular data center so you can
00:50:05
get that kind of affinity with your
00:50:07
zonal resources right so you can you can
00:50:09
get affinity with your ec2 instances and
00:50:11
your Cloud HSM instance and your EBS
00:50:15
volumes because we give you full control
00:50:16
over the place well and availability
00:50:20
zone you can think of as a lot
00:50:22
whole data center there they're one in
00:50:23
the same so a data center will always be
00:50:28
in one easy right an AZ will always be
00:50:33
one or more data centers you're saying
00:50:36
for the multiple data center or a Z's
00:50:39
exactly got it yeah no that that is if
00:50:42
that is not visible to you as a customer
00:50:43
you don't have control over that Thanks
00:50:46
yep so as you move from regions to cells
00:50:50
how does the version management of stuff
00:50:53
that you're deploying is there any
00:50:55
operational insight into that of being
00:50:58
deployed there I mean how do we how do
00:51:01
we keep does it have any impact on you
00:51:05
know managing operations deploying in a
00:51:07
region versus deploying at a shard level
00:51:10
I said that any complexities there yeah
00:51:12
so I think what you're saying is there's
00:51:14
this window of time when we're going
00:51:17
through the progression of the
00:51:18
deployment and the versions may not be
00:51:20
in sync across multiple locations is
00:51:23
that the core of your question yes yeah
00:51:25
that's definitely consideration and a
00:51:28
deployment may take depending on the
00:51:29
system it may take several days to
00:51:33
several weeks in some operating
00:51:35
structure were particularly careful
00:51:39
about the pace of deployment in fact in
00:51:42
some of the networking components for
00:51:44
VPC it's a particularly delicate affair
00:51:47
to roll out changes now it might take a
00:51:49
while so yeah that is a complication
00:51:51
that we have to be mindful of of well
00:51:54
which version is running here yeah some
00:52:00
questions regarding the global services
00:52:03
we talked a lot about regions zones how
00:52:05
hard the global services regarding
00:52:08
control playing day to play in
00:52:10
resiliency in respect to the regions or
00:52:12
edge notes and organize so is your
00:52:15
question about regional services that
00:52:18
are have global features or truly global
00:52:20
services for example I am I am okay yeah
00:52:23
that's a great question so I am as a
00:52:26
special case because
00:52:28
your account information and your
00:52:32
credentials and so forth are available
00:52:34
globally so it is in a sense of global
00:52:36
service however each region has a
00:52:39
separate control plane sorry a separate
00:52:43
data plate that can operate completely
00:52:46
disconnected from the source of truth
00:52:49
for I am the control plane however is
00:52:54
global and so there are particular
00:52:56
considerations that we take with a
00:52:58
service like that to make sure that it
00:53:00
continues to be available even in the
00:53:02
face of some of the failures that I
00:53:03
talked about but I should I had it on
00:53:07
the site and I forgot to mention that
00:53:08
there is another philosophy that we have
00:53:11
which is the separation between control
00:53:13
plane and data plane and ensuring that
00:53:16
the data plane can continue to function
00:53:17
even if there is a control plane issue
00:53:20
and that's exactly the scenario and I am
00:53:23
that's critically important because a
00:53:25
regent can be disconnected from the
00:53:26
source of truth or it can there could be
00:53:29
other issues but the control plane we
00:53:30
wanna make sure that you can still you
00:53:33
know validate your credentials and so
00:53:34
forth you know in each region and not
00:53:36
have global impact can you go to the mic
00:53:48
so when you're talking about the cell
00:53:51
structure and the shuffle shouting I
00:53:56
mean when you're talking about the cell
00:53:58
stuff you were talking about in the
00:54:00
control layer setting out the routing to
00:54:02
cut with the grain of the service ways
00:54:05
to break it up when you move to shuffle
00:54:07
shouting is that done at the same level
00:54:08
because obviously you lose that sort of
00:54:11
control based on the characteristics of
00:54:14
the service yeah it's a good question
00:54:16
and for for cell based services usually
00:54:24
that prevents you from shuffle charting
00:54:26
is that you have you have to have some
00:54:29
control over state so you can't a sort
00:54:32
of willy-nilly a scatter of state across
00:54:34
the entire fleet some servers don't have
00:54:37
state or the earthís off state that they
00:54:39
can they can cache locally and then
00:54:41
operate fine the shop charting works
00:54:45
well for for the latter scenario whether
00:54:47
it doesn't need to be any particular
00:54:48
node affinity the there's another sort
00:54:55
of advanced topic here which is kind of
00:54:57
interesting which is you can actually
00:54:58
layer both of these things together you
00:55:01
could take a cell based architecture and
00:55:02
then shuffle shard requests across them
00:55:05
in a stateless system to get a different
00:55:08
kind of resilience property that's
00:55:09
something that we're working on now with
00:55:11
with with dynamodb so I don't know if I
00:55:15
answered your question okay
00:55:18
so I guess maybe never another way to
00:55:20
answer to that as shuffle charting and
00:55:22
is not an improvement on cell based in
00:55:24
all cases it's really a system specific
00:55:28
on when you might be able to apply it
00:55:30
cool okay look that there's no more
00:55:36
questions thanks again for coming
00:55:38
hope you learn something and enjoy your
00:55:40
week at reinvent
00:55:40
[Applause]