AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
Résumé
TLDRIn his talk, Peter Voss, a distinguished engineer at AWS, explores how AWS minimizes the blast radius of failures in distributed systems. He discusses fundamental properties that impact system design, such as the speed of light, the CAP theorem, and Murphy's Law. Voss outlines various techniques employed by AWS, including region isolation, availability zone independence, cell-based architecture, and shuffle sharding, to contain failures and reduce their impact on customers. He emphasizes the importance of operational practices, such as staggered deployments and automation, in maintaining system resilience. The talk aims to provide insights into building resilient systems and offers practical techniques for reducing the blast radius in attendees' own systems.
A retenir
- 🌍 AWS focuses on minimizing the blast radius of failures.
- ⚡ The speed of light affects long-distance communications.
- 📉 The CAP theorem states you can't have consistency and availability simultaneously.
- 🔧 AWS employs region isolation to contain failures.
- 🏢 Availability zones provide fault tolerance within regions.
- 🔄 Cell-based architecture helps in isolating failures further.
- 🎲 Shuffle sharding reduces the impact of problematic requests.
- 🔍 Staggered deployments help in managing risks during updates.
- 🤖 Automation is key to reducing human error in operations.
- 🔄 Post-mortem analyses are conducted to learn from failures.
Chronologie
- 00:00:00 - 00:05:00
The session begins with Peter Voss introducing himself as a distinguished engineer at AWS, discussing the importance of minimizing the blast radius of failures in distributed systems. He highlights the challenges posed by the speed of light, the CAP theorem, and Murphy's Law, emphasizing the need for resilience in AWS systems.
- 00:05:00 - 00:10:00
Voss explains the concept of blast radius, which refers to the degree of impact a failure can have on customers, workloads, and locations. He stresses the importance of reducing this blast radius and outlines AWS's commitment to maintaining high availability while preparing for potential failures.
- 00:10:00 - 00:15:00
The discussion shifts to the various ways systems can fail, including server crashes, disk failures, network issues, and external factors like storms. Voss also mentions non-physical failures such as traffic surges and software bugs, setting the stage for techniques to contain failures.
- 00:15:00 - 00:20:00
Voss introduces techniques for reducing blast radius, starting with region isolation. He explains that AWS operates in multiple regions, each with its own set of services, ensuring that failures in one region do not affect others, thus providing a strong layer of isolation.
- 00:20:00 - 00:25:00
Next, Voss discusses availability zone independence, explaining that each region consists of multiple availability zones (AZs) that are physically separated to minimize correlated failures. He illustrates how applications can be designed to leverage this architecture for fault tolerance.
- 00:25:00 - 00:30:00
The presentation continues with the concept of cell-based architecture, where services are compartmentalized into cells that operate independently. This design allows for better fault isolation and resilience, as failures in one cell do not impact others.
- 00:30:00 - 00:35:00
Voss elaborates on shuffle sharding, a technique that further reduces blast radius by randomly assigning customers to multiple nodes. This method ensures that even if one node fails, customers can still access services through other nodes, significantly lowering the impact of failures.
- 00:35:00 - 00:40:00
The operational practices that support these architectural techniques are discussed, including staggered deployments, automated testing, and end-to-end ownership by service teams. Voss emphasizes the importance of automation in reducing human error and maintaining system integrity.
- 00:40:00 - 00:45:00
Voss concludes by summarizing the various mechanisms AWS employs to minimize blast radius, including region isolation, availability zones, cell-based architecture, and shuffle sharding, all supported by robust operational practices.
- 00:45:00 - 00:50:00
The session ends with a Q&A segment where Voss addresses questions about real-world examples of failures, the relationship between regions and cells, and the complexities of version management during deployments.
- 00:50:00 - 00:55:43
Overall, the talk provides insights into AWS's strategies for building resilient systems that can withstand failures while minimizing their impact on customers.
Carte mentale
Vidéo Q&R
What is the main focus of Peter Voss's talk?
The main focus is on how AWS minimizes the blast radius of failures in distributed systems.
What are some fundamental properties that affect distributed systems?
The speed of light, the CAP theorem, and Murphy's Law are key properties that affect distributed systems.
What techniques does AWS use to reduce the blast radius?
AWS uses region isolation, availability zone independence, cell-based architecture, and shuffle sharding to reduce the blast radius.
How does AWS ensure operational resilience?
AWS ensures operational resilience through staggered deployments, automation, and end-to-end ownership by service teams.
What is shuffle sharding?
Shuffle sharding is a technique that assigns customers to multiple nodes randomly to minimize the impact of failures.
What is the significance of availability zones?
Availability zones provide fault tolerance within a region, reducing the impact of failures.
How does AWS handle software deployments?
AWS handles software deployments in a staggered manner to minimize risk and ensure system stability.
What is the role of the control plane in AWS services?
The control plane manages resource administration, while the data plane handles the actual work.
What is the importance of automation in AWS operations?
Automation reduces human error, ensures consistency, and allows for predictable operations.
How does AWS approach failure analysis?
AWS conducts post-mortem analyses to identify root causes and implement changes to prevent future failures.
Voir plus de résumés vidéo
AWS re:Invent 2022 - Reliable scalability: How Amazon.com scales in the cloud (ARC206)
Zebra DevTalk | Introduction to Zebra Basic Interpreter (ZBI) | March 2025 | Zebra
Cuba L' Avana documentario
Elements of Poetry
Just in Time by Toyota: The Smartest Production System in The World
The Fourier Series and Fourier Transform Demystified
- 00:00:01good afternoon thanks for coming to this
- 00:00:05session who's here at reinvent for the
- 00:00:07first time awesome welcome should be an
- 00:00:12exciting week lots of sessions lots of
- 00:00:15parties lots of fun so I am Peter Voss
- 00:00:19all I am a distinguished engineer at AWS
- 00:00:22and today I'm going to talk about how
- 00:00:25AWS minimizes the blast radius of
- 00:00:27failures so I spent my career focused on
- 00:00:31building highly available large-scale
- 00:00:33distributed systems and one of the
- 00:00:36things I've learned and something you
- 00:00:38learn quickly in this field is that
- 00:00:40there are some fundamental properties of
- 00:00:41the universe that you basically have to
- 00:00:43contend with the first is the speed of
- 00:00:46light so 186 miles per millisecond is
- 00:00:51pretty darn fast but as the folks at JPL
- 00:00:54this morning trying to land their their
- 00:00:57Mars rover learned it really gets in the
- 00:00:59way when you travel any long distance
- 00:01:01communications by the way they did land
- 00:01:03the rover successfully so
- 00:01:05congratulations to our our friends at
- 00:01:07JPL a happy Amazon customer a tip us
- 00:01:10customer the next thing is sort of this
- 00:01:14pesky property universe known as the cap
- 00:01:17theorem so this is a an observation you
- 00:01:21can't Simas simultaneously have
- 00:01:22consistency and availability in a
- 00:01:25distributed system that is built to be
- 00:01:26partitioned tolerant so this was first
- 00:01:28postulated by Eric Brewer in 98 and then
- 00:01:31proven by Seth Gilbert and Nancy Lynch
- 00:01:34at MIT
- 00:01:36perhaps most annoying though is this
- 00:01:38observation by some guy named Murphy
- 00:01:40which is basically stuff breaks
- 00:01:43inevitably things will fail in a variety
- 00:01:46of interesting ways I don't know if this
- 00:01:48has been proven but in my experience
- 00:01:49that seems to have been borne out to be
- 00:01:51the truth now security availability
- 00:01:56durability these are all super important
- 00:01:59to AWS in fact they're our top priority
- 00:02:01and yet we have to contend with with
- 00:02:04Murphy's Law so today I'm going to talk
- 00:02:08about the techniques that Amazon uses
- 00:02:11AWS uses
- 00:02:13to contain the blast radius or the the
- 00:02:16degree of impact when Murphy's Law does
- 00:02:19in fact strike and I hope you'll walk
- 00:02:21away from this talk getting a better
- 00:02:23understanding of how a DPS builds
- 00:02:25resilience into our systems and also
- 00:02:28some techniques that you can use to
- 00:02:30reduce the blast radius in your own
- 00:02:32systems so this term blast radius is a
- 00:02:36it's a really useful one because failure
- 00:02:40is in binary it's not a it's failing or
- 00:02:43it's not there is a degree of impact
- 00:02:45it's a really useful term it's part of
- 00:02:48our common language at AWS and basically
- 00:02:52it's a way to describe the degree of
- 00:02:54impact so a table us if we have a
- 00:02:57failure one way to talk about blast
- 00:02:59radius as well how many customers did it
- 00:03:01impact or how many workloads or what
- 00:03:04functionality maybe it was just a
- 00:03:05portion of the functionality that was
- 00:03:07impacted and then finally in what
- 00:03:09locations was it a racket a data center
- 00:03:11was it a whole data center was it an
- 00:03:13entire region obviously we would always
- 00:03:16prefer a smaller blast we as a single
- 00:03:18rack failing to a larger scope failure
- 00:03:23so while we do relentlessly focus on
- 00:03:26preventing failures and and I would say
- 00:03:28that we've gotten very good at keeping
- 00:03:30our systems extremely highly available
- 00:03:32we also relentlessly focus on reducing
- 00:03:35blast radius for those very rare cases
- 00:03:38where we do have a failure and try to
- 00:03:40contain it
- 00:03:41and make it as small as possible one way
- 00:03:46to do that we do this is in our
- 00:03:48correction of errors process so that's
- 00:03:51our post mortem process that we use
- 00:03:53whenever there's an event the service
- 00:03:55team will go through analyze what were
- 00:03:58the root causes of the failure and then
- 00:04:01identify a set of actions to take to
- 00:04:03prevent recurrence one of the questions
- 00:04:06in the template for doing this
- 00:04:09correction of errors is about customer
- 00:04:10impact and we asked as a thought
- 00:04:12exercise
- 00:04:12how could you cut the blast radius for a
- 00:04:14similar event in half so we're always
- 00:04:16thinking about even when we do have an
- 00:04:18event how can we make it even smaller
- 00:04:20blast radius the next time
- 00:04:23so before talking about how to do that
- 00:04:25let's talk about how things can fail so
- 00:04:26one way things can fail as well servers
- 00:04:28crash maybe not like this but servers do
- 00:04:31crash disks fail in a variety of
- 00:04:33interesting ways they might go
- 00:04:35completely offline they might have
- 00:04:36random i/o errors network devices can
- 00:04:39fail and if they don't fail they could
- 00:04:41introduce random bit flips we've
- 00:04:43definitely seen this happen a few times
- 00:04:46meanwhile outside the data center
- 00:04:49utility workers can accidentally cause
- 00:04:52fiber cuts you can have electrical
- 00:04:55storms that can cause utility power
- 00:04:58failures in a more extreme scenario you
- 00:05:03could have data centers actually get
- 00:05:05physically damaged by storms or fires so
- 00:05:10far I've talked about physical failures
- 00:05:11right but there's also non-physical
- 00:05:14affairs we have to worry about one is
- 00:05:16just a surge of traffic whether it's a
- 00:05:18DDoS attack or a extreme surge in demand
- 00:05:23on one of our services which can cause
- 00:05:26overload conditions we have to worry
- 00:05:29about black swan' requests we actually
- 00:05:31call them these poison pills internally
- 00:05:33I've done a little research and that
- 00:05:35seems to be just an AWS term but you can
- 00:05:37think of these as these particularly
- 00:05:39problematic requests that are either
- 00:05:41really expensive or there's something
- 00:05:43about them that trigger a bug in in the
- 00:05:46system and they can be particularly
- 00:05:48pernicious because a client will retry
- 00:05:51after a failure and so a single failure
- 00:05:54can cascade and cause an entire system
- 00:05:57to get infected
- 00:05:58so we worried a lot about these poison
- 00:06:00pills or black swans we also have to be
- 00:06:04mindful the fact that sometimes a
- 00:06:05software deployment or configuration
- 00:06:06change could introduce the problem and
- 00:06:10then of course most generally there are
- 00:06:12bugs and obviously none of us want to
- 00:06:16have bugs get into production and we
- 00:06:18have a very good success rate and not
- 00:06:20letting bugs get into production but
- 00:06:21there is still that that possible
- 00:06:23eventualities and so now that I've
- 00:06:28talked about failures and what we mean
- 00:06:30by blast radius I'll better talk about
- 00:06:32the various techniques we use to contain
- 00:06:34it
- 00:06:35so region isolation availability zone
- 00:06:38independence cell based architecture
- 00:06:40shuffle sharding and then finally in
- 00:06:43tandem with all these techniques the
- 00:06:44operational practices that we use to get
- 00:06:47the desired outcome so let's start with
- 00:06:50region isolation so as you probably know
- 00:06:53AWS is deployed globally in 19 separate
- 00:06:57locations that we call regions and we
- 00:07:02actually have five additional regions
- 00:07:03that we've announced that will be coming
- 00:07:04on online soon and as a customer you
- 00:07:08choose the region that you want to run
- 00:07:10your workloads in based on factors like
- 00:07:12like latency or maybe data residency
- 00:07:17each one of these regions is a separate
- 00:07:20distinct stack of AWS services and each
- 00:07:26region has a separate set of end points
- 00:07:28for the api's you use to interact with
- 00:07:30the services so for example in ec2 if
- 00:07:32you want to interact with ec2 in uswest
- 00:07:35one you would use the ec2 that uswest
- 00:07:38one that either Sam it's not that common
- 00:07:40endpoint meanwhile if you wanted to use
- 00:07:43easy to than us used you wussies - then
- 00:07:46you would use separate endpoint and the
- 00:07:49reason is there's no single global ec2
- 00:07:53there's a separate stack in every region
- 00:07:56and these are isolated in San Shi ations
- 00:07:58and they don't know about each other
- 00:08:01so this is a shared nothing architecture
- 00:08:03which gives us the ultimate in blast
- 00:08:05radius protection so there's our ec2 you
- 00:08:10know Amazon Elastic Compute cloud
- 00:08:11service in u.s. west wall and US east -
- 00:08:14they don't talk to each other they don't
- 00:08:16know about each other same is true for a
- 00:08:18service like Amazon FQs Amazon sage
- 00:08:20maker and and so on now you might be
- 00:08:24wondering well but there are multi
- 00:08:27region or global features you know what
- 00:08:30about those features things like s3
- 00:08:34cross region replication or dynamos
- 00:08:36global tables or ec2 virtual private
- 00:08:40cloud peering so let me talk about how
- 00:08:43we address those by going through an
- 00:08:45example so there
- 00:08:46is an example of inter-regional private
- 00:08:49cloud peering so virtual private cloud
- 00:08:51or VPC basically lets you provision the
- 00:08:54logically isolated section of the AWS
- 00:08:57cloud where you can launch resources in
- 00:09:00a virtual network that you define you
- 00:09:02define the IP addresses in your virtual
- 00:09:03network configure route tables gateways
- 00:09:06and so forth and with inter region DPC
- 00:09:10peering you can actually take two v pcs
- 00:09:12in two different regions and virtually
- 00:09:15connect them so they can communicate
- 00:09:17without having to go over the the public
- 00:09:20internet so here's our V PC peering
- 00:09:24connection now setting one of these
- 00:09:26things up involves a workflow and a
- 00:09:28setup approvals particularly because
- 00:09:30they can be v pcs owned by different
- 00:09:32customers so you want both customers to
- 00:09:34agree to establishing this connection
- 00:09:37and that involves configuration changes
- 00:09:39in the V PC configuration on each of
- 00:09:44these in each of these regions but these
- 00:09:47easy to control planes don't talk to
- 00:09:49each other so how do we accomplish this
- 00:09:51the answer we have a dedicated service
- 00:09:55called the cross region Orchestrator
- 00:09:56that sits on top of these systems and
- 00:10:01implements the workflow manages through
- 00:10:03the approval process and comes through
- 00:10:05the sort of the front door of BC - just
- 00:10:08like any other API would and it also has
- 00:10:11a certain number of safety features it
- 00:10:13ring fences some of the interactions to
- 00:10:16ensure that there is no possibility for
- 00:10:18multiple regions to be impacted by you
- 00:10:20know whatever issue there could be at
- 00:10:22the same time so we can preserve that
- 00:10:24single region blast radius so kind of a
- 00:10:29recap here we've got all these regions
- 00:10:32and we're deeply committed to this
- 00:10:34principle of region isolation so in the
- 00:10:37event of say an earthquake in Japan
- 00:10:39which we did experience a few years ago
- 00:10:42there could be impacted in that region
- 00:10:44but it's going to be limited to that
- 00:10:45region so in the worst case we have a
- 00:10:48single region blast radius in case of a
- 00:10:50failure that's still pretty big right
- 00:10:54that's that's a lot of impact especially
- 00:10:56if you're
- 00:10:57a customer that sits only in that region
- 00:11:00so obviously we want to do better so how
- 00:11:04do we do better now I'm gonna talk about
- 00:11:06availability zone independence so let's
- 00:11:09drill into the design of a region kind
- 00:11:11of double click on it and see how we
- 00:11:13limit blast radius within a region so
- 00:11:17there's our region a region is actually
- 00:11:18composed of well before I talk about its
- 00:11:21proposed over a region sits in a in a
- 00:11:23location right so at the macro scale its
- 00:11:25Northern Virginia or Dublin or Frankfurt
- 00:11:28and but it's not a single data center
- 00:11:31it's it's actually a it's composed of
- 00:11:35multiple data centers that are spread
- 00:11:38across the metropolitan area of the
- 00:11:41region and we call these different
- 00:11:43locations availability zones and their
- 00:11:44and their cross connected with
- 00:11:47high-speed private fiber links now these
- 00:11:53are far enough apart from each other
- 00:11:54that there's a very very low possibility
- 00:11:58for correlated failure except maybe in
- 00:12:00the in the earthquake case so you can
- 00:12:02think of them as miles apart and that
- 00:12:06means if a tornado comes through it's
- 00:12:07unlikely that it would hit multiple
- 00:12:09facilities if there's a utility issue we
- 00:12:13actually run off of different utility
- 00:12:15suppliers in a region so that won't
- 00:12:17affect multiple availability zones at
- 00:12:18the same time and at the same time
- 00:12:21they're close enough to each other so
- 00:12:23again think about it as miles away they
- 00:12:26don't have that pesky speed of light
- 00:12:27issues you can think of them logically
- 00:12:28being in the same place run things like
- 00:12:31synchronous replication protocols and so
- 00:12:32forth without any latency penalty so
- 00:12:37each availability zone is basically n
- 00:12:40data centers it's not just one data
- 00:12:41center in some so in some cases because
- 00:12:44we keep the size of the data center set
- 00:12:48to a fixed maximum size in larger AZ's
- 00:12:51we might have multiple buildings for na
- 00:12:53Z and then we have n of those
- 00:12:56availability zones per region usually
- 00:12:58three or more in fact in all future
- 00:13:01region builds will have at least three
- 00:13:03and that's useful first for a certain
- 00:13:06should be the system's consensus
- 00:13:08protocols that
- 00:13:10best suited when you have had these
- 00:13:11three locations so you can get to
- 00:13:14consensus agreement across them globally
- 00:13:17we have 19 regions with 57 total
- 00:13:21availability zones and five more regions
- 00:13:23coming online with 15 more AZ's so with
- 00:13:28this architecture of AZ's within a
- 00:13:30region we now have the possibility to
- 00:13:32reduce the blast radius because we
- 00:13:35reduce the possibility of a correlated
- 00:13:37failure across the entire region and you
- 00:13:40can take advantage of this multi a-z
- 00:13:42architecture in old applications just
- 00:13:44like we do in a DBMS by using a multi
- 00:13:46a-z architecture for your application so
- 00:13:49let me go through a quick example here
- 00:13:51super simple so you have your
- 00:13:54application it runs on a set of
- 00:13:55instances that you've deployed across
- 00:13:57multiple AZ's and then run an elastic
- 00:14:00load balancer to low balance the traffic
- 00:14:02across them and then behind the scenes
- 00:14:04are using you know a relational database
- 00:14:06maybe for your persistence and that set
- 00:14:09up in a multi a Z primary standby pair
- 00:14:12so if you have a failure in one of the
- 00:14:14availability zones the elastic load
- 00:14:16balancer will detect that and stop
- 00:14:18sending traffic to the field a Z or the
- 00:14:20instances in the failed a Z
- 00:14:23meanwhile the application is connected
- 00:14:26to the master database that has gone
- 00:14:28away me because there's a power failure
- 00:14:29or a network failure but you could fail
- 00:14:33that database over to the to the healthy
- 00:14:38AZ which happens automatically with
- 00:14:40Amazon RDS and then your application is
- 00:14:43up and running again
- 00:14:43and these failure detection events could
- 00:14:47happen fairly quickly so there might be
- 00:14:49a slight hiccup the operation your
- 00:14:51application but otherwise it's almost
- 00:14:53like a non-event that you lost you know
- 00:14:55one of the data centers in your
- 00:14:56application so this is how all of our
- 00:14:58services that run regionally in AWS are
- 00:15:03designed and operated it's a really
- 00:15:07powerful model cuz it gives you this
- 00:15:08fault tolerance and it basically means
- 00:15:11that you can withstand an AZ failure and
- 00:15:14not have any impact on your customers
- 00:15:16it's also a powerful design for
- 00:15:18durability so s3 uses the multi AZ model
- 00:15:21to get its 11
- 00:15:22of durability and so with this
- 00:15:25architecture you can basically get to
- 00:15:26zero blast radius when you have a data
- 00:15:28center and get hit by tornado or a
- 00:15:31utility worker hitting a cutting out
- 00:15:33fiber connection so what about the
- 00:15:36services in a WMS that are zonal so some
- 00:15:39of our applications are some of our
- 00:15:41services give you the opportunities and
- 00:15:43application to have zonal resources like
- 00:15:46an ec2 instance or an EBS volume you
- 00:15:50decide which AZ these live in and that's
- 00:15:52part of the ear story for creating a
- 00:15:54resilient application
- 00:15:56these are zone specific resources and
- 00:16:00for minimum blast radius considerations
- 00:16:04we actually have a zone local control
- 00:16:06plane for the resources in each of our
- 00:16:09Easy's with a principle of availability
- 00:16:12zone independence so we talk about
- 00:16:15control planes real quickly if you're
- 00:16:16not familiar with the term so control
- 00:16:18plane is the thing that you interact
- 00:16:19with to administer these resources
- 00:16:21whether they're zonal or regional and
- 00:16:25then the data plane is the thing you
- 00:16:26interact with to actually do the work
- 00:16:28you want to get accomplished so some
- 00:16:30examples are here for Amazon ec2 the
- 00:16:32control plane is what handles run
- 00:16:34instances so it takes your API request
- 00:16:36and then does all the work necessary to
- 00:16:39launch the VM as per your instructions
- 00:16:41in your API and then the data plane for
- 00:16:44you to do is basically the instance
- 00:16:46right
- 00:16:47the thing you SSH into or the run your
- 00:16:49application on it's also the network
- 00:16:51that attaches your you should do it
- 00:16:54since to other instances in your V PC so
- 00:16:59let's talk about azi her availability
- 00:17:00zone independence for these things so
- 00:17:02the data plane runs in each of these
- 00:17:04AZ's and these are isolated from each
- 00:17:07other they can obviously communicate
- 00:17:08over a network but otherwise the data
- 00:17:09planes don't really know about each
- 00:17:11other and the same is true of the
- 00:17:13control planes for these for these
- 00:17:16resources now I mentioned earlier
- 00:17:20there's a regional endpoint to access
- 00:17:22ec2 so there must be some layer that you
- 00:17:25connect to that sits on top of these and
- 00:17:27there is there's a regional control
- 00:17:28plane that acts as the entry point and
- 00:17:30also handles things that are
- 00:17:33zone specific so for easy to things like
- 00:17:36security groups or not zone specific it
- 00:17:40also aggregates API is like describe
- 00:17:42instances so if you want to find out
- 00:17:44about all your instances in a region it
- 00:17:46will need to interrogate the control
- 00:17:48planes across all the easies so in this
- 00:17:52model again if you lose an AZ you're
- 00:17:55gonna lose your zonal resources but
- 00:17:56you've expected that anyway but the data
- 00:17:59planes and the other ATS are fine the
- 00:18:01control planes for those zones are fine
- 00:18:03and the regional control control plane
- 00:18:06is built to be multi AZ fault tolerant
- 00:18:08so it will also continue to operate fine
- 00:18:12except for the fact that it won't be
- 00:18:14able to service requests that target the
- 00:18:15zone that's down so if you have API
- 00:18:18calls in to one of the healthy as ease
- 00:18:20to launch an instance there it should be
- 00:18:23able to route around this particular
- 00:18:25type of AZ failure
- 00:18:30okay so let's review the black phase
- 00:18:32improvements that we get from
- 00:18:33availability zones so here's a regional
- 00:18:36service spread across three AZ's if zone
- 00:18:41a fails regional service is fine because
- 00:18:43it's able to fail away from from the
- 00:18:45failed AZ meanwhile our zonal service
- 00:18:49has an impact adjust in that zone but
- 00:18:51not the other zones the other zones are
- 00:18:53isolated and they won't have any impact
- 00:18:55these are all these are all good things
- 00:18:59the theoretical blast radius is a
- 00:19:01different story so by a theoretical
- 00:19:02blast rates I mean in the worst case
- 00:19:05scenario the thing that you know the
- 00:19:07Black Swan event like what's the worst
- 00:19:09case that could happen here for the
- 00:19:11regional service it is the entire
- 00:19:13service and that's the thing that we
- 00:19:15lose sleep at night every night thinking
- 00:19:17about for the zonal service it's it's
- 00:19:20still the zone so there's something nice
- 00:19:22about this property of these
- 00:19:23availability zones we still don't like
- 00:19:25that there's a non infrastructure event
- 00:19:27that could take out a service in his own
- 00:19:29but it is still nice that it's contained
- 00:19:31to the zone can we get the same kind of
- 00:19:33resilience in our regional service
- 00:19:36without having to sort of target it at a
- 00:19:38in a zone local kind of way
- 00:19:41so let's take a step back and look at
- 00:19:42this abstracted architecture so there's
- 00:19:46an entry point a regional entry point
- 00:19:49into the service it has an aggregation
- 00:19:52layer that might do a few things but
- 00:19:54mostly it's a routing layer into a set
- 00:19:56of compartmentalize resources and then
- 00:19:59there's failure isolation between them
- 00:20:01so in the availability zone case these
- 00:20:03are AZ's down here and then a regional
- 00:20:06control plane that accesses them but
- 00:20:09more generally there's this
- 00:20:10compartmentalization that is giving us
- 00:20:13this nice fault isolation can we use
- 00:20:16that in a in a different way in a
- 00:20:18different dimension than AZ's and get
- 00:20:21some smaller blast radius there's
- 00:20:24another way to think about this
- 00:20:25abstracted architecture which is how
- 00:20:28they build chips so for centuries now
- 00:20:31ships have been built with these
- 00:20:32watertight compartments that are
- 00:20:35separated by bulkheads and the reason
- 00:20:38that this is that if there's a there's
- 00:20:39damage to the hull and it causes
- 00:20:40flooding the flooding is contained into
- 00:20:43one of those compartments and the rest
- 00:20:46of the ship is still intact and the ship
- 00:20:48stays afloat those bulkheads also
- 00:20:51provide structural integrity these are
- 00:20:53both nice properties right you have this
- 00:20:56fault Tower it's minimized impact of
- 00:20:59failure and higher structural integrity
- 00:21:03and so we've taken these ideas and
- 00:21:06applied them to our regional services
- 00:21:08and what we call cellular architecture
- 00:21:10or cell based architecture so let me go
- 00:21:15through our simple example again of a
- 00:21:18application with a load balancer compute
- 00:21:20and some storage and it's not shown here
- 00:21:24but this is an application that is
- 00:21:25running in multiple lazy's and has the
- 00:21:27the failover as it described before so
- 00:21:31in cell based architecture we take this
- 00:21:34service stack configuration and we
- 00:21:37create multiple instantiations of it and
- 00:21:43these are fully iced I didn't know about
- 00:21:44each other and each one of these stacks
- 00:21:48is what we call a cell
- 00:21:52and then what we'll take our workload
- 00:21:53and basically low balance it partition
- 00:21:57it over these these cells one way to do
- 00:21:59that might be by customer so we'll put
- 00:22:01this section into cell zero the section
- 00:22:03into cell one this section into cell n
- 00:22:07now you guys all look nice but maybe
- 00:22:10there's someone naughty in here it's
- 00:22:11gonna cause us a problem they're gonna
- 00:22:12only gonna cause up the problem in that
- 00:22:14one cell the other cells are gonna be
- 00:22:16fine now we need some way to contain
- 00:22:19this thing or make this thing look like
- 00:22:21a single service still so we put a cell
- 00:22:23router on top of it that makes those
- 00:22:24routing decisions and that whole thing
- 00:22:28is what we call a cell based service so
- 00:22:32the cells are an internal structure
- 00:22:33that's invisible to you as a customer
- 00:22:35but provide resilience and fault
- 00:22:38tolerance and this looks just like the
- 00:22:41picture for azi or availability zone
- 00:22:44independence but on a different
- 00:22:46dimension so let's talk about what that
- 00:22:49looks like
- 00:22:49so here's the regional service getting
- 00:22:52resilience from availability zones
- 00:22:54here's that same service divided into
- 00:22:56cells and you can see now that with an
- 00:23:00availability zone failure both services
- 00:23:02are resilient to that because their
- 00:23:03fault tolerant across AZ's and the other
- 00:23:08examples of failure the ones that we
- 00:23:10lose sleep at at night over the failure
- 00:23:13in their in the cell based service it's
- 00:23:15contained to the cell so the impact is 1
- 00:23:17over N or n is the number of cells
- 00:23:19rather than the whole service and I've
- 00:23:22shown three here for the purposes of
- 00:23:25presentation but the number of cells can
- 00:23:27actually be much higher so one over n
- 00:23:29could be a fairly small percentage of
- 00:23:30the overall set of workloads that you're
- 00:23:33supporting now that this approach can
- 00:23:38also be applied to zonal services and we
- 00:23:40do this in our ec2 control plane where
- 00:23:44you divide each of the zonal services
- 00:23:45into cells as well so you still have a
- 00:23:48failure as you'd expect if an
- 00:23:50availability zone goes down what's
- 00:23:52interesting though is as I mentioned
- 00:23:53some availability zones or multiple data
- 00:23:55centers and so you can actually have a
- 00:23:57smaller blast radius
- 00:23:58in certain cases if your zonal cells are
- 00:24:01aligned with the physical infrastructure
- 00:24:03which is the case with our zonal
- 00:24:05easy to control plane services that's
- 00:24:08actually a nice improvement and then
- 00:24:10again for the other failure cases that
- 00:24:13we worry about there is also a smaller
- 00:24:15ablation Shrek blast radius even for the
- 00:24:17zonal services so let's look at the
- 00:24:21system properties of a cell based
- 00:24:23architecture we've I talked about some
- 00:24:26of them the first one is workload
- 00:24:28isolation and this is useful not just
- 00:24:30for failures but also just noisy
- 00:24:31neighbor problems and then of course
- 00:24:35there's the the failure containment so
- 00:24:37if we lose a cell the other cells are
- 00:24:38fine there's also this nice property
- 00:24:43that is really powerful really important
- 00:24:45to us at AWS which is how we scale these
- 00:24:47things
- 00:24:48so rather than scaling up a service
- 00:24:52which is sort of the traditional way
- 00:24:53just add you know more and more capacity
- 00:24:55to it in a cell based architecture you
- 00:24:59could also add more capacity to cells
- 00:25:00but one of the things that we include in
- 00:25:04our goals for a sublet are characters
- 00:25:06that cells like our data centers have a
- 00:25:09maximum size we won't let them grow past
- 00:25:10a certain point and so if we need to
- 00:25:13continue to grow a system rather than
- 00:25:15growing the cells past that point we'll
- 00:25:18add another cell so you grow the system
- 00:25:20by scaling it out with more and more
- 00:25:22cells the fact that the cells have a
- 00:25:25maximum size means you can test them at
- 00:25:28that maximum size with a reasonable test
- 00:25:30configuration so you can test them to
- 00:25:33failure you can do all sorts of stress
- 00:25:34testing and get confidence that you
- 00:25:36understand how that piece of the system
- 00:25:38is going to operate as you you know as
- 00:25:41you get more and more demand on your
- 00:25:43system these cells are also more
- 00:25:46manageable because they're smaller so if
- 00:25:47there's some issue you need to look
- 00:25:49through logs or otherwise inspect the
- 00:25:52nodes in the cell it's gonna be smaller
- 00:25:55it's just a piece of your system rather
- 00:25:56than the whole system so it's gonna be
- 00:25:58easier to work through so let's now talk
- 00:26:03about some of the core considerations in
- 00:26:05a cell based architecture so the first
- 00:26:09is cell size
- 00:26:14sort of the trade-off here is you can
- 00:26:16have a large number of smaller cells or
- 00:26:19a smaller number of large cells in the
- 00:26:23the case where you have smaller cells
- 00:26:25that's nice because now your blast
- 00:26:26radius is you know that much smaller and
- 00:26:29those smaller things are easy to test
- 00:26:31easier to break to understand what their
- 00:26:33breaking points are and 30 to operate in
- 00:26:36terms of figuring out if there's an
- 00:26:37issue you know smaller number of notes
- 00:26:39to go in and take a peek at on the other
- 00:26:43end of the spectrum though larger cells
- 00:26:45have some good properties which is first
- 00:26:47if there's a fixed cost to each of these
- 00:26:49which often there is you know maybe
- 00:26:51there's a separate low bounce or each
- 00:26:53one of them then you get cost efficiency
- 00:26:55by having fewer of them
- 00:26:57you also get reduced splits so that this
- 00:27:00is an important consideration if if
- 00:27:03we're dividing our workload by a
- 00:27:06customer some of our customers might be
- 00:27:07large they have a large number of
- 00:27:09workloads and that may be too large to
- 00:27:12fit in a smaller cell if we're using
- 00:27:16larger cells we may be able to fit that
- 00:27:18larger customer into a single cell right
- 00:27:20and not have to worry about deal with
- 00:27:21the complexity of splitting across
- 00:27:23multiple cells and finally as a whole
- 00:27:26this system that has fewer cells is
- 00:27:29easier to operate because you it's
- 00:27:30easier to think about easier to look at
- 00:27:31dashboards and so forth there's no right
- 00:27:34answer here except that all things being
- 00:27:36equal we will always prefer the lower
- 00:27:39about blast radius do these other
- 00:27:41considerations another which I'm sure
- 00:27:46you've been thinking about looking at
- 00:27:49this diagram is well what about the
- 00:27:50router the cell router is this remaining
- 00:27:53shared component across the entire
- 00:27:56system and so it's really important that
- 00:27:58that thing not fail because now you're
- 00:28:00back to the regional blast radius and so
- 00:28:04we spent a lot of effort making sure
- 00:28:06that that component is stress tested and
- 00:28:11battle-hardened so that we know when it
- 00:28:13we have high confidence that even in the
- 00:28:16Black Swan scenarios that it's going to
- 00:28:17stay resilient and stay up and and one
- 00:28:21of the ways to accomplish that is to
- 00:28:22keep it as simple as possible
- 00:28:25so when we talk to teams about adopting
- 00:28:27a sublation architecture we call this
- 00:28:29component the thinnest possible layer to
- 00:28:30kind of reinforce should be really
- 00:28:33simple and yeah that's all I'll say
- 00:28:38about that another consideration is
- 00:28:42partitioning dimension so I've talked
- 00:28:44about how we might divide cells along
- 00:28:48lines of customers but then in the ec2
- 00:28:51control plane case there's an aspect of
- 00:28:54the control plane that actually is cell
- 00:28:56based based on physical infrastructure
- 00:28:58in our data centers which makes sense
- 00:29:00for that application another scenario we
- 00:29:06may divide not by customer by VPC
- 00:29:08especially because sometimes V pcs may
- 00:29:11have crossed customer scenarios and so
- 00:29:14it takes some analysis to decide what's
- 00:29:16the right way to carve this thing up and
- 00:29:19the recommendation I always use is cut
- 00:29:21with the grain and if you don't know
- 00:29:23what that means then think about it this
- 00:29:25way that you know wood has a certain
- 00:29:26grain and it's easy to split along one
- 00:29:28dimension and really hard to split cross
- 00:29:31across the grain and every system has
- 00:29:34has a natural grain to it another
- 00:29:40consideration is what I call cross cell
- 00:29:42use cases these may be unavoidable the
- 00:29:47goal is to keep them to a minimum
- 00:29:48because that adds complexity to the
- 00:29:51thinnest possible layer and also
- 00:29:53increases the blast radius for those
- 00:29:56those operations one example to scatter
- 00:29:59gatherer queries so what this means is
- 00:30:01there may be an API that comes in that
- 00:30:03needs to interrogate multiple cells so
- 00:30:07if scatter requests out and then gather
- 00:30:09responses are sent out a single reply so
- 00:30:11an example in ec2 is the describe
- 00:30:13instance this case I mentioned earlier
- 00:30:15now there is batch operations so if you
- 00:30:18need to execute work on multiple cells
- 00:30:20in a single operation so again an AC -
- 00:30:23maybe the terminate instances API where
- 00:30:25you can spend send multiple instance IDs
- 00:30:27that can be a cross sale use case the
- 00:30:33last in the probably hardest is
- 00:30:34coordinated writes where you're actually
- 00:30:36doing trying sending atomically but a
- 00:30:38car
- 00:30:39multiple cells those require careful
- 00:30:42consideration and one example of that is
- 00:30:44cell migration so cell migration is when
- 00:30:47you relocate a workload from one cell to
- 00:30:50another it's maybe we decided this
- 00:30:52customer we're gonna move that customer
- 00:30:53into cell two from cell 1 and you may
- 00:30:58choose do this because you want to
- 00:31:00manage the amount of load or heat that's
- 00:31:02on each cell or maybe just want to load
- 00:31:05bounce the sizes of them or maybe you've
- 00:31:07added a salad you and you need to and
- 00:31:09your approach for adding cells involves
- 00:31:12you know moving existing workloads over
- 00:31:14into the new cell and the process that
- 00:31:17you used to do the migration is not
- 00:31:20unlike a VM migration so if you're
- 00:31:22familiar with how VM migration works
- 00:31:24they say there's a invisible clone that
- 00:31:27gets created in the target location and
- 00:31:29it gets brought up to date and
- 00:31:32synchronized with the source of course
- 00:31:35the source is still changing so this
- 00:31:37could take a while for it to get close
- 00:31:40to begin sync and at the last possible
- 00:31:42moment
- 00:31:42both are frozen for the final completion
- 00:31:46of the syncing and then an atomic flip
- 00:31:48over to the to the target location that
- 00:31:52works for VMs and that's the same
- 00:31:53approach that we use for migrating
- 00:31:55workloads across cells it requires a
- 00:31:58careful coordination best-managed at the
- 00:32:02router level we have a few approaches
- 00:32:05that we've been using to accomplish this
- 00:32:09so again with the cell Bay structure
- 00:32:13we're able to reduce the blast radius
- 00:32:14from 100 percent down to one over n
- 00:32:17where n is the number of cells
- 00:32:18I should reinforce that the events that
- 00:32:23cause these types of additives are
- 00:32:25exceedingly rare we could spend a long
- 00:32:32time not even having to worry about the
- 00:32:34kind of failure happening because of all
- 00:32:35the other things that abus does but
- 00:32:38we're so focused on resilience that
- 00:32:40we're investing additional engineering
- 00:32:41work to get to that picture on the right
- 00:32:44which is a smaller blast radius even
- 00:32:46when those Black Swan events occur
- 00:32:50so cells are great but there's another
- 00:32:56technique that we've been using that is
- 00:32:57even more impressive and more exciting I
- 00:33:01think which is called shuffle charting
- 00:33:04and shuffle charting is a technique that
- 00:33:06is like cell based architectures and
- 00:33:09it's particularly useful in stateless or
- 00:33:12soft state services so we'll be walk
- 00:33:16through what shuffle charting looks like
- 00:33:17so here's another simple service we've
- 00:33:20got eight nodes and these ain't nodes or
- 00:33:23handling requests that are sent by a
- 00:33:26load balancer and then we have eight
- 00:33:29different customers that are sending
- 00:33:30requests so we're in Vegas I use some
- 00:33:33some gambling relevant to high cons here
- 00:33:36there were ten are different customers
- 00:33:39and let's imagine one of them Dimond
- 00:33:42here is is introducing a bad workload
- 00:33:45for whatever reason maybe it's expensive
- 00:33:46request maybe it's one of these requests
- 00:33:50that triggers a bug in the system so
- 00:33:53Dimon sends a request in and that
- 00:33:56request causes one of our servers to
- 00:33:57crash okay that's all right we've got
- 00:34:00seven others maybe Dimon will go away or
- 00:34:03change what it's doing probably not
- 00:34:06it'll probably keep retrying and
- 00:34:09eventually take out the whole system so
- 00:34:11here our blast radius is basically all
- 00:34:14the customers this is like the worst
- 00:34:16case scenario we really want to avoid
- 00:34:18this this is where we go to sell based
- 00:34:21architecture so we divide our our
- 00:34:23customers assign a subset of customers
- 00:34:25to each sell
- 00:34:28now when diamond comes along and causes
- 00:34:31problems that problems contained adjust
- 00:34:34to the sell this is a 4x improvement
- 00:34:38right we've gone from 100% down to 25%
- 00:34:40the blast radius is the number of
- 00:34:41customers divided by the number of cells
- 00:34:44and again we could approve that further
- 00:34:46by adding more cells or in a system
- 00:34:50where we don't need to really worry
- 00:34:52about which customers land on which
- 00:34:54nodes we could shuffle shard them which
- 00:34:58is a little bit different and it's
- 00:34:59nuanced but you'll see shortly how
- 00:35:01powerful this is
- 00:35:03so we'll take each customer and assign
- 00:35:05them to two nodes effectively at random
- 00:35:08not really a random own will use hash
- 00:35:10functions to be predictable that where
- 00:35:15are these customers land on these nodes
- 00:35:16but basically we've assigned them at
- 00:35:18random so diamond gets assigned to the
- 00:35:20first and fourth nodes will put spades
- 00:35:23on those two our role of two on the dice
- 00:35:27goes to those two nodes and so on
- 00:35:29Ceaser's basically shuffled randomly
- 00:35:33across our set of capacity so now again
- 00:35:37diamond comes along takes out the two
- 00:35:39nodes that are assigned to it but here's
- 00:35:41where it gets interesting look at who's
- 00:35:44sharing those nodes with diamond one of
- 00:35:47them is hearts hearts however has a
- 00:35:52second know that it's assigned to that's
- 00:35:55not impacted to buy the average so as
- 00:35:56long as that customer retries it's
- 00:35:59fault-tolerant even though one of its
- 00:36:02nodes is down one of its noses up it's
- 00:36:04able to continue operation the same is
- 00:36:06true on the other node where clubs has
- 00:36:09that same property so in this case our
- 00:36:14blast radius is actually the number of
- 00:36:16customers divided by the number of
- 00:36:17combinations of of two pairs out of
- 00:36:21eight which turns out there are 28 of
- 00:36:25them which is 3.6 percent so if we had a
- 00:36:29much larger number of customers we'd
- 00:36:30expect these are well distributed
- 00:36:33randomly you would have three poor 63.6%
- 00:36:35of customers impacted by the failure at
- 00:36:38that I showed meanwhile less than half
- 00:36:42of the customers would be in that
- 00:36:43scenario where they're sharing at least
- 00:36:45one of the nodes so they may see a
- 00:36:47little bit of impact a little bit of
- 00:36:48hiccup but they're fine so we went from
- 00:36:5125 percent down to 3.6 percent going
- 00:36:55from the cell-based down to shuffle
- 00:36:56shark now this is a small system oh I
- 00:36:59should show you the math so the math
- 00:37:00here is probably as you remember from
- 00:37:01high school the binomial coefficient as
- 00:37:06you look at this math to realize as n
- 00:37:08grows our number of combinations grows
- 00:37:12really quickly so let's say we go from 8
- 00:37:14nodes up to 100
- 00:37:15larged like a huge number of notes it's
- 00:37:18a reasonable number to run in a
- 00:37:19large-scale system so say we have a
- 00:37:22hundred nodes and then we give each
- 00:37:23customer five combinations we're sorry
- 00:37:26if five nodes to represent their
- 00:37:28combination the math tells us that's
- 00:37:31going to be 75 million different
- 00:37:33combinations I think of it basically as
- 00:37:35you know a deck couple hundred cards
- 00:37:36there's fire you know 75 million
- 00:37:39different combinations of cards you can
- 00:37:40get from by picking randomly from that
- 00:37:42deck which is amazing because now you
- 00:37:45can see all right 77 percent of
- 00:37:47customers are not going to see any
- 00:37:48impact when diamond comes along and
- 00:37:50takes out it's five nodes but more
- 00:37:52interestingly ninety-nine point eight
- 00:37:55percent so those first three rows are
- 00:37:58still gonna have a majority of their
- 00:37:59nodes available so they're gonna have a
- 00:38:01better chance than not to completely you
- 00:38:05know route around that problem without
- 00:38:06even having to try and meanwhile that
- 00:38:10very very very low percentage of
- 00:38:12customers is basically the percentage of
- 00:38:16customers that are going to be sharing
- 00:38:17completely those same five notes what's
- 00:38:21magical about this and it's all in the
- 00:38:23math is we've created a multi-tenant
- 00:38:27system and then used the shuffle
- 00:38:30charting to create a single basically a
- 00:38:32single tenant experience which is
- 00:38:34obviously what a POS aspires to do now
- 00:38:38this needs a fault our client as I
- 00:38:40mentioned so one that when it gets a
- 00:38:42failure will retry but that's that's not
- 00:38:44hard that's that's pretty common what's
- 00:38:47interesting also is this not only works
- 00:38:48for servers it can work for queues they
- 00:38:51can work for other resources it's also
- 00:38:54critically dependent on fixed
- 00:38:56assignments so you're you're stuck with
- 00:39:00the hand that we deal you if there's any
- 00:39:04sort of failover like oh well your five
- 00:39:05nodes are down I'll give you these five
- 00:39:06then you get back into that old world
- 00:39:10where now a problem can infect and
- 00:39:13cascade across an entire system so it
- 00:39:15really depends on those fixed
- 00:39:16assignments and if I need some sort of
- 00:39:18routing mechanism so either a shuffle
- 00:39:20sharding aware router or dns can be
- 00:39:25another so in some of our servers both
- 00:39:27will hand a customer's
- 00:39:28if ik DNS name and that will resolve to
- 00:39:32the customer specific shuffle sharted
- 00:39:34set for that customer which basically
- 00:39:37gives them the routing for free cool so
- 00:39:42let's talk about the operational
- 00:39:43practices that we now layer on top of
- 00:39:46these architectural techniques to
- 00:39:48achieve the lowest possible blast radius
- 00:39:53the first is not even the practice but
- 00:39:56really ever a mindset and maybe even a
- 00:39:59religion at a table us which is probably
- 00:40:02best captured by Vernors blog from
- 00:40:04earlier this year about
- 00:40:06compartmentalization
- 00:40:07now that's this back shall just read it
- 00:40:09so I won't do as accident I have nothing
- 00:40:12to see in burger talk but I'd wager that
- 00:40:15every new a TBS engineer knows within
- 00:40:16their first week it's not their first
- 00:40:18day that we never want to touch more
- 00:40:20than one zone at a time this is so
- 00:40:23important because if we have
- 00:40:26availability zone fault isolation or we
- 00:40:29have region isolation as a core tenet of
- 00:40:31our blast radius reduction that's gonna
- 00:40:34go out the window if we have you know
- 00:40:36some correlated failure introduced by
- 00:40:38some manual action or automated action
- 00:40:40on multiple of these at the same time
- 00:40:43the most common and most I guess obvious
- 00:40:47example of this is with software
- 00:40:48deployments so our software deployments
- 00:40:51are done in a staggered way across zones
- 00:40:55across regions over time you know
- 00:41:00quickly enough that we can get features
- 00:41:01app to customers that you know because
- 00:41:03we like to launch features but slowly
- 00:41:06enough that we have confidence that as
- 00:41:08we're pushing this change broader and
- 00:41:10wider that it's not gonna cause an issue
- 00:41:13so we'll start slow observe test and
- 00:41:17then maybe speed it up as as it goes out
- 00:41:20broader and broader and that will that's
- 00:41:25the case with cells it's the case with
- 00:41:26availability zones casement regions and
- 00:41:29then within which each of those
- 00:41:30deployment units will do a fractional
- 00:41:33deployment to their to so a one box test
- 00:41:35has a very first step for a service for
- 00:41:39our ec2 Harbor deployments
- 00:41:42we'll start with maybe five or ten
- 00:41:45machines at first verify that things are
- 00:41:49working and then gradually speed it up
- 00:41:52as it expands across the infrastructure
- 00:41:55in tandem with that we have a bunch of
- 00:41:57automated tests that run that's part of
- 00:41:59the deployment as well as Canaries or
- 00:42:02these test applications that are
- 00:42:03mimicking you know a real-world customer
- 00:42:06in booking api's and will monitor the
- 00:42:09results of those if there's any problem
- 00:42:10the role the deployment get
- 00:42:12automatically rolled back we'll look at
- 00:42:14what happened and either decide it was
- 00:42:17not an issue that we need to worry about
- 00:42:18or fix the problem before we start the
- 00:42:20deployment again so this is really
- 00:42:23important this is this is what we need
- 00:42:25to do to make sure that we've not
- 00:42:27compromised the boundaries that we've
- 00:42:29put between cells and AZ's and regions
- 00:42:35all that automation sorry all that
- 00:42:37deployment machinery is automated
- 00:42:41including the rules about the tests that
- 00:42:45have to succeed including the timing and
- 00:42:47the windows when we progress to the next
- 00:42:49stage but automation is in key in
- 00:42:52general right it's not just employments
- 00:42:53there are other things that we do with
- 00:42:54our infrastructure to manage it whether
- 00:42:56it's configuring network devices or you
- 00:43:00know propagating new credentials to the
- 00:43:02stacks that could be done by hand
- 00:43:05but humans are prone to error it's much
- 00:43:08better to automate it but it could be
- 00:43:10because you can review whatever it is
- 00:43:13that you're building in your automation
- 00:43:15you can test it it's predictable and how
- 00:43:18it's gonna operate and you can repeat it
- 00:43:19over and over again and know that it's
- 00:43:21gonna work the same way every time and
- 00:43:25then finally as you may have heard AWS
- 00:43:28and Amazon in general is has a very
- 00:43:31strong philosophy around end-to-end
- 00:43:32ownership so our service teams are
- 00:43:36composed of engineers who are builder
- 00:43:38operators so as engineers we build this
- 00:43:40up we design the software we test it we
- 00:43:43also are the ones that deploy it and
- 00:43:46we're the ones that operate it and
- 00:43:47respond to issues in production what
- 00:43:49this does is it gives us this wonderful
- 00:43:51opportunity to have a feedback loop in
- 00:43:53terms of design choices what
- 00:43:56impact is operationally and
- 00:43:58understanding what changes we need to
- 00:43:59make to avoid future problems when there
- 00:44:03is a failure that we need to worry about
- 00:44:05and so I mentioned earlier the
- 00:44:07correction of errors template that we
- 00:44:08use that's usually filled out by
- 00:44:11engineers or in partnership with
- 00:44:12engineers where they think about the
- 00:44:15blast radius and so they're in a great
- 00:44:17position to then go and implement the
- 00:44:19changes to make sure the next time if
- 00:44:21that event occurs the blast radius is
- 00:44:23cut in half
- 00:44:23or more so to wrap up we've got a
- 00:44:31variety of containment
- 00:44:33compartmentalization mechanisms that we
- 00:44:35use to reduce blast radius
- 00:44:37it starts with regions and the strong
- 00:44:38isolation between regions then it goes
- 00:44:41into availability zones and then this
- 00:44:44alternate dimension of availability
- 00:44:46zones compartmentalization which
- 00:44:47ourselves and then the magic of shuffle
- 00:44:51charting to get sort of virtual cells
- 00:44:54and taking advantage of the
- 00:44:56combinatorics of the shuffle charting
- 00:44:59and then with those we protect that
- 00:45:02compartmentalization with operation
- 00:45:04operation operational practices so step
- 00:45:08by step phase deployments that are
- 00:45:10automated and then service teams that
- 00:45:13are builder operators so they are close
- 00:45:16to the the frontlines understand how
- 00:45:18their decisions are impacting the
- 00:45:20availability of their systems and all
- 00:45:24with the goal of reducing blast radius
- 00:45:28so that's my talk I hope you learned a
- 00:45:32few things and I'm happy to take
- 00:45:34questions now if anyone has them and I
- 00:45:36think there are yeah there are live
- 00:45:38microphones down the center aisles here
- 00:46:01yeah so the question was is there an
- 00:46:02example of Murphy's Law where we thought
- 00:46:06we had everything nailed down everything
- 00:46:08sorted out that we answered although the
- 00:46:12possible failure modes its it goes back
- 00:46:16to my slide of the the bit flips on the
- 00:46:21networking side so this is a example for
- 00:46:23many many years ago
- 00:46:24so in s/3 s/3 cares deeply about data
- 00:46:29care deeply about the integrity of data
- 00:46:31and there are many layers in s3 of
- 00:46:35checksumming
- 00:46:37so if there is an arrant but bit flip
- 00:46:41introduced we make sure it doesn't you
- 00:46:43know we detected it's not an issue in
- 00:46:472008 we had an event in s3 where there
- 00:46:52was one network card on one server that
- 00:46:57every now and then was flipping one bit
- 00:47:01and there was a layer of the system that
- 00:47:04handles basically the group
- 00:47:06communication across the system uses
- 00:47:09gossip protocols so can detect whether I
- 00:47:13can understand what the state is of all
- 00:47:14the servers in the system whether
- 00:47:15they're healthy or not
- 00:47:16well the gossip protocol noticed there's
- 00:47:20a funny server name in this packet
- 00:47:23because one of the bits was flipped in
- 00:47:25the server name I never heard of that
- 00:47:27hos before and it triggered a much more
- 00:47:29expensive sort of reconciliation
- 00:47:31protocol long story short you could read
- 00:47:35the long story there's a post mortem
- 00:47:37that Stoll published on the somewhere in
- 00:47:40the dashboard status dashboard pages
- 00:47:42talks about the outage but it took down
- 00:47:43s3 completely so one server one NIC one
- 00:47:49bit took down our regional service and
- 00:47:52that was the only layer we had checks
- 00:47:55coming all the way up and down the stack
- 00:47:56that was the only layer that didn't have
- 00:47:58the check something someone we'd
- 00:47:59forgotten to add it there
- 00:48:09yeah so the follow question here is
- 00:48:12wasn't there a more recent event where
- 00:48:14there was a human error involved in an
- 00:48:16s3 patent yes there was and this comes
- 00:48:19back to my operation back to slides I
- 00:48:21was talking about which is all your best
- 00:48:25plans can can fall apart if whatever
- 00:48:30you're doing doesn't respect those fault
- 00:48:32isolation boundaries in this case the
- 00:48:36engineer very certainly knew our
- 00:48:38philosophy and it was more of a just a
- 00:48:42human error mistake on the command line
- 00:48:45which goes back to the importance of
- 00:48:46automation and importance of testing of
- 00:48:48those those things
- 00:48:49Peter yes with the everything that you
- 00:48:54talked about in the different layers
- 00:48:55from regions availability zones to cells
- 00:48:58to the shuffle shorty to your customers
- 00:49:01are are they given to them or are they
- 00:49:05for purchase as services good question
- 00:49:09so regions are given to you these are
- 00:49:13given to you cells and shuffle charting
- 00:49:16are sort of different things in that
- 00:49:18they are they're like the watertight
- 00:49:22compartments inside the ship so you
- 00:49:24could ask the the purser on the boat to
- 00:49:27give you a tour of the lower decks and
- 00:49:28maybe they'd show you the compartments
- 00:49:29but otherwise you know you don't really
- 00:49:32care about that and you get them for
- 00:49:33free it's just part of the safety of the
- 00:49:35ship so they're visible and they're
- 00:49:39they're free I wanted to have a cell for
- 00:49:51my own service that I'm making and I
- 00:49:53actually care about aligning that cell
- 00:49:55with a particular data center so you can
- 00:50:05get that kind of affinity with your
- 00:50:07zonal resources right so you can you can
- 00:50:09get affinity with your ec2 instances and
- 00:50:11your Cloud HSM instance and your EBS
- 00:50:15volumes because we give you full control
- 00:50:16over the place well and availability
- 00:50:20zone you can think of as a lot
- 00:50:22whole data center there they're one in
- 00:50:23the same so a data center will always be
- 00:50:28in one easy right an AZ will always be
- 00:50:33one or more data centers you're saying
- 00:50:36for the multiple data center or a Z's
- 00:50:39exactly got it yeah no that that is if
- 00:50:42that is not visible to you as a customer
- 00:50:43you don't have control over that Thanks
- 00:50:46yep so as you move from regions to cells
- 00:50:50how does the version management of stuff
- 00:50:53that you're deploying is there any
- 00:50:55operational insight into that of being
- 00:50:58deployed there I mean how do we how do
- 00:51:01we keep does it have any impact on you
- 00:51:05know managing operations deploying in a
- 00:51:07region versus deploying at a shard level
- 00:51:10I said that any complexities there yeah
- 00:51:12so I think what you're saying is there's
- 00:51:14this window of time when we're going
- 00:51:17through the progression of the
- 00:51:18deployment and the versions may not be
- 00:51:20in sync across multiple locations is
- 00:51:23that the core of your question yes yeah
- 00:51:25that's definitely consideration and a
- 00:51:28deployment may take depending on the
- 00:51:29system it may take several days to
- 00:51:33several weeks in some operating
- 00:51:35structure were particularly careful
- 00:51:39about the pace of deployment in fact in
- 00:51:42some of the networking components for
- 00:51:44VPC it's a particularly delicate affair
- 00:51:47to roll out changes now it might take a
- 00:51:49while so yeah that is a complication
- 00:51:51that we have to be mindful of of well
- 00:51:54which version is running here yeah some
- 00:52:00questions regarding the global services
- 00:52:03we talked a lot about regions zones how
- 00:52:05hard the global services regarding
- 00:52:08control playing day to play in
- 00:52:10resiliency in respect to the regions or
- 00:52:12edge notes and organize so is your
- 00:52:15question about regional services that
- 00:52:18are have global features or truly global
- 00:52:20services for example I am I am okay yeah
- 00:52:23that's a great question so I am as a
- 00:52:26special case because
- 00:52:28your account information and your
- 00:52:32credentials and so forth are available
- 00:52:34globally so it is in a sense of global
- 00:52:36service however each region has a
- 00:52:39separate control plane sorry a separate
- 00:52:43data plate that can operate completely
- 00:52:46disconnected from the source of truth
- 00:52:49for I am the control plane however is
- 00:52:54global and so there are particular
- 00:52:56considerations that we take with a
- 00:52:58service like that to make sure that it
- 00:53:00continues to be available even in the
- 00:53:02face of some of the failures that I
- 00:53:03talked about but I should I had it on
- 00:53:07the site and I forgot to mention that
- 00:53:08there is another philosophy that we have
- 00:53:11which is the separation between control
- 00:53:13plane and data plane and ensuring that
- 00:53:16the data plane can continue to function
- 00:53:17even if there is a control plane issue
- 00:53:20and that's exactly the scenario and I am
- 00:53:23that's critically important because a
- 00:53:25regent can be disconnected from the
- 00:53:26source of truth or it can there could be
- 00:53:29other issues but the control plane we
- 00:53:30wanna make sure that you can still you
- 00:53:33know validate your credentials and so
- 00:53:34forth you know in each region and not
- 00:53:36have global impact can you go to the mic
- 00:53:48so when you're talking about the cell
- 00:53:51structure and the shuffle shouting I
- 00:53:56mean when you're talking about the cell
- 00:53:58stuff you were talking about in the
- 00:54:00control layer setting out the routing to
- 00:54:02cut with the grain of the service ways
- 00:54:05to break it up when you move to shuffle
- 00:54:07shouting is that done at the same level
- 00:54:08because obviously you lose that sort of
- 00:54:11control based on the characteristics of
- 00:54:14the service yeah it's a good question
- 00:54:16and for for cell based services usually
- 00:54:24that prevents you from shuffle charting
- 00:54:26is that you have you have to have some
- 00:54:29control over state so you can't a sort
- 00:54:32of willy-nilly a scatter of state across
- 00:54:34the entire fleet some servers don't have
- 00:54:37state or the earthís off state that they
- 00:54:39can they can cache locally and then
- 00:54:41operate fine the shop charting works
- 00:54:45well for for the latter scenario whether
- 00:54:47it doesn't need to be any particular
- 00:54:48node affinity the there's another sort
- 00:54:55of advanced topic here which is kind of
- 00:54:57interesting which is you can actually
- 00:54:58layer both of these things together you
- 00:55:01could take a cell based architecture and
- 00:55:02then shuffle shard requests across them
- 00:55:05in a stateless system to get a different
- 00:55:08kind of resilience property that's
- 00:55:09something that we're working on now with
- 00:55:11with with dynamodb so I don't know if I
- 00:55:15answered your question okay
- 00:55:18so I guess maybe never another way to
- 00:55:20answer to that as shuffle charting and
- 00:55:22is not an improvement on cell based in
- 00:55:24all cases it's really a system specific
- 00:55:28on when you might be able to apply it
- 00:55:30cool okay look that there's no more
- 00:55:36questions thanks again for coming
- 00:55:38hope you learn something and enjoy your
- 00:55:40week at reinvent
- 00:55:40[Applause]
- AWS
- Distributed Systems
- Resilience
- Failure Management
- CAP Theorem
- Murphy's Law
- Operational Practices
- Shuffle Sharding
- Cell-Based Architecture
- Availability Zones