AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)

00:55:43
https://www.youtube.com/watch?v=swQbA4zub20

Résumé

TLDRIn his talk, Peter Voss, a distinguished engineer at AWS, explores how AWS minimizes the blast radius of failures in distributed systems. He discusses fundamental properties that impact system design, such as the speed of light, the CAP theorem, and Murphy's Law. Voss outlines various techniques employed by AWS, including region isolation, availability zone independence, cell-based architecture, and shuffle sharding, to contain failures and reduce their impact on customers. He emphasizes the importance of operational practices, such as staggered deployments and automation, in maintaining system resilience. The talk aims to provide insights into building resilient systems and offers practical techniques for reducing the blast radius in attendees' own systems.

A retenir

  • 🌍 AWS focuses on minimizing the blast radius of failures.
  • ⚡ The speed of light affects long-distance communications.
  • 📉 The CAP theorem states you can't have consistency and availability simultaneously.
  • 🔧 AWS employs region isolation to contain failures.
  • 🏢 Availability zones provide fault tolerance within regions.
  • 🔄 Cell-based architecture helps in isolating failures further.
  • 🎲 Shuffle sharding reduces the impact of problematic requests.
  • 🔍 Staggered deployments help in managing risks during updates.
  • 🤖 Automation is key to reducing human error in operations.
  • 🔄 Post-mortem analyses are conducted to learn from failures.

Chronologie

  • 00:00:00 - 00:05:00

    The session begins with Peter Voss introducing himself as a distinguished engineer at AWS, discussing the importance of minimizing the blast radius of failures in distributed systems. He highlights the challenges posed by the speed of light, the CAP theorem, and Murphy's Law, emphasizing the need for resilience in AWS systems.

  • 00:05:00 - 00:10:00

    Voss explains the concept of blast radius, which refers to the degree of impact a failure can have on customers, workloads, and locations. He stresses the importance of reducing this blast radius and outlines AWS's commitment to maintaining high availability while preparing for potential failures.

  • 00:10:00 - 00:15:00

    The discussion shifts to the various ways systems can fail, including server crashes, disk failures, network issues, and external factors like storms. Voss also mentions non-physical failures such as traffic surges and software bugs, setting the stage for techniques to contain failures.

  • 00:15:00 - 00:20:00

    Voss introduces techniques for reducing blast radius, starting with region isolation. He explains that AWS operates in multiple regions, each with its own set of services, ensuring that failures in one region do not affect others, thus providing a strong layer of isolation.

  • 00:20:00 - 00:25:00

    Next, Voss discusses availability zone independence, explaining that each region consists of multiple availability zones (AZs) that are physically separated to minimize correlated failures. He illustrates how applications can be designed to leverage this architecture for fault tolerance.

  • 00:25:00 - 00:30:00

    The presentation continues with the concept of cell-based architecture, where services are compartmentalized into cells that operate independently. This design allows for better fault isolation and resilience, as failures in one cell do not impact others.

  • 00:30:00 - 00:35:00

    Voss elaborates on shuffle sharding, a technique that further reduces blast radius by randomly assigning customers to multiple nodes. This method ensures that even if one node fails, customers can still access services through other nodes, significantly lowering the impact of failures.

  • 00:35:00 - 00:40:00

    The operational practices that support these architectural techniques are discussed, including staggered deployments, automated testing, and end-to-end ownership by service teams. Voss emphasizes the importance of automation in reducing human error and maintaining system integrity.

  • 00:40:00 - 00:45:00

    Voss concludes by summarizing the various mechanisms AWS employs to minimize blast radius, including region isolation, availability zones, cell-based architecture, and shuffle sharding, all supported by robust operational practices.

  • 00:45:00 - 00:50:00

    The session ends with a Q&A segment where Voss addresses questions about real-world examples of failures, the relationship between regions and cells, and the complexities of version management during deployments.

  • 00:50:00 - 00:55:43

    Overall, the talk provides insights into AWS's strategies for building resilient systems that can withstand failures while minimizing their impact on customers.

Afficher plus

Carte mentale

Vidéo Q&R

  • What is the main focus of Peter Voss's talk?

    The main focus is on how AWS minimizes the blast radius of failures in distributed systems.

  • What are some fundamental properties that affect distributed systems?

    The speed of light, the CAP theorem, and Murphy's Law are key properties that affect distributed systems.

  • What techniques does AWS use to reduce the blast radius?

    AWS uses region isolation, availability zone independence, cell-based architecture, and shuffle sharding to reduce the blast radius.

  • How does AWS ensure operational resilience?

    AWS ensures operational resilience through staggered deployments, automation, and end-to-end ownership by service teams.

  • What is shuffle sharding?

    Shuffle sharding is a technique that assigns customers to multiple nodes randomly to minimize the impact of failures.

  • What is the significance of availability zones?

    Availability zones provide fault tolerance within a region, reducing the impact of failures.

  • How does AWS handle software deployments?

    AWS handles software deployments in a staggered manner to minimize risk and ensure system stability.

  • What is the role of the control plane in AWS services?

    The control plane manages resource administration, while the data plane handles the actual work.

  • What is the importance of automation in AWS operations?

    Automation reduces human error, ensures consistency, and allows for predictable operations.

  • How does AWS approach failure analysis?

    AWS conducts post-mortem analyses to identify root causes and implement changes to prevent future failures.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !
Sous-titres
en
Défilement automatique:
  • 00:00:01
    good afternoon thanks for coming to this
  • 00:00:05
    session who's here at reinvent for the
  • 00:00:07
    first time awesome welcome should be an
  • 00:00:12
    exciting week lots of sessions lots of
  • 00:00:15
    parties lots of fun so I am Peter Voss
  • 00:00:19
    all I am a distinguished engineer at AWS
  • 00:00:22
    and today I'm going to talk about how
  • 00:00:25
    AWS minimizes the blast radius of
  • 00:00:27
    failures so I spent my career focused on
  • 00:00:31
    building highly available large-scale
  • 00:00:33
    distributed systems and one of the
  • 00:00:36
    things I've learned and something you
  • 00:00:38
    learn quickly in this field is that
  • 00:00:40
    there are some fundamental properties of
  • 00:00:41
    the universe that you basically have to
  • 00:00:43
    contend with the first is the speed of
  • 00:00:46
    light so 186 miles per millisecond is
  • 00:00:51
    pretty darn fast but as the folks at JPL
  • 00:00:54
    this morning trying to land their their
  • 00:00:57
    Mars rover learned it really gets in the
  • 00:00:59
    way when you travel any long distance
  • 00:01:01
    communications by the way they did land
  • 00:01:03
    the rover successfully so
  • 00:01:05
    congratulations to our our friends at
  • 00:01:07
    JPL a happy Amazon customer a tip us
  • 00:01:10
    customer the next thing is sort of this
  • 00:01:14
    pesky property universe known as the cap
  • 00:01:17
    theorem so this is a an observation you
  • 00:01:21
    can't Simas simultaneously have
  • 00:01:22
    consistency and availability in a
  • 00:01:25
    distributed system that is built to be
  • 00:01:26
    partitioned tolerant so this was first
  • 00:01:28
    postulated by Eric Brewer in 98 and then
  • 00:01:31
    proven by Seth Gilbert and Nancy Lynch
  • 00:01:34
    at MIT
  • 00:01:36
    perhaps most annoying though is this
  • 00:01:38
    observation by some guy named Murphy
  • 00:01:40
    which is basically stuff breaks
  • 00:01:43
    inevitably things will fail in a variety
  • 00:01:46
    of interesting ways I don't know if this
  • 00:01:48
    has been proven but in my experience
  • 00:01:49
    that seems to have been borne out to be
  • 00:01:51
    the truth now security availability
  • 00:01:56
    durability these are all super important
  • 00:01:59
    to AWS in fact they're our top priority
  • 00:02:01
    and yet we have to contend with with
  • 00:02:04
    Murphy's Law so today I'm going to talk
  • 00:02:08
    about the techniques that Amazon uses
  • 00:02:11
    AWS uses
  • 00:02:13
    to contain the blast radius or the the
  • 00:02:16
    degree of impact when Murphy's Law does
  • 00:02:19
    in fact strike and I hope you'll walk
  • 00:02:21
    away from this talk getting a better
  • 00:02:23
    understanding of how a DPS builds
  • 00:02:25
    resilience into our systems and also
  • 00:02:28
    some techniques that you can use to
  • 00:02:30
    reduce the blast radius in your own
  • 00:02:32
    systems so this term blast radius is a
  • 00:02:36
    it's a really useful one because failure
  • 00:02:40
    is in binary it's not a it's failing or
  • 00:02:43
    it's not there is a degree of impact
  • 00:02:45
    it's a really useful term it's part of
  • 00:02:48
    our common language at AWS and basically
  • 00:02:52
    it's a way to describe the degree of
  • 00:02:54
    impact so a table us if we have a
  • 00:02:57
    failure one way to talk about blast
  • 00:02:59
    radius as well how many customers did it
  • 00:03:01
    impact or how many workloads or what
  • 00:03:04
    functionality maybe it was just a
  • 00:03:05
    portion of the functionality that was
  • 00:03:07
    impacted and then finally in what
  • 00:03:09
    locations was it a racket a data center
  • 00:03:11
    was it a whole data center was it an
  • 00:03:13
    entire region obviously we would always
  • 00:03:16
    prefer a smaller blast we as a single
  • 00:03:18
    rack failing to a larger scope failure
  • 00:03:23
    so while we do relentlessly focus on
  • 00:03:26
    preventing failures and and I would say
  • 00:03:28
    that we've gotten very good at keeping
  • 00:03:30
    our systems extremely highly available
  • 00:03:32
    we also relentlessly focus on reducing
  • 00:03:35
    blast radius for those very rare cases
  • 00:03:38
    where we do have a failure and try to
  • 00:03:40
    contain it
  • 00:03:41
    and make it as small as possible one way
  • 00:03:46
    to do that we do this is in our
  • 00:03:48
    correction of errors process so that's
  • 00:03:51
    our post mortem process that we use
  • 00:03:53
    whenever there's an event the service
  • 00:03:55
    team will go through analyze what were
  • 00:03:58
    the root causes of the failure and then
  • 00:04:01
    identify a set of actions to take to
  • 00:04:03
    prevent recurrence one of the questions
  • 00:04:06
    in the template for doing this
  • 00:04:09
    correction of errors is about customer
  • 00:04:10
    impact and we asked as a thought
  • 00:04:12
    exercise
  • 00:04:12
    how could you cut the blast radius for a
  • 00:04:14
    similar event in half so we're always
  • 00:04:16
    thinking about even when we do have an
  • 00:04:18
    event how can we make it even smaller
  • 00:04:20
    blast radius the next time
  • 00:04:23
    so before talking about how to do that
  • 00:04:25
    let's talk about how things can fail so
  • 00:04:26
    one way things can fail as well servers
  • 00:04:28
    crash maybe not like this but servers do
  • 00:04:31
    crash disks fail in a variety of
  • 00:04:33
    interesting ways they might go
  • 00:04:35
    completely offline they might have
  • 00:04:36
    random i/o errors network devices can
  • 00:04:39
    fail and if they don't fail they could
  • 00:04:41
    introduce random bit flips we've
  • 00:04:43
    definitely seen this happen a few times
  • 00:04:46
    meanwhile outside the data center
  • 00:04:49
    utility workers can accidentally cause
  • 00:04:52
    fiber cuts you can have electrical
  • 00:04:55
    storms that can cause utility power
  • 00:04:58
    failures in a more extreme scenario you
  • 00:05:03
    could have data centers actually get
  • 00:05:05
    physically damaged by storms or fires so
  • 00:05:10
    far I've talked about physical failures
  • 00:05:11
    right but there's also non-physical
  • 00:05:14
    affairs we have to worry about one is
  • 00:05:16
    just a surge of traffic whether it's a
  • 00:05:18
    DDoS attack or a extreme surge in demand
  • 00:05:23
    on one of our services which can cause
  • 00:05:26
    overload conditions we have to worry
  • 00:05:29
    about black swan' requests we actually
  • 00:05:31
    call them these poison pills internally
  • 00:05:33
    I've done a little research and that
  • 00:05:35
    seems to be just an AWS term but you can
  • 00:05:37
    think of these as these particularly
  • 00:05:39
    problematic requests that are either
  • 00:05:41
    really expensive or there's something
  • 00:05:43
    about them that trigger a bug in in the
  • 00:05:46
    system and they can be particularly
  • 00:05:48
    pernicious because a client will retry
  • 00:05:51
    after a failure and so a single failure
  • 00:05:54
    can cascade and cause an entire system
  • 00:05:57
    to get infected
  • 00:05:58
    so we worried a lot about these poison
  • 00:06:00
    pills or black swans we also have to be
  • 00:06:04
    mindful the fact that sometimes a
  • 00:06:05
    software deployment or configuration
  • 00:06:06
    change could introduce the problem and
  • 00:06:10
    then of course most generally there are
  • 00:06:12
    bugs and obviously none of us want to
  • 00:06:16
    have bugs get into production and we
  • 00:06:18
    have a very good success rate and not
  • 00:06:20
    letting bugs get into production but
  • 00:06:21
    there is still that that possible
  • 00:06:23
    eventualities and so now that I've
  • 00:06:28
    talked about failures and what we mean
  • 00:06:30
    by blast radius I'll better talk about
  • 00:06:32
    the various techniques we use to contain
  • 00:06:34
    it
  • 00:06:35
    so region isolation availability zone
  • 00:06:38
    independence cell based architecture
  • 00:06:40
    shuffle sharding and then finally in
  • 00:06:43
    tandem with all these techniques the
  • 00:06:44
    operational practices that we use to get
  • 00:06:47
    the desired outcome so let's start with
  • 00:06:50
    region isolation so as you probably know
  • 00:06:53
    AWS is deployed globally in 19 separate
  • 00:06:57
    locations that we call regions and we
  • 00:07:02
    actually have five additional regions
  • 00:07:03
    that we've announced that will be coming
  • 00:07:04
    on online soon and as a customer you
  • 00:07:08
    choose the region that you want to run
  • 00:07:10
    your workloads in based on factors like
  • 00:07:12
    like latency or maybe data residency
  • 00:07:17
    each one of these regions is a separate
  • 00:07:20
    distinct stack of AWS services and each
  • 00:07:26
    region has a separate set of end points
  • 00:07:28
    for the api's you use to interact with
  • 00:07:30
    the services so for example in ec2 if
  • 00:07:32
    you want to interact with ec2 in uswest
  • 00:07:35
    one you would use the ec2 that uswest
  • 00:07:38
    one that either Sam it's not that common
  • 00:07:40
    endpoint meanwhile if you wanted to use
  • 00:07:43
    easy to than us used you wussies - then
  • 00:07:46
    you would use separate endpoint and the
  • 00:07:49
    reason is there's no single global ec2
  • 00:07:53
    there's a separate stack in every region
  • 00:07:56
    and these are isolated in San Shi ations
  • 00:07:58
    and they don't know about each other
  • 00:08:01
    so this is a shared nothing architecture
  • 00:08:03
    which gives us the ultimate in blast
  • 00:08:05
    radius protection so there's our ec2 you
  • 00:08:10
    know Amazon Elastic Compute cloud
  • 00:08:11
    service in u.s. west wall and US east -
  • 00:08:14
    they don't talk to each other they don't
  • 00:08:16
    know about each other same is true for a
  • 00:08:18
    service like Amazon FQs Amazon sage
  • 00:08:20
    maker and and so on now you might be
  • 00:08:24
    wondering well but there are multi
  • 00:08:27
    region or global features you know what
  • 00:08:30
    about those features things like s3
  • 00:08:34
    cross region replication or dynamos
  • 00:08:36
    global tables or ec2 virtual private
  • 00:08:40
    cloud peering so let me talk about how
  • 00:08:43
    we address those by going through an
  • 00:08:45
    example so there
  • 00:08:46
    is an example of inter-regional private
  • 00:08:49
    cloud peering so virtual private cloud
  • 00:08:51
    or VPC basically lets you provision the
  • 00:08:54
    logically isolated section of the AWS
  • 00:08:57
    cloud where you can launch resources in
  • 00:09:00
    a virtual network that you define you
  • 00:09:02
    define the IP addresses in your virtual
  • 00:09:03
    network configure route tables gateways
  • 00:09:06
    and so forth and with inter region DPC
  • 00:09:10
    peering you can actually take two v pcs
  • 00:09:12
    in two different regions and virtually
  • 00:09:15
    connect them so they can communicate
  • 00:09:17
    without having to go over the the public
  • 00:09:20
    internet so here's our V PC peering
  • 00:09:24
    connection now setting one of these
  • 00:09:26
    things up involves a workflow and a
  • 00:09:28
    setup approvals particularly because
  • 00:09:30
    they can be v pcs owned by different
  • 00:09:32
    customers so you want both customers to
  • 00:09:34
    agree to establishing this connection
  • 00:09:37
    and that involves configuration changes
  • 00:09:39
    in the V PC configuration on each of
  • 00:09:44
    these in each of these regions but these
  • 00:09:47
    easy to control planes don't talk to
  • 00:09:49
    each other so how do we accomplish this
  • 00:09:51
    the answer we have a dedicated service
  • 00:09:55
    called the cross region Orchestrator
  • 00:09:56
    that sits on top of these systems and
  • 00:10:01
    implements the workflow manages through
  • 00:10:03
    the approval process and comes through
  • 00:10:05
    the sort of the front door of BC - just
  • 00:10:08
    like any other API would and it also has
  • 00:10:11
    a certain number of safety features it
  • 00:10:13
    ring fences some of the interactions to
  • 00:10:16
    ensure that there is no possibility for
  • 00:10:18
    multiple regions to be impacted by you
  • 00:10:20
    know whatever issue there could be at
  • 00:10:22
    the same time so we can preserve that
  • 00:10:24
    single region blast radius so kind of a
  • 00:10:29
    recap here we've got all these regions
  • 00:10:32
    and we're deeply committed to this
  • 00:10:34
    principle of region isolation so in the
  • 00:10:37
    event of say an earthquake in Japan
  • 00:10:39
    which we did experience a few years ago
  • 00:10:42
    there could be impacted in that region
  • 00:10:44
    but it's going to be limited to that
  • 00:10:45
    region so in the worst case we have a
  • 00:10:48
    single region blast radius in case of a
  • 00:10:50
    failure that's still pretty big right
  • 00:10:54
    that's that's a lot of impact especially
  • 00:10:56
    if you're
  • 00:10:57
    a customer that sits only in that region
  • 00:11:00
    so obviously we want to do better so how
  • 00:11:04
    do we do better now I'm gonna talk about
  • 00:11:06
    availability zone independence so let's
  • 00:11:09
    drill into the design of a region kind
  • 00:11:11
    of double click on it and see how we
  • 00:11:13
    limit blast radius within a region so
  • 00:11:17
    there's our region a region is actually
  • 00:11:18
    composed of well before I talk about its
  • 00:11:21
    proposed over a region sits in a in a
  • 00:11:23
    location right so at the macro scale its
  • 00:11:25
    Northern Virginia or Dublin or Frankfurt
  • 00:11:28
    and but it's not a single data center
  • 00:11:31
    it's it's actually a it's composed of
  • 00:11:35
    multiple data centers that are spread
  • 00:11:38
    across the metropolitan area of the
  • 00:11:41
    region and we call these different
  • 00:11:43
    locations availability zones and their
  • 00:11:44
    and their cross connected with
  • 00:11:47
    high-speed private fiber links now these
  • 00:11:53
    are far enough apart from each other
  • 00:11:54
    that there's a very very low possibility
  • 00:11:58
    for correlated failure except maybe in
  • 00:12:00
    the in the earthquake case so you can
  • 00:12:02
    think of them as miles apart and that
  • 00:12:06
    means if a tornado comes through it's
  • 00:12:07
    unlikely that it would hit multiple
  • 00:12:09
    facilities if there's a utility issue we
  • 00:12:13
    actually run off of different utility
  • 00:12:15
    suppliers in a region so that won't
  • 00:12:17
    affect multiple availability zones at
  • 00:12:18
    the same time and at the same time
  • 00:12:21
    they're close enough to each other so
  • 00:12:23
    again think about it as miles away they
  • 00:12:26
    don't have that pesky speed of light
  • 00:12:27
    issues you can think of them logically
  • 00:12:28
    being in the same place run things like
  • 00:12:31
    synchronous replication protocols and so
  • 00:12:32
    forth without any latency penalty so
  • 00:12:37
    each availability zone is basically n
  • 00:12:40
    data centers it's not just one data
  • 00:12:41
    center in some so in some cases because
  • 00:12:44
    we keep the size of the data center set
  • 00:12:48
    to a fixed maximum size in larger AZ's
  • 00:12:51
    we might have multiple buildings for na
  • 00:12:53
    Z and then we have n of those
  • 00:12:56
    availability zones per region usually
  • 00:12:58
    three or more in fact in all future
  • 00:13:01
    region builds will have at least three
  • 00:13:03
    and that's useful first for a certain
  • 00:13:06
    should be the system's consensus
  • 00:13:08
    protocols that
  • 00:13:10
    best suited when you have had these
  • 00:13:11
    three locations so you can get to
  • 00:13:14
    consensus agreement across them globally
  • 00:13:17
    we have 19 regions with 57 total
  • 00:13:21
    availability zones and five more regions
  • 00:13:23
    coming online with 15 more AZ's so with
  • 00:13:28
    this architecture of AZ's within a
  • 00:13:30
    region we now have the possibility to
  • 00:13:32
    reduce the blast radius because we
  • 00:13:35
    reduce the possibility of a correlated
  • 00:13:37
    failure across the entire region and you
  • 00:13:40
    can take advantage of this multi a-z
  • 00:13:42
    architecture in old applications just
  • 00:13:44
    like we do in a DBMS by using a multi
  • 00:13:46
    a-z architecture for your application so
  • 00:13:49
    let me go through a quick example here
  • 00:13:51
    super simple so you have your
  • 00:13:54
    application it runs on a set of
  • 00:13:55
    instances that you've deployed across
  • 00:13:57
    multiple AZ's and then run an elastic
  • 00:14:00
    load balancer to low balance the traffic
  • 00:14:02
    across them and then behind the scenes
  • 00:14:04
    are using you know a relational database
  • 00:14:06
    maybe for your persistence and that set
  • 00:14:09
    up in a multi a Z primary standby pair
  • 00:14:12
    so if you have a failure in one of the
  • 00:14:14
    availability zones the elastic load
  • 00:14:16
    balancer will detect that and stop
  • 00:14:18
    sending traffic to the field a Z or the
  • 00:14:20
    instances in the failed a Z
  • 00:14:23
    meanwhile the application is connected
  • 00:14:26
    to the master database that has gone
  • 00:14:28
    away me because there's a power failure
  • 00:14:29
    or a network failure but you could fail
  • 00:14:33
    that database over to the to the healthy
  • 00:14:38
    AZ which happens automatically with
  • 00:14:40
    Amazon RDS and then your application is
  • 00:14:43
    up and running again
  • 00:14:43
    and these failure detection events could
  • 00:14:47
    happen fairly quickly so there might be
  • 00:14:49
    a slight hiccup the operation your
  • 00:14:51
    application but otherwise it's almost
  • 00:14:53
    like a non-event that you lost you know
  • 00:14:55
    one of the data centers in your
  • 00:14:56
    application so this is how all of our
  • 00:14:58
    services that run regionally in AWS are
  • 00:15:03
    designed and operated it's a really
  • 00:15:07
    powerful model cuz it gives you this
  • 00:15:08
    fault tolerance and it basically means
  • 00:15:11
    that you can withstand an AZ failure and
  • 00:15:14
    not have any impact on your customers
  • 00:15:16
    it's also a powerful design for
  • 00:15:18
    durability so s3 uses the multi AZ model
  • 00:15:21
    to get its 11
  • 00:15:22
    of durability and so with this
  • 00:15:25
    architecture you can basically get to
  • 00:15:26
    zero blast radius when you have a data
  • 00:15:28
    center and get hit by tornado or a
  • 00:15:31
    utility worker hitting a cutting out
  • 00:15:33
    fiber connection so what about the
  • 00:15:36
    services in a WMS that are zonal so some
  • 00:15:39
    of our applications are some of our
  • 00:15:41
    services give you the opportunities and
  • 00:15:43
    application to have zonal resources like
  • 00:15:46
    an ec2 instance or an EBS volume you
  • 00:15:50
    decide which AZ these live in and that's
  • 00:15:52
    part of the ear story for creating a
  • 00:15:54
    resilient application
  • 00:15:56
    these are zone specific resources and
  • 00:16:00
    for minimum blast radius considerations
  • 00:16:04
    we actually have a zone local control
  • 00:16:06
    plane for the resources in each of our
  • 00:16:09
    Easy's with a principle of availability
  • 00:16:12
    zone independence so we talk about
  • 00:16:15
    control planes real quickly if you're
  • 00:16:16
    not familiar with the term so control
  • 00:16:18
    plane is the thing that you interact
  • 00:16:19
    with to administer these resources
  • 00:16:21
    whether they're zonal or regional and
  • 00:16:25
    then the data plane is the thing you
  • 00:16:26
    interact with to actually do the work
  • 00:16:28
    you want to get accomplished so some
  • 00:16:30
    examples are here for Amazon ec2 the
  • 00:16:32
    control plane is what handles run
  • 00:16:34
    instances so it takes your API request
  • 00:16:36
    and then does all the work necessary to
  • 00:16:39
    launch the VM as per your instructions
  • 00:16:41
    in your API and then the data plane for
  • 00:16:44
    you to do is basically the instance
  • 00:16:46
    right
  • 00:16:47
    the thing you SSH into or the run your
  • 00:16:49
    application on it's also the network
  • 00:16:51
    that attaches your you should do it
  • 00:16:54
    since to other instances in your V PC so
  • 00:16:59
    let's talk about azi her availability
  • 00:17:00
    zone independence for these things so
  • 00:17:02
    the data plane runs in each of these
  • 00:17:04
    AZ's and these are isolated from each
  • 00:17:07
    other they can obviously communicate
  • 00:17:08
    over a network but otherwise the data
  • 00:17:09
    planes don't really know about each
  • 00:17:11
    other and the same is true of the
  • 00:17:13
    control planes for these for these
  • 00:17:16
    resources now I mentioned earlier
  • 00:17:20
    there's a regional endpoint to access
  • 00:17:22
    ec2 so there must be some layer that you
  • 00:17:25
    connect to that sits on top of these and
  • 00:17:27
    there is there's a regional control
  • 00:17:28
    plane that acts as the entry point and
  • 00:17:30
    also handles things that are
  • 00:17:33
    zone specific so for easy to things like
  • 00:17:36
    security groups or not zone specific it
  • 00:17:40
    also aggregates API is like describe
  • 00:17:42
    instances so if you want to find out
  • 00:17:44
    about all your instances in a region it
  • 00:17:46
    will need to interrogate the control
  • 00:17:48
    planes across all the easies so in this
  • 00:17:52
    model again if you lose an AZ you're
  • 00:17:55
    gonna lose your zonal resources but
  • 00:17:56
    you've expected that anyway but the data
  • 00:17:59
    planes and the other ATS are fine the
  • 00:18:01
    control planes for those zones are fine
  • 00:18:03
    and the regional control control plane
  • 00:18:06
    is built to be multi AZ fault tolerant
  • 00:18:08
    so it will also continue to operate fine
  • 00:18:12
    except for the fact that it won't be
  • 00:18:14
    able to service requests that target the
  • 00:18:15
    zone that's down so if you have API
  • 00:18:18
    calls in to one of the healthy as ease
  • 00:18:20
    to launch an instance there it should be
  • 00:18:23
    able to route around this particular
  • 00:18:25
    type of AZ failure
  • 00:18:30
    okay so let's review the black phase
  • 00:18:32
    improvements that we get from
  • 00:18:33
    availability zones so here's a regional
  • 00:18:36
    service spread across three AZ's if zone
  • 00:18:41
    a fails regional service is fine because
  • 00:18:43
    it's able to fail away from from the
  • 00:18:45
    failed AZ meanwhile our zonal service
  • 00:18:49
    has an impact adjust in that zone but
  • 00:18:51
    not the other zones the other zones are
  • 00:18:53
    isolated and they won't have any impact
  • 00:18:55
    these are all these are all good things
  • 00:18:59
    the theoretical blast radius is a
  • 00:19:01
    different story so by a theoretical
  • 00:19:02
    blast rates I mean in the worst case
  • 00:19:05
    scenario the thing that you know the
  • 00:19:07
    Black Swan event like what's the worst
  • 00:19:09
    case that could happen here for the
  • 00:19:11
    regional service it is the entire
  • 00:19:13
    service and that's the thing that we
  • 00:19:15
    lose sleep at night every night thinking
  • 00:19:17
    about for the zonal service it's it's
  • 00:19:20
    still the zone so there's something nice
  • 00:19:22
    about this property of these
  • 00:19:23
    availability zones we still don't like
  • 00:19:25
    that there's a non infrastructure event
  • 00:19:27
    that could take out a service in his own
  • 00:19:29
    but it is still nice that it's contained
  • 00:19:31
    to the zone can we get the same kind of
  • 00:19:33
    resilience in our regional service
  • 00:19:36
    without having to sort of target it at a
  • 00:19:38
    in a zone local kind of way
  • 00:19:41
    so let's take a step back and look at
  • 00:19:42
    this abstracted architecture so there's
  • 00:19:46
    an entry point a regional entry point
  • 00:19:49
    into the service it has an aggregation
  • 00:19:52
    layer that might do a few things but
  • 00:19:54
    mostly it's a routing layer into a set
  • 00:19:56
    of compartmentalize resources and then
  • 00:19:59
    there's failure isolation between them
  • 00:20:01
    so in the availability zone case these
  • 00:20:03
    are AZ's down here and then a regional
  • 00:20:06
    control plane that accesses them but
  • 00:20:09
    more generally there's this
  • 00:20:10
    compartmentalization that is giving us
  • 00:20:13
    this nice fault isolation can we use
  • 00:20:16
    that in a in a different way in a
  • 00:20:18
    different dimension than AZ's and get
  • 00:20:21
    some smaller blast radius there's
  • 00:20:24
    another way to think about this
  • 00:20:25
    abstracted architecture which is how
  • 00:20:28
    they build chips so for centuries now
  • 00:20:31
    ships have been built with these
  • 00:20:32
    watertight compartments that are
  • 00:20:35
    separated by bulkheads and the reason
  • 00:20:38
    that this is that if there's a there's
  • 00:20:39
    damage to the hull and it causes
  • 00:20:40
    flooding the flooding is contained into
  • 00:20:43
    one of those compartments and the rest
  • 00:20:46
    of the ship is still intact and the ship
  • 00:20:48
    stays afloat those bulkheads also
  • 00:20:51
    provide structural integrity these are
  • 00:20:53
    both nice properties right you have this
  • 00:20:56
    fault Tower it's minimized impact of
  • 00:20:59
    failure and higher structural integrity
  • 00:21:03
    and so we've taken these ideas and
  • 00:21:06
    applied them to our regional services
  • 00:21:08
    and what we call cellular architecture
  • 00:21:10
    or cell based architecture so let me go
  • 00:21:15
    through our simple example again of a
  • 00:21:18
    application with a load balancer compute
  • 00:21:20
    and some storage and it's not shown here
  • 00:21:24
    but this is an application that is
  • 00:21:25
    running in multiple lazy's and has the
  • 00:21:27
    the failover as it described before so
  • 00:21:31
    in cell based architecture we take this
  • 00:21:34
    service stack configuration and we
  • 00:21:37
    create multiple instantiations of it and
  • 00:21:43
    these are fully iced I didn't know about
  • 00:21:44
    each other and each one of these stacks
  • 00:21:48
    is what we call a cell
  • 00:21:52
    and then what we'll take our workload
  • 00:21:53
    and basically low balance it partition
  • 00:21:57
    it over these these cells one way to do
  • 00:21:59
    that might be by customer so we'll put
  • 00:22:01
    this section into cell zero the section
  • 00:22:03
    into cell one this section into cell n
  • 00:22:07
    now you guys all look nice but maybe
  • 00:22:10
    there's someone naughty in here it's
  • 00:22:11
    gonna cause us a problem they're gonna
  • 00:22:12
    only gonna cause up the problem in that
  • 00:22:14
    one cell the other cells are gonna be
  • 00:22:16
    fine now we need some way to contain
  • 00:22:19
    this thing or make this thing look like
  • 00:22:21
    a single service still so we put a cell
  • 00:22:23
    router on top of it that makes those
  • 00:22:24
    routing decisions and that whole thing
  • 00:22:28
    is what we call a cell based service so
  • 00:22:32
    the cells are an internal structure
  • 00:22:33
    that's invisible to you as a customer
  • 00:22:35
    but provide resilience and fault
  • 00:22:38
    tolerance and this looks just like the
  • 00:22:41
    picture for azi or availability zone
  • 00:22:44
    independence but on a different
  • 00:22:46
    dimension so let's talk about what that
  • 00:22:49
    looks like
  • 00:22:49
    so here's the regional service getting
  • 00:22:52
    resilience from availability zones
  • 00:22:54
    here's that same service divided into
  • 00:22:56
    cells and you can see now that with an
  • 00:23:00
    availability zone failure both services
  • 00:23:02
    are resilient to that because their
  • 00:23:03
    fault tolerant across AZ's and the other
  • 00:23:08
    examples of failure the ones that we
  • 00:23:10
    lose sleep at at night over the failure
  • 00:23:13
    in their in the cell based service it's
  • 00:23:15
    contained to the cell so the impact is 1
  • 00:23:17
    over N or n is the number of cells
  • 00:23:19
    rather than the whole service and I've
  • 00:23:22
    shown three here for the purposes of
  • 00:23:25
    presentation but the number of cells can
  • 00:23:27
    actually be much higher so one over n
  • 00:23:29
    could be a fairly small percentage of
  • 00:23:30
    the overall set of workloads that you're
  • 00:23:33
    supporting now that this approach can
  • 00:23:38
    also be applied to zonal services and we
  • 00:23:40
    do this in our ec2 control plane where
  • 00:23:44
    you divide each of the zonal services
  • 00:23:45
    into cells as well so you still have a
  • 00:23:48
    failure as you'd expect if an
  • 00:23:50
    availability zone goes down what's
  • 00:23:52
    interesting though is as I mentioned
  • 00:23:53
    some availability zones or multiple data
  • 00:23:55
    centers and so you can actually have a
  • 00:23:57
    smaller blast radius
  • 00:23:58
    in certain cases if your zonal cells are
  • 00:24:01
    aligned with the physical infrastructure
  • 00:24:03
    which is the case with our zonal
  • 00:24:05
    easy to control plane services that's
  • 00:24:08
    actually a nice improvement and then
  • 00:24:10
    again for the other failure cases that
  • 00:24:13
    we worry about there is also a smaller
  • 00:24:15
    ablation Shrek blast radius even for the
  • 00:24:17
    zonal services so let's look at the
  • 00:24:21
    system properties of a cell based
  • 00:24:23
    architecture we've I talked about some
  • 00:24:26
    of them the first one is workload
  • 00:24:28
    isolation and this is useful not just
  • 00:24:30
    for failures but also just noisy
  • 00:24:31
    neighbor problems and then of course
  • 00:24:35
    there's the the failure containment so
  • 00:24:37
    if we lose a cell the other cells are
  • 00:24:38
    fine there's also this nice property
  • 00:24:43
    that is really powerful really important
  • 00:24:45
    to us at AWS which is how we scale these
  • 00:24:47
    things
  • 00:24:48
    so rather than scaling up a service
  • 00:24:52
    which is sort of the traditional way
  • 00:24:53
    just add you know more and more capacity
  • 00:24:55
    to it in a cell based architecture you
  • 00:24:59
    could also add more capacity to cells
  • 00:25:00
    but one of the things that we include in
  • 00:25:04
    our goals for a sublet are characters
  • 00:25:06
    that cells like our data centers have a
  • 00:25:09
    maximum size we won't let them grow past
  • 00:25:10
    a certain point and so if we need to
  • 00:25:13
    continue to grow a system rather than
  • 00:25:15
    growing the cells past that point we'll
  • 00:25:18
    add another cell so you grow the system
  • 00:25:20
    by scaling it out with more and more
  • 00:25:22
    cells the fact that the cells have a
  • 00:25:25
    maximum size means you can test them at
  • 00:25:28
    that maximum size with a reasonable test
  • 00:25:30
    configuration so you can test them to
  • 00:25:33
    failure you can do all sorts of stress
  • 00:25:34
    testing and get confidence that you
  • 00:25:36
    understand how that piece of the system
  • 00:25:38
    is going to operate as you you know as
  • 00:25:41
    you get more and more demand on your
  • 00:25:43
    system these cells are also more
  • 00:25:46
    manageable because they're smaller so if
  • 00:25:47
    there's some issue you need to look
  • 00:25:49
    through logs or otherwise inspect the
  • 00:25:52
    nodes in the cell it's gonna be smaller
  • 00:25:55
    it's just a piece of your system rather
  • 00:25:56
    than the whole system so it's gonna be
  • 00:25:58
    easier to work through so let's now talk
  • 00:26:03
    about some of the core considerations in
  • 00:26:05
    a cell based architecture so the first
  • 00:26:09
    is cell size
  • 00:26:14
    sort of the trade-off here is you can
  • 00:26:16
    have a large number of smaller cells or
  • 00:26:19
    a smaller number of large cells in the
  • 00:26:23
    the case where you have smaller cells
  • 00:26:25
    that's nice because now your blast
  • 00:26:26
    radius is you know that much smaller and
  • 00:26:29
    those smaller things are easy to test
  • 00:26:31
    easier to break to understand what their
  • 00:26:33
    breaking points are and 30 to operate in
  • 00:26:36
    terms of figuring out if there's an
  • 00:26:37
    issue you know smaller number of notes
  • 00:26:39
    to go in and take a peek at on the other
  • 00:26:43
    end of the spectrum though larger cells
  • 00:26:45
    have some good properties which is first
  • 00:26:47
    if there's a fixed cost to each of these
  • 00:26:49
    which often there is you know maybe
  • 00:26:51
    there's a separate low bounce or each
  • 00:26:53
    one of them then you get cost efficiency
  • 00:26:55
    by having fewer of them
  • 00:26:57
    you also get reduced splits so that this
  • 00:27:00
    is an important consideration if if
  • 00:27:03
    we're dividing our workload by a
  • 00:27:06
    customer some of our customers might be
  • 00:27:07
    large they have a large number of
  • 00:27:09
    workloads and that may be too large to
  • 00:27:12
    fit in a smaller cell if we're using
  • 00:27:16
    larger cells we may be able to fit that
  • 00:27:18
    larger customer into a single cell right
  • 00:27:20
    and not have to worry about deal with
  • 00:27:21
    the complexity of splitting across
  • 00:27:23
    multiple cells and finally as a whole
  • 00:27:26
    this system that has fewer cells is
  • 00:27:29
    easier to operate because you it's
  • 00:27:30
    easier to think about easier to look at
  • 00:27:31
    dashboards and so forth there's no right
  • 00:27:34
    answer here except that all things being
  • 00:27:36
    equal we will always prefer the lower
  • 00:27:39
    about blast radius do these other
  • 00:27:41
    considerations another which I'm sure
  • 00:27:46
    you've been thinking about looking at
  • 00:27:49
    this diagram is well what about the
  • 00:27:50
    router the cell router is this remaining
  • 00:27:53
    shared component across the entire
  • 00:27:56
    system and so it's really important that
  • 00:27:58
    that thing not fail because now you're
  • 00:28:00
    back to the regional blast radius and so
  • 00:28:04
    we spent a lot of effort making sure
  • 00:28:06
    that that component is stress tested and
  • 00:28:11
    battle-hardened so that we know when it
  • 00:28:13
    we have high confidence that even in the
  • 00:28:16
    Black Swan scenarios that it's going to
  • 00:28:17
    stay resilient and stay up and and one
  • 00:28:21
    of the ways to accomplish that is to
  • 00:28:22
    keep it as simple as possible
  • 00:28:25
    so when we talk to teams about adopting
  • 00:28:27
    a sublation architecture we call this
  • 00:28:29
    component the thinnest possible layer to
  • 00:28:30
    kind of reinforce should be really
  • 00:28:33
    simple and yeah that's all I'll say
  • 00:28:38
    about that another consideration is
  • 00:28:42
    partitioning dimension so I've talked
  • 00:28:44
    about how we might divide cells along
  • 00:28:48
    lines of customers but then in the ec2
  • 00:28:51
    control plane case there's an aspect of
  • 00:28:54
    the control plane that actually is cell
  • 00:28:56
    based based on physical infrastructure
  • 00:28:58
    in our data centers which makes sense
  • 00:29:00
    for that application another scenario we
  • 00:29:06
    may divide not by customer by VPC
  • 00:29:08
    especially because sometimes V pcs may
  • 00:29:11
    have crossed customer scenarios and so
  • 00:29:14
    it takes some analysis to decide what's
  • 00:29:16
    the right way to carve this thing up and
  • 00:29:19
    the recommendation I always use is cut
  • 00:29:21
    with the grain and if you don't know
  • 00:29:23
    what that means then think about it this
  • 00:29:25
    way that you know wood has a certain
  • 00:29:26
    grain and it's easy to split along one
  • 00:29:28
    dimension and really hard to split cross
  • 00:29:31
    across the grain and every system has
  • 00:29:34
    has a natural grain to it another
  • 00:29:40
    consideration is what I call cross cell
  • 00:29:42
    use cases these may be unavoidable the
  • 00:29:47
    goal is to keep them to a minimum
  • 00:29:48
    because that adds complexity to the
  • 00:29:51
    thinnest possible layer and also
  • 00:29:53
    increases the blast radius for those
  • 00:29:56
    those operations one example to scatter
  • 00:29:59
    gatherer queries so what this means is
  • 00:30:01
    there may be an API that comes in that
  • 00:30:03
    needs to interrogate multiple cells so
  • 00:30:07
    if scatter requests out and then gather
  • 00:30:09
    responses are sent out a single reply so
  • 00:30:11
    an example in ec2 is the describe
  • 00:30:13
    instance this case I mentioned earlier
  • 00:30:15
    now there is batch operations so if you
  • 00:30:18
    need to execute work on multiple cells
  • 00:30:20
    in a single operation so again an AC -
  • 00:30:23
    maybe the terminate instances API where
  • 00:30:25
    you can spend send multiple instance IDs
  • 00:30:27
    that can be a cross sale use case the
  • 00:30:33
    last in the probably hardest is
  • 00:30:34
    coordinated writes where you're actually
  • 00:30:36
    doing trying sending atomically but a
  • 00:30:38
    car
  • 00:30:39
    multiple cells those require careful
  • 00:30:42
    consideration and one example of that is
  • 00:30:44
    cell migration so cell migration is when
  • 00:30:47
    you relocate a workload from one cell to
  • 00:30:50
    another it's maybe we decided this
  • 00:30:52
    customer we're gonna move that customer
  • 00:30:53
    into cell two from cell 1 and you may
  • 00:30:58
    choose do this because you want to
  • 00:31:00
    manage the amount of load or heat that's
  • 00:31:02
    on each cell or maybe just want to load
  • 00:31:05
    bounce the sizes of them or maybe you've
  • 00:31:07
    added a salad you and you need to and
  • 00:31:09
    your approach for adding cells involves
  • 00:31:12
    you know moving existing workloads over
  • 00:31:14
    into the new cell and the process that
  • 00:31:17
    you used to do the migration is not
  • 00:31:20
    unlike a VM migration so if you're
  • 00:31:22
    familiar with how VM migration works
  • 00:31:24
    they say there's a invisible clone that
  • 00:31:27
    gets created in the target location and
  • 00:31:29
    it gets brought up to date and
  • 00:31:32
    synchronized with the source of course
  • 00:31:35
    the source is still changing so this
  • 00:31:37
    could take a while for it to get close
  • 00:31:40
    to begin sync and at the last possible
  • 00:31:42
    moment
  • 00:31:42
    both are frozen for the final completion
  • 00:31:46
    of the syncing and then an atomic flip
  • 00:31:48
    over to the to the target location that
  • 00:31:52
    works for VMs and that's the same
  • 00:31:53
    approach that we use for migrating
  • 00:31:55
    workloads across cells it requires a
  • 00:31:58
    careful coordination best-managed at the
  • 00:32:02
    router level we have a few approaches
  • 00:32:05
    that we've been using to accomplish this
  • 00:32:09
    so again with the cell Bay structure
  • 00:32:13
    we're able to reduce the blast radius
  • 00:32:14
    from 100 percent down to one over n
  • 00:32:17
    where n is the number of cells
  • 00:32:18
    I should reinforce that the events that
  • 00:32:23
    cause these types of additives are
  • 00:32:25
    exceedingly rare we could spend a long
  • 00:32:32
    time not even having to worry about the
  • 00:32:34
    kind of failure happening because of all
  • 00:32:35
    the other things that abus does but
  • 00:32:38
    we're so focused on resilience that
  • 00:32:40
    we're investing additional engineering
  • 00:32:41
    work to get to that picture on the right
  • 00:32:44
    which is a smaller blast radius even
  • 00:32:46
    when those Black Swan events occur
  • 00:32:50
    so cells are great but there's another
  • 00:32:56
    technique that we've been using that is
  • 00:32:57
    even more impressive and more exciting I
  • 00:33:01
    think which is called shuffle charting
  • 00:33:04
    and shuffle charting is a technique that
  • 00:33:06
    is like cell based architectures and
  • 00:33:09
    it's particularly useful in stateless or
  • 00:33:12
    soft state services so we'll be walk
  • 00:33:16
    through what shuffle charting looks like
  • 00:33:17
    so here's another simple service we've
  • 00:33:20
    got eight nodes and these ain't nodes or
  • 00:33:23
    handling requests that are sent by a
  • 00:33:26
    load balancer and then we have eight
  • 00:33:29
    different customers that are sending
  • 00:33:30
    requests so we're in Vegas I use some
  • 00:33:33
    some gambling relevant to high cons here
  • 00:33:36
    there were ten are different customers
  • 00:33:39
    and let's imagine one of them Dimond
  • 00:33:42
    here is is introducing a bad workload
  • 00:33:45
    for whatever reason maybe it's expensive
  • 00:33:46
    request maybe it's one of these requests
  • 00:33:50
    that triggers a bug in the system so
  • 00:33:53
    Dimon sends a request in and that
  • 00:33:56
    request causes one of our servers to
  • 00:33:57
    crash okay that's all right we've got
  • 00:34:00
    seven others maybe Dimon will go away or
  • 00:34:03
    change what it's doing probably not
  • 00:34:06
    it'll probably keep retrying and
  • 00:34:09
    eventually take out the whole system so
  • 00:34:11
    here our blast radius is basically all
  • 00:34:14
    the customers this is like the worst
  • 00:34:16
    case scenario we really want to avoid
  • 00:34:18
    this this is where we go to sell based
  • 00:34:21
    architecture so we divide our our
  • 00:34:23
    customers assign a subset of customers
  • 00:34:25
    to each sell
  • 00:34:28
    now when diamond comes along and causes
  • 00:34:31
    problems that problems contained adjust
  • 00:34:34
    to the sell this is a 4x improvement
  • 00:34:38
    right we've gone from 100% down to 25%
  • 00:34:40
    the blast radius is the number of
  • 00:34:41
    customers divided by the number of cells
  • 00:34:44
    and again we could approve that further
  • 00:34:46
    by adding more cells or in a system
  • 00:34:50
    where we don't need to really worry
  • 00:34:52
    about which customers land on which
  • 00:34:54
    nodes we could shuffle shard them which
  • 00:34:58
    is a little bit different and it's
  • 00:34:59
    nuanced but you'll see shortly how
  • 00:35:01
    powerful this is
  • 00:35:03
    so we'll take each customer and assign
  • 00:35:05
    them to two nodes effectively at random
  • 00:35:08
    not really a random own will use hash
  • 00:35:10
    functions to be predictable that where
  • 00:35:15
    are these customers land on these nodes
  • 00:35:16
    but basically we've assigned them at
  • 00:35:18
    random so diamond gets assigned to the
  • 00:35:20
    first and fourth nodes will put spades
  • 00:35:23
    on those two our role of two on the dice
  • 00:35:27
    goes to those two nodes and so on
  • 00:35:29
    Ceaser's basically shuffled randomly
  • 00:35:33
    across our set of capacity so now again
  • 00:35:37
    diamond comes along takes out the two
  • 00:35:39
    nodes that are assigned to it but here's
  • 00:35:41
    where it gets interesting look at who's
  • 00:35:44
    sharing those nodes with diamond one of
  • 00:35:47
    them is hearts hearts however has a
  • 00:35:52
    second know that it's assigned to that's
  • 00:35:55
    not impacted to buy the average so as
  • 00:35:56
    long as that customer retries it's
  • 00:35:59
    fault-tolerant even though one of its
  • 00:36:02
    nodes is down one of its noses up it's
  • 00:36:04
    able to continue operation the same is
  • 00:36:06
    true on the other node where clubs has
  • 00:36:09
    that same property so in this case our
  • 00:36:14
    blast radius is actually the number of
  • 00:36:16
    customers divided by the number of
  • 00:36:17
    combinations of of two pairs out of
  • 00:36:21
    eight which turns out there are 28 of
  • 00:36:25
    them which is 3.6 percent so if we had a
  • 00:36:29
    much larger number of customers we'd
  • 00:36:30
    expect these are well distributed
  • 00:36:33
    randomly you would have three poor 63.6%
  • 00:36:35
    of customers impacted by the failure at
  • 00:36:38
    that I showed meanwhile less than half
  • 00:36:42
    of the customers would be in that
  • 00:36:43
    scenario where they're sharing at least
  • 00:36:45
    one of the nodes so they may see a
  • 00:36:47
    little bit of impact a little bit of
  • 00:36:48
    hiccup but they're fine so we went from
  • 00:36:51
    25 percent down to 3.6 percent going
  • 00:36:55
    from the cell-based down to shuffle
  • 00:36:56
    shark now this is a small system oh I
  • 00:36:59
    should show you the math so the math
  • 00:37:00
    here is probably as you remember from
  • 00:37:01
    high school the binomial coefficient as
  • 00:37:06
    you look at this math to realize as n
  • 00:37:08
    grows our number of combinations grows
  • 00:37:12
    really quickly so let's say we go from 8
  • 00:37:14
    nodes up to 100
  • 00:37:15
    larged like a huge number of notes it's
  • 00:37:18
    a reasonable number to run in a
  • 00:37:19
    large-scale system so say we have a
  • 00:37:22
    hundred nodes and then we give each
  • 00:37:23
    customer five combinations we're sorry
  • 00:37:26
    if five nodes to represent their
  • 00:37:28
    combination the math tells us that's
  • 00:37:31
    going to be 75 million different
  • 00:37:33
    combinations I think of it basically as
  • 00:37:35
    you know a deck couple hundred cards
  • 00:37:36
    there's fire you know 75 million
  • 00:37:39
    different combinations of cards you can
  • 00:37:40
    get from by picking randomly from that
  • 00:37:42
    deck which is amazing because now you
  • 00:37:45
    can see all right 77 percent of
  • 00:37:47
    customers are not going to see any
  • 00:37:48
    impact when diamond comes along and
  • 00:37:50
    takes out it's five nodes but more
  • 00:37:52
    interestingly ninety-nine point eight
  • 00:37:55
    percent so those first three rows are
  • 00:37:58
    still gonna have a majority of their
  • 00:37:59
    nodes available so they're gonna have a
  • 00:38:01
    better chance than not to completely you
  • 00:38:05
    know route around that problem without
  • 00:38:06
    even having to try and meanwhile that
  • 00:38:10
    very very very low percentage of
  • 00:38:12
    customers is basically the percentage of
  • 00:38:16
    customers that are going to be sharing
  • 00:38:17
    completely those same five notes what's
  • 00:38:21
    magical about this and it's all in the
  • 00:38:23
    math is we've created a multi-tenant
  • 00:38:27
    system and then used the shuffle
  • 00:38:30
    charting to create a single basically a
  • 00:38:32
    single tenant experience which is
  • 00:38:34
    obviously what a POS aspires to do now
  • 00:38:38
    this needs a fault our client as I
  • 00:38:40
    mentioned so one that when it gets a
  • 00:38:42
    failure will retry but that's that's not
  • 00:38:44
    hard that's that's pretty common what's
  • 00:38:47
    interesting also is this not only works
  • 00:38:48
    for servers it can work for queues they
  • 00:38:51
    can work for other resources it's also
  • 00:38:54
    critically dependent on fixed
  • 00:38:56
    assignments so you're you're stuck with
  • 00:39:00
    the hand that we deal you if there's any
  • 00:39:04
    sort of failover like oh well your five
  • 00:39:05
    nodes are down I'll give you these five
  • 00:39:06
    then you get back into that old world
  • 00:39:10
    where now a problem can infect and
  • 00:39:13
    cascade across an entire system so it
  • 00:39:15
    really depends on those fixed
  • 00:39:16
    assignments and if I need some sort of
  • 00:39:18
    routing mechanism so either a shuffle
  • 00:39:20
    sharding aware router or dns can be
  • 00:39:25
    another so in some of our servers both
  • 00:39:27
    will hand a customer's
  • 00:39:28
    if ik DNS name and that will resolve to
  • 00:39:32
    the customer specific shuffle sharted
  • 00:39:34
    set for that customer which basically
  • 00:39:37
    gives them the routing for free cool so
  • 00:39:42
    let's talk about the operational
  • 00:39:43
    practices that we now layer on top of
  • 00:39:46
    these architectural techniques to
  • 00:39:48
    achieve the lowest possible blast radius
  • 00:39:53
    the first is not even the practice but
  • 00:39:56
    really ever a mindset and maybe even a
  • 00:39:59
    religion at a table us which is probably
  • 00:40:02
    best captured by Vernors blog from
  • 00:40:04
    earlier this year about
  • 00:40:06
    compartmentalization
  • 00:40:07
    now that's this back shall just read it
  • 00:40:09
    so I won't do as accident I have nothing
  • 00:40:12
    to see in burger talk but I'd wager that
  • 00:40:15
    every new a TBS engineer knows within
  • 00:40:16
    their first week it's not their first
  • 00:40:18
    day that we never want to touch more
  • 00:40:20
    than one zone at a time this is so
  • 00:40:23
    important because if we have
  • 00:40:26
    availability zone fault isolation or we
  • 00:40:29
    have region isolation as a core tenet of
  • 00:40:31
    our blast radius reduction that's gonna
  • 00:40:34
    go out the window if we have you know
  • 00:40:36
    some correlated failure introduced by
  • 00:40:38
    some manual action or automated action
  • 00:40:40
    on multiple of these at the same time
  • 00:40:43
    the most common and most I guess obvious
  • 00:40:47
    example of this is with software
  • 00:40:48
    deployments so our software deployments
  • 00:40:51
    are done in a staggered way across zones
  • 00:40:55
    across regions over time you know
  • 00:41:00
    quickly enough that we can get features
  • 00:41:01
    app to customers that you know because
  • 00:41:03
    we like to launch features but slowly
  • 00:41:06
    enough that we have confidence that as
  • 00:41:08
    we're pushing this change broader and
  • 00:41:10
    wider that it's not gonna cause an issue
  • 00:41:13
    so we'll start slow observe test and
  • 00:41:17
    then maybe speed it up as as it goes out
  • 00:41:20
    broader and broader and that will that's
  • 00:41:25
    the case with cells it's the case with
  • 00:41:26
    availability zones casement regions and
  • 00:41:29
    then within which each of those
  • 00:41:30
    deployment units will do a fractional
  • 00:41:33
    deployment to their to so a one box test
  • 00:41:35
    has a very first step for a service for
  • 00:41:39
    our ec2 Harbor deployments
  • 00:41:42
    we'll start with maybe five or ten
  • 00:41:45
    machines at first verify that things are
  • 00:41:49
    working and then gradually speed it up
  • 00:41:52
    as it expands across the infrastructure
  • 00:41:55
    in tandem with that we have a bunch of
  • 00:41:57
    automated tests that run that's part of
  • 00:41:59
    the deployment as well as Canaries or
  • 00:42:02
    these test applications that are
  • 00:42:03
    mimicking you know a real-world customer
  • 00:42:06
    in booking api's and will monitor the
  • 00:42:09
    results of those if there's any problem
  • 00:42:10
    the role the deployment get
  • 00:42:12
    automatically rolled back we'll look at
  • 00:42:14
    what happened and either decide it was
  • 00:42:17
    not an issue that we need to worry about
  • 00:42:18
    or fix the problem before we start the
  • 00:42:20
    deployment again so this is really
  • 00:42:23
    important this is this is what we need
  • 00:42:25
    to do to make sure that we've not
  • 00:42:27
    compromised the boundaries that we've
  • 00:42:29
    put between cells and AZ's and regions
  • 00:42:35
    all that automation sorry all that
  • 00:42:37
    deployment machinery is automated
  • 00:42:41
    including the rules about the tests that
  • 00:42:45
    have to succeed including the timing and
  • 00:42:47
    the windows when we progress to the next
  • 00:42:49
    stage but automation is in key in
  • 00:42:52
    general right it's not just employments
  • 00:42:53
    there are other things that we do with
  • 00:42:54
    our infrastructure to manage it whether
  • 00:42:56
    it's configuring network devices or you
  • 00:43:00
    know propagating new credentials to the
  • 00:43:02
    stacks that could be done by hand
  • 00:43:05
    but humans are prone to error it's much
  • 00:43:08
    better to automate it but it could be
  • 00:43:10
    because you can review whatever it is
  • 00:43:13
    that you're building in your automation
  • 00:43:15
    you can test it it's predictable and how
  • 00:43:18
    it's gonna operate and you can repeat it
  • 00:43:19
    over and over again and know that it's
  • 00:43:21
    gonna work the same way every time and
  • 00:43:25
    then finally as you may have heard AWS
  • 00:43:28
    and Amazon in general is has a very
  • 00:43:31
    strong philosophy around end-to-end
  • 00:43:32
    ownership so our service teams are
  • 00:43:36
    composed of engineers who are builder
  • 00:43:38
    operators so as engineers we build this
  • 00:43:40
    up we design the software we test it we
  • 00:43:43
    also are the ones that deploy it and
  • 00:43:46
    we're the ones that operate it and
  • 00:43:47
    respond to issues in production what
  • 00:43:49
    this does is it gives us this wonderful
  • 00:43:51
    opportunity to have a feedback loop in
  • 00:43:53
    terms of design choices what
  • 00:43:56
    impact is operationally and
  • 00:43:58
    understanding what changes we need to
  • 00:43:59
    make to avoid future problems when there
  • 00:44:03
    is a failure that we need to worry about
  • 00:44:05
    and so I mentioned earlier the
  • 00:44:07
    correction of errors template that we
  • 00:44:08
    use that's usually filled out by
  • 00:44:11
    engineers or in partnership with
  • 00:44:12
    engineers where they think about the
  • 00:44:15
    blast radius and so they're in a great
  • 00:44:17
    position to then go and implement the
  • 00:44:19
    changes to make sure the next time if
  • 00:44:21
    that event occurs the blast radius is
  • 00:44:23
    cut in half
  • 00:44:23
    or more so to wrap up we've got a
  • 00:44:31
    variety of containment
  • 00:44:33
    compartmentalization mechanisms that we
  • 00:44:35
    use to reduce blast radius
  • 00:44:37
    it starts with regions and the strong
  • 00:44:38
    isolation between regions then it goes
  • 00:44:41
    into availability zones and then this
  • 00:44:44
    alternate dimension of availability
  • 00:44:46
    zones compartmentalization which
  • 00:44:47
    ourselves and then the magic of shuffle
  • 00:44:51
    charting to get sort of virtual cells
  • 00:44:54
    and taking advantage of the
  • 00:44:56
    combinatorics of the shuffle charting
  • 00:44:59
    and then with those we protect that
  • 00:45:02
    compartmentalization with operation
  • 00:45:04
    operation operational practices so step
  • 00:45:08
    by step phase deployments that are
  • 00:45:10
    automated and then service teams that
  • 00:45:13
    are builder operators so they are close
  • 00:45:16
    to the the frontlines understand how
  • 00:45:18
    their decisions are impacting the
  • 00:45:20
    availability of their systems and all
  • 00:45:24
    with the goal of reducing blast radius
  • 00:45:28
    so that's my talk I hope you learned a
  • 00:45:32
    few things and I'm happy to take
  • 00:45:34
    questions now if anyone has them and I
  • 00:45:36
    think there are yeah there are live
  • 00:45:38
    microphones down the center aisles here
  • 00:46:01
    yeah so the question was is there an
  • 00:46:02
    example of Murphy's Law where we thought
  • 00:46:06
    we had everything nailed down everything
  • 00:46:08
    sorted out that we answered although the
  • 00:46:12
    possible failure modes its it goes back
  • 00:46:16
    to my slide of the the bit flips on the
  • 00:46:21
    networking side so this is a example for
  • 00:46:23
    many many years ago
  • 00:46:24
    so in s/3 s/3 cares deeply about data
  • 00:46:29
    care deeply about the integrity of data
  • 00:46:31
    and there are many layers in s3 of
  • 00:46:35
    checksumming
  • 00:46:37
    so if there is an arrant but bit flip
  • 00:46:41
    introduced we make sure it doesn't you
  • 00:46:43
    know we detected it's not an issue in
  • 00:46:47
    2008 we had an event in s3 where there
  • 00:46:52
    was one network card on one server that
  • 00:46:57
    every now and then was flipping one bit
  • 00:47:01
    and there was a layer of the system that
  • 00:47:04
    handles basically the group
  • 00:47:06
    communication across the system uses
  • 00:47:09
    gossip protocols so can detect whether I
  • 00:47:13
    can understand what the state is of all
  • 00:47:14
    the servers in the system whether
  • 00:47:15
    they're healthy or not
  • 00:47:16
    well the gossip protocol noticed there's
  • 00:47:20
    a funny server name in this packet
  • 00:47:23
    because one of the bits was flipped in
  • 00:47:25
    the server name I never heard of that
  • 00:47:27
    hos before and it triggered a much more
  • 00:47:29
    expensive sort of reconciliation
  • 00:47:31
    protocol long story short you could read
  • 00:47:35
    the long story there's a post mortem
  • 00:47:37
    that Stoll published on the somewhere in
  • 00:47:40
    the dashboard status dashboard pages
  • 00:47:42
    talks about the outage but it took down
  • 00:47:43
    s3 completely so one server one NIC one
  • 00:47:49
    bit took down our regional service and
  • 00:47:52
    that was the only layer we had checks
  • 00:47:55
    coming all the way up and down the stack
  • 00:47:56
    that was the only layer that didn't have
  • 00:47:58
    the check something someone we'd
  • 00:47:59
    forgotten to add it there
  • 00:48:09
    yeah so the follow question here is
  • 00:48:12
    wasn't there a more recent event where
  • 00:48:14
    there was a human error involved in an
  • 00:48:16
    s3 patent yes there was and this comes
  • 00:48:19
    back to my operation back to slides I
  • 00:48:21
    was talking about which is all your best
  • 00:48:25
    plans can can fall apart if whatever
  • 00:48:30
    you're doing doesn't respect those fault
  • 00:48:32
    isolation boundaries in this case the
  • 00:48:36
    engineer very certainly knew our
  • 00:48:38
    philosophy and it was more of a just a
  • 00:48:42
    human error mistake on the command line
  • 00:48:45
    which goes back to the importance of
  • 00:48:46
    automation and importance of testing of
  • 00:48:48
    those those things
  • 00:48:49
    Peter yes with the everything that you
  • 00:48:54
    talked about in the different layers
  • 00:48:55
    from regions availability zones to cells
  • 00:48:58
    to the shuffle shorty to your customers
  • 00:49:01
    are are they given to them or are they
  • 00:49:05
    for purchase as services good question
  • 00:49:09
    so regions are given to you these are
  • 00:49:13
    given to you cells and shuffle charting
  • 00:49:16
    are sort of different things in that
  • 00:49:18
    they are they're like the watertight
  • 00:49:22
    compartments inside the ship so you
  • 00:49:24
    could ask the the purser on the boat to
  • 00:49:27
    give you a tour of the lower decks and
  • 00:49:28
    maybe they'd show you the compartments
  • 00:49:29
    but otherwise you know you don't really
  • 00:49:32
    care about that and you get them for
  • 00:49:33
    free it's just part of the safety of the
  • 00:49:35
    ship so they're visible and they're
  • 00:49:39
    they're free I wanted to have a cell for
  • 00:49:51
    my own service that I'm making and I
  • 00:49:53
    actually care about aligning that cell
  • 00:49:55
    with a particular data center so you can
  • 00:50:05
    get that kind of affinity with your
  • 00:50:07
    zonal resources right so you can you can
  • 00:50:09
    get affinity with your ec2 instances and
  • 00:50:11
    your Cloud HSM instance and your EBS
  • 00:50:15
    volumes because we give you full control
  • 00:50:16
    over the place well and availability
  • 00:50:20
    zone you can think of as a lot
  • 00:50:22
    whole data center there they're one in
  • 00:50:23
    the same so a data center will always be
  • 00:50:28
    in one easy right an AZ will always be
  • 00:50:33
    one or more data centers you're saying
  • 00:50:36
    for the multiple data center or a Z's
  • 00:50:39
    exactly got it yeah no that that is if
  • 00:50:42
    that is not visible to you as a customer
  • 00:50:43
    you don't have control over that Thanks
  • 00:50:46
    yep so as you move from regions to cells
  • 00:50:50
    how does the version management of stuff
  • 00:50:53
    that you're deploying is there any
  • 00:50:55
    operational insight into that of being
  • 00:50:58
    deployed there I mean how do we how do
  • 00:51:01
    we keep does it have any impact on you
  • 00:51:05
    know managing operations deploying in a
  • 00:51:07
    region versus deploying at a shard level
  • 00:51:10
    I said that any complexities there yeah
  • 00:51:12
    so I think what you're saying is there's
  • 00:51:14
    this window of time when we're going
  • 00:51:17
    through the progression of the
  • 00:51:18
    deployment and the versions may not be
  • 00:51:20
    in sync across multiple locations is
  • 00:51:23
    that the core of your question yes yeah
  • 00:51:25
    that's definitely consideration and a
  • 00:51:28
    deployment may take depending on the
  • 00:51:29
    system it may take several days to
  • 00:51:33
    several weeks in some operating
  • 00:51:35
    structure were particularly careful
  • 00:51:39
    about the pace of deployment in fact in
  • 00:51:42
    some of the networking components for
  • 00:51:44
    VPC it's a particularly delicate affair
  • 00:51:47
    to roll out changes now it might take a
  • 00:51:49
    while so yeah that is a complication
  • 00:51:51
    that we have to be mindful of of well
  • 00:51:54
    which version is running here yeah some
  • 00:52:00
    questions regarding the global services
  • 00:52:03
    we talked a lot about regions zones how
  • 00:52:05
    hard the global services regarding
  • 00:52:08
    control playing day to play in
  • 00:52:10
    resiliency in respect to the regions or
  • 00:52:12
    edge notes and organize so is your
  • 00:52:15
    question about regional services that
  • 00:52:18
    are have global features or truly global
  • 00:52:20
    services for example I am I am okay yeah
  • 00:52:23
    that's a great question so I am as a
  • 00:52:26
    special case because
  • 00:52:28
    your account information and your
  • 00:52:32
    credentials and so forth are available
  • 00:52:34
    globally so it is in a sense of global
  • 00:52:36
    service however each region has a
  • 00:52:39
    separate control plane sorry a separate
  • 00:52:43
    data plate that can operate completely
  • 00:52:46
    disconnected from the source of truth
  • 00:52:49
    for I am the control plane however is
  • 00:52:54
    global and so there are particular
  • 00:52:56
    considerations that we take with a
  • 00:52:58
    service like that to make sure that it
  • 00:53:00
    continues to be available even in the
  • 00:53:02
    face of some of the failures that I
  • 00:53:03
    talked about but I should I had it on
  • 00:53:07
    the site and I forgot to mention that
  • 00:53:08
    there is another philosophy that we have
  • 00:53:11
    which is the separation between control
  • 00:53:13
    plane and data plane and ensuring that
  • 00:53:16
    the data plane can continue to function
  • 00:53:17
    even if there is a control plane issue
  • 00:53:20
    and that's exactly the scenario and I am
  • 00:53:23
    that's critically important because a
  • 00:53:25
    regent can be disconnected from the
  • 00:53:26
    source of truth or it can there could be
  • 00:53:29
    other issues but the control plane we
  • 00:53:30
    wanna make sure that you can still you
  • 00:53:33
    know validate your credentials and so
  • 00:53:34
    forth you know in each region and not
  • 00:53:36
    have global impact can you go to the mic
  • 00:53:48
    so when you're talking about the cell
  • 00:53:51
    structure and the shuffle shouting I
  • 00:53:56
    mean when you're talking about the cell
  • 00:53:58
    stuff you were talking about in the
  • 00:54:00
    control layer setting out the routing to
  • 00:54:02
    cut with the grain of the service ways
  • 00:54:05
    to break it up when you move to shuffle
  • 00:54:07
    shouting is that done at the same level
  • 00:54:08
    because obviously you lose that sort of
  • 00:54:11
    control based on the characteristics of
  • 00:54:14
    the service yeah it's a good question
  • 00:54:16
    and for for cell based services usually
  • 00:54:24
    that prevents you from shuffle charting
  • 00:54:26
    is that you have you have to have some
  • 00:54:29
    control over state so you can't a sort
  • 00:54:32
    of willy-nilly a scatter of state across
  • 00:54:34
    the entire fleet some servers don't have
  • 00:54:37
    state or the earthís off state that they
  • 00:54:39
    can they can cache locally and then
  • 00:54:41
    operate fine the shop charting works
  • 00:54:45
    well for for the latter scenario whether
  • 00:54:47
    it doesn't need to be any particular
  • 00:54:48
    node affinity the there's another sort
  • 00:54:55
    of advanced topic here which is kind of
  • 00:54:57
    interesting which is you can actually
  • 00:54:58
    layer both of these things together you
  • 00:55:01
    could take a cell based architecture and
  • 00:55:02
    then shuffle shard requests across them
  • 00:55:05
    in a stateless system to get a different
  • 00:55:08
    kind of resilience property that's
  • 00:55:09
    something that we're working on now with
  • 00:55:11
    with with dynamodb so I don't know if I
  • 00:55:15
    answered your question okay
  • 00:55:18
    so I guess maybe never another way to
  • 00:55:20
    answer to that as shuffle charting and
  • 00:55:22
    is not an improvement on cell based in
  • 00:55:24
    all cases it's really a system specific
  • 00:55:28
    on when you might be able to apply it
  • 00:55:30
    cool okay look that there's no more
  • 00:55:36
    questions thanks again for coming
  • 00:55:38
    hope you learn something and enjoy your
  • 00:55:40
    week at reinvent
  • 00:55:40
    [Applause]
Tags
  • AWS
  • Distributed Systems
  • Resilience
  • Failure Management
  • CAP Theorem
  • Murphy's Law
  • Operational Practices
  • Shuffle Sharding
  • Cell-Based Architecture
  • Availability Zones