DS101.2 Cassandra Overview

00:10:08
https://www.youtube.com/watch?v=E-xBog9nWTA

Résumé

TLDRApache Cassandra är en distribuerad databas som är konstruerad för att garantera hög tillgänglighet och linjär skalbarhet. Den är utformad för att inte ha några enskilda felpunkter tack vare sin peer-to-peer-struktur där alla noder är likvärdiga och kan hantera både läs- och skrivbegäranden. Cassandra väljer hög tillgänglighet över strikt konsistens vid nätverksdelningar enligt CAP-teoremet. Data replikerar till flera servrar vilket bidrar till systemets robusthet mot fel. Systemet ger utvecklare möjlighet att konfigurera replikationsfaktorer och konsistensnivåer efter behov, vilket gör det särskilt fördelaktigt för miljöer där hög tillgänglighet är viktigare än strikt datakonsistens. Cassandra stödjer asynkron replikering över flera datacenter, vilket möjliggör datahantering på global nivå. Det fungerar bra med en mängd billiga hårdvaror och kan enkelt hantera kluster oavsett deras storlek.

A retenir

  • 📌 Apache Cassandra är en distribuerad databas med hög tillgänglighet och skalbarhet.
  • 🔄 Peer-to-peer-strukturen eliminerar enskilda felpunkter.
  • 🌓 Data replikeras för att säkerställa motståndskraft mot fel.
  • ⚖️ Cassandra prioriterar tillgänglighet över strikt konsistens enligt CAP-teoremet.
  • 🛠️ Utvecklare kan anpassa replikationsfaktorer och konsistensnivåer.
  • 🌐 Stödjer global asynkron replikering över flera datacenter.
  • 💡 Effektiv för kostnadseffektivt klustermanagement.
  • ⏩ Låg latens och förutsägbar prestanda vid skalning.
  • 🔑 Primärnycklar och partitionsnycklar hanterar datadistribution.
  • 💽 Kan köras på billig serverhårdvara utan centrala kontrollöverhead.

Chronologie

  • 00:00:00 - 00:05:00

    Apache Cassandra är en snabb distribuerad databas designad för hög tillgänglighet och linjär skalbarhet. Cassandra garanterar ett SLA och mycket låg latens, vilket innebär att prestandan förblir förutsägbar när klustret växer från fem till tusen noder utan några ensamstående felpunkter. Den använder peer-to-peer-teknik, vilket innebär att alla noder är jämlika utan behov av master-slave-konfigurationer. Cassandra kan köra på billiga hårdvaror och hanteras enkelt, vilket gör det möjligt för ett litet team att hantera stora kluster effektivt. Det är dock inte en direkt ersättning för en relationsdatabas och kräver design enligt dess datamodelleringsregler.

  • 00:05:00 - 00:10:08

    Cassandra använder alltid asynkron replikering med en replikeringsfaktor som ofta är tre, vilket betyder att tre kopior av varje databit finns i klustret. Om en nod är nere, använder Cassandra 'hinted handoffs' för att återskapa saknade skrivningar när noden återansluter till klustret. Utvecklare kan ställa in konsistensnivå per läs- eller skrivbegäran, något som påverkar hastighet och tillgänglighet beroende på valt värde. Vanliga konsistensnivåer inkluderar 'Consistency Level One' och 'Quorum'. Cassandra tillhandahåller också enkel konfiguration för flera datacenter och replikering över dessa, vilket säkerställer hög tillgänglighet och möjliggör logiska eller fysiska datacenter. Det är enkelt att integrera Cassandra med verktyg som Spark för att hantera olika typer av arbetsbelastningar.

Carte mentale

Vidéo Q&R

  • Vad är Apache Cassandra?

    Apache Cassandra är en snabb, distribuerad databas designad för hög tillgänglighet och linjär skalbarhet.

  • Vilka är fördelarna med Cassandra jämfört med en relationsdatabas?

    Cassandra har ingen enskild felpunkt och kan hantera noder utan ledare, vilket gör att den kan hantera hög tillgänglighet och motstånd mot fel, till skillnad från relationsdatabaser.

  • Hur fungerar datareplikering i Cassandra?

    Data replikeras till flera servrar så att alla noder kan hantera läs- och skrivförfrågningar, vilket bidrar till hög tillgänglighet.

  • Vad betyder "consistency level" i Cassandra?

    Det betyder hur många repliker som måste svara innan en läs- eller skrivbegäran anses vara framgångsrik.

  • Vad är en primär nyckel i Cassandra?

    En primär nyckel inkluderar en partitionsnyckel som används vid datainmatning för att bestämma vilken nod data ska gå till i klustret.

  • Vad innebär replication factor i Cassandra?

    Replication factor anger hur många kopior av varje del av data som ska finnas i klustret.

  • Hur hanterar Cassandra nätverkspartitioner enligt CAP-teoremet?

    Cassandra prioriterar hög tillgänglighet över strikt konsistens under nätverkspartitioner.

  • Vad är "quorum" i Cassandra-konsistensnivåer?

    Quorum innebär en majoritet av repliker, exempelvis 2 av 3, måste svara för en åtgärd ska anses framgångsrik.

  • Hur kan Cassandra användas med flera datacenter?

    Cassandra kan asynkront replikera data mellan flera datacenter, vilket ökar tillgängligheten även om ett helt datacenter går ner.

  • Varför är Apache Cassandra effektiv när det kommer till skalbarhet?

    Den kan köras på billighetsserverhårdvara och enkelt hantera kluster av olika storlekar utan centraliserad kontroll.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !
Sous-titres
en
Défilement automatique:
  • 00:00:00
    [Music]
  • 00:00:08
    all right
  • 00:00:09
    that brings us to what is apache
  • 00:00:11
    cassandra apache cassandra is a fast
  • 00:00:13
    distributed database
  • 00:00:15
    built for high availability and linear
  • 00:00:16
    scalability we want to be able to get
  • 00:00:18
    predictable performance
  • 00:00:19
    out of our database that means that we
  • 00:00:21
    can guarantee sla
  • 00:00:23
    a very very low latency we know that as
  • 00:00:26
    our cluster scales up
  • 00:00:27
    we're going to have the same performance
  • 00:00:28
    whether it's five nodes or ten nodes or
  • 00:00:30
    a thousand nodes
  • 00:00:31
    we know that we're not going to have any
  • 00:00:34
    single points of failure one of the
  • 00:00:35
    great things about cassandra
  • 00:00:37
    is that it's a peer-to-peer technology
  • 00:00:39
    so there's no master slave there's no
  • 00:00:41
    failover there's no leader election
  • 00:00:42
    there's none of this funny business that
  • 00:00:44
    we really have to worry about anymore
  • 00:00:46
    out of the box with open source we can
  • 00:00:48
    do multi-dc
  • 00:00:49
    when we're talking high availability we
  • 00:00:51
    want to be able to withstand failure of
  • 00:00:52
    an
  • 00:00:53
    entire data center cassandra can do that
  • 00:00:55
    you absolutely are not going to get that
  • 00:00:57
    out of a relational database we also
  • 00:00:59
    want to deploy everything commodity
  • 00:01:00
    hardware we talked about how expensive
  • 00:01:02
    it is to scale things up vertically
  • 00:01:04
    with cassandra you're going to put it on
  • 00:01:05
    cheap hardware and you're just going to
  • 00:01:07
    use a whole bunch of it
  • 00:01:08
    it's really really easy to manage and
  • 00:01:10
    speaking of management it's
  • 00:01:12
    extremely easy to manage operationally
  • 00:01:14
    you can take the same three-man team and
  • 00:01:16
    have them
  • 00:01:17
    manage a three-node cluster or 30-node
  • 00:01:19
    cluster
  • 00:01:20
    or a 100 node cluster i have put this
  • 00:01:22
    into production with that size
  • 00:01:24
    team and it absolutely works one thing
  • 00:01:26
    to keep in mind
  • 00:01:27
    it is not a drop-in replacement for a
  • 00:01:29
    relational database
  • 00:01:30
    so you're not going to take the same
  • 00:01:31
    data model and just throw it in
  • 00:01:32
    cassandra and hope it works
  • 00:01:34
    you are going to have to design your
  • 00:01:35
    application around cassandra's data
  • 00:01:37
    modeling rules but the net result is an
  • 00:01:39
    application that will never go down
  • 00:01:41
    if you want to think about cassandra
  • 00:01:42
    conceptually you can think about it as a
  • 00:01:44
    giant hash ring
  • 00:01:45
    where all nodes in the cluster are equal
  • 00:01:47
    so when i say nodes i basically mean
  • 00:01:48
    virtual machines or machines could
  • 00:01:50
    actually be physical computers
  • 00:01:52
    all participating in a cluster all equal
  • 00:01:54
    and each node owns a
  • 00:01:56
    range of hashes so like a bucket of
  • 00:01:58
    hashes and when you define a data model
  • 00:02:00
    in cassandra and
  • 00:02:02
    we won't be talking too much about cql
  • 00:02:04
    in this video but when you define a data
  • 00:02:05
    model and cassandra when you create a
  • 00:02:07
    table one of the things that you specify
  • 00:02:09
    is a primary key and part of that
  • 00:02:11
    primary key is something called the
  • 00:02:12
    partition key
  • 00:02:13
    and the partition key is what's actually
  • 00:02:16
    used when you insert data intake sander
  • 00:02:18
    the value of that
  • 00:02:19
    partition key is run through a
  • 00:02:20
    consistent hashing function
  • 00:02:22
    and depending on the output we can
  • 00:02:23
    figure out which bucket or which range
  • 00:02:25
    of hashes
  • 00:02:26
    that value fits into and thus which node
  • 00:02:29
    we need to go talk to
  • 00:02:30
    to actually distribute the data around
  • 00:02:33
    the cluster
  • 00:02:34
    the other cool thing about cassandra is
  • 00:02:35
    that data is replicated
  • 00:02:37
    to multiple servers and that all of
  • 00:02:39
    those servers all of them are equal so
  • 00:02:41
    like john mentioned
  • 00:02:42
    when we were just talking on the
  • 00:02:43
    previous slide there's none of this
  • 00:02:45
    master slave there's no zookeeper
  • 00:02:46
    there's no config servers
  • 00:02:48
    all nodes are equal any node in the
  • 00:02:50
    cluster can service any given
  • 00:02:52
    read or write request for the cluster
  • 00:02:54
    okay let's talk about trade-offs one of
  • 00:02:56
    the things that's really important to
  • 00:02:57
    understand when looking at a database
  • 00:02:59
    is how the cap theorem works so the cap
  • 00:03:01
    theorem says that during a network
  • 00:03:02
    partition which basically means when
  • 00:03:04
    computers can't talk to each other
  • 00:03:06
    either between data centers or on a
  • 00:03:07
    single network that you can either
  • 00:03:09
    choose consistency
  • 00:03:10
    or actually you can't get consistency or
  • 00:03:12
    you can get high availability
  • 00:03:14
    so really what happens here is if two
  • 00:03:16
    machines can't talk and you do a right
  • 00:03:18
    to them and you have to be completely
  • 00:03:19
    consistent
  • 00:03:20
    they can't talk to each other the system
  • 00:03:22
    is going to appear as if it's down
  • 00:03:24
    so if we give up consistency that means
  • 00:03:26
    we can be highly available
  • 00:03:27
    so that's what cassandra chooses it
  • 00:03:29
    chooses to be highly available
  • 00:03:31
    in a network partition as opposed to
  • 00:03:33
    being down
  • 00:03:34
    in a lot of applications this is
  • 00:03:35
    definitely way better than downtime
  • 00:03:38
    another thing that we need to know is
  • 00:03:41
    from data center to data center let's
  • 00:03:42
    say we were to take
  • 00:03:43
    three data centers around the world
  • 00:03:44
    maybe one in the u.s one in europe
  • 00:03:47
    one in asia it's completely impractical
  • 00:03:49
    to try and be consistent across data
  • 00:03:51
    centers
  • 00:03:52
    we want to asynchronously replicate our
  • 00:03:54
    data from one dc to another
  • 00:03:56
    because it just takes way too long for
  • 00:03:58
    data to travel from the u.s to asia
  • 00:04:00
    we're limited by the speed of light here
  • 00:04:01
    this is something that we are never
  • 00:04:03
    going to get around
  • 00:04:04
    it's just not going to happen that's why
  • 00:04:05
    we choose availability and that's how
  • 00:04:08
    consistency is affected now we want to
  • 00:04:09
    talk about some of the dials that
  • 00:04:11
    cassandra puts
  • 00:04:12
    into your hands as a developer to kind
  • 00:04:14
    of control this idea of
  • 00:04:16
    fault tolerance so john talked about the
  • 00:04:18
    cap theorem earlier this idea of
  • 00:04:20
    being maybe more consistent or being
  • 00:04:22
    more available it's kind of a sliding
  • 00:04:23
    scale and cassandra doesn't impose
  • 00:04:25
    one model on you you get a couple of
  • 00:04:27
    dials that you get to turn
  • 00:04:29
    to configure this like this idea of
  • 00:04:30
    being more consistent or more available
  • 00:04:32
    so first one we want to talk about is
  • 00:04:34
    replication so replication
  • 00:04:36
    usually this is called replication
  • 00:04:37
    factor it's abbreviated as rf you'll see
  • 00:04:39
    that a lot
  • 00:04:40
    when you're looking at cassandra
  • 00:04:41
    documentation and whatnot and a very
  • 00:04:43
    typical
  • 00:04:44
    replication factor for people running in
  • 00:04:46
    production is a replication factor of
  • 00:04:47
    three
  • 00:04:48
    and essentially all this means is how
  • 00:04:50
    many copies of each piece of data should
  • 00:04:52
    there be
  • 00:04:53
    in your cluster so when i do a write to
  • 00:04:54
    cassandra you can see in this example on
  • 00:04:56
    the slide here client's writing a
  • 00:04:58
    the a node gets a copy the b node gets a
  • 00:05:00
    copy and the c node there also gets a
  • 00:05:01
    copy so three copies total for an rf of
  • 00:05:03
    three
  • 00:05:04
    data is always replicated in cassandra
  • 00:05:06
    so we're going to talk on the next slide
  • 00:05:08
    about consistency level
  • 00:05:09
    data is always replicated in cassandra
  • 00:05:11
    you set this replication factor when you
  • 00:05:13
    configure a key space
  • 00:05:14
    which in cassandra is essentially a
  • 00:05:16
    collection of tables it's very similar
  • 00:05:18
    to
  • 00:05:18
    a schema in oracle or a database in
  • 00:05:20
    mysql or microsoft sql server this
  • 00:05:23
    replication
  • 00:05:24
    happens asynchronously and if a machine
  • 00:05:26
    is down while
  • 00:05:27
    while this replication is supposed to go
  • 00:05:28
    on whatever node you happen to be
  • 00:05:30
    talking to
  • 00:05:31
    is going to save what's called a hint
  • 00:05:33
    and cassandra uses something called
  • 00:05:34
    hinted handoffs
  • 00:05:35
    to be able to replay when that node
  • 00:05:37
    comes back up
  • 00:05:38
    and rejoins the cluster to be able to
  • 00:05:40
    replay all of the writes
  • 00:05:42
    that that node that was down missed the
  • 00:05:44
    other dial that cassandra gives you as a
  • 00:05:46
    developer is something called
  • 00:05:47
    consistency level
  • 00:05:48
    so you get to set this on any given read
  • 00:05:50
    or write request that you do
  • 00:05:52
    as a developer from your application
  • 00:05:54
    talking to cassandra
  • 00:05:56
    so i'm going to show you an example of
  • 00:05:57
    two of the most popular consistency
  • 00:05:59
    levels with the hope that that'll kind
  • 00:06:00
    of illustrate
  • 00:06:01
    exactly what consistency level means but
  • 00:06:03
    basically a consistency level means
  • 00:06:05
    how many replicas do i need to hear when
  • 00:06:07
    i do a read or write
  • 00:06:09
    before that read or that write is
  • 00:06:10
    considered successful so if i'm doing a
  • 00:06:12
    read
  • 00:06:12
    how many replicas do i need to hear from
  • 00:06:14
    before cassandra gives
  • 00:06:15
    the data back to the client or if i'm
  • 00:06:18
    doing a write
  • 00:06:19
    how many replicas need to say yep we got
  • 00:06:21
    your data we've written it to disk
  • 00:06:23
    before cassandra replies to the client
  • 00:06:25
    and says yep got your data
  • 00:06:27
    so the two most popular consistency
  • 00:06:29
    levels are consistency level of one
  • 00:06:31
    which like the name sort of implies just
  • 00:06:33
    means one replica so
  • 00:06:35
    you can see that first example there
  • 00:06:36
    client is writing a to the cluster
  • 00:06:39
    the a node gets its copy and since we're
  • 00:06:41
    writing at consistency level of one we
  • 00:06:42
    can acknowledge right back to the client
  • 00:06:44
    immediately
  • 00:06:44
    yep we got the data the dashed lines on
  • 00:06:46
    that diagram there are there
  • 00:06:48
    to indicate to you that just because you
  • 00:06:49
    write with a consistency level of one
  • 00:06:51
    doesn't mean that cassandra is not going
  • 00:06:53
    to honor your replication factor still
  • 00:06:55
    now the other most popular consistency
  • 00:06:57
    level that people use a lot
  • 00:06:58
    is quorum so quorum if you've never
  • 00:07:00
    heard the term before essentially
  • 00:07:02
    means a majority of replicas so in the
  • 00:07:04
    case of replication factor of three
  • 00:07:06
    uh this is two out of three uh for 51
  • 00:07:09
    or greater so in this example again we
  • 00:07:12
    got a client writing a
  • 00:07:13
    again and you can see the anode gets a
  • 00:07:15
    copy the b node gets a copy and responds
  • 00:07:17
    and at that point once we've got two out
  • 00:07:19
    of the three replicas that have
  • 00:07:20
    acknowledged we can acknowledge back to
  • 00:07:22
    the client
  • 00:07:22
    yes we've got your data now again dashed
  • 00:07:25
    line indicates
  • 00:07:26
    that uh you know we're still going to
  • 00:07:27
    honor your replication factor we're just
  • 00:07:28
    not waiting on that extra node to reply
  • 00:07:31
    before we acknowledge back to the client
  • 00:07:33
    you might say well why would i pick one
  • 00:07:34
    consistency level versus another you
  • 00:07:36
    know what kind of
  • 00:07:37
    impact is it going to have well one
  • 00:07:38
    pretty obvious one if you if you think
  • 00:07:40
    about it
  • 00:07:41
    is how fast you can read and write data
  • 00:07:43
    is definitely going to be impacted by
  • 00:07:45
    what consistency level you use so if i'm
  • 00:07:47
    using a lower consistency level
  • 00:07:49
    like say one if i'm only waiting on a
  • 00:07:51
    single server i'm going to be able to
  • 00:07:52
    read and write data really really
  • 00:07:54
    quickly whereas if i'm using a higher
  • 00:07:55
    consistency level
  • 00:07:56
    gonna be much slower to read and write
  • 00:07:59
    data
  • 00:07:59
    now the other thing that you kind of
  • 00:08:01
    have to keep in mind with consistency
  • 00:08:02
    level is this is also going to impact
  • 00:08:04
    your availability
  • 00:08:05
    so john talked about the cap theorem and
  • 00:08:07
    talked about c versus a
  • 00:08:08
    and how you know it's kind of a sliding
  • 00:08:10
    scale well if i choose a higher
  • 00:08:12
    consistency level
  • 00:08:14
    where i have to hear from more nodes
  • 00:08:15
    where more nodes have to be online to be
  • 00:08:17
    able to acknowledge reads and writes
  • 00:08:19
    then i'm going to be less available i'm
  • 00:08:20
    going to be less tolerant
  • 00:08:22
    to nodes going down whereas if i choose
  • 00:08:24
    a lower consistency level like
  • 00:08:26
    1 i'm going to be much more highly
  • 00:08:28
    available i'm going to be able to
  • 00:08:29
    withstand
  • 00:08:30
    in this example here with an rf3 i'm
  • 00:08:32
    going to be able to withstand two nodes
  • 00:08:33
    going down
  • 00:08:34
    and still be able to do reads and writes
  • 00:08:36
    in my cluster
  • 00:08:37
    and cassandra lets you pick this for
  • 00:08:38
    every query you do so it's not going to
  • 00:08:40
    impose one model
  • 00:08:41
    you as a developer get to choose which
  • 00:08:42
    consistency level is appropriate for
  • 00:08:44
    which parts of your application
  • 00:08:45
    one of the great things about cassandra
  • 00:08:47
    because we get asynchronous replication
  • 00:08:49
    is that it's really easy to do multiple
  • 00:08:51
    data centers so when we do a write to a
  • 00:08:53
    single data center we can specify our
  • 00:08:55
    consistency level
  • 00:08:56
    maybe we say one or quorum if we're
  • 00:08:58
    doing multiple data centers we're going
  • 00:08:59
    to say local one or local quorum
  • 00:09:01
    now when that happens you get your write
  • 00:09:03
    and it happens in your local data center
  • 00:09:05
    and then it returns to the client
  • 00:09:07
    and then let's say you had five data
  • 00:09:08
    centers the information that you wrote
  • 00:09:10
    to your first data center is going to be
  • 00:09:12
    asynchronously replicated
  • 00:09:13
    to the other data centers around the
  • 00:09:15
    world this is very very powerful
  • 00:09:17
    this is how you can get super super high
  • 00:09:19
    availability even when an entire data
  • 00:09:21
    center goes down
  • 00:09:22
    so you can specify the replication
  • 00:09:24
    factor per key space
  • 00:09:26
    so you can have one key space that has
  • 00:09:28
    five replicas one that has three one
  • 00:09:30
    that has one
  • 00:09:31
    it's completely up to you and completely
  • 00:09:33
    configurable
  • 00:09:34
    it's important to understand that a data
  • 00:09:36
    center can be logical or physical
  • 00:09:37
    one of the things that's great about
  • 00:09:38
    cassandra is that you can run it with a
  • 00:09:40
    tool like spark
  • 00:09:41
    and if you're doing that you may want to
  • 00:09:43
    have one data center which is your oltp
  • 00:09:46
    you're serving your fast reads to your
  • 00:09:48
    application and then one data center
  • 00:09:50
    virtually that's serving your olap
  • 00:09:53
    queries
  • 00:09:53
    and then doing that you can make sure
  • 00:09:55
    that your olap queries don't impact your
  • 00:09:57
    oltp stuff
  • 00:10:04
    [Music]
  • 00:10:07
    you
Tags
  • Apache Cassandra
  • distribuerad databas
  • hög tillgänglighet
  • linjär skalbarhet
  • peer-to-peer
  • replikering
  • konsistensnivå
  • CAP-teoremet
  • datacenter
  • datahantering