Vad är Apache Cassandra?

Apache Cassandra är en snabb, distribuerad databas designad för hög tillgänglighet och linjär skalbarhet.

Vilka är fördelarna med Cassandra jämfört med en relationsdatabas?

Cassandra har ingen enskild felpunkt och kan hantera noder utan ledare, vilket gör att den kan hantera hög tillgänglighet och motstånd mot fel, till skillnad från relationsdatabaser.

Hur fungerar datareplikering i Cassandra?

Data replikeras till flera servrar så att alla noder kan hantera läs- och skrivförfrågningar, vilket bidrar till hög tillgänglighet.

Vad betyder "consistency level" i Cassandra?

Det betyder hur många repliker som måste svara innan en läs- eller skrivbegäran anses vara framgångsrik.

Vad är en primär nyckel i Cassandra?

En primär nyckel inkluderar en partitionsnyckel som används vid datainmatning för att bestämma vilken nod data ska gå till i klustret.

Vad innebär replication factor i Cassandra?

Replication factor anger hur många kopior av varje del av data som ska finnas i klustret.

Hur hanterar Cassandra nätverkspartitioner enligt CAP-teoremet?

Cassandra prioriterar hög tillgänglighet över strikt konsistens under nätverkspartitioner.

Vad är "quorum" i Cassandra-konsistensnivåer?

Quorum innebär en majoritet av repliker, exempelvis 2 av 3, måste svara för en åtgärd ska anses framgångsrik.

Hur kan Cassandra användas med flera datacenter?

Cassandra kan asynkront replikera data mellan flera datacenter, vilket ökar tillgängligheten även om ett helt datacenter går ner.

Varför är Apache Cassandra effektiv när det kommer till skalbarhet?

Den kan köras på billighetsserverhårdvara och enkelt hantera kluster av olika storlekar utan centraliserad kontroll.

DS101.2 Cassandra Overview

00:10:08

https://www.youtube.com/watch?v=E-xBog9nWTA

Sintesi

TLDRApache Cassandra är en distribuerad databas som är konstruerad för att garantera hög tillgänglighet och linjär skalbarhet. Den är utformad för att inte ha några enskilda felpunkter tack vare sin peer-to-peer-struktur där alla noder är likvärdiga och kan hantera både läs- och skrivbegäranden. Cassandra väljer hög tillgänglighet över strikt konsistens vid nätverksdelningar enligt CAP-teoremet. Data replikerar till flera servrar vilket bidrar till systemets robusthet mot fel. Systemet ger utvecklare möjlighet att konfigurera replikationsfaktorer och konsistensnivåer efter behov, vilket gör det särskilt fördelaktigt för miljöer där hög tillgänglighet är viktigare än strikt datakonsistens. Cassandra stödjer asynkron replikering över flera datacenter, vilket möjliggör datahantering på global nivå. Det fungerar bra med en mängd billiga hårdvaror och kan enkelt hantera kluster oavsett deras storlek.

Punti di forza

📌 Apache Cassandra är en distribuerad databas med hög tillgänglighet och skalbarhet.
🔄 Peer-to-peer-strukturen eliminerar enskilda felpunkter.
🌓 Data replikeras för att säkerställa motståndskraft mot fel.
⚖️ Cassandra prioriterar tillgänglighet över strikt konsistens enligt CAP-teoremet.
🛠️ Utvecklare kan anpassa replikationsfaktorer och konsistensnivåer.
🌐 Stödjer global asynkron replikering över flera datacenter.
💡 Effektiv för kostnadseffektivt klustermanagement.
⏩ Låg latens och förutsägbar prestanda vid skalning.
🔑 Primärnycklar och partitionsnycklar hanterar datadistribution.
💽 Kan köras på billig serverhårdvara utan centrala kontrollöverhead.

Linea temporale

00:00:00 - 00:05:00
Apache Cassandra är en snabb distribuerad databas designad för hög tillgänglighet och linjär skalbarhet. Cassandra garanterar ett SLA och mycket låg latens, vilket innebär att prestandan förblir förutsägbar när klustret växer från fem till tusen noder utan några ensamstående felpunkter. Den använder peer-to-peer-teknik, vilket innebär att alla noder är jämlika utan behov av master-slave-konfigurationer. Cassandra kan köra på billiga hårdvaror och hanteras enkelt, vilket gör det möjligt för ett litet team att hantera stora kluster effektivt. Det är dock inte en direkt ersättning för en relationsdatabas och kräver design enligt dess datamodelleringsregler.
00:05:00 - 00:10:08
Cassandra använder alltid asynkron replikering med en replikeringsfaktor som ofta är tre, vilket betyder att tre kopior av varje databit finns i klustret. Om en nod är nere, använder Cassandra 'hinted handoffs' för att återskapa saknade skrivningar när noden återansluter till klustret. Utvecklare kan ställa in konsistensnivå per läs- eller skrivbegäran, något som påverkar hastighet och tillgänglighet beroende på valt värde. Vanliga konsistensnivåer inkluderar 'Consistency Level One' och 'Quorum'. Cassandra tillhandahåller också enkel konfiguration för flera datacenter och replikering över dessa, vilket säkerställer hög tillgänglighet och möjliggör logiska eller fysiska datacenter. Det är enkelt att integrera Cassandra med verktyg som Spark för att hantera olika typer av arbetsbelastningar.

Mappa mentale

Video Domande e Risposte

Vad är Apache Cassandra?
Apache Cassandra är en snabb, distribuerad databas designad för hög tillgänglighet och linjär skalbarhet.
Vilka är fördelarna med Cassandra jämfört med en relationsdatabas?
Cassandra har ingen enskild felpunkt och kan hantera noder utan ledare, vilket gör att den kan hantera hög tillgänglighet och motstånd mot fel, till skillnad från relationsdatabaser.
Hur fungerar datareplikering i Cassandra?
Data replikeras till flera servrar så att alla noder kan hantera läs- och skrivförfrågningar, vilket bidrar till hög tillgänglighet.
Vad betyder "consistency level" i Cassandra?
Det betyder hur många repliker som måste svara innan en läs- eller skrivbegäran anses vara framgångsrik.
Vad är en primär nyckel i Cassandra?
En primär nyckel inkluderar en partitionsnyckel som används vid datainmatning för att bestämma vilken nod data ska gå till i klustret.
Vad innebär replication factor i Cassandra?
Replication factor anger hur många kopior av varje del av data som ska finnas i klustret.
Hur hanterar Cassandra nätverkspartitioner enligt CAP-teoremet?
Cassandra prioriterar hög tillgänglighet över strikt konsistens under nätverkspartitioner.
Vad är "quorum" i Cassandra-konsistensnivåer?
Quorum innebär en majoritet av repliker, exempelvis 2 av 3, måste svara för en åtgärd ska anses framgångsrik.
Hur kan Cassandra användas med flera datacenter?
Cassandra kan asynkront replikera data mellan flera datacenter, vilket ökar tillgängligheten även om ett helt datacenter går ner.
Varför är Apache Cassandra effektiv när det kommer till skalbarhet?
Den kan köras på billighetsserverhårdvara och enkelt hantera kluster av olika storlekar utan centraliserad kontroll.

Visualizza altre sintesi video

Ottenete l'accesso immediato ai riassunti gratuiti dei video di YouTube grazie all'intelligenza artificiale!

Sottotitoli

Scorrimento automatico:

00:00:00
[Music]
00:00:08
all right
00:00:09
that brings us to what is apache
00:00:11
cassandra apache cassandra is a fast
00:00:13
distributed database
00:00:15
built for high availability and linear
00:00:16
scalability we want to be able to get
00:00:18
predictable performance
00:00:19
out of our database that means that we
00:00:21
can guarantee sla
00:00:23
a very very low latency we know that as
00:00:26
our cluster scales up
00:00:27
we're going to have the same performance
00:00:28
whether it's five nodes or ten nodes or
00:00:30
a thousand nodes
00:00:31
we know that we're not going to have any
00:00:34
single points of failure one of the
00:00:35
great things about cassandra
00:00:37
is that it's a peer-to-peer technology
00:00:39
so there's no master slave there's no
00:00:41
failover there's no leader election
00:00:42
there's none of this funny business that
00:00:44
we really have to worry about anymore
00:00:46
out of the box with open source we can
00:00:48
do multi-dc
00:00:49
when we're talking high availability we
00:00:51
want to be able to withstand failure of
00:00:52
an
00:00:53
entire data center cassandra can do that
00:00:55
you absolutely are not going to get that
00:00:57
out of a relational database we also
00:00:59
want to deploy everything commodity
00:01:00
hardware we talked about how expensive
00:01:02
it is to scale things up vertically
00:01:04
with cassandra you're going to put it on
00:01:05
cheap hardware and you're just going to
00:01:07
use a whole bunch of it
00:01:08
it's really really easy to manage and
00:01:10
speaking of management it's
00:01:12
extremely easy to manage operationally
00:01:14
you can take the same three-man team and
00:01:16
have them
00:01:17
manage a three-node cluster or 30-node
00:01:19
cluster
00:01:20
or a 100 node cluster i have put this
00:01:22
into production with that size
00:01:24
team and it absolutely works one thing
00:01:26
to keep in mind
00:01:27
it is not a drop-in replacement for a
00:01:29
relational database
00:01:30
so you're not going to take the same
00:01:31
data model and just throw it in
00:01:32
cassandra and hope it works
00:01:34
you are going to have to design your
00:01:35
application around cassandra's data
00:01:37
modeling rules but the net result is an
00:01:39
application that will never go down
00:01:41
if you want to think about cassandra
00:01:42
conceptually you can think about it as a
00:01:44
giant hash ring
00:01:45
where all nodes in the cluster are equal
00:01:47
so when i say nodes i basically mean
00:01:48
virtual machines or machines could
00:01:50
actually be physical computers
00:01:52
all participating in a cluster all equal
00:01:54
and each node owns a
00:01:56
range of hashes so like a bucket of
00:01:58
hashes and when you define a data model
00:02:00
in cassandra and
00:02:02
we won't be talking too much about cql
00:02:04
in this video but when you define a data
00:02:05
model and cassandra when you create a
00:02:07
table one of the things that you specify
00:02:09
is a primary key and part of that
00:02:11
primary key is something called the
00:02:12
partition key
00:02:13
and the partition key is what's actually
00:02:16
used when you insert data intake sander
00:02:18
the value of that
00:02:19
partition key is run through a
00:02:20
consistent hashing function
00:02:22
and depending on the output we can
00:02:23
figure out which bucket or which range
00:02:25
of hashes
00:02:26
that value fits into and thus which node
00:02:29
we need to go talk to
00:02:30
to actually distribute the data around
00:02:33
the cluster
00:02:34
the other cool thing about cassandra is
00:02:35
that data is replicated
00:02:37
to multiple servers and that all of
00:02:39
those servers all of them are equal so
00:02:41
like john mentioned
00:02:42
when we were just talking on the
00:02:43
previous slide there's none of this
00:02:45
master slave there's no zookeeper
00:02:46
there's no config servers
00:02:48
all nodes are equal any node in the
00:02:50
cluster can service any given
00:02:52
read or write request for the cluster
00:02:54
okay let's talk about trade-offs one of
00:02:56
the things that's really important to
00:02:57
understand when looking at a database
00:02:59
is how the cap theorem works so the cap
00:03:01
theorem says that during a network
00:03:02
partition which basically means when
00:03:04
computers can't talk to each other
00:03:06
either between data centers or on a
00:03:07
single network that you can either
00:03:09
choose consistency
00:03:10
or actually you can't get consistency or
00:03:12
you can get high availability
00:03:14
so really what happens here is if two
00:03:16
machines can't talk and you do a right
00:03:18
to them and you have to be completely
00:03:19
consistent
00:03:20
they can't talk to each other the system
00:03:22
is going to appear as if it's down
00:03:24
so if we give up consistency that means
00:03:26
we can be highly available
00:03:27
so that's what cassandra chooses it
00:03:29
chooses to be highly available
00:03:31
in a network partition as opposed to
00:03:33
being down
00:03:34
in a lot of applications this is
00:03:35
definitely way better than downtime
00:03:38
another thing that we need to know is
00:03:41
from data center to data center let's
00:03:42
say we were to take
00:03:43
three data centers around the world
00:03:44
maybe one in the u.s one in europe
00:03:47
one in asia it's completely impractical
00:03:49
to try and be consistent across data
00:03:51
centers
00:03:52
we want to asynchronously replicate our
00:03:54
data from one dc to another
00:03:56
because it just takes way too long for
00:03:58
data to travel from the u.s to asia
00:04:00
we're limited by the speed of light here
00:04:01
this is something that we are never
00:04:03
going to get around
00:04:04
it's just not going to happen that's why
00:04:05
we choose availability and that's how
00:04:08
consistency is affected now we want to
00:04:09
talk about some of the dials that
00:04:11
cassandra puts
00:04:12
into your hands as a developer to kind
00:04:14
of control this idea of
00:04:16
fault tolerance so john talked about the
00:04:18
cap theorem earlier this idea of
00:04:20
being maybe more consistent or being
00:04:22
more available it's kind of a sliding
00:04:23
scale and cassandra doesn't impose
00:04:25
one model on you you get a couple of
00:04:27
dials that you get to turn
00:04:29
to configure this like this idea of
00:04:30
being more consistent or more available
00:04:32
so first one we want to talk about is
00:04:34
replication so replication
00:04:36
usually this is called replication
00:04:37
factor it's abbreviated as rf you'll see
00:04:39
that a lot
00:04:40
when you're looking at cassandra
00:04:41
documentation and whatnot and a very
00:04:43
typical
00:04:44
replication factor for people running in
00:04:46
production is a replication factor of
00:04:47
three
00:04:48
and essentially all this means is how
00:04:50
many copies of each piece of data should
00:04:52
there be
00:04:53
in your cluster so when i do a write to
00:04:54
cassandra you can see in this example on
00:04:56
the slide here client's writing a
00:04:58
the a node gets a copy the b node gets a
00:05:00
copy and the c node there also gets a
00:05:01
copy so three copies total for an rf of
00:05:03
three
00:05:04
data is always replicated in cassandra
00:05:06
so we're going to talk on the next slide
00:05:08
about consistency level
00:05:09
data is always replicated in cassandra
00:05:11
you set this replication factor when you
00:05:13
configure a key space
00:05:14
which in cassandra is essentially a
00:05:16
collection of tables it's very similar
00:05:18
to
00:05:18
a schema in oracle or a database in
00:05:20
mysql or microsoft sql server this
00:05:23
replication
00:05:24
happens asynchronously and if a machine
00:05:26
is down while
00:05:27
while this replication is supposed to go
00:05:28
on whatever node you happen to be
00:05:30
talking to
00:05:31
is going to save what's called a hint
00:05:33
and cassandra uses something called
00:05:34
hinted handoffs
00:05:35
to be able to replay when that node
00:05:37
comes back up
00:05:38
and rejoins the cluster to be able to
00:05:40
replay all of the writes
00:05:42
that that node that was down missed the
00:05:44
other dial that cassandra gives you as a
00:05:46
developer is something called
00:05:47
consistency level
00:05:48
so you get to set this on any given read
00:05:50
or write request that you do
00:05:52
as a developer from your application
00:05:54
talking to cassandra
00:05:56
so i'm going to show you an example of
00:05:57
two of the most popular consistency
00:05:59
levels with the hope that that'll kind
00:06:00
of illustrate
00:06:01
exactly what consistency level means but
00:06:03
basically a consistency level means
00:06:05
how many replicas do i need to hear when
00:06:07
i do a read or write
00:06:09
before that read or that write is
00:06:10
considered successful so if i'm doing a
00:06:12
read
00:06:12
how many replicas do i need to hear from
00:06:14
before cassandra gives
00:06:15
the data back to the client or if i'm
00:06:18
doing a write
00:06:19
how many replicas need to say yep we got
00:06:21
your data we've written it to disk
00:06:23
before cassandra replies to the client
00:06:25
and says yep got your data
00:06:27
so the two most popular consistency
00:06:29
levels are consistency level of one
00:06:31
which like the name sort of implies just
00:06:33
means one replica so
00:06:35
you can see that first example there
00:06:36
client is writing a to the cluster
00:06:39
the a node gets its copy and since we're
00:06:41
writing at consistency level of one we
00:06:42
can acknowledge right back to the client
00:06:44
immediately
00:06:44
yep we got the data the dashed lines on
00:06:46
that diagram there are there
00:06:48
to indicate to you that just because you
00:06:49
write with a consistency level of one
00:06:51
doesn't mean that cassandra is not going
00:06:53
to honor your replication factor still
00:06:55
now the other most popular consistency
00:06:57
level that people use a lot
00:06:58
is quorum so quorum if you've never
00:07:00
heard the term before essentially
00:07:02
means a majority of replicas so in the
00:07:04
case of replication factor of three
00:07:06
uh this is two out of three uh for 51
00:07:09
or greater so in this example again we
00:07:12
got a client writing a
00:07:13
again and you can see the anode gets a
00:07:15
copy the b node gets a copy and responds
00:07:17
and at that point once we've got two out
00:07:19
of the three replicas that have
00:07:20
acknowledged we can acknowledge back to
00:07:22
the client
00:07:22
yes we've got your data now again dashed
00:07:25
line indicates
00:07:26
that uh you know we're still going to
00:07:27
honor your replication factor we're just
00:07:28
not waiting on that extra node to reply
00:07:31
before we acknowledge back to the client
00:07:33
you might say well why would i pick one
00:07:34
consistency level versus another you
00:07:36
know what kind of
00:07:37
impact is it going to have well one
00:07:38
pretty obvious one if you if you think
00:07:40
about it
00:07:41
is how fast you can read and write data
00:07:43
is definitely going to be impacted by
00:07:45
what consistency level you use so if i'm
00:07:47
using a lower consistency level
00:07:49
like say one if i'm only waiting on a
00:07:51
single server i'm going to be able to
00:07:52
read and write data really really
00:07:54
quickly whereas if i'm using a higher
00:07:55
consistency level
00:07:56
gonna be much slower to read and write
00:07:59
data
00:07:59
now the other thing that you kind of
00:08:01
have to keep in mind with consistency
00:08:02
level is this is also going to impact
00:08:04
your availability
00:08:05
so john talked about the cap theorem and
00:08:07
talked about c versus a
00:08:08
and how you know it's kind of a sliding
00:08:10
scale well if i choose a higher
00:08:12
consistency level
00:08:14
where i have to hear from more nodes
00:08:15
where more nodes have to be online to be
00:08:17
able to acknowledge reads and writes
00:08:19
then i'm going to be less available i'm
00:08:20
going to be less tolerant
00:08:22
to nodes going down whereas if i choose
00:08:24
a lower consistency level like
00:08:26
1 i'm going to be much more highly
00:08:28
available i'm going to be able to
00:08:29
withstand
00:08:30
in this example here with an rf3 i'm
00:08:32
going to be able to withstand two nodes
00:08:33
going down
00:08:34
and still be able to do reads and writes
00:08:36
in my cluster
00:08:37
and cassandra lets you pick this for
00:08:38
every query you do so it's not going to
00:08:40
impose one model
00:08:41
you as a developer get to choose which
00:08:42
consistency level is appropriate for
00:08:44
which parts of your application
00:08:45
one of the great things about cassandra
00:08:47
because we get asynchronous replication
00:08:49
is that it's really easy to do multiple
00:08:51
data centers so when we do a write to a
00:08:53
single data center we can specify our
00:08:55
consistency level
00:08:56
maybe we say one or quorum if we're
00:08:58
doing multiple data centers we're going
00:08:59
to say local one or local quorum
00:09:01
now when that happens you get your write
00:09:03
and it happens in your local data center
00:09:05
and then it returns to the client
00:09:07
and then let's say you had five data
00:09:08
centers the information that you wrote
00:09:10
to your first data center is going to be
00:09:12
asynchronously replicated
00:09:13
to the other data centers around the
00:09:15
world this is very very powerful
00:09:17
this is how you can get super super high
00:09:19
availability even when an entire data
00:09:21
center goes down
00:09:22
so you can specify the replication
00:09:24
factor per key space
00:09:26
so you can have one key space that has
00:09:28
five replicas one that has three one
00:09:30
that has one
00:09:31
it's completely up to you and completely
00:09:33
configurable
00:09:34
it's important to understand that a data
00:09:36
center can be logical or physical
00:09:37
one of the things that's great about
00:09:38
cassandra is that you can run it with a
00:09:40
tool like spark
00:09:41
and if you're doing that you may want to
00:09:43
have one data center which is your oltp
00:09:46
you're serving your fast reads to your
00:09:48
application and then one data center
00:09:50
virtually that's serving your olap
00:09:53
queries
00:09:53
and then doing that you can make sure
00:09:55
that your olap queries don't impact your
00:09:57
oltp stuff
00:10:04
[Music]
00:10:07
you

Tag

Apache Cassandra
distribuerad databas
hög tillgänglighet
linjär skalbarhet
peer-to-peer
replikering
konsistensnivå
CAP-teoremet
datacenter
datahantering