What are some challenges with scaling relational databases?

Scaling relational databases can be difficult due to costs, the complexity of sharding and replication, and loss of consistency.

Why are acid guarantees not always practical in big data scenarios?

Acid guarantees like atomicity and consistency are hard to maintain in large-scale distributed systems, leading to the need for alternative database solutions.

What is replication lag?

Replication lag occurs when data changes on the master database take time to propagate to the slave databases, resulting in potential inconsistencies.

How does Cassandra handle high availability differently?

Cassandra disperses data across multiple nodes, avoids single points of failure, and automatically handles data replication and distribution.

Why might developers choose DataStax Enterprise over Apache Cassandra?

DataStax Enterprise offers additional features, optimizations, and support that might be beneficial for production environments.

Why is vertical scaling expensive for relational databases?

Vertical scaling involves purchasing more powerful and costly hardware, which can quickly become financially unfeasible.

What is the challenge with master-slave architecture for high availability?

Master-slave architecture can lead to downtime during failover and complexity in managing automatic or manual failover processes.

How does Cassandra avoid the problem of manual sharding?

Cassandra automates data distribution across nodes, removing the need for manual sharding and rebalancing.

Course Intro | DS101: Introduction to Apache Cassandra™

00:11:41

https://www.youtube.com/watch?v=fXk9e00FKvI

Résumé

TLDRIn this video, John Haddad and Luke Tillman introduce Apache Cassandra and discuss the challenges of using traditional relational databases for big data. They explore how relational databases struggle with scaling due to expensive hardware requirements, sharding complexities, and replication lags that compromise consistency. They highlight Cassandra's architecture, which automates data distribution and provides robust fault tolerance without the need for manual sharding or a master-slave setup. The presenters explain the differences between using open-source Apache Cassandra and DataStax Enterprise by outlining how each distribution fits different production needs. They emphasize Cassandra's cost-effectiveness compared to vertically scaling traditional databases, advocating for the use of commodity hardware to achieve elasticity. The video serves as an educational guide for developers looking to understand and implement Cassandra for handling high-availability and large-scale data processing, paving the way for more scalable and resilient applications.

A retenir

🛠️ Cassandra automates data sharding and distribution, avoiding manual intervention.
🔄 Relational databases face limitations in scaling and maintaining consistency at large scales.
💡 Master-slave replication models introduce complexities and potential downtime issues.
🏷️ DataStax Enterprise offers enhanced features over Apache Cassandra for production use.
💰 Scaling relational databases vertically is costly and inefficient.
📊 Denormalization of data is often necessary to improve performance in relational databases.
🕒 Replication lag can lead to outdated data being read from slaves in relational models.
🌎 Cassandra handles data distribution across multiple nodes for resilience and availability.
💡 Sharded relational databases complicate schema management and data movement.

Chronologie

00:00:00 - 00:05:00
The video introduces John Haddad and Luke Tillman, Technical Evangelists at DataStax, discussing Apache Cassandra. They start by examining relational databases and the challenges faced when scaling them for high availability. The talk highlights the benefits Cassandra offers, like its efficient fault tolerance controls and high availability. They explain what small data involves, such as data in text files managed by simple scripts like Python. They then shift to medium data typically managed by relational databases with their ACID guarantees. However, challenges arise when scaling these databases vertically, leading to expensive solutions. When dealing with big data, issues like inconsistency and replication lag become prominent, requiring intricate recoding to manage applications effectively.
00:05:00 - 00:11:41
The video delves into the complications of scaling relational databases further using methods like sharding, which leads to data fragmentation and the loss of traditional benefits like OLAP analytics. The inefficiency in querying secondary indexes is highlighted, necessitating data denormalization and multiple data copies to enhance performance. This practice can significantly complicate schema management across multiple shards. It also presents challenges in ensuring high availability through master-slave replication, leading to potential downtimes and complex failover procedures. The speakers critique relational databases, emphasizing the impracticality of maintaining consistency, the complexity of manual sharding, and high costs of scaling up, promoting a new dream database ideal for big data by advocating for simplicity, reliance on commodity hardware, and data locality that enhances query efficiency.

Carte mentale

Vidéo Q&R

What are some challenges with scaling relational databases?
Scaling relational databases can be difficult due to costs, the complexity of sharding and replication, and loss of consistency.
Why are acid guarantees not always practical in big data scenarios?
Acid guarantees like atomicity and consistency are hard to maintain in large-scale distributed systems, leading to the need for alternative database solutions.
What is replication lag?
Replication lag occurs when data changes on the master database take time to propagate to the slave databases, resulting in potential inconsistencies.
How does Cassandra handle high availability differently?
Cassandra disperses data across multiple nodes, avoids single points of failure, and automatically handles data replication and distribution.
Why might developers choose DataStax Enterprise over Apache Cassandra?
DataStax Enterprise offers additional features, optimizations, and support that might be beneficial for production environments.
Why is vertical scaling expensive for relational databases?
Vertical scaling involves purchasing more powerful and costly hardware, which can quickly become financially unfeasible.
What is the challenge with master-slave architecture for high availability?
Master-slave architecture can lead to downtime during failover and complexity in managing automatic or manual failover processes.
How does Cassandra avoid the problem of manual sharding?
Cassandra automates data distribution across nodes, removing the need for manual sharding and rebalancing.

Voir plus de résumés vidéo

Accédez instantanément à des résumés vidéo gratuits sur YouTube grâce à l'IA !

Sous-titres

Défilement automatique:

00:00:00
you
00:00:00
[Music]
00:00:07
hi everyone I'm John Haddad and I am
00:00:10
Luke Tillman and we're Technical
00:00:11
Evangelist with data stacks today we're
00:00:13
going to be talking about an
00:00:14
introduction to Cassandra during the
00:00:16
talk today we're going to cover a few
00:00:18
things so first of all talk a little bit
00:00:20
about relational databases maybe some of
00:00:22
the problems that you run into when you
00:00:24
try to scale relational databases for
00:00:26
high availability then we'll also cover
00:00:28
some core concepts of Cassandra how
00:00:29
Cassandra works internally some of the
00:00:31
dials that it gives you as a developer
00:00:33
to kind of control its fault tolerance
00:00:35
and this notion of high availability and
00:00:37
then lastly we'll kind of talk about the
00:00:40
different distributions that you can
00:00:41
choose of Cassandra so is open source
00:00:43
Apache Cassandra right for you or should
00:00:46
you maybe think about using data stacks
00:00:48
enterprise distribution DSE when you're
00:00:50
deploying Cassandra to production first
00:00:52
thing we're going to talk about is small
00:00:53
data this is when you've got maybe data
00:00:55
and text files or on a small sequel live
00:00:57
database and you're going to be using
00:00:58
utilities like said and awk in order to
00:01:01
analyze this may be you whip out the
00:01:03
Python script or something in Ruby but
00:01:05
in the end it generally means that it's
00:01:07
a one-off and you don't need any
00:01:09
concurrency you're not sharing this with
00:01:10
anybody this is running on your laptop
00:01:13
and it's okay like maybe the batch takes
00:01:15
like 30 seconds if it's a really really
00:01:17
big file but that's kind of what you're
00:01:19
looking at with small data we've talked
00:01:20
about small data so let's talk about
00:01:21
medium data now so probably if you're a
00:01:23
web application developer or web
00:01:25
developer of some kind this is probably
00:01:26
the typical data set that you're working
00:01:28
with so this is data that fits on a
00:01:29
single machine you're probably using a
00:01:31
relational database of some kind maybe
00:01:34
MySQL or Postgres
00:01:36
undreds of concurrent users so you've
00:01:38
got some concurrency going on now and
00:01:40
the other kind of nice thing that we get
00:01:41
when were working with a relational
00:01:42
database is these acid guarantees acid
00:01:45
standing for atomicity consistency
00:01:46
isolation and durability and as a
00:01:49
developer we've been taught for many
00:01:51
years how to develop on top of machines
00:01:53
like this one I go to put data into a
00:01:55
relational database with these acid
00:01:57
guarantees I can kind of feel warm and
00:01:58
cuddly and I kind of know exactly what's
00:02:00
going to happen with my data when I put
00:02:02
it in the other thing to know about is
00:02:03
the way we try to scale these typically
00:02:05
first is by scaling vertically so we buy
00:02:07
more expensive hardware like more memory
00:02:10
or maybe a bigger processor that kind of
00:02:13
thing and this can get expensive really
00:02:14
really quickly question will now ask
00:02:16
ourselves is can the relational database
00:02:18
work for big data first thing
00:02:20
we find when we start to use a
00:02:21
relational database to try and apply it
00:02:23
to big data is that acid is a lie we're
00:02:26
no longer in developed in that cocoon of
00:02:28
safety which is Adam SAT consistency
00:02:31
isolation and durability so let's take
00:02:33
our scenario over here we have a single
00:02:35
master and we have a client that's
00:02:37
talking to that master and we have a
00:02:39
read heavy workload and what we do is we
00:02:41
decide to add on replication so one of
00:02:43
the things that's important to know
00:02:44
about replication is that the data is
00:02:46
replicated asynchronously and this is
00:02:48
known as replication lag and so what
00:02:50
happens is when the client decides to do
00:02:53
a write to the master it takes a little
00:02:54
while to propagate over to the slave and
00:02:56
if the client decides to do a read to
00:02:59
the slave before the data has been
00:03:01
replicated it's going to get old data
00:03:02
back and what we find is that we have
00:03:05
lost our consistency completely in the
00:03:08
scope of our database so that whole
00:03:10
thing that we built our entire
00:03:12
application around that certainty that
00:03:14
we had that we were always going to get
00:03:15
up-to-date data is completely gone all
00:03:17
the operations that we do are no longer
00:03:19
in isolation they're definitely not
00:03:21
atomic so we have to recode our entire
00:03:23
app to take advantage or at least to
00:03:26
accommodate the fact that we have
00:03:27
replication and there is replication lag
00:03:29
another thing that we run into when we
00:03:32
start to deal with performance problems
00:03:33
on a relational databases is something
00:03:35
like maybe this query that you see on
00:03:37
the right side of the slide here so
00:03:38
probably most of us that have worked
00:03:40
with relational databases see these
00:03:41
crazy queries lots of joins maybe it's
00:03:44
been generated by an ORM behind the
00:03:46
scenes kind of thing in fact at a
00:03:48
company I used to work for we had this
00:03:50
problem where every day at 1 o'clock
00:03:51
we'd have a lot of users try to log on
00:03:53
to the system and nobody would be able
00:03:55
to log in and when we actually went and
00:03:57
looked at what was going on behind the
00:03:59
scenes it was some crazy queries like
00:04:01
this with lots of joins essentially
00:04:03
locking up the database behind the
00:04:05
scenes so queries like this can cause
00:04:07
lots of problems it's kind of one of the
00:04:09
side effects of using third normal form
00:04:11
to do all of our data modeling and so
00:04:14
what we try to do when we're dealing
00:04:16
with queries like this that have
00:04:18
unpredictable performance or poor
00:04:19
performance a lot of times as we do
00:04:21
normalize so we'll create a table and
00:04:23
that table is built specifically to
00:04:26
answer that query so at right time what
00:04:29
we'll do is do you normalize at write
00:04:31
time maybe write that data into that
00:04:32
table
00:04:33
specifically so that at read time we can
00:04:35
do is sort of a select star very simple
00:04:37
query that doesn't have a lot of
00:04:38
expensive joins in it and that means
00:04:41
that now we've probably got duplicate
00:04:43
copies of our data we kind of violated
00:04:45
this sort of third normal form that
00:04:47
we're used to using and that has been
00:04:49
drilled into our heads as developers for
00:04:51
a really long time as we continue to
00:04:53
scale our application the next thing
00:04:54
that we're going to have to do is
00:04:55
implement charting charting is when you
00:04:57
take your data and instead of having an
00:04:59
all-in-one database and one master you
00:05:01
split it up into multiple databases and
00:05:03
this works okay for a little while the
00:05:06
big problems with this is that now your
00:05:07
data is all over the place and even if
00:05:10
you were relying on let's say a single
00:05:13
master to do your OLAP queries you can't
00:05:15
do it anymore all of your joins all of
00:05:17
your aggregations all that stuff is
00:05:18
history you absolutely cannot do it and
00:05:20
you have to keep building different
00:05:22
denormalized views of the data that you
00:05:24
have in order to answer queries
00:05:25
efficiently we also find that as we
00:05:27
start to query secondary indexes that
00:05:29
doesn't scale well either so if we take
00:05:31
our servers and we say I'm going to
00:05:33
split my users into four different
00:05:35
shards and then I haven't shorted users
00:05:39
on something like state and I want to do
00:05:40
a query I want to say I want all the
00:05:42
users in the state of Massachusetts that
00:05:44
means I have to hit all the shards this
00:05:46
is extremely non performant means if
00:05:48
there was a hundred shards I have to do
00:05:50
a hundred queries this does not scale
00:05:52
well at all
00:05:53
as a result we end up doing or malaises
00:05:55
again so now we store two copies of our
00:05:57
users one by user ID and another by
00:05:59
state whenever we decide to add shards
00:06:02
to our cluster if we want to double the
00:06:03
number from four to eight we now have to
00:06:05
write a tool that will manually move
00:06:07
everything over this requires a lot of
00:06:08
coordination between developers and
00:06:10
operations and is an absolutely
00:06:12
nightmare because there's dozens of edge
00:06:14
cases that can come up when you're
00:06:16
moving your data around so you have to
00:06:17
think about all of them in your
00:06:19
application and your ops team has to be
00:06:21
aware of them as well the last thing
00:06:22
that we find is that managing your
00:06:24
schema is a huge pain if you have 20
00:06:27
shards all with the same schema on it
00:06:29
you have to now come up with tools to
00:06:31
apply schema changes to all the
00:06:34
different shards in your system and
00:06:35
remember it's not just a master that has
00:06:37
to take it but all of your slaves this
00:06:39
is a huge burden it is an absolute mess
00:06:41
at the end of the day you look like this
00:06:43
guy on the phone like he's just
00:06:44
absolutely out of his mind earlier John
00:06:46
mentioned
00:06:47
using master-slave replication to kind
00:06:49
of scale out when you have a read heavy
00:06:50
workload another reason why people will
00:06:53
introduce this sort of master slave
00:06:54
architecture with replication is for
00:06:56
high availability or maybe higher
00:06:58
availability the thing is when you when
00:07:01
you introduce this replication a lot of
00:07:04
times you have to decide how you're
00:07:05
going to do the failover so maybe you
00:07:07
build some sort of automatic process to
00:07:10
do the failover maybe it's a manual
00:07:12
process where somebody has to notice
00:07:13
that the database has gone down and push
00:07:15
a button to failover to the slave server
00:07:17
if you build an automatic process of
00:07:19
some kind then what's going to watch the
00:07:21
automatic process to make sure it
00:07:22
doesn't crash and ultimately not end up
00:07:24
being able to fail over your database
00:07:26
and in any scenario the the problem is
00:07:30
that you still end up with downtime
00:07:32
because whether it's a manual failover
00:07:33
process or an automatic failover process
00:07:36
that implies that it's something's going
00:07:37
to have to detect that the database is
00:07:39
down and that you're having downtime
00:07:40
before the failover can kick in the
00:07:42
other thing is that trying to do this
00:07:44
with the relational database and do
00:07:45
multiple data centers is a disaster it's
00:07:47
really really hard to do and we're not
00:07:49
just talking about downtime as far as
00:07:51
unplanned downtime you know we know the
00:07:54
hardware fails Amazon reboots your
00:07:56
servers as a service sort of thing that
00:07:59
kind of stuff happens but then there's
00:08:00
also planned downtime as well so there's
00:08:02
things like OS upgrades or upgrades to
00:08:05
your database server software so you've
00:08:07
got a plan for those as well so it'd be
00:08:09
really nice to have some way to have
00:08:10
higher availability than what the
00:08:12
master/slave kind of architecture gives
00:08:15
to us let's summarize the ways of the
00:08:18
relational database fails us handling
00:08:20
Big Data we know that scaling is an
00:08:22
absolute pain right we want to put
00:08:24
bigger bigger hardware that costs a lot
00:08:26
of money we want a shard that's an
00:08:28
absolute mess we've got replication it's
00:08:31
falling behind we have to keep changing
00:08:32
our application to account for the
00:08:34
things that we give up in the relational
00:08:36
database acid you know that cocoon of
00:08:38
safety we're not in that thing anymore
00:08:40
we are basically treating our relational
00:08:42
database pretty much like a glorified
00:08:43
key value store we know that when we
00:08:45
want to double the size of our cluster
00:08:47
and we have to recharge that is an
00:08:49
absolute nightmare to deal with nobody
00:08:51
wants to do this
00:08:52
it requires way too much coordination we
00:08:54
know that we're going to have to
00:08:55
denormalize all of our cool third normal
00:08:57
form queries that we'd love to do our
00:09:01
there's things that we were so proud to
00:09:02
write in the first place they're gone
00:09:04
now we're just writing our data in a
00:09:05
bunch of different tables and some of
00:09:08
the times we're just going to JSON
00:09:09
serialize it and whatever it's an
00:09:12
absolute disaster high-availability
00:09:14
it's not happening right if you want to
00:09:16
do multi DC with my sequel or Postgres
00:09:19
it is absolutely not happening unless
00:09:21
you in and have an entire dedicated
00:09:23
engineering team to try and solve all
00:09:25
the problems those are going to come up
00:09:26
along the way so if we were to take some
00:09:29
of the lessons that we've learned you
00:09:30
know some of the points of failure that
00:09:31
John just summarized and apply it to a
00:09:33
new database if we were trying to build
00:09:35
something maybe from scratch that would
00:09:37
kind of be good for handling big data
00:09:39
what are some of the lessons that we've
00:09:41
learned from those failure so the first
00:09:43
thing is that consistency is not
00:09:45
practical this whole idea of acid
00:09:47
consistency probably not practical in a
00:09:48
big distributed system so we're going to
00:09:50
give it up we also noticed that manual
00:09:52
sharding and rebalancing is really hard
00:09:53
right we had to write a lot of code just
00:09:56
to move data from place to place and
00:09:57
handle all these error conditions so
00:09:59
instead what we're going to do is push
00:10:01
that responsibility to our cluster our
00:10:03
dream database can go from 3 to 20
00:10:06
machines and we as developers don't have
00:10:08
to worry about it we don't have to write
00:10:10
any special code to accommodate that
00:10:12
next thing we know is that every moving
00:10:14
part that we add to the system so this
00:10:15
idea of master-slave replication that we
00:10:17
get in a lot of databases that makes
00:10:19
things more complex and all this
00:10:20
failover and processes to watch the
00:10:23
failover and everything so we want our
00:10:25
system to be as simple as possible as
00:10:27
few moving parts as possible none of
00:10:29
this master/slave architecture sort of
00:10:30
thing we also find that scaling up is
00:10:32
really expensive if you want to
00:10:34
vertically scale your database you're
00:10:36
going to have to put things like a sand
00:10:37
in place you're going to have to get
00:10:38
bigger and bigger servers every time you
00:10:40
do it it's more and more expensive it's
00:10:42
a lot of money and it's really not worth
00:10:45
it in the end so what we would do in our
00:10:48
dream database is to only use commodity
00:10:51
hardware we want to spend five ten
00:10:53
thousand dollars per machine instead of
00:10:54
a hundred thousand dollars and what we
00:10:56
want to do is buy more machines that way
00:10:58
when we want to double the capacity of
00:11:00
our cluster we're not going from a
00:11:01
hundred thousand dollar machine to a two
00:11:03
hundred thousand dollar machine we're
00:11:04
just doubling the number of cheap
00:11:06
machines that we use well sort of last
00:11:08
lesson learned here is that scattered
00:11:10
gathered queries are not going to be
00:11:12
any good so we want to have something
00:11:15
that kind of tries to push us maybe in
00:11:17
its data modeling or something like that
00:11:19
towards data locality where queries will
00:11:22
only hit a single machine so that we're
00:11:24
efficient we don't introduce a whole
00:11:25
bunch of extra latency where we're
00:11:26
instead doing a full table scan now
00:11:28
we're doing a full cluster scan sort of
00:11:30
thing
00:11:37
you