Course Intro | DS101: Introduction to Apache Cassandra™

00:11:41
https://www.youtube.com/watch?v=fXk9e00FKvI

Summary

TLDRIn this video, John Haddad and Luke Tillman introduce Apache Cassandra and discuss the challenges of using traditional relational databases for big data. They explore how relational databases struggle with scaling due to expensive hardware requirements, sharding complexities, and replication lags that compromise consistency. They highlight Cassandra's architecture, which automates data distribution and provides robust fault tolerance without the need for manual sharding or a master-slave setup. The presenters explain the differences between using open-source Apache Cassandra and DataStax Enterprise by outlining how each distribution fits different production needs. They emphasize Cassandra's cost-effectiveness compared to vertically scaling traditional databases, advocating for the use of commodity hardware to achieve elasticity. The video serves as an educational guide for developers looking to understand and implement Cassandra for handling high-availability and large-scale data processing, paving the way for more scalable and resilient applications.

Takeaways

  • 🛠️ Cassandra automates data sharding and distribution, avoiding manual intervention.
  • 🔄 Relational databases face limitations in scaling and maintaining consistency at large scales.
  • 💡 Master-slave replication models introduce complexities and potential downtime issues.
  • 🏷️ DataStax Enterprise offers enhanced features over Apache Cassandra for production use.
  • 💰 Scaling relational databases vertically is costly and inefficient.
  • 📊 Denormalization of data is often necessary to improve performance in relational databases.
  • 🕒 Replication lag can lead to outdated data being read from slaves in relational models.
  • 🌎 Cassandra handles data distribution across multiple nodes for resilience and availability.
  • 💡 Sharded relational databases complicate schema management and data movement.

Timeline

  • 00:00:00 - 00:05:00

    The video introduces John Haddad and Luke Tillman, Technical Evangelists at DataStax, discussing Apache Cassandra. They start by examining relational databases and the challenges faced when scaling them for high availability. The talk highlights the benefits Cassandra offers, like its efficient fault tolerance controls and high availability. They explain what small data involves, such as data in text files managed by simple scripts like Python. They then shift to medium data typically managed by relational databases with their ACID guarantees. However, challenges arise when scaling these databases vertically, leading to expensive solutions. When dealing with big data, issues like inconsistency and replication lag become prominent, requiring intricate recoding to manage applications effectively.

  • 00:05:00 - 00:11:41

    The video delves into the complications of scaling relational databases further using methods like sharding, which leads to data fragmentation and the loss of traditional benefits like OLAP analytics. The inefficiency in querying secondary indexes is highlighted, necessitating data denormalization and multiple data copies to enhance performance. This practice can significantly complicate schema management across multiple shards. It also presents challenges in ensuring high availability through master-slave replication, leading to potential downtimes and complex failover procedures. The speakers critique relational databases, emphasizing the impracticality of maintaining consistency, the complexity of manual sharding, and high costs of scaling up, promoting a new dream database ideal for big data by advocating for simplicity, reliance on commodity hardware, and data locality that enhances query efficiency.

Mind Map

Video Q&A

  • What are some challenges with scaling relational databases?

    Scaling relational databases can be difficult due to costs, the complexity of sharding and replication, and loss of consistency.

  • Why are acid guarantees not always practical in big data scenarios?

    Acid guarantees like atomicity and consistency are hard to maintain in large-scale distributed systems, leading to the need for alternative database solutions.

  • What is replication lag?

    Replication lag occurs when data changes on the master database take time to propagate to the slave databases, resulting in potential inconsistencies.

  • How does Cassandra handle high availability differently?

    Cassandra disperses data across multiple nodes, avoids single points of failure, and automatically handles data replication and distribution.

  • Why might developers choose DataStax Enterprise over Apache Cassandra?

    DataStax Enterprise offers additional features, optimizations, and support that might be beneficial for production environments.

  • Why is vertical scaling expensive for relational databases?

    Vertical scaling involves purchasing more powerful and costly hardware, which can quickly become financially unfeasible.

  • What is the challenge with master-slave architecture for high availability?

    Master-slave architecture can lead to downtime during failover and complexity in managing automatic or manual failover processes.

  • How does Cassandra avoid the problem of manual sharding?

    Cassandra automates data distribution across nodes, removing the need for manual sharding and rebalancing.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:00
    you
  • 00:00:00
    [Music]
  • 00:00:07
    hi everyone I'm John Haddad and I am
  • 00:00:10
    Luke Tillman and we're Technical
  • 00:00:11
    Evangelist with data stacks today we're
  • 00:00:13
    going to be talking about an
  • 00:00:14
    introduction to Cassandra during the
  • 00:00:16
    talk today we're going to cover a few
  • 00:00:18
    things so first of all talk a little bit
  • 00:00:20
    about relational databases maybe some of
  • 00:00:22
    the problems that you run into when you
  • 00:00:24
    try to scale relational databases for
  • 00:00:26
    high availability then we'll also cover
  • 00:00:28
    some core concepts of Cassandra how
  • 00:00:29
    Cassandra works internally some of the
  • 00:00:31
    dials that it gives you as a developer
  • 00:00:33
    to kind of control its fault tolerance
  • 00:00:35
    and this notion of high availability and
  • 00:00:37
    then lastly we'll kind of talk about the
  • 00:00:40
    different distributions that you can
  • 00:00:41
    choose of Cassandra so is open source
  • 00:00:43
    Apache Cassandra right for you or should
  • 00:00:46
    you maybe think about using data stacks
  • 00:00:48
    enterprise distribution DSE when you're
  • 00:00:50
    deploying Cassandra to production first
  • 00:00:52
    thing we're going to talk about is small
  • 00:00:53
    data this is when you've got maybe data
  • 00:00:55
    and text files or on a small sequel live
  • 00:00:57
    database and you're going to be using
  • 00:00:58
    utilities like said and awk in order to
  • 00:01:01
    analyze this may be you whip out the
  • 00:01:03
    Python script or something in Ruby but
  • 00:01:05
    in the end it generally means that it's
  • 00:01:07
    a one-off and you don't need any
  • 00:01:09
    concurrency you're not sharing this with
  • 00:01:10
    anybody this is running on your laptop
  • 00:01:13
    and it's okay like maybe the batch takes
  • 00:01:15
    like 30 seconds if it's a really really
  • 00:01:17
    big file but that's kind of what you're
  • 00:01:19
    looking at with small data we've talked
  • 00:01:20
    about small data so let's talk about
  • 00:01:21
    medium data now so probably if you're a
  • 00:01:23
    web application developer or web
  • 00:01:25
    developer of some kind this is probably
  • 00:01:26
    the typical data set that you're working
  • 00:01:28
    with so this is data that fits on a
  • 00:01:29
    single machine you're probably using a
  • 00:01:31
    relational database of some kind maybe
  • 00:01:34
    MySQL or Postgres
  • 00:01:36
    undreds of concurrent users so you've
  • 00:01:38
    got some concurrency going on now and
  • 00:01:40
    the other kind of nice thing that we get
  • 00:01:41
    when were working with a relational
  • 00:01:42
    database is these acid guarantees acid
  • 00:01:45
    standing for atomicity consistency
  • 00:01:46
    isolation and durability and as a
  • 00:01:49
    developer we've been taught for many
  • 00:01:51
    years how to develop on top of machines
  • 00:01:53
    like this one I go to put data into a
  • 00:01:55
    relational database with these acid
  • 00:01:57
    guarantees I can kind of feel warm and
  • 00:01:58
    cuddly and I kind of know exactly what's
  • 00:02:00
    going to happen with my data when I put
  • 00:02:02
    it in the other thing to know about is
  • 00:02:03
    the way we try to scale these typically
  • 00:02:05
    first is by scaling vertically so we buy
  • 00:02:07
    more expensive hardware like more memory
  • 00:02:10
    or maybe a bigger processor that kind of
  • 00:02:13
    thing and this can get expensive really
  • 00:02:14
    really quickly question will now ask
  • 00:02:16
    ourselves is can the relational database
  • 00:02:18
    work for big data first thing
  • 00:02:20
    we find when we start to use a
  • 00:02:21
    relational database to try and apply it
  • 00:02:23
    to big data is that acid is a lie we're
  • 00:02:26
    no longer in developed in that cocoon of
  • 00:02:28
    safety which is Adam SAT consistency
  • 00:02:31
    isolation and durability so let's take
  • 00:02:33
    our scenario over here we have a single
  • 00:02:35
    master and we have a client that's
  • 00:02:37
    talking to that master and we have a
  • 00:02:39
    read heavy workload and what we do is we
  • 00:02:41
    decide to add on replication so one of
  • 00:02:43
    the things that's important to know
  • 00:02:44
    about replication is that the data is
  • 00:02:46
    replicated asynchronously and this is
  • 00:02:48
    known as replication lag and so what
  • 00:02:50
    happens is when the client decides to do
  • 00:02:53
    a write to the master it takes a little
  • 00:02:54
    while to propagate over to the slave and
  • 00:02:56
    if the client decides to do a read to
  • 00:02:59
    the slave before the data has been
  • 00:03:01
    replicated it's going to get old data
  • 00:03:02
    back and what we find is that we have
  • 00:03:05
    lost our consistency completely in the
  • 00:03:08
    scope of our database so that whole
  • 00:03:10
    thing that we built our entire
  • 00:03:12
    application around that certainty that
  • 00:03:14
    we had that we were always going to get
  • 00:03:15
    up-to-date data is completely gone all
  • 00:03:17
    the operations that we do are no longer
  • 00:03:19
    in isolation they're definitely not
  • 00:03:21
    atomic so we have to recode our entire
  • 00:03:23
    app to take advantage or at least to
  • 00:03:26
    accommodate the fact that we have
  • 00:03:27
    replication and there is replication lag
  • 00:03:29
    another thing that we run into when we
  • 00:03:32
    start to deal with performance problems
  • 00:03:33
    on a relational databases is something
  • 00:03:35
    like maybe this query that you see on
  • 00:03:37
    the right side of the slide here so
  • 00:03:38
    probably most of us that have worked
  • 00:03:40
    with relational databases see these
  • 00:03:41
    crazy queries lots of joins maybe it's
  • 00:03:44
    been generated by an ORM behind the
  • 00:03:46
    scenes kind of thing in fact at a
  • 00:03:48
    company I used to work for we had this
  • 00:03:50
    problem where every day at 1 o'clock
  • 00:03:51
    we'd have a lot of users try to log on
  • 00:03:53
    to the system and nobody would be able
  • 00:03:55
    to log in and when we actually went and
  • 00:03:57
    looked at what was going on behind the
  • 00:03:59
    scenes it was some crazy queries like
  • 00:04:01
    this with lots of joins essentially
  • 00:04:03
    locking up the database behind the
  • 00:04:05
    scenes so queries like this can cause
  • 00:04:07
    lots of problems it's kind of one of the
  • 00:04:09
    side effects of using third normal form
  • 00:04:11
    to do all of our data modeling and so
  • 00:04:14
    what we try to do when we're dealing
  • 00:04:16
    with queries like this that have
  • 00:04:18
    unpredictable performance or poor
  • 00:04:19
    performance a lot of times as we do
  • 00:04:21
    normalize so we'll create a table and
  • 00:04:23
    that table is built specifically to
  • 00:04:26
    answer that query so at right time what
  • 00:04:29
    we'll do is do you normalize at write
  • 00:04:31
    time maybe write that data into that
  • 00:04:32
    table
  • 00:04:33
    specifically so that at read time we can
  • 00:04:35
    do is sort of a select star very simple
  • 00:04:37
    query that doesn't have a lot of
  • 00:04:38
    expensive joins in it and that means
  • 00:04:41
    that now we've probably got duplicate
  • 00:04:43
    copies of our data we kind of violated
  • 00:04:45
    this sort of third normal form that
  • 00:04:47
    we're used to using and that has been
  • 00:04:49
    drilled into our heads as developers for
  • 00:04:51
    a really long time as we continue to
  • 00:04:53
    scale our application the next thing
  • 00:04:54
    that we're going to have to do is
  • 00:04:55
    implement charting charting is when you
  • 00:04:57
    take your data and instead of having an
  • 00:04:59
    all-in-one database and one master you
  • 00:05:01
    split it up into multiple databases and
  • 00:05:03
    this works okay for a little while the
  • 00:05:06
    big problems with this is that now your
  • 00:05:07
    data is all over the place and even if
  • 00:05:10
    you were relying on let's say a single
  • 00:05:13
    master to do your OLAP queries you can't
  • 00:05:15
    do it anymore all of your joins all of
  • 00:05:17
    your aggregations all that stuff is
  • 00:05:18
    history you absolutely cannot do it and
  • 00:05:20
    you have to keep building different
  • 00:05:22
    denormalized views of the data that you
  • 00:05:24
    have in order to answer queries
  • 00:05:25
    efficiently we also find that as we
  • 00:05:27
    start to query secondary indexes that
  • 00:05:29
    doesn't scale well either so if we take
  • 00:05:31
    our servers and we say I'm going to
  • 00:05:33
    split my users into four different
  • 00:05:35
    shards and then I haven't shorted users
  • 00:05:39
    on something like state and I want to do
  • 00:05:40
    a query I want to say I want all the
  • 00:05:42
    users in the state of Massachusetts that
  • 00:05:44
    means I have to hit all the shards this
  • 00:05:46
    is extremely non performant means if
  • 00:05:48
    there was a hundred shards I have to do
  • 00:05:50
    a hundred queries this does not scale
  • 00:05:52
    well at all
  • 00:05:53
    as a result we end up doing or malaises
  • 00:05:55
    again so now we store two copies of our
  • 00:05:57
    users one by user ID and another by
  • 00:05:59
    state whenever we decide to add shards
  • 00:06:02
    to our cluster if we want to double the
  • 00:06:03
    number from four to eight we now have to
  • 00:06:05
    write a tool that will manually move
  • 00:06:07
    everything over this requires a lot of
  • 00:06:08
    coordination between developers and
  • 00:06:10
    operations and is an absolutely
  • 00:06:12
    nightmare because there's dozens of edge
  • 00:06:14
    cases that can come up when you're
  • 00:06:16
    moving your data around so you have to
  • 00:06:17
    think about all of them in your
  • 00:06:19
    application and your ops team has to be
  • 00:06:21
    aware of them as well the last thing
  • 00:06:22
    that we find is that managing your
  • 00:06:24
    schema is a huge pain if you have 20
  • 00:06:27
    shards all with the same schema on it
  • 00:06:29
    you have to now come up with tools to
  • 00:06:31
    apply schema changes to all the
  • 00:06:34
    different shards in your system and
  • 00:06:35
    remember it's not just a master that has
  • 00:06:37
    to take it but all of your slaves this
  • 00:06:39
    is a huge burden it is an absolute mess
  • 00:06:41
    at the end of the day you look like this
  • 00:06:43
    guy on the phone like he's just
  • 00:06:44
    absolutely out of his mind earlier John
  • 00:06:46
    mentioned
  • 00:06:47
    using master-slave replication to kind
  • 00:06:49
    of scale out when you have a read heavy
  • 00:06:50
    workload another reason why people will
  • 00:06:53
    introduce this sort of master slave
  • 00:06:54
    architecture with replication is for
  • 00:06:56
    high availability or maybe higher
  • 00:06:58
    availability the thing is when you when
  • 00:07:01
    you introduce this replication a lot of
  • 00:07:04
    times you have to decide how you're
  • 00:07:05
    going to do the failover so maybe you
  • 00:07:07
    build some sort of automatic process to
  • 00:07:10
    do the failover maybe it's a manual
  • 00:07:12
    process where somebody has to notice
  • 00:07:13
    that the database has gone down and push
  • 00:07:15
    a button to failover to the slave server
  • 00:07:17
    if you build an automatic process of
  • 00:07:19
    some kind then what's going to watch the
  • 00:07:21
    automatic process to make sure it
  • 00:07:22
    doesn't crash and ultimately not end up
  • 00:07:24
    being able to fail over your database
  • 00:07:26
    and in any scenario the the problem is
  • 00:07:30
    that you still end up with downtime
  • 00:07:32
    because whether it's a manual failover
  • 00:07:33
    process or an automatic failover process
  • 00:07:36
    that implies that it's something's going
  • 00:07:37
    to have to detect that the database is
  • 00:07:39
    down and that you're having downtime
  • 00:07:40
    before the failover can kick in the
  • 00:07:42
    other thing is that trying to do this
  • 00:07:44
    with the relational database and do
  • 00:07:45
    multiple data centers is a disaster it's
  • 00:07:47
    really really hard to do and we're not
  • 00:07:49
    just talking about downtime as far as
  • 00:07:51
    unplanned downtime you know we know the
  • 00:07:54
    hardware fails Amazon reboots your
  • 00:07:56
    servers as a service sort of thing that
  • 00:07:59
    kind of stuff happens but then there's
  • 00:08:00
    also planned downtime as well so there's
  • 00:08:02
    things like OS upgrades or upgrades to
  • 00:08:05
    your database server software so you've
  • 00:08:07
    got a plan for those as well so it'd be
  • 00:08:09
    really nice to have some way to have
  • 00:08:10
    higher availability than what the
  • 00:08:12
    master/slave kind of architecture gives
  • 00:08:15
    to us let's summarize the ways of the
  • 00:08:18
    relational database fails us handling
  • 00:08:20
    Big Data we know that scaling is an
  • 00:08:22
    absolute pain right we want to put
  • 00:08:24
    bigger bigger hardware that costs a lot
  • 00:08:26
    of money we want a shard that's an
  • 00:08:28
    absolute mess we've got replication it's
  • 00:08:31
    falling behind we have to keep changing
  • 00:08:32
    our application to account for the
  • 00:08:34
    things that we give up in the relational
  • 00:08:36
    database acid you know that cocoon of
  • 00:08:38
    safety we're not in that thing anymore
  • 00:08:40
    we are basically treating our relational
  • 00:08:42
    database pretty much like a glorified
  • 00:08:43
    key value store we know that when we
  • 00:08:45
    want to double the size of our cluster
  • 00:08:47
    and we have to recharge that is an
  • 00:08:49
    absolute nightmare to deal with nobody
  • 00:08:51
    wants to do this
  • 00:08:52
    it requires way too much coordination we
  • 00:08:54
    know that we're going to have to
  • 00:08:55
    denormalize all of our cool third normal
  • 00:08:57
    form queries that we'd love to do our
  • 00:09:01
    there's things that we were so proud to
  • 00:09:02
    write in the first place they're gone
  • 00:09:04
    now we're just writing our data in a
  • 00:09:05
    bunch of different tables and some of
  • 00:09:08
    the times we're just going to JSON
  • 00:09:09
    serialize it and whatever it's an
  • 00:09:12
    absolute disaster high-availability
  • 00:09:14
    it's not happening right if you want to
  • 00:09:16
    do multi DC with my sequel or Postgres
  • 00:09:19
    it is absolutely not happening unless
  • 00:09:21
    you in and have an entire dedicated
  • 00:09:23
    engineering team to try and solve all
  • 00:09:25
    the problems those are going to come up
  • 00:09:26
    along the way so if we were to take some
  • 00:09:29
    of the lessons that we've learned you
  • 00:09:30
    know some of the points of failure that
  • 00:09:31
    John just summarized and apply it to a
  • 00:09:33
    new database if we were trying to build
  • 00:09:35
    something maybe from scratch that would
  • 00:09:37
    kind of be good for handling big data
  • 00:09:39
    what are some of the lessons that we've
  • 00:09:41
    learned from those failure so the first
  • 00:09:43
    thing is that consistency is not
  • 00:09:45
    practical this whole idea of acid
  • 00:09:47
    consistency probably not practical in a
  • 00:09:48
    big distributed system so we're going to
  • 00:09:50
    give it up we also noticed that manual
  • 00:09:52
    sharding and rebalancing is really hard
  • 00:09:53
    right we had to write a lot of code just
  • 00:09:56
    to move data from place to place and
  • 00:09:57
    handle all these error conditions so
  • 00:09:59
    instead what we're going to do is push
  • 00:10:01
    that responsibility to our cluster our
  • 00:10:03
    dream database can go from 3 to 20
  • 00:10:06
    machines and we as developers don't have
  • 00:10:08
    to worry about it we don't have to write
  • 00:10:10
    any special code to accommodate that
  • 00:10:12
    next thing we know is that every moving
  • 00:10:14
    part that we add to the system so this
  • 00:10:15
    idea of master-slave replication that we
  • 00:10:17
    get in a lot of databases that makes
  • 00:10:19
    things more complex and all this
  • 00:10:20
    failover and processes to watch the
  • 00:10:23
    failover and everything so we want our
  • 00:10:25
    system to be as simple as possible as
  • 00:10:27
    few moving parts as possible none of
  • 00:10:29
    this master/slave architecture sort of
  • 00:10:30
    thing we also find that scaling up is
  • 00:10:32
    really expensive if you want to
  • 00:10:34
    vertically scale your database you're
  • 00:10:36
    going to have to put things like a sand
  • 00:10:37
    in place you're going to have to get
  • 00:10:38
    bigger and bigger servers every time you
  • 00:10:40
    do it it's more and more expensive it's
  • 00:10:42
    a lot of money and it's really not worth
  • 00:10:45
    it in the end so what we would do in our
  • 00:10:48
    dream database is to only use commodity
  • 00:10:51
    hardware we want to spend five ten
  • 00:10:53
    thousand dollars per machine instead of
  • 00:10:54
    a hundred thousand dollars and what we
  • 00:10:56
    want to do is buy more machines that way
  • 00:10:58
    when we want to double the capacity of
  • 00:11:00
    our cluster we're not going from a
  • 00:11:01
    hundred thousand dollar machine to a two
  • 00:11:03
    hundred thousand dollar machine we're
  • 00:11:04
    just doubling the number of cheap
  • 00:11:06
    machines that we use well sort of last
  • 00:11:08
    lesson learned here is that scattered
  • 00:11:10
    gathered queries are not going to be
  • 00:11:12
    any good so we want to have something
  • 00:11:15
    that kind of tries to push us maybe in
  • 00:11:17
    its data modeling or something like that
  • 00:11:19
    towards data locality where queries will
  • 00:11:22
    only hit a single machine so that we're
  • 00:11:24
    efficient we don't introduce a whole
  • 00:11:25
    bunch of extra latency where we're
  • 00:11:26
    instead doing a full table scan now
  • 00:11:28
    we're doing a full cluster scan sort of
  • 00:11:30
    thing
  • 00:11:37
    you
Tags
  • Cassandra
  • Relational Databases
  • Data Scaling
  • High Availability
  • Cluster Management
  • Data Distribution
  • Consistency
  • DataStax Enterprise
  • Replication
  • Sharding