00:00:00
you
00:00:00
[Music]
00:00:07
hi everyone I'm John Haddad and I am
00:00:10
Luke Tillman and we're Technical
00:00:11
Evangelist with data stacks today we're
00:00:13
going to be talking about an
00:00:14
introduction to Cassandra during the
00:00:16
talk today we're going to cover a few
00:00:18
things so first of all talk a little bit
00:00:20
about relational databases maybe some of
00:00:22
the problems that you run into when you
00:00:24
try to scale relational databases for
00:00:26
high availability then we'll also cover
00:00:28
some core concepts of Cassandra how
00:00:29
Cassandra works internally some of the
00:00:31
dials that it gives you as a developer
00:00:33
to kind of control its fault tolerance
00:00:35
and this notion of high availability and
00:00:37
then lastly we'll kind of talk about the
00:00:40
different distributions that you can
00:00:41
choose of Cassandra so is open source
00:00:43
Apache Cassandra right for you or should
00:00:46
you maybe think about using data stacks
00:00:48
enterprise distribution DSE when you're
00:00:50
deploying Cassandra to production first
00:00:52
thing we're going to talk about is small
00:00:53
data this is when you've got maybe data
00:00:55
and text files or on a small sequel live
00:00:57
database and you're going to be using
00:00:58
utilities like said and awk in order to
00:01:01
analyze this may be you whip out the
00:01:03
Python script or something in Ruby but
00:01:05
in the end it generally means that it's
00:01:07
a one-off and you don't need any
00:01:09
concurrency you're not sharing this with
00:01:10
anybody this is running on your laptop
00:01:13
and it's okay like maybe the batch takes
00:01:15
like 30 seconds if it's a really really
00:01:17
big file but that's kind of what you're
00:01:19
looking at with small data we've talked
00:01:20
about small data so let's talk about
00:01:21
medium data now so probably if you're a
00:01:23
web application developer or web
00:01:25
developer of some kind this is probably
00:01:26
the typical data set that you're working
00:01:28
with so this is data that fits on a
00:01:29
single machine you're probably using a
00:01:31
relational database of some kind maybe
00:01:34
MySQL or Postgres
00:01:36
undreds of concurrent users so you've
00:01:38
got some concurrency going on now and
00:01:40
the other kind of nice thing that we get
00:01:41
when were working with a relational
00:01:42
database is these acid guarantees acid
00:01:45
standing for atomicity consistency
00:01:46
isolation and durability and as a
00:01:49
developer we've been taught for many
00:01:51
years how to develop on top of machines
00:01:53
like this one I go to put data into a
00:01:55
relational database with these acid
00:01:57
guarantees I can kind of feel warm and
00:01:58
cuddly and I kind of know exactly what's
00:02:00
going to happen with my data when I put
00:02:02
it in the other thing to know about is
00:02:03
the way we try to scale these typically
00:02:05
first is by scaling vertically so we buy
00:02:07
more expensive hardware like more memory
00:02:10
or maybe a bigger processor that kind of
00:02:13
thing and this can get expensive really
00:02:14
really quickly question will now ask
00:02:16
ourselves is can the relational database
00:02:18
work for big data first thing
00:02:20
we find when we start to use a
00:02:21
relational database to try and apply it
00:02:23
to big data is that acid is a lie we're
00:02:26
no longer in developed in that cocoon of
00:02:28
safety which is Adam SAT consistency
00:02:31
isolation and durability so let's take
00:02:33
our scenario over here we have a single
00:02:35
master and we have a client that's
00:02:37
talking to that master and we have a
00:02:39
read heavy workload and what we do is we
00:02:41
decide to add on replication so one of
00:02:43
the things that's important to know
00:02:44
about replication is that the data is
00:02:46
replicated asynchronously and this is
00:02:48
known as replication lag and so what
00:02:50
happens is when the client decides to do
00:02:53
a write to the master it takes a little
00:02:54
while to propagate over to the slave and
00:02:56
if the client decides to do a read to
00:02:59
the slave before the data has been
00:03:01
replicated it's going to get old data
00:03:02
back and what we find is that we have
00:03:05
lost our consistency completely in the
00:03:08
scope of our database so that whole
00:03:10
thing that we built our entire
00:03:12
application around that certainty that
00:03:14
we had that we were always going to get
00:03:15
up-to-date data is completely gone all
00:03:17
the operations that we do are no longer
00:03:19
in isolation they're definitely not
00:03:21
atomic so we have to recode our entire
00:03:23
app to take advantage or at least to
00:03:26
accommodate the fact that we have
00:03:27
replication and there is replication lag
00:03:29
another thing that we run into when we
00:03:32
start to deal with performance problems
00:03:33
on a relational databases is something
00:03:35
like maybe this query that you see on
00:03:37
the right side of the slide here so
00:03:38
probably most of us that have worked
00:03:40
with relational databases see these
00:03:41
crazy queries lots of joins maybe it's
00:03:44
been generated by an ORM behind the
00:03:46
scenes kind of thing in fact at a
00:03:48
company I used to work for we had this
00:03:50
problem where every day at 1 o'clock
00:03:51
we'd have a lot of users try to log on
00:03:53
to the system and nobody would be able
00:03:55
to log in and when we actually went and
00:03:57
looked at what was going on behind the
00:03:59
scenes it was some crazy queries like
00:04:01
this with lots of joins essentially
00:04:03
locking up the database behind the
00:04:05
scenes so queries like this can cause
00:04:07
lots of problems it's kind of one of the
00:04:09
side effects of using third normal form
00:04:11
to do all of our data modeling and so
00:04:14
what we try to do when we're dealing
00:04:16
with queries like this that have
00:04:18
unpredictable performance or poor
00:04:19
performance a lot of times as we do
00:04:21
normalize so we'll create a table and
00:04:23
that table is built specifically to
00:04:26
answer that query so at right time what
00:04:29
we'll do is do you normalize at write
00:04:31
time maybe write that data into that
00:04:32
table
00:04:33
specifically so that at read time we can
00:04:35
do is sort of a select star very simple
00:04:37
query that doesn't have a lot of
00:04:38
expensive joins in it and that means
00:04:41
that now we've probably got duplicate
00:04:43
copies of our data we kind of violated
00:04:45
this sort of third normal form that
00:04:47
we're used to using and that has been
00:04:49
drilled into our heads as developers for
00:04:51
a really long time as we continue to
00:04:53
scale our application the next thing
00:04:54
that we're going to have to do is
00:04:55
implement charting charting is when you
00:04:57
take your data and instead of having an
00:04:59
all-in-one database and one master you
00:05:01
split it up into multiple databases and
00:05:03
this works okay for a little while the
00:05:06
big problems with this is that now your
00:05:07
data is all over the place and even if
00:05:10
you were relying on let's say a single
00:05:13
master to do your OLAP queries you can't
00:05:15
do it anymore all of your joins all of
00:05:17
your aggregations all that stuff is
00:05:18
history you absolutely cannot do it and
00:05:20
you have to keep building different
00:05:22
denormalized views of the data that you
00:05:24
have in order to answer queries
00:05:25
efficiently we also find that as we
00:05:27
start to query secondary indexes that
00:05:29
doesn't scale well either so if we take
00:05:31
our servers and we say I'm going to
00:05:33
split my users into four different
00:05:35
shards and then I haven't shorted users
00:05:39
on something like state and I want to do
00:05:40
a query I want to say I want all the
00:05:42
users in the state of Massachusetts that
00:05:44
means I have to hit all the shards this
00:05:46
is extremely non performant means if
00:05:48
there was a hundred shards I have to do
00:05:50
a hundred queries this does not scale
00:05:52
well at all
00:05:53
as a result we end up doing or malaises
00:05:55
again so now we store two copies of our
00:05:57
users one by user ID and another by
00:05:59
state whenever we decide to add shards
00:06:02
to our cluster if we want to double the
00:06:03
number from four to eight we now have to
00:06:05
write a tool that will manually move
00:06:07
everything over this requires a lot of
00:06:08
coordination between developers and
00:06:10
operations and is an absolutely
00:06:12
nightmare because there's dozens of edge
00:06:14
cases that can come up when you're
00:06:16
moving your data around so you have to
00:06:17
think about all of them in your
00:06:19
application and your ops team has to be
00:06:21
aware of them as well the last thing
00:06:22
that we find is that managing your
00:06:24
schema is a huge pain if you have 20
00:06:27
shards all with the same schema on it
00:06:29
you have to now come up with tools to
00:06:31
apply schema changes to all the
00:06:34
different shards in your system and
00:06:35
remember it's not just a master that has
00:06:37
to take it but all of your slaves this
00:06:39
is a huge burden it is an absolute mess
00:06:41
at the end of the day you look like this
00:06:43
guy on the phone like he's just
00:06:44
absolutely out of his mind earlier John
00:06:46
mentioned
00:06:47
using master-slave replication to kind
00:06:49
of scale out when you have a read heavy
00:06:50
workload another reason why people will
00:06:53
introduce this sort of master slave
00:06:54
architecture with replication is for
00:06:56
high availability or maybe higher
00:06:58
availability the thing is when you when
00:07:01
you introduce this replication a lot of
00:07:04
times you have to decide how you're
00:07:05
going to do the failover so maybe you
00:07:07
build some sort of automatic process to
00:07:10
do the failover maybe it's a manual
00:07:12
process where somebody has to notice
00:07:13
that the database has gone down and push
00:07:15
a button to failover to the slave server
00:07:17
if you build an automatic process of
00:07:19
some kind then what's going to watch the
00:07:21
automatic process to make sure it
00:07:22
doesn't crash and ultimately not end up
00:07:24
being able to fail over your database
00:07:26
and in any scenario the the problem is
00:07:30
that you still end up with downtime
00:07:32
because whether it's a manual failover
00:07:33
process or an automatic failover process
00:07:36
that implies that it's something's going
00:07:37
to have to detect that the database is
00:07:39
down and that you're having downtime
00:07:40
before the failover can kick in the
00:07:42
other thing is that trying to do this
00:07:44
with the relational database and do
00:07:45
multiple data centers is a disaster it's
00:07:47
really really hard to do and we're not
00:07:49
just talking about downtime as far as
00:07:51
unplanned downtime you know we know the
00:07:54
hardware fails Amazon reboots your
00:07:56
servers as a service sort of thing that
00:07:59
kind of stuff happens but then there's
00:08:00
also planned downtime as well so there's
00:08:02
things like OS upgrades or upgrades to
00:08:05
your database server software so you've
00:08:07
got a plan for those as well so it'd be
00:08:09
really nice to have some way to have
00:08:10
higher availability than what the
00:08:12
master/slave kind of architecture gives
00:08:15
to us let's summarize the ways of the
00:08:18
relational database fails us handling
00:08:20
Big Data we know that scaling is an
00:08:22
absolute pain right we want to put
00:08:24
bigger bigger hardware that costs a lot
00:08:26
of money we want a shard that's an
00:08:28
absolute mess we've got replication it's
00:08:31
falling behind we have to keep changing
00:08:32
our application to account for the
00:08:34
things that we give up in the relational
00:08:36
database acid you know that cocoon of
00:08:38
safety we're not in that thing anymore
00:08:40
we are basically treating our relational
00:08:42
database pretty much like a glorified
00:08:43
key value store we know that when we
00:08:45
want to double the size of our cluster
00:08:47
and we have to recharge that is an
00:08:49
absolute nightmare to deal with nobody
00:08:51
wants to do this
00:08:52
it requires way too much coordination we
00:08:54
know that we're going to have to
00:08:55
denormalize all of our cool third normal
00:08:57
form queries that we'd love to do our
00:09:01
there's things that we were so proud to
00:09:02
write in the first place they're gone
00:09:04
now we're just writing our data in a
00:09:05
bunch of different tables and some of
00:09:08
the times we're just going to JSON
00:09:09
serialize it and whatever it's an
00:09:12
absolute disaster high-availability
00:09:14
it's not happening right if you want to
00:09:16
do multi DC with my sequel or Postgres
00:09:19
it is absolutely not happening unless
00:09:21
you in and have an entire dedicated
00:09:23
engineering team to try and solve all
00:09:25
the problems those are going to come up
00:09:26
along the way so if we were to take some
00:09:29
of the lessons that we've learned you
00:09:30
know some of the points of failure that
00:09:31
John just summarized and apply it to a
00:09:33
new database if we were trying to build
00:09:35
something maybe from scratch that would
00:09:37
kind of be good for handling big data
00:09:39
what are some of the lessons that we've
00:09:41
learned from those failure so the first
00:09:43
thing is that consistency is not
00:09:45
practical this whole idea of acid
00:09:47
consistency probably not practical in a
00:09:48
big distributed system so we're going to
00:09:50
give it up we also noticed that manual
00:09:52
sharding and rebalancing is really hard
00:09:53
right we had to write a lot of code just
00:09:56
to move data from place to place and
00:09:57
handle all these error conditions so
00:09:59
instead what we're going to do is push
00:10:01
that responsibility to our cluster our
00:10:03
dream database can go from 3 to 20
00:10:06
machines and we as developers don't have
00:10:08
to worry about it we don't have to write
00:10:10
any special code to accommodate that
00:10:12
next thing we know is that every moving
00:10:14
part that we add to the system so this
00:10:15
idea of master-slave replication that we
00:10:17
get in a lot of databases that makes
00:10:19
things more complex and all this
00:10:20
failover and processes to watch the
00:10:23
failover and everything so we want our
00:10:25
system to be as simple as possible as
00:10:27
few moving parts as possible none of
00:10:29
this master/slave architecture sort of
00:10:30
thing we also find that scaling up is
00:10:32
really expensive if you want to
00:10:34
vertically scale your database you're
00:10:36
going to have to put things like a sand
00:10:37
in place you're going to have to get
00:10:38
bigger and bigger servers every time you
00:10:40
do it it's more and more expensive it's
00:10:42
a lot of money and it's really not worth
00:10:45
it in the end so what we would do in our
00:10:48
dream database is to only use commodity
00:10:51
hardware we want to spend five ten
00:10:53
thousand dollars per machine instead of
00:10:54
a hundred thousand dollars and what we
00:10:56
want to do is buy more machines that way
00:10:58
when we want to double the capacity of
00:11:00
our cluster we're not going from a
00:11:01
hundred thousand dollar machine to a two
00:11:03
hundred thousand dollar machine we're
00:11:04
just doubling the number of cheap
00:11:06
machines that we use well sort of last
00:11:08
lesson learned here is that scattered
00:11:10
gathered queries are not going to be
00:11:12
any good so we want to have something
00:11:15
that kind of tries to push us maybe in
00:11:17
its data modeling or something like that
00:11:19
towards data locality where queries will
00:11:22
only hit a single machine so that we're
00:11:24
efficient we don't introduce a whole
00:11:25
bunch of extra latency where we're
00:11:26
instead doing a full table scan now
00:11:28
we're doing a full cluster scan sort of
00:11:30
thing
00:11:37
you