00:00:00
It's no secret that I'm a fan of Reddus,
00:00:03
or at least I was a fan of Reddus until
00:00:06
they went and forked things up. These
00:00:09
days, I'm now pinning pictures of
00:00:10
Valkyrie on my wall instead. In any
00:00:13
case, I've been using Reddus for a good
00:00:16
number of years, and through that time,
00:00:18
I've always been fascinated with how
00:00:20
incredibly fast both Reddus and its
00:00:22
forks, such as Valkyrie, can be. In
00:00:25
fact, by just running a single instance
00:00:27
of Valky on a bare metal machine, you
00:00:30
can easily exceed 100,000 requests per
00:00:33
second, which is a lot of throughput.
00:00:36
Whilst 100,000 requests per second is
00:00:39
pretty impressive. I wanted to see how
00:00:41
difficult it would be to push this even
00:00:43
further, say 10 times further to 1
00:00:46
million requests per second. So, how
00:00:49
does one go about increasing the number
00:00:51
of requests per second or RPS that
00:00:54
Reddis can handle to greater than 1
00:00:57
million? Well, as it turns out, there
00:00:59
are a number of different ways to do so,
00:01:02
each with their own pros and cons
00:01:04
depending on the situation that you find
00:01:06
yourself in. The first approach I
00:01:09
decided to take when it came to scaling
00:01:10
Reddis or Valky in this case was to just
00:01:13
run it on a bigger machine. This is
00:01:16
known as vertical scaling. And whilst it
00:01:19
can be effective for some software, when
00:01:22
it comes to reddis, it's a little bit
00:01:23
more complicated. To show what I mean, I
00:01:26
decided to deploy a Valky instance onto
00:01:28
three separate machines. The first being
00:01:31
a Blink Mini S12, which has a lowowered
00:01:34
4 core CPU, the Intel
00:01:37
N95, which is the least powerful of the
00:01:40
three. The second machine that I
00:01:42
installed Valkyrie on, I've defined as
00:01:44
the mid machine, which is yet another
00:01:46
Beink mini PC. This one, the STR6, which
00:01:50
runs an AMD 7735 CPU with eight cores
00:01:55
and 16 threads. So, it's a little bit
00:01:58
more powerful. The final machine in this
00:02:00
testing is the big boy. This is the AMD
00:02:03
Thread Ripper,
00:02:04
3970X with a massive 32 cores and 64
00:02:08
threads, as well as boasting 128 gigs of
00:02:11
RAM. To test how much throughput each of
00:02:13
these machines can handle, I'm going to
00:02:15
go ahead and use the following mem
00:02:17
benchmark command, which is pretty much
00:02:19
the standard tool when it comes to
00:02:21
benchmarking Reddus and its forks.
00:02:24
Additionally, I'm running this tool on
00:02:25
another machine and sending the commands
00:02:27
over the network in order to simulate
00:02:29
realworld usage. And when I run this
00:02:32
command, you can see that the mini PC is
00:02:35
handling a huge amount of requests per
00:02:37
second, around 200,000, which honestly
00:02:41
is pretty impressive. This high number
00:02:44
is already happening because I'm running
00:02:45
on my local network and running the
00:02:47
Valky instance on bare metal, which both
00:02:50
Reddus and Valky handle really well. If
00:02:53
instead I was making use of a couple of
00:02:55
VPS instances, one to host the instance
00:02:57
and one to actually perform the test in
00:03:00
the same data center, you'll see that
00:03:02
the request count drops significantly.
00:03:05
And if I try to go ahead and benchmark
00:03:06
this from my machine to one of these VPS
00:03:08
instances over the internet, then it
00:03:11
drops even further. So that's one thing
00:03:13
to consider if you want to scale Reddis.
00:03:16
bare metal is best, but also
00:03:18
colllocating or running in the same data
00:03:21
center is really important. If you're
00:03:24
sending commands halfway across the
00:03:25
world, then you're going to have a bad
00:03:27
time. In any case, going back to my
00:03:29
local area network setup, the four core
00:03:31
machine did pretty well. So, let's go
00:03:34
ahead and see how it works when it comes
00:03:35
to the midlevel machine. As you can see,
00:03:38
we're actually getting around the same
00:03:40
number of operations per second, which
00:03:43
at first may feel a little surprising.
00:03:46
This, however, is actually expected when
00:03:48
it comes to Reddus. This is because
00:03:51
Reddus and some of its forks such as
00:03:53
Valky only make use of a single thread
00:03:55
when it comes to handling commands.
00:03:58
Meaning that more cores or more CPUs
00:04:00
isn't going to increase performance. In
00:04:03
fact, in some cases, it can actually
00:04:05
hinder it. For example, if I go ahead
00:04:07
and run the benchmark against my
00:04:09
Valkyrie instance on the 32 core
00:04:11
machine, you can see now we're actually
00:04:13
producing less operations down around
00:04:16
1/4 what we were before. This is because
00:04:19
the 32 core Thread Ripper machine
00:04:21
actually has worse single core
00:04:23
performance than the other two. And
00:04:25
because Reddus is so dependent on single
00:04:27
core performance, then it's having a
00:04:29
negative impact. Now there are both
00:04:32
forks of Reddus and Reddis compatible
00:04:34
solutions that are better able to make
00:04:36
use of multiple cores on a machine with
00:04:38
one such solution being Dragonfly DB who
00:04:42
are also the sponsors of today's video.
00:04:44
We'll talk a little bit more about
00:04:45
Dragonfly DB later on and how they
00:04:48
provide increased performance on
00:04:50
multi-core machines. However, for the
00:04:52
meantime, we're going to go ahead and
00:04:53
focus on Reddus or Valky and see how we
00:04:56
can get it to perform millions of
00:04:58
requests per second using a single
00:05:00
threaded instance. One approach to doing
00:05:03
so is to make use of something called
00:05:06
pipelining. Pipelining is a technique
00:05:08
for improving performance by issuing
00:05:10
multiple commands at once without
00:05:12
waiting for the response to each
00:05:14
individual command to return. It's very
00:05:16
similar to the concept of batching. to
00:05:19
show the performance improvement of
00:05:20
using pipelining. If I go ahead and use
00:05:22
the benchmark test tool again, this time
00:05:24
passing in the d--peline flag, setting
00:05:27
it to be two. So, we're sending a
00:05:29
pipeline of two commands at once. As you
00:05:32
can see, we're now effectively doubling
00:05:33
our throughput without making any
00:05:35
hardware changes to our instance. This
00:05:38
is pretty great, but how far can we
00:05:40
actually push it? Well, if I go ahead
00:05:42
and set a pipeline to be an order of
00:05:44
magnitude higher from 2 to 10, you can
00:05:47
see now we're pushing over 1 million
00:05:49
requests a second, give or take. Pretty
00:05:52
cool. However, we don't just have to
00:05:53
stop here and we can actually push
00:05:55
pipelining a little further. Let's say
00:05:57
we go ahead and set a pipeline of 100.
00:06:00
This time we're now maxing out at about
00:06:03
2.5 million requests a second. Whilst
00:06:06
this is incredibly fast, you'll notice
00:06:08
it's slightly less than we would expect.
00:06:10
Given that we were hitting 1 million
00:06:12
operations when it came to a pipeline
00:06:13
size of 10, we should expect to see 10
00:06:16
million operations when it came to a
00:06:17
pipeline size of 100. Unfortunately,
00:06:19
however, we're actually hitting a
00:06:21
bottleneck, which is caused by the
00:06:23
available bandwidth on my local network.
00:06:26
Either way, reaching 2.5 million
00:06:28
operations per second is impressive. And
00:06:30
pipelining is an effective way to
00:06:32
achieve this, especially as it's
00:06:34
supported by most Reddis clients
00:06:35
already, and it's pretty easy to
00:06:38
perform. In Go, you can achieve this by
00:06:40
using the pipeline method as follows.
00:06:42
Then sending commands to this pipeline
00:06:44
followed by executing it, and all of the
00:06:46
responses will be in the same order you
00:06:48
pass them in. Despite being simple to
00:06:50
implement, there is unfortunately a
00:06:52
catch. Pipelining isn't always possible
00:06:55
for every use case when it comes to
00:06:57
working with Reddus as it requires you
00:06:59
to have a batch of commands in order to
00:07:01
send up. In some situations, this is
00:07:04
actually going to be the case. For
00:07:06
example, if you want to send multiple
00:07:07
commands at once, such as if you're
00:07:09
enriching a bunch of data and need to
00:07:11
perform a get command for a large number
00:07:13
of keys, you can effectively send all of
00:07:16
these keys up to Reddus in a batch of
00:07:18
say 100. However, for most cases when it
00:07:21
comes to Reddus, pipelining just doesn't
00:07:23
really make sense as you don't have a
00:07:25
batch of commands that you can send at
00:07:27
once. So, whilst it is a great way to
00:07:29
improve performance, it doesn't work for
00:07:31
every case. Not only this, but there's
00:07:33
also some other limitations when it
00:07:34
comes to Reddus that pipelining can't
00:07:36
resolve, specifically when it comes to
00:07:39
resources. As we've seen already, Reddus
00:07:41
uses a single thread when it comes to
00:07:43
handling commands, which means even
00:07:46
though it's incredibly fast, there's
00:07:47
going to be an upper limit as to what a
00:07:49
single instance can do. Not only this,
00:07:52
but the network stack itself can also be
00:07:54
a bottleneck, especially when dealing
00:07:56
with operations over the internet that
00:07:58
we saw already. Lastly, one thing that's
00:08:00
really important when it comes to Reddus
00:08:02
and Valky is system memory. given that
00:08:05
Reddus is an in-memory data store and
00:08:07
more memory means more data stored and
00:08:09
less evictions. Therefore, whilst
00:08:12
pipelining is a great way to get more
00:08:13
performance out of your Reddus instance,
00:08:15
it's not going to work when it comes to
00:08:17
true scaling. So, how can we achieve 1
00:08:20
million operations per second without
00:08:22
needing to use pipelining? Well, that's
00:08:25
where another approach comes in.
00:08:27
Horizontal scaling. Horizontal scaling
00:08:30
is where you increase the performance of
00:08:32
a system by scaling the number of
00:08:34
instances rather than the amount of
00:08:36
resources per instance. Basically,
00:08:38
you're deploying multiple instances of
00:08:40
Reddus or Valky across multiple
00:08:43
machines. However, just deploying these
00:08:45
across multiple machines doesn't really
00:08:47
do that much by itself. Instead, you
00:08:49
need to couple these deployments with a
00:08:51
horizontal scaling strategy in order to
00:08:54
determine how data is both stored and
00:08:56
retrieved. When it comes to Reddus and
00:08:58
well most data storage applications,
00:09:01
there are two horizontal scaling
00:09:02
strategies that you can take. The first
00:09:05
horizontal scaling strategy is known as
00:09:08
read
00:09:09
replication. This is where you deploy a
00:09:11
single instance known as the primary and
00:09:14
a number of other instances called
00:09:16
replicas. These replicas are constrained
00:09:20
to read operations only with the only
00:09:23
instance that allows data to be written
00:09:25
to it being the primary. When data is
00:09:28
then written to this primary instance,
00:09:30
it's then synchronized to the other
00:09:31
replicas in the replica set. To set up
00:09:34
read replication in Reddus and Valky is
00:09:37
actually incredibly simple. You can
00:09:39
either do so in the configuration or by
00:09:42
sending the replica of command which you
00:09:44
can do through the CLI. In my case, I
00:09:46
decided to set this up on both my small
00:09:48
instance and on my thread ripper to
00:09:50
become replicas of the mid instance. As
00:09:53
you can see, I'm using the replica of
00:09:55
command to achieve this, passing in the
00:09:57
midhost name. If your primary instance
00:10:00
requires authentication, then there are
00:10:02
a couple of other steps you need to
00:10:04
take. I recommend reading the Reddus or
00:10:06
Valky documentation for whichever
00:10:08
version you're using. In any case, upon
00:10:10
executing these commands, replication is
00:10:12
now set up. And if I go ahead and make a
00:10:15
right to my primary instance, you'll see
00:10:17
that this key is now available on the
00:10:19
two replicas as well. Therefore, we can
00:10:22
now go ahead and make use of these in
00:10:24
order to improve the throughput of our
00:10:26
Reddus deployment. In order to test how
00:10:28
much throughput we now have, we can
00:10:30
modify our benchmark command so that we
00:10:33
only write data to the primary and read
00:10:35
from the replicas, which is done using
00:10:38
the following three commands. setting
00:10:40
the ratio for the primary to be write
00:10:42
only and setting the ratio for the
00:10:44
replicas to be read only. Now, if I go
00:10:46
ahead and run this for about 60 seconds,
00:10:48
you can see the performance is pretty
00:10:50
good. We're hitting around 300,000
00:10:53
requests per second using three
00:10:55
instances with two replicas. So, all it
00:10:57
would take to reach 1 million would be
00:10:59
adding in maybe another seven. Whilst
00:11:02
this is perfectly achievable, when it
00:11:04
comes to realworld setups, read
00:11:06
replication isn't always viable. For
00:11:09
starters, whilst our total performance
00:11:11
across the three nodes has increased, we
00:11:13
haven't actually improved our write
00:11:15
performance, only our reads. This is
00:11:17
because we're constrained to only being
00:11:19
able to write to a single node, which
00:11:21
means our performance is constrained to
00:11:23
this one instance. In some setups, this
00:11:26
is actually okay, especially when it
00:11:29
comes to more read workflows where
00:11:31
having multiple instances or multiple
00:11:33
replicas where you can read from will
00:11:35
directly scale performance. However,
00:11:38
there are still some trade-offs when it
00:11:39
comes to using this horizontal scaling
00:11:41
strategy. For one thing, this approach
00:11:44
is more expensive when it comes to
00:11:46
memory as we're effectively having to
00:11:48
replicate our entire data set across
00:11:50
multiple nodes. This means when it comes
00:11:52
to scaling the actual storage or the
00:11:54
amount of memory that we have to store
00:11:56
keys in our instances, we're back to
00:11:59
vertical scaling. And we can only
00:12:01
increase this performance by increasing
00:12:03
the size of memory available on each of
00:12:05
our nodes. Additionally, replication
00:12:08
also comes with another caveat called
00:12:10
lag. This is the time delta where data
00:12:13
is written to the primary before it's
00:12:15
available to the replicas and in some
00:12:17
cases can cause data integrity issues.
00:12:20
This means that read replication has
00:12:22
what's known as eventual consistency,
00:12:25
which is something you don't normally
00:12:26
have to worry about when running on
00:12:28
standalone mode. Not only this, but
00:12:30
there's also a single point of failure
00:12:32
when it comes to this setup, the
00:12:35
primary. If this instance happens to go
00:12:37
down, then we're no longer able to write
00:12:40
any data to our entire Reddus system.
00:12:43
Fortunately, there is a solution
00:12:45
provided to this by both Reddis and
00:12:47
Valky, which is known as Sentinel. This
00:12:50
solution monitors both the replicas and
00:12:53
the primary and will promote a replica
00:12:55
in the event that a primary goes down.
00:12:58
Sentinel is actually really awesome when
00:13:00
it comes to ensuring high availability
00:13:02
on a Reddus installation. So much so
00:13:04
that it actually deserves its own
00:13:06
dedicated video. In any case, whilst
00:13:08
read replication is a simple approach to
00:13:11
horizontal scaling and is really
00:13:13
powerful when it comes to more read
00:13:15
heavy workflows, it still doesn't solve
00:13:17
some of the other issues that we've
00:13:19
mentioned. Therefore, this is where a
00:13:21
second approach to horizontally scaling
00:13:23
Reddis comes in known as Reddis cluster.
00:13:27
Reddis or Valky cluster provides a way
00:13:30
to run an installation where data is
00:13:32
automatically sharded across multiple
00:13:34
nodes. This allows you to distribute
00:13:36
your requests across multiple instances
00:13:39
which means you can effectively scale
00:13:41
CPU memory and networking to an infinite
00:13:44
amount. Not really. There is still a
00:13:47
finite limit. Regardless, the way that
00:13:50
this is achieved is by sharding data
00:13:52
across multiple nodes. Meaning rather
00:13:54
than each node having the full data set,
00:13:56
it splits across each node inside of the
00:13:58
cluster. This is done by using a
00:14:00
sharding algorithm. The way the
00:14:03
algorithm works is actually kind of
00:14:05
simple. The idea is that the cluster has
00:14:09
16,384 different hash slots. You can
00:14:12
think of these as being a bucket of
00:14:15
keys. Each of these slots or buckets is
00:14:18
then distributed across the nodes
00:14:20
evenly. Then in order to determine which
00:14:22
bucket or slot a key belongs to, the key
00:14:26
itself is then hashed using
00:14:28
CRC16 followed by then taking that
00:14:30
result and modulusing it by the number
00:14:32
of hash slots, i.e.
00:14:36
16,384. This then returns the slot that
00:14:39
the key belongs to, allowing you to then
00:14:41
distribute it to the node that owns that
00:14:43
hash slot. Whilst setting up cluster
00:14:45
mode is a little more involved than read
00:14:47
replication, it's still not that
00:14:49
difficult. To do so, you first need to
00:14:52
add in the following three options into
00:14:54
your Valky/Rice configuration. These are
00:14:58
cluster enabled, cluster config file,
00:15:00
and cluster node timeout. With those
00:15:03
three configuration options applied for
00:15:05
each of the Valky instances you wish to
00:15:07
clusterize, all that remains is to use
00:15:10
the following cluster create command on
00:15:12
one of the nodes, passing in the host
00:15:14
port combinations of all of the
00:15:16
instances you wish to form the cluster
00:15:18
with. This will then present you with
00:15:21
the following screen which will show you
00:15:23
the distribution of hash slots across
00:15:25
each of the nodes as well as prompting
00:15:27
you for confirmation. Upon doing so, the
00:15:30
cluster will then be set up hopefully
00:15:32
and should let you know when everything
00:15:34
is working, which we can then go ahead
00:15:36
and confirm using the cluster info
00:15:38
command on one or all of our instances.
00:15:41
With the cluster setup, if I now go
00:15:43
ahead and run the MEM tier benchmark
00:15:45
command again, this time making sure
00:15:47
it's set to cluster mode by using the
00:15:49
following flag, you can see that both
00:15:51
the read and write throughput is now
00:15:54
substantially increased, hitting around
00:15:56
400,000 requests a second. Very cool. As
00:16:00
you can see, this is similar to the
00:16:02
throughput we were getting when it came
00:16:03
to using read replicas. However, the
00:16:06
benefit here is that we're able to both
00:16:08
read from and write to all three of our
00:16:11
instances instead of just only being
00:16:13
able to write to one. Not only does this
00:16:16
mean that we have improved performance
00:16:17
when it comes to our write operations,
00:16:19
but it also means we can make use of the
00:16:21
available resources on each of our
00:16:23
machines by deploying multiple Valky
00:16:26
instances on each node. For example,
00:16:29
here I've gone and deployed another
00:16:30
seven instances on my midtier machine,
00:16:33
bringing the total number of nodes in my
00:16:35
Valkyrie cluster to 10. This means that
00:16:37
the cluster should be making better use
00:16:39
of all of the available hardware on my
00:16:41
mid-tier machine, which if I now go
00:16:43
ahead and run a benchmark test against,
00:16:45
you can see I'm hitting 1 million
00:16:47
requests per second. Hooray. Of course,
00:16:51
this is in perfect conditions, running
00:16:54
on my local network on bare metal. So,
00:16:58
in order to really complete this
00:16:59
challenge, then we're going to want to
00:17:01
take a look at how we can achieve 1
00:17:03
million operations per second whilst
00:17:05
running on the cloud using VPS
00:17:08
instances. Before we do that, however,
00:17:10
let's first talk about some of the
00:17:11
caveats associated with cluster mode
00:17:14
because there are a few. The first of
00:17:16
which is that in order to be able to
00:17:18
send commands to it, you need to use a
00:17:21
clusteraware client. Now to be fair,
00:17:23
this isn't too much of an issue as most
00:17:26
Reddus clients provide support for
00:17:27
cluster mode. However, it does mean that
00:17:29
any existing code that makes use of
00:17:31
Reddus does need to be modified at least
00:17:34
slightly. For example, if I try to
00:17:35
connect to an instance in my cluster
00:17:37
using the Valky CLI and try to pull out
00:17:40
the following key, you'll see I get an
00:17:42
error letting me know that the key has
00:17:43
been moved. Therefore, in order to be
00:17:46
able to use the Valky CLI to send
00:17:48
commands to the Valky cluster, I would
00:17:51
need to make use of the - C flag in
00:17:53
order to connect to cluster mode and my
00:17:56
client will then be routed to the
00:17:57
correct node that contains this key. In
00:18:00
addition to ensuring that the client
00:18:01
connects to the cluster mode correctly,
00:18:03
there are some other caveats as well.
00:18:06
Specifically, when it comes to multikey
00:18:08
operations, such as working with
00:18:10
transaction pipelines or reddis lure
00:18:13
scripts, in each of these cases, you
00:18:15
need to ensure that any related keys
00:18:17
will belong to the same key slot.
00:18:20
Otherwise, any scripts or transactions
00:18:22
won't be able to be used. Fortunately,
00:18:24
Reddus and Valky provide a way to
00:18:26
achieve this, which is to make use of a
00:18:29
hashtag. Not the social media kind of
00:18:31
hashtags. Instead, a hashtag is defined
00:18:34
within an actual key as follows.
00:18:37
Specifying the ID that you want to be
00:18:39
hashed using the following syntax. In
00:18:42
this case, the key being hashed is 1 2 3
00:18:45
rather than the actual full key itself.
00:18:48
By doing this, it means that any keys
00:18:50
that share the same hashtag will be
00:18:52
placed inside of the same hash slot,
00:18:54
which means you're then able to perform
00:18:55
any multi-key operations such as
00:18:58
transactions or scripts. Whilst hashtags
00:19:01
solve the issue of key distribution,
00:19:03
there are still many other problems when
00:19:05
it comes to running a distributed system
00:19:07
such as a Valky cluster with perhaps the
00:19:10
most major one being how to ensure
00:19:12
reliability in the event that a node
00:19:14
goes down. Now to be fair, Reddis
00:19:16
cluster does provide some resilience
00:19:18
when it comes to availability. If a node
00:19:21
is dropped from a cluster, then those
00:19:22
key slots will be redistributed.
00:19:24
However, the data can be lost.
00:19:27
Fortunately, cluster mode also provides
00:19:29
the ability to set up replication.
00:19:31
However, this differs slightly from the
00:19:34
replication we saw before in that rather
00:19:36
than replicating the entire data set,
00:19:39
these replicas instead contain a copy of
00:19:41
their respective shards or different
00:19:43
hash slots. Additionally, this means you
00:19:45
can have multiple replicas per shard,
00:19:48
providing you high availability and
00:19:50
reducing the risk of data loss in
00:19:52
cluster mode. This does mean however
00:19:54
that you'll want to ensure that each of
00:19:55
these replicas is on a different machine
00:19:58
than the primary and ideally from each
00:20:00
other as well and ultimately means your
00:20:02
total reddis system is going to become
00:20:05
more complex. Fortunately, there are
00:20:07
tools out there such as IA, Kubernetes
00:20:10
or managed providers to help make this
00:20:12
complexity more manageable. In my case,
00:20:15
when it came to deploying a cluster onto
00:20:17
a number of VPS instances, I ended up
00:20:20
writing the following Terraform
00:20:21
configuration. Well, actually, it's an
00:20:23
open Tofu configuration who I'm now
00:20:26
pinning on my wall instead. Regardless,
00:20:28
this configuration allows me to deploy a
00:20:30
Valky cluster onto one of two different
00:20:33
providers, either Digital Ocean or
00:20:35
Hzner, which I used to see if I could
00:20:38
reach 1 million operations per second.
00:20:41
As it turned out, it was a little bit
00:20:43
more challenging than I thought.
00:20:45
However, before we take a look at
00:20:46
whether or not I was able to achieve
00:20:48
this on the public cloud, let's take a
00:20:50
quick look at another way to achieve 1
00:20:53
million requests per second. One that is
00:20:55
actually a lot more simple than setting
00:20:57
up a Valkyrie cluster. This is through
00:20:59
using the sponsor of today's video,
00:21:02
Dragonfly DB, who, as I mentioned
00:21:04
before, provide a drop-in replacement
00:21:06
for Reddis that boasts greater
00:21:08
performance. to show how much
00:21:09
performance improvement Dragonfly DB
00:21:11
has. If I go ahead and deploy an
00:21:13
instance of it onto my small machine,
00:21:16
followed by performing the following
00:21:17
benchmark test we've been using before,
00:21:19
you can see I'm hitting about 250,000
00:21:22
requests per second, which isn't that
00:21:24
much of an improvement compared to the
00:21:26
existing instance I was using before.
00:21:28
However, if I go ahead and now deploy
00:21:29
this on my mid machine, you can see the
00:21:32
performance improvement is now
00:21:34
substantial. This time I'm running about
00:21:37
twice as fast as I was before, which
00:21:39
makes a lot of sense given there's twice
00:21:40
as many cores on this machine. As you
00:21:43
can see, by using Dragonfly DB, which
00:21:45
makes better use of the available
00:21:46
resources on a system, we're able to now
00:21:49
vertically scale our system compared to
00:21:51
just using a single core implementation
00:21:53
like we were before. So, let's see what
00:21:55
happens if we run Dragonfly on the big
00:21:58
boy Thread Ripper. Can we hit 1 million
00:22:00
requests per second by just running a
00:22:03
single instance in standalone mode?
00:22:05
Turns out we can by just a hair. So 1
00:22:09
million requests achieved by just using
00:22:12
vertical scaling. And this number can
00:22:14
actually go even further. In fact, the
00:22:17
team at Dragonfly managed to reach 6
00:22:19
million requests per second when it came
00:22:21
to running on a VPS. All of this is
00:22:23
achieved without having to worry about
00:22:25
setting up multiple instances or any of
00:22:28
the caveats that come from using cluster
00:22:30
mode. That being said, there are still
00:22:33
some good reasons to embrace the
00:22:34
complexity of clustering, such as when
00:22:37
you want to distribute your key set
00:22:38
across multiple machines, either for
00:22:40
scaling memory or for redundancy.
00:22:43
Dragonfly itself does provide a way to
00:22:45
horizontally scale using their own swarm
00:22:48
mode, which was announced a short while
00:22:50
before filming this video. So, if you're
00:22:53
interested in Dragonfly DB as a drop-in
00:22:55
replacement for Reddus that offers
00:22:57
greater performance, then check them out
00:22:59
using the link below. Okay, so now that
00:23:02
we've seen how to reach 1 million
00:23:03
requests per second in perfect
00:23:05
conditions, let's take a look at how
00:23:07
difficult it is to achieve this on
00:23:09
something less perfect, the cloud. As I
00:23:12
mentioned before, I'd managed to set up
00:23:13
a Terraform or Open Tofu configuration
00:23:16
for both Digital Ocean and Hzner, which
00:23:19
you can actually download yourself from
00:23:20
GitHub if you want to try it out. Just
00:23:23
remember, don't leave these instances
00:23:25
running for a long time, or you'll end
00:23:27
up with a large bill at the end of the
00:23:30
month. So there are some instructions on
00:23:32
how to actually deploy this Terraform
00:23:33
configuration on the actual repo itself,
00:23:36
but at a high level, you mainly just
00:23:38
need to set your API token and SSH key
00:23:40
values in the TF vase for whichever
00:23:43
cloud provider you want to use, followed
00:23:46
by then running the Tofu apply command.
00:23:49
Once you're done with your testing, in
00:23:51
order to clean everything up, go ahead
00:23:52
and use the tofu destroy command just to
00:23:55
make sure you don't go bankrupt. As I
00:23:57
mentioned, I ended up writing an open
00:23:59
Tofu configuration to deploy a Valky
00:24:02
cluster to either Digital Ocean or
00:24:04
Hetner. However, this wasn't my original
00:24:07
plan as I had intended to only use one
00:24:10
of these providers, HNER. But for some
00:24:14
reason, I couldn't seem to get anywhere
00:24:16
close to 1 million operations per
00:24:18
second. instead barely only exceeding
00:24:22
100,000 no matter what I did or how I
00:24:24
had the cluster configured. Overall, it
00:24:28
was kind of strange and my best guess as
00:24:31
to why this was happening was because I
00:24:33
was using shared vCPUs. So, I went about
00:24:36
migrating to dedicated ones instead.
00:24:39
However, because my account was too new
00:24:41
to request a limit increase, I was
00:24:43
unable to provision any more dedicated
00:24:45
vCPUs. So when it came to hitting a
00:24:48
million Valky operations per second
00:24:50
using Hzner, I was out of moves.
00:24:53
Therefore, I instead decided to try with
00:24:55
Digital Ocean. First reaching out to
00:24:58
support in order to get access to the
00:25:00
larger instance sizes so that I could
00:25:02
run benchmarking without having any
00:25:04
bottlenecking. Once I had my compute
00:25:06
limits increased, I then deployed the
00:25:08
cluster using Tofu Apply and SSH into my
00:25:11
benchmark box. Once I had everything set
00:25:14
up and the cluster was deployed, I then
00:25:17
went about running the memier benchmark
00:25:19
command. And on my initial attempt,
00:25:22
which had nine Valkyrie nodes inside of
00:25:23
the cluster and using a 16 vcpu machine
00:25:26
for the benchmarking tool, I was hitting
00:25:28
around 450,000 requests per second. Not
00:25:32
bad. After confirming that it was the
00:25:34
benchmark tool that was bottlenecking, I
00:25:36
decided to scale up the instance it was
00:25:38
running on to one with 32 vCPUs. This
00:25:42
time when I ran the benchmarking tool
00:25:43
with 32 threads, I was getting around
00:25:47
900,000 off the get- go. However, as
00:25:50
time went on, this number would start to
00:25:52
decrease down to about
00:25:54
800,000. So, I decided to go allin and
00:25:58
scaled up the number of Valky nodes I
00:26:00
had from 9 to 15. After doing a quick
00:26:03
tofu apply and seeing all of the
00:26:05
instances come through on the digital
00:26:07
ocean dashboard, I sshed in and ran the
00:26:10
ment
00:26:13
again. Success. I was now hitting a
00:26:16
sustained 1 million operations per
00:26:19
second.
00:26:22
With that, my goal had been achieved,
00:26:24
and all it took was for me to deploy 15
00:26:26
Valky instances on premium Intel nodes,
00:26:30
which had I left running would have only
00:26:31
cost me around $1,100 a month. Yeah, a
00:26:35
little out of my infrastructure budget.
00:26:38
Just for fun, I decided to see how much
00:26:40
throughput I could get by using a
00:26:42
pipeline of 100 when it came to this
00:26:44
setup, which ended up producing around
00:26:47
14 million operations per second. So
00:26:50
yeah, that just goes to show how much
00:26:51
improvement you can get when it comes to
00:26:53
reducing the round trip time by using
00:26:55
pipelining with Reddus. In any case, all
00:26:58
that remained was to tear down my setup
00:27:00
using Tofu Destroy. And with that, I had
00:27:03
managed to achieve my goal, hitting 1
00:27:05
million requests, both using bare metal
00:27:08
on my home network, which honestly is
00:27:09
kind of cheating, and through using a
00:27:12
VPS instance on Digital Ocean, using
00:27:15
private networking in the data center.
00:27:18
In any case, I want to give a big thank
00:27:19
you to Dragonfly DB for sponsoring this
00:27:22
video and making all of this happen.
00:27:24
Without them, I wouldn't have been able
00:27:25
to spend so much cash on VPS instances
00:27:28
for testing. Additionally, if you want
00:27:30
to use a high performance Reddit
00:27:32
alternative without managing your own
00:27:34
infrastructure, then Dragonfly also
00:27:36
offers a fully managed service,
00:27:38
Dragonfly Cloud, which runs the same
00:27:41
code as if you were self-hosting, but
00:27:43
just handles all of the operational
00:27:45
heavy lifting for you. So, if you want a
00:27:48
hassle-free caching solution, then check
00:27:50
it out using the link in the description
00:27:52
below. Otherwise, I want to give a big
00:27:55
thank you for watching, and I'll see you
00:27:56
on the next one.