00:00:00
so you've heard so much about data
00:00:02
modeling everywhere that it is important
00:00:04
if you are a data engineer and in fact
00:00:06
I've kind of unintentionally created
00:00:08
this little minseries uh I've got two
00:00:10
other videos that you may have seen if
00:00:12
not you can watch them after this video
00:00:14
where I kind of give an intro to data
00:00:16
modeling we talk about important
00:00:17
Concepts as well as go over things like
00:00:19
back tables and dimension tables and now
00:00:21
we're going to continue in this
00:00:23
miniseries to talk about historical data
00:00:26
or how to actually model data for
00:00:28
history and why it's important but
00:00:30
before diving in hey everyone welcome
00:00:32
back to another video with me Ben Rogan
00:00:34
aka the C dat guy as I said today we're
00:00:37
going to be talking about data modeling
00:00:38
specifically discussing things like
00:00:40
slowly changing dimensions and other
00:00:42
methods to essentially capture
00:00:46
historical data so let's dive
00:00:49
[Music]
00:00:55
in recently I also saw that uh Zach
00:00:58
Wilson uh put out a video where he also
00:01:01
discussed a few other videos including
00:01:03
tables such as cumulative tables which
00:01:05
is another way you can definitely model
00:01:07
uh historical data um but we're going to
00:01:09
be not talking about that today uh we're
00:01:12
going be talking about facts and
00:01:13
dimension so if you're think about
00:01:14
historical fact data generally speaking
00:01:16
fact data really at least the way you
00:01:18
might be thinking about it is already
00:01:20
kind of capturing historical data right
00:01:22
because you're capturing events so let's
00:01:23
say you have this fact table here right
00:01:26
uh you might have a date uh date or date
00:01:28
ID here so it can connect date Dimension
00:01:30
table might have an event time stamp to
00:01:32
like actually capture the actual time
00:01:34
stamp of when it happened you know maybe
00:01:36
it's a user maybe it's a customer
00:01:37
purchasing a product and uh then you
00:01:39
have some other dimension IDs so
00:01:40
actually let me change this to customer
00:01:41
CU I think that'll be more consistent so
00:01:43
you know maybe you're talking about a
00:01:44
customer buying a product and then every
00:01:47
event that occurs right you can think of
00:01:48
that just as like an iner insertion as
00:01:50
much as that says process I am just
00:01:52
viewing this as a row so you can just
00:01:54
you know this is event one here so this
00:01:56
is you know customer
00:01:58
one uh event one for them and then
00:02:01
you're just going to keep appending to
00:02:02
that right like it's just going to be an
00:02:04
appending appending situation right okay
00:02:06
there's event two now you've got another
00:02:08
maybe customer data uh another customer
00:02:11
ID coming in and servers event Etc so
00:02:13
historical data is captured just by the
00:02:16
data being in a row State and we are
00:02:18
going to talk about a different way you
00:02:19
can capture kind of this event data uh
00:02:21
at the end of this video so don't think
00:02:23
that's the only way you can essentially
00:02:24
capture things like fact data there's at
00:02:26
least one other way that I've used and
00:02:27
again we're just talking about ways that
00:02:28
I've used I'm sure there are other ways
00:02:29
people people have used but this is what
00:02:30
I've seen I've seen across dozens of
00:02:33
companies and and used successfully so
00:02:35
now quick pause this has actually been
00:02:37
from the future one of the things I
00:02:38
realized uh as I was talking about fact
00:02:41
data you I was saying all that data just
00:02:43
kind of comes in and you append it uh
00:02:45
honestly it can really depend on how the
00:02:48
data comes in for example I worked at a
00:02:50
company where we would have to append
00:02:52
data at least in a few different ways
00:02:54
because there's the easy way which is
00:02:55
just appending it so every new event uh
00:02:58
you get you just append it so sometimes
00:03:00
this even takes care of things like
00:03:01
reverse outs or or kind of when you
00:03:03
maybe have something like a return or
00:03:05
maybe if you're like in healthcare maybe
00:03:07
one of the bills was disputed so you
00:03:09
have like a negative $50 on that bill so
00:03:12
you'll actually get that appended onto
00:03:13
it so that's one way where it's like
00:03:15
everything's appended you could just
00:03:17
have a replacement so sometimes you
00:03:19
might have the same ID so there might be
00:03:21
whatever that event ID and a new event
00:03:23
ID might come in and that might actually
00:03:25
just fully replace the prior one where
00:03:27
you delete the old data instead of
00:03:29
minusing $50 it's just the new amount
00:03:31
that you now owe or have to spend for
00:03:33
whatever it might be again could be
00:03:35
Healthcare so it is important to note
00:03:37
that although maybe how it looks like in
00:03:39
the table might always look the same
00:03:41
where it's just a bunch of events in a
00:03:42
row how it gets inserted can vary uh and
00:03:46
will change how you actually develop
00:03:47
your system on the other end so just
00:03:49
wanted to have that quick caveat and now
00:03:50
we can jump back to uh pass bed that's
00:03:53
kind of how you can think by a fact it's
00:03:54
very straightforward in that manner that
00:03:56
it's you know as each row comes in
00:03:58
you're capturing fact in information
00:04:00
you're not losing it that's the key is
00:04:02
you're not losing it now the problem is
00:04:03
let's say you have got another dimension
00:04:04
table let's put together a dimension
00:04:06
table so let's say uh usually I think a
00:04:10
good one is like dim customer uh looks
00:04:12
like they're doing this way so I'm just
00:04:13
going to stick with this way dim
00:04:16
customer you know you've got customer
00:04:19
ID you know City let's see City where
00:04:22
they live it's going to be a varchar of
00:04:24
some
00:04:25
kind maybe date created
00:04:30
their name right I then you can just
00:04:32
ignore this last one so I'm just going
00:04:33
to
00:04:34
that so then actually it would probably
00:04:37
be
00:04:39
swapped so now you've got this uh dim
00:04:42
customer table and if we were to look
00:04:44
into this table so let's you know dive
00:04:47
in so let's say you've got this
00:04:50
table right doesn't have all the columns
00:04:52
but we really just need a few to show
00:04:53
you the example so you've got customer
00:04:55
ID customer name and City and this is
00:04:58
really all you need so what you'll have
00:05:01
happen here right is id1 customer named
00:05:03
John and they live in New York City
00:05:06
right uh just for the example we'll do
00:05:08
another one Jane
00:05:12
Seattle so we've captured this data
00:05:15
right we've pull this out from the
00:05:16
database and this is how you've
00:05:18
currently modeled the data set you've
00:05:19
got customer ID your name and City the
00:05:21
problem occurs that it someday in the
00:05:23
future this customer will likely move so
00:05:26
now let's say John also moves to Seattle
00:05:28
so now that they've moved and you would
00:05:31
like to maybe answer a question for
00:05:33
management let's say ask a question like
00:05:35
you know how many uh sales are we
00:05:37
getting from our customers that live uh
00:05:40
essentially in different areas right
00:05:42
like let's not consider where stores are
00:05:43
let's consider like they're doing this
00:05:44
by customer want to know how much uh
00:05:47
different customers from different
00:05:48
states purchase from our product now uh
00:05:51
when you do that report and you write
00:05:53
that query SE this customer data John
00:05:57
will now go to Seattle right it's not
00:05:59
going to be captured that this customer
00:06:01
uh lived in New York that you don't have
00:06:03
that anywhere this this table cannot
00:06:05
capture that information and so you've
00:06:07
lost where they have lived in the past
00:06:09
and so you're losing information and so
00:06:11
this is where you actually have to think
00:06:12
about how you model historical data so
00:06:14
when you hear slowly changing Dimensions
00:06:16
what that really is is you trying to
00:06:18
capture and model change over time of
00:06:21
these Dimension uh data sets and in
00:06:23
general there there are a few different
00:06:25
types of slowly changing Dimensions I've
00:06:27
gone over them in the past this is
00:06:28
arguably the just switching you know
00:06:30
switching from Seattle to NYC or NYC to
00:06:33
Seattle is arguably type one generally
00:06:35
speaking I really have only seen either
00:06:38
type two or kind of type six some level
00:06:40
of type six I I don't think I've ever
00:06:42
seen it perfectly implemented for for
00:06:43
type six which means uh that type
00:06:46
essentially you could just think about
00:06:48
it that this would have at least two
00:06:50
more columns and we'll just add one more
00:06:52
Row for example and you've got your
00:06:54
start basically kind of like a start
00:06:56
effective date they might call it
00:06:58
effective date or something like that
00:06:59
and then end date I'm just doing this to
00:07:01
simplify it so what that will do is you
00:07:03
know for that
00:07:05
NYC you'll actually capture where that
00:07:07
person stops and starts essentially when
00:07:09
they live here and when they live
00:07:10
somewhere else so 2023 at1 and let's say
00:07:12
they end at 2024 let's make this easy on
00:07:15
me
00:07:16
a11 and then when you end
00:07:20
up read addding in this data you'll
00:07:23
actually add a new row so this started
00:07:26
let's say 20241 and then this will be
00:07:28
null
00:07:30
and again you'd have the same thing here
00:07:34
where since this hasn't changed yet
00:07:36
they'd be n so what this does is
00:07:38
actually capture dimensional data
00:07:40
through history right so as as things
00:07:42
change you know you should only pull
00:07:44
data where there is essentially a null
00:07:46
end date so when you write this query
00:07:48
you'll say select from dim customer
00:07:50
where end date is null so you only pull
00:07:53
the most recent uh information so you're
00:07:55
not getting you know all this historical
00:07:56
duplicate data that's one kind of part
00:07:59
here with this Rule and then if you are
00:08:01
doing something where let's say you're
00:08:03
trying to report historically you might
00:08:06
actually add in this missing data right
00:08:08
John when he lived here in these various
00:08:10
places that way when you join this data
00:08:12
you're going to do something where you
00:08:13
do a between and likely you end up
00:08:15
joining it uh here on the date field so
00:08:18
in between you you consider all those
00:08:19
purchases for that specific user in that
00:08:22
set of dates so that it doesn't get
00:08:23
confused so that when you pull in city
00:08:25
for that specific User it's correctly um
00:08:28
being used and then obviously you're
00:08:29
going to have a bunch of duplicate
00:08:30
Fields but each of those should be for
00:08:32
the specific events so now you're going
00:08:34
at the like sales uh level versus the
00:08:37
customer level so you're going to have
00:08:38
multiple instances of JN but in some
00:08:40
cases JN will be living in New York City
00:08:42
and in other cases they'll be living in
00:08:44
Seattle which would make sense right
00:08:46
because when they moved you don't want
00:08:48
to report that data U moving forward but
00:08:50
you want to make sure you accurately
00:08:51
report the past now this is just one way
00:08:54
essentially to capture this specific
00:08:56
data set um I say that because at
00:08:57
Facebook we actually did it slightly
00:08:59
differently now at Facebook what we
00:09:00
ended up doing so this is again this is
00:09:02
one way you can kind of your standard
00:09:03
slowly changing Dimensions but at
00:09:05
Facebook the way we ended up doing it
00:09:06
let me just get a box I want a box so
00:09:08
let me get this the way we ended up
00:09:11
doing it is you could kind of say let's
00:09:13
say I want to change this over here so
00:09:15
this was essentially uh just a partition
00:09:18
where somewhere in a folder this was
00:09:20
essentially a file somewhere that was
00:09:22
like 2023 01-01 so on this date we had
00:09:25
all the data we just took a snapshot of
00:09:27
all employee information at that time
00:09:30
so let's just say 2023 and then again
00:09:31
this is kind of a folder system so you
00:09:33
can think about like above this is like
00:09:36
some folder called employee somewhere
00:09:38
because it's in like a hive uh metast
00:09:41
store somewhere because it's in hdfs
00:09:43
folder system some somewhere and then
00:09:45
essentially what you can think from
00:09:46
there is that each one of these then had
00:09:49
you know for every date that continued
00:09:50
you would just have another one and we
00:09:52
would just store all the
00:09:53
data so each of the you know this would
00:09:56
be this first one here would be all the
00:09:58
data snapped on that day took a picture
00:10:00
of that specific day uh so with this one
00:10:04
right like for O2 that's how we capture
00:10:05
that data and 03 that would be that so
00:10:07
let's say it's all employee data from
00:10:08
that specific date now part of this is
00:10:10
possible because storage was rather
00:10:12
cheap and this kind of could make it
00:10:14
somewhat easy for analyst right because
00:10:16
what they end up doing on the query side
00:10:18
let me just take a little here is they
00:10:20
do something like
00:10:21
select you know obviously don't do
00:10:23
select star but we're doing select star
00:10:25
for an example uh select from uh
00:10:27
employee and we had a macro
00:10:29
that essentially let me just pin this
00:10:31
over here uh where we would often call
00:10:35
it DS was kind of your standard on every
00:10:38
table had it and we had a macro that
00:10:39
would essentially be that you could use
00:10:42
um there's a few of them one of them
00:10:44
which is basically your current DS but I
00:10:45
don't remember if it's I don't think
00:10:46
it's current DS I don't know what the
00:10:47
macro was someone who works at Facebook
00:10:49
maybe commented below it was it was
00:10:51
something that basically said today's
00:10:52
date um but you could also use uh your
00:10:56
own macros in dashboard so let's say for
00:10:59
example you you end up having to join
00:11:00
this to some other table join trying to
00:11:03
think of a good table that would be here
00:11:05
you know what let's do um what would you
00:11:06
do like employee learnings because
00:11:08
there's like a learnings table of like
00:11:11
like the courses you go through
00:11:12
internally so you know L whatever you
00:11:14
join this on both employee
00:11:17
ID uh you know L do employee ID you can
00:11:20
be mad at me for doing a onlet uh alas
00:11:23
later uh and you have to join this
00:11:27
on um some sort of like what the current
00:11:29
DS was right you want to make sure you
00:11:31
weren't going beyond that um so e. DS
00:11:35
equals l.
00:11:37
DS and then maybe you had a uh dashboard
00:11:40
kind of filter here so literally it
00:11:42
would go back it would automatically
00:11:43
populate hey we're looking at you know
00:11:45
2023 0103 and so as it went back and
00:11:48
proliferated through this join right
00:11:49
like it would automatically look at e.
00:11:51
DS let's say and when it did this join
00:11:53
up here right it would only pick one of
00:11:54
the dates automatically right like so
00:11:56
LDS is only going to pick one date so
00:11:57
you're not going to get duplicate data
00:11:58
you're only only going to get let's say
00:12:00
again 202 this data set here uh and then
00:12:02
only the learn data from that data set
00:12:04
if that's your goal like that that might
00:12:05
have not been your goal this is just an
00:12:07
example um and that way you could kind
00:12:09
of have your own historical approach
00:12:11
right like well now if you want to
00:12:12
filter to a different date you just
00:12:14
filter it to a different DS right like
00:12:16
as you're kind of looking through this
00:12:17
uh overall report that still was
00:12:19
sometimes difficult to use though as you
00:12:21
can imagine like let's say we're kind of
00:12:23
looking at this example here
00:12:25
again uh and you're instead wanting to
00:12:28
think about like when employees change
00:12:30
roles so instead of this you have got
00:12:32
employee
00:12:34
ID name sure stays the same role is
00:12:36
probably the difference right like maybe
00:12:38
they're a data engineer here this is
00:12:41
another data engineer and this is now
00:12:43
like they become a project manager now
00:12:45
one way you could do this obviously
00:12:46
right is um have hundreds of of data
00:12:50
partitions and you have to figure out
00:12:51
where exactly that change right like if
00:12:52
you want to figure out let's say like a
00:12:54
count of how many days are in between
00:12:56
someone switching roles at Facebook like
00:12:57
the average count it's going to be
00:12:58
really difficult to to kind of calculate
00:13:01
here and have to like do a lot of manual
00:13:03
things where really all you'd have to do
00:13:04
here is do something where you say you
00:13:06
know from this employee table uh count
00:13:08
the number of days in between here and
00:13:10
then you can figure out the average from
00:13:11
there which would be far more
00:13:12
straightforward than trying to go
00:13:13
through all these partitions so we
00:13:14
actually still ended up needing to
00:13:15
create this um I often find you still
00:13:18
sometimes need to make some of these
00:13:19
tables where you're like a this maybe
00:13:20
this other method is more performant but
00:13:22
sometimes you have to still make this
00:13:23
trade-off where you still end up making
00:13:25
this this specific slowly changing
00:13:27
Dimension table to capture it cuz it
00:13:28
just just is more succinct like instead
00:13:31
of having every date partition ever even
00:13:34
though we did have a certain retention
00:13:35
period 90 days was like default so this
00:13:37
data would generally be deleted after 90
00:13:39
days but you've got all of that data or
00:13:41
you could just have you know essentially
00:13:43
two rows and that's all you really need
00:13:44
to know about John right you don't need
00:13:46
90 days of data you need two rows um of
00:13:50
data so we still often sometimes switch
00:13:52
to this when required that's just
00:13:54
something I like to compare against like
00:13:56
you know yes you'll see certain things
00:13:58
done at Large companies and there are
00:13:59
other benefits that we we'll talk about
00:14:01
here um in terms of like why a company
00:14:03
might do this there are performance
00:14:04
reasons it can be it's easier to query
00:14:06
right like instead of having to do this
00:14:07
between thing you can just write you
00:14:09
know tell me what the current data is
00:14:11
and oftentimes that's what we really
00:14:12
cared about was like current data um
00:14:14
also when data gets really big sometimes
00:14:16
you don't want to store as much and you
00:14:18
just want to store like the current date
00:14:20
and you don't necessarily want a large
00:14:22
data set and that's where I'm going to
00:14:23
go into the next example for that I'm
00:14:25
going to go to Roblox they covered this
00:14:26
really well so one of the examples we
00:14:28
talked about was fact dat right and one
00:14:30
of the types of facts that you might try
00:14:31
to capture is something like page visits
00:14:33
like how many times people visit a page
00:14:35
um maybe you want to capture how many
00:14:36
times they visit that page in a day
00:14:38
right you got let's say six or seven or
00:14:41
eight or 100 uh page visits in a day and
00:14:44
that gets really big right when you
00:14:46
think about like Facebook you could have
00:14:47
millions of visits in a day and over
00:14:50
time that's going to get very very big
00:14:52
and maybe you want to be able to query
00:14:53
that data very effectively but still be
00:14:55
able to talk about historical data at
00:14:57
least in an aggregated fashion so as you
00:14:59
can see here like you can almost view
00:15:01
each of these almost like a user I'm
00:15:03
just going to treat this as like how
00:15:04
often someone maybe shows up to the
00:15:06
website on that day and this is already
00:15:08
kind of uh aggregated as well but you
00:15:10
can imagine like there's maybe five
00:15:12
visits in a day on some very very
00:15:14
granular table somewhere and that's
00:15:16
going to be very hard to query it's
00:15:17
going to be very big it's going to be
00:15:18
very expensive it's going to take a long
00:15:19
time especially if you only want to look
00:15:21
at a specific subset of users so another
00:15:23
way you can do this to model it is like
00:15:24
hey if we already know you're going to
00:15:26
filter by users why not create one very
00:15:28
wide table where you create what we
00:15:30
often reference as a date list and so
00:15:33
there's a few different ways you can do
00:15:34
it some people will implement it using
00:15:36
kind of bits so zeros and ones to kind
00:15:38
of Define often if someone showed up but
00:15:40
this is another way where you can use a
00:15:42
dictionary to say hey which days people
00:15:43
show up and how many times do they show
00:15:45
up so that way if you want to have you
00:15:47
know a billion users but you really only
00:15:49
want to pull out a specific set of them
00:15:51
right like maybe you're only interested
00:15:52
in us data at a time you can one really
00:15:54
quickly get that you're not having to go
00:15:56
through you know billions and billions
00:15:57
of rows just to pull all that data back
00:15:59
you just have to go through however many
00:16:00
billion people there are on your on your
00:16:02
service so 2 billion which is not that
00:16:04
big for for data sets versus you know as
00:16:07
you can imagine if you've got two
00:16:08
billion users the amount of data that
00:16:10
that has for events is huge you've now
00:16:12
shrunk that down uh to as at most three
00:16:15
or four billion rows uh to filter out
00:16:17
and then you filter that filter out what
00:16:19
you want and then on top of that uh you
00:16:21
have all your data kind of nicely
00:16:22
aggregated um in a date list so you can
00:16:25
already just say hey now go into that
00:16:27
field and tell me how many visits for
00:16:29
this section of time right you see like
00:16:31
got first and last date for this section
00:16:32
of time did someone show up to our site
00:16:35
so it just drastically changes the way
00:16:37
you kind of track this event data this
00:16:39
is generally more pertinent to companies
00:16:40
with very large data sets right like cuz
00:16:42
this does add a technical bit of
00:16:44
complexity right you could argue that
00:16:46
this might be simpler to query for most
00:16:48
analysts whereas now if you have a
00:16:50
dictionary you have a little bit of
00:16:51
extra layer of technical complexity so
00:16:53
it is this trade-off where this is very
00:16:55
easy to query like hey I just want to
00:16:57
see you know certain dates certain date
00:16:59
range and you know some this field but
00:17:01
at a certain point there can be you know
00:17:04
this need to essentially have summarized
00:17:07
data that's one reason you might see
00:17:08
this data model this way um to track
00:17:10
historical information it also can lose
00:17:12
a little bit of information in this in
00:17:14
this manner because you know you're not
00:17:15
getting as much granularity but again
00:17:17
there's pros and cons to to both um
00:17:19
approaches here so hopeing that's
00:17:21
helpful for those of you who are trying
00:17:22
to understand how to model historical
00:17:24
data one we've hopefully gone over why
00:17:25
it's important right like if you'll have
00:17:27
reports to do it helps you add a little
00:17:29
more context it helps ensure that you
00:17:31
don't lose information but also there
00:17:32
are reasons in terms of performance that
00:17:33
you might model your data differently
00:17:35
like half the reason that I think we do
00:17:36
SL changing Dimensions is because you
00:17:38
used to have limited space and you
00:17:39
couldn't just do you know every date
00:17:41
partition and so as much as it seems
00:17:43
like oh that's very easy just take a
00:17:44
snapshot every day and capture that
00:17:46
information not everyone can afford that
00:17:48
not everyone has in unlimited compute
00:17:50
and storage and so it's really important
00:17:52
to always make sure your data model
00:17:54
doesn't just fit your needs in terms of
00:17:56
business needs but also your technical
00:17:58
needs you can't store or uh infinite
00:17:59
data on Prem generally unless you are a
00:18:02
large company that has their own
00:18:03
internal cloud like Facebook but
00:18:06
hopefully that was helpful for all yall
00:18:07
um I went through a few different ways
00:18:08
you can kind of store historical data
00:18:10
and with that guys I'll see you all in
00:18:11
the next one thanks all goodbye