What are slowly changing dimensions in data modeling?

It involves creating a data structure that allows for changes in data over time without losing historical information.

How can you model data to capture historical changes?

By using Slowly Changing Dimensions Type 2, or through other methods like storing snapshots of data at different times.

Why is data modeling important for historical data?

To capture the historical data of dimensions when they change over time, such as a customer moving to a new city.

How does Slowly Changing Dimensions Type 2 work?

Type 2 keeps historical data by adding a new row to the dimension table for each change and maintaining a start and end date for each period.

How do fact tables capture historical data?

It appends new rows for each event, capturing events as they happen which inherently keeps historical data.

What alternative method does Facebook use for storing historical data?

Ben discusses partitioning data and storing snapshots daily, which is feasible with large storage capacity.

How do you decide the best method for modeling historical data?

By balancing the need for detailed information with the technical constraints and performance of the database.

How can you efficiently model page visit frequency in large datasets?

Instead of storing every visit event, aggregate data into a dictionary or bit format to track presence and visits efficiently.

Data Modeling Where Theory Meets Reality - How Different Companies I Worked At Modeled Their Data

00:18:13

https://www.youtube.com/watch?v=rqLbn1PQPKA

Resumen

TLDRIn this video, Ben Rogan discusses the importance of data modeling for capturing historical data, especially for data engineers. He introduces concepts such as slowly changing dimensions, which allow for tracking changes in data over time without losing historical context. Ben explains the use of fact tables in capturing events, which inherently stores historical information. He also talks about handling dimension tables that might lose history due to data changes, such as a customer's address, using Slowly Changing Dimensions Type 2 to maintain history with start and end dates. Additionally, he discusses the strategies employed by companies like Facebook and Roblox for managing large historical datasets. Facebook might use a snapshot method to store data at specific intervals, while Roblox can store page visits efficiently using date lists to avoid large data volumes. Ben emphasizes the balance needed between granular detail and technical feasibility, helping viewers understand different methods for modeling historical data and tailoring solutions to both business and technical needs.

Para llevar

🗂️ Understanding data modeling is crucial for data engineers.
📈 Fact tables inherently capture historical data by recording events.
🔄 Slowly changing dimensions help track changes over time without data loss.
⚙️ Data modeling aids in answering business queries effectively with preserved historical context.
🏢 Methods for capturing historical data vary across companies like Facebook and Roblox.
💾 Large datasets require efficient storage strategies to manage historical data.
📊 Effective data modeling balances business needs with technical constraints.
🛠️ Different approaches cater to varying scenarios, such as customer moves or page visits.
📜 Type 2 SCD keeps track of historical changes by using start and end dates.
🌐 Not all companies can afford extensive data storage solutions; efficiency is key.

Cronología

00:00:00 - 00:05:00
Ben Rogan introduces the importance of data modeling for data engineers, highlighting previous videos in a miniseries that discuss crucial concepts. In this installment, he focuses on historical data modeling, specifically slowly changing dimensions and other methods for capturing history in data without losing information. The talk begins by outlining how fact tables capture events as they happen, using date IDs and timestamps, and explain how appending or replacing data can affect data retention, particularly when changes occur, such as customer movements.
00:05:00 - 00:10:00
Ben explains issues with dimension tables using an example of a 'dim customer' table, where capturing historical changes like customers changing cities can be problematic. To address this, he introduces the concept of slowly changing dimensions, specifically type two, which involves adding columns for start and end dates to track changes over time and ensure past information isn't lost during future updates. Techniques discussed include using new rows to log changes and querying current data with a filter for null in end dates to maintain an accurate record of past and current states.
00:10:00 - 00:18:13
The video discusses alternative methods used by large companies like Facebook and Roblox for capturing historical data, such as creating partitions to snapshot data or summarizing large volumes of event data with wide date-list tables. These strategies aim to optimize performance and resource usage despite complexities, requiring a balance between granularity and system efficiency. Ben emphasizes the importance of adapting data models to fit both business and technical needs, teaching viewers how to manage large datasets and prevent data loss while ensuring system performance.

Mapa mental

Vídeo de preguntas y respuestas

What are slowly changing dimensions in data modeling?
It involves creating a data structure that allows for changes in data over time without losing historical information.
How can you model data to capture historical changes?
By using Slowly Changing Dimensions Type 2, or through other methods like storing snapshots of data at different times.
Why is data modeling important for historical data?
To capture the historical data of dimensions when they change over time, such as a customer moving to a new city.
How does Slowly Changing Dimensions Type 2 work?
Type 2 keeps historical data by adding a new row to the dimension table for each change and maintaining a start and end date for each period.
How do fact tables capture historical data?
It appends new rows for each event, capturing events as they happen which inherently keeps historical data.
What alternative method does Facebook use for storing historical data?
Ben discusses partitioning data and storing snapshots daily, which is feasible with large storage capacity.
How do you decide the best method for modeling historical data?
By balancing the need for detailed information with the technical constraints and performance of the database.
How can you efficiently model page visit frequency in large datasets?
Instead of storing every visit event, aggregate data into a dictionary or bit format to track presence and visits efficiently.

Ver más resúmenes de vídeos

Obtén acceso instantáneo a resúmenes gratuitos de vídeos de YouTube gracias a la IA.

Subtítulos

Desplazamiento automático:

00:00:00
so you've heard so much about data
00:00:02
modeling everywhere that it is important
00:00:04
if you are a data engineer and in fact
00:00:06
I've kind of unintentionally created
00:00:08
this little minseries uh I've got two
00:00:10
other videos that you may have seen if
00:00:12
not you can watch them after this video
00:00:14
where I kind of give an intro to data
00:00:16
modeling we talk about important
00:00:17
Concepts as well as go over things like
00:00:19
back tables and dimension tables and now
00:00:21
we're going to continue in this
00:00:23
miniseries to talk about historical data
00:00:26
or how to actually model data for
00:00:28
history and why it's important but
00:00:30
before diving in hey everyone welcome
00:00:32
back to another video with me Ben Rogan
00:00:34
aka the C dat guy as I said today we're
00:00:37
going to be talking about data modeling
00:00:38
specifically discussing things like
00:00:40
slowly changing dimensions and other
00:00:42
methods to essentially capture
00:00:46
historical data so let's dive
00:00:49
[Music]
00:00:55
in recently I also saw that uh Zach
00:00:58
Wilson uh put out a video where he also
00:01:01
discussed a few other videos including
00:01:03
tables such as cumulative tables which
00:01:05
is another way you can definitely model
00:01:07
uh historical data um but we're going to
00:01:09
be not talking about that today uh we're
00:01:12
going be talking about facts and
00:01:13
dimension so if you're think about
00:01:14
historical fact data generally speaking
00:01:16
fact data really at least the way you
00:01:18
might be thinking about it is already
00:01:20
kind of capturing historical data right
00:01:22
because you're capturing events so let's
00:01:23
say you have this fact table here right
00:01:26
uh you might have a date uh date or date
00:01:28
ID here so it can connect date Dimension
00:01:30
table might have an event time stamp to
00:01:32
like actually capture the actual time
00:01:34
stamp of when it happened you know maybe
00:01:36
it's a user maybe it's a customer
00:01:37
purchasing a product and uh then you
00:01:39
have some other dimension IDs so
00:01:40
actually let me change this to customer
00:01:41
CU I think that'll be more consistent so
00:01:43
you know maybe you're talking about a
00:01:44
customer buying a product and then every
00:01:47
event that occurs right you can think of
00:01:48
that just as like an iner insertion as
00:01:50
much as that says process I am just
00:01:52
viewing this as a row so you can just
00:01:54
you know this is event one here so this
00:01:56
is you know customer
00:01:58
one uh event one for them and then
00:02:01
you're just going to keep appending to
00:02:02
that right like it's just going to be an
00:02:04
appending appending situation right okay
00:02:06
there's event two now you've got another
00:02:08
maybe customer data uh another customer
00:02:11
ID coming in and servers event Etc so
00:02:13
historical data is captured just by the
00:02:16
data being in a row State and we are
00:02:18
going to talk about a different way you
00:02:19
can capture kind of this event data uh
00:02:21
at the end of this video so don't think
00:02:23
that's the only way you can essentially
00:02:24
capture things like fact data there's at
00:02:26
least one other way that I've used and
00:02:27
again we're just talking about ways that
00:02:28
I've used I'm sure there are other ways
00:02:29
people people have used but this is what
00:02:30
I've seen I've seen across dozens of
00:02:33
companies and and used successfully so
00:02:35
now quick pause this has actually been
00:02:37
from the future one of the things I
00:02:38
realized uh as I was talking about fact
00:02:41
data you I was saying all that data just
00:02:43
kind of comes in and you append it uh
00:02:45
honestly it can really depend on how the
00:02:48
data comes in for example I worked at a
00:02:50
company where we would have to append
00:02:52
data at least in a few different ways
00:02:54
because there's the easy way which is
00:02:55
just appending it so every new event uh
00:02:58
you get you just append it so sometimes
00:03:00
this even takes care of things like
00:03:01
reverse outs or or kind of when you
00:03:03
maybe have something like a return or
00:03:05
maybe if you're like in healthcare maybe
00:03:07
one of the bills was disputed so you
00:03:09
have like a negative $50 on that bill so
00:03:12
you'll actually get that appended onto
00:03:13
it so that's one way where it's like
00:03:15
everything's appended you could just
00:03:17
have a replacement so sometimes you
00:03:19
might have the same ID so there might be
00:03:21
whatever that event ID and a new event
00:03:23
ID might come in and that might actually
00:03:25
just fully replace the prior one where
00:03:27
you delete the old data instead of
00:03:29
minusing $50 it's just the new amount
00:03:31
that you now owe or have to spend for
00:03:33
whatever it might be again could be
00:03:35
Healthcare so it is important to note
00:03:37
that although maybe how it looks like in
00:03:39
the table might always look the same
00:03:41
where it's just a bunch of events in a
00:03:42
row how it gets inserted can vary uh and
00:03:46
will change how you actually develop
00:03:47
your system on the other end so just
00:03:49
wanted to have that quick caveat and now
00:03:50
we can jump back to uh pass bed that's
00:03:53
kind of how you can think by a fact it's
00:03:54
very straightforward in that manner that
00:03:56
it's you know as each row comes in
00:03:58
you're capturing fact in information
00:04:00
you're not losing it that's the key is
00:04:02
you're not losing it now the problem is
00:04:03
let's say you have got another dimension
00:04:04
table let's put together a dimension
00:04:06
table so let's say uh usually I think a
00:04:10
good one is like dim customer uh looks
00:04:12
like they're doing this way so I'm just
00:04:13
going to stick with this way dim
00:04:16
customer you know you've got customer
00:04:19
ID you know City let's see City where
00:04:22
they live it's going to be a varchar of
00:04:24
some
00:04:25
kind maybe date created
00:04:30
their name right I then you can just
00:04:32
ignore this last one so I'm just going
00:04:33
to
00:04:34
that so then actually it would probably
00:04:37
be
00:04:39
swapped so now you've got this uh dim
00:04:42
customer table and if we were to look
00:04:44
into this table so let's you know dive
00:04:47
in so let's say you've got this
00:04:50
table right doesn't have all the columns
00:04:52
but we really just need a few to show
00:04:53
you the example so you've got customer
00:04:55
ID customer name and City and this is
00:04:58
really all you need so what you'll have
00:05:01
happen here right is id1 customer named
00:05:03
John and they live in New York City
00:05:06
right uh just for the example we'll do
00:05:08
another one Jane
00:05:12
Seattle so we've captured this data
00:05:15
right we've pull this out from the
00:05:16
database and this is how you've
00:05:18
currently modeled the data set you've
00:05:19
got customer ID your name and City the
00:05:21
problem occurs that it someday in the
00:05:23
future this customer will likely move so
00:05:26
now let's say John also moves to Seattle
00:05:28
so now that they've moved and you would
00:05:31
like to maybe answer a question for
00:05:33
management let's say ask a question like
00:05:35
you know how many uh sales are we
00:05:37
getting from our customers that live uh
00:05:40
essentially in different areas right
00:05:42
like let's not consider where stores are
00:05:43
let's consider like they're doing this
00:05:44
by customer want to know how much uh
00:05:47
different customers from different
00:05:48
states purchase from our product now uh
00:05:51
when you do that report and you write
00:05:53
that query SE this customer data John
00:05:57
will now go to Seattle right it's not
00:05:59
going to be captured that this customer
00:06:01
uh lived in New York that you don't have
00:06:03
that anywhere this this table cannot
00:06:05
capture that information and so you've
00:06:07
lost where they have lived in the past
00:06:09
and so you're losing information and so
00:06:11
this is where you actually have to think
00:06:12
about how you model historical data so
00:06:14
when you hear slowly changing Dimensions
00:06:16
what that really is is you trying to
00:06:18
capture and model change over time of
00:06:21
these Dimension uh data sets and in
00:06:23
general there there are a few different
00:06:25
types of slowly changing Dimensions I've
00:06:27
gone over them in the past this is
00:06:28
arguably the just switching you know
00:06:30
switching from Seattle to NYC or NYC to
00:06:33
Seattle is arguably type one generally
00:06:35
speaking I really have only seen either
00:06:38
type two or kind of type six some level
00:06:40
of type six I I don't think I've ever
00:06:42
seen it perfectly implemented for for
00:06:43
type six which means uh that type
00:06:46
essentially you could just think about
00:06:48
it that this would have at least two
00:06:50
more columns and we'll just add one more
00:06:52
Row for example and you've got your
00:06:54
start basically kind of like a start
00:06:56
effective date they might call it
00:06:58
effective date or something like that
00:06:59
and then end date I'm just doing this to
00:07:01
simplify it so what that will do is you
00:07:03
know for that
00:07:05
NYC you'll actually capture where that
00:07:07
person stops and starts essentially when
00:07:09
they live here and when they live
00:07:10
somewhere else so 2023 at1 and let's say
00:07:12
they end at 2024 let's make this easy on
00:07:15
me
00:07:16
a11 and then when you end
00:07:20
up read addding in this data you'll
00:07:23
actually add a new row so this started
00:07:26
let's say 20241 and then this will be
00:07:28
null
00:07:30
and again you'd have the same thing here
00:07:34
where since this hasn't changed yet
00:07:36
they'd be n so what this does is
00:07:38
actually capture dimensional data
00:07:40
through history right so as as things
00:07:42
change you know you should only pull
00:07:44
data where there is essentially a null
00:07:46
end date so when you write this query
00:07:48
you'll say select from dim customer
00:07:50
where end date is null so you only pull
00:07:53
the most recent uh information so you're
00:07:55
not getting you know all this historical
00:07:56
duplicate data that's one kind of part
00:07:59
here with this Rule and then if you are
00:08:01
doing something where let's say you're
00:08:03
trying to report historically you might
00:08:06
actually add in this missing data right
00:08:08
John when he lived here in these various
00:08:10
places that way when you join this data
00:08:12
you're going to do something where you
00:08:13
do a between and likely you end up
00:08:15
joining it uh here on the date field so
00:08:18
in between you you consider all those
00:08:19
purchases for that specific user in that
00:08:22
set of dates so that it doesn't get
00:08:23
confused so that when you pull in city
00:08:25
for that specific User it's correctly um
00:08:28
being used and then obviously you're
00:08:29
going to have a bunch of duplicate
00:08:30
Fields but each of those should be for
00:08:32
the specific events so now you're going
00:08:34
at the like sales uh level versus the
00:08:37
customer level so you're going to have
00:08:38
multiple instances of JN but in some
00:08:40
cases JN will be living in New York City
00:08:42
and in other cases they'll be living in
00:08:44
Seattle which would make sense right
00:08:46
because when they moved you don't want
00:08:48
to report that data U moving forward but
00:08:50
you want to make sure you accurately
00:08:51
report the past now this is just one way
00:08:54
essentially to capture this specific
00:08:56
data set um I say that because at
00:08:57
Facebook we actually did it slightly
00:08:59
differently now at Facebook what we
00:09:00
ended up doing so this is again this is
00:09:02
one way you can kind of your standard
00:09:03
slowly changing Dimensions but at
00:09:05
Facebook the way we ended up doing it
00:09:06
let me just get a box I want a box so
00:09:08
let me get this the way we ended up
00:09:11
doing it is you could kind of say let's
00:09:13
say I want to change this over here so
00:09:15
this was essentially uh just a partition
00:09:18
where somewhere in a folder this was
00:09:20
essentially a file somewhere that was
00:09:22
like 2023 01-01 so on this date we had
00:09:25
all the data we just took a snapshot of
00:09:27
all employee information at that time
00:09:30
so let's just say 2023 and then again
00:09:31
this is kind of a folder system so you
00:09:33
can think about like above this is like
00:09:36
some folder called employee somewhere
00:09:38
because it's in like a hive uh metast
00:09:41
store somewhere because it's in hdfs
00:09:43
folder system some somewhere and then
00:09:45
essentially what you can think from
00:09:46
there is that each one of these then had
00:09:49
you know for every date that continued
00:09:50
you would just have another one and we
00:09:52
would just store all the
00:09:53
data so each of the you know this would
00:09:56
be this first one here would be all the
00:09:58
data snapped on that day took a picture
00:10:00
of that specific day uh so with this one
00:10:04
right like for O2 that's how we capture
00:10:05
that data and 03 that would be that so
00:10:07
let's say it's all employee data from
00:10:08
that specific date now part of this is
00:10:10
possible because storage was rather
00:10:12
cheap and this kind of could make it
00:10:14
somewhat easy for analyst right because
00:10:16
what they end up doing on the query side
00:10:18
let me just take a little here is they
00:10:20
do something like
00:10:21
select you know obviously don't do
00:10:23
select star but we're doing select star
00:10:25
for an example uh select from uh
00:10:27
employee and we had a macro
00:10:29
that essentially let me just pin this
00:10:31
over here uh where we would often call
00:10:35
it DS was kind of your standard on every
00:10:38
table had it and we had a macro that
00:10:39
would essentially be that you could use
00:10:42
um there's a few of them one of them
00:10:44
which is basically your current DS but I
00:10:45
don't remember if it's I don't think
00:10:46
it's current DS I don't know what the
00:10:47
macro was someone who works at Facebook
00:10:49
maybe commented below it was it was
00:10:51
something that basically said today's
00:10:52
date um but you could also use uh your
00:10:56
own macros in dashboard so let's say for
00:10:59
example you you end up having to join
00:11:00
this to some other table join trying to
00:11:03
think of a good table that would be here
00:11:05
you know what let's do um what would you
00:11:06
do like employee learnings because
00:11:08
there's like a learnings table of like
00:11:11
like the courses you go through
00:11:12
internally so you know L whatever you
00:11:14
join this on both employee
00:11:17
ID uh you know L do employee ID you can
00:11:20
be mad at me for doing a onlet uh alas
00:11:23
later uh and you have to join this
00:11:27
on um some sort of like what the current
00:11:29
DS was right you want to make sure you
00:11:31
weren't going beyond that um so e. DS
00:11:35
equals l.
00:11:37
DS and then maybe you had a uh dashboard
00:11:40
kind of filter here so literally it
00:11:42
would go back it would automatically
00:11:43
populate hey we're looking at you know
00:11:45
2023 0103 and so as it went back and
00:11:48
proliferated through this join right
00:11:49
like it would automatically look at e.
00:11:51
DS let's say and when it did this join
00:11:53
up here right it would only pick one of
00:11:54
the dates automatically right like so
00:11:56
LDS is only going to pick one date so
00:11:57
you're not going to get duplicate data
00:11:58
you're only only going to get let's say
00:12:00
again 202 this data set here uh and then
00:12:02
only the learn data from that data set
00:12:04
if that's your goal like that that might
00:12:05
have not been your goal this is just an
00:12:07
example um and that way you could kind
00:12:09
of have your own historical approach
00:12:11
right like well now if you want to
00:12:12
filter to a different date you just
00:12:14
filter it to a different DS right like
00:12:16
as you're kind of looking through this
00:12:17
uh overall report that still was
00:12:19
sometimes difficult to use though as you
00:12:21
can imagine like let's say we're kind of
00:12:23
looking at this example here
00:12:25
again uh and you're instead wanting to
00:12:28
think about like when employees change
00:12:30
roles so instead of this you have got
00:12:32
employee
00:12:34
ID name sure stays the same role is
00:12:36
probably the difference right like maybe
00:12:38
they're a data engineer here this is
00:12:41
another data engineer and this is now
00:12:43
like they become a project manager now
00:12:45
one way you could do this obviously
00:12:46
right is um have hundreds of of data
00:12:50
partitions and you have to figure out
00:12:51
where exactly that change right like if
00:12:52
you want to figure out let's say like a
00:12:54
count of how many days are in between
00:12:56
someone switching roles at Facebook like
00:12:57
the average count it's going to be
00:12:58
really difficult to to kind of calculate
00:13:01
here and have to like do a lot of manual
00:13:03
things where really all you'd have to do
00:13:04
here is do something where you say you
00:13:06
know from this employee table uh count
00:13:08
the number of days in between here and
00:13:10
then you can figure out the average from
00:13:11
there which would be far more
00:13:12
straightforward than trying to go
00:13:13
through all these partitions so we
00:13:14
actually still ended up needing to
00:13:15
create this um I often find you still
00:13:18
sometimes need to make some of these
00:13:19
tables where you're like a this maybe
00:13:20
this other method is more performant but
00:13:22
sometimes you have to still make this
00:13:23
trade-off where you still end up making
00:13:25
this this specific slowly changing
00:13:27
Dimension table to capture it cuz it
00:13:28
just just is more succinct like instead
00:13:31
of having every date partition ever even
00:13:34
though we did have a certain retention
00:13:35
period 90 days was like default so this
00:13:37
data would generally be deleted after 90
00:13:39
days but you've got all of that data or
00:13:41
you could just have you know essentially
00:13:43
two rows and that's all you really need
00:13:44
to know about John right you don't need
00:13:46
90 days of data you need two rows um of
00:13:50
data so we still often sometimes switch
00:13:52
to this when required that's just
00:13:54
something I like to compare against like
00:13:56
you know yes you'll see certain things
00:13:58
done at Large companies and there are
00:13:59
other benefits that we we'll talk about
00:14:01
here um in terms of like why a company
00:14:03
might do this there are performance
00:14:04
reasons it can be it's easier to query
00:14:06
right like instead of having to do this
00:14:07
between thing you can just write you
00:14:09
know tell me what the current data is
00:14:11
and oftentimes that's what we really
00:14:12
cared about was like current data um
00:14:14
also when data gets really big sometimes
00:14:16
you don't want to store as much and you
00:14:18
just want to store like the current date
00:14:20
and you don't necessarily want a large
00:14:22
data set and that's where I'm going to
00:14:23
go into the next example for that I'm
00:14:25
going to go to Roblox they covered this
00:14:26
really well so one of the examples we
00:14:28
talked about was fact dat right and one
00:14:30
of the types of facts that you might try
00:14:31
to capture is something like page visits
00:14:33
like how many times people visit a page
00:14:35
um maybe you want to capture how many
00:14:36
times they visit that page in a day
00:14:38
right you got let's say six or seven or
00:14:41
eight or 100 uh page visits in a day and
00:14:44
that gets really big right when you
00:14:46
think about like Facebook you could have
00:14:47
millions of visits in a day and over
00:14:50
time that's going to get very very big
00:14:52
and maybe you want to be able to query
00:14:53
that data very effectively but still be
00:14:55
able to talk about historical data at
00:14:57
least in an aggregated fashion so as you
00:14:59
can see here like you can almost view
00:15:01
each of these almost like a user I'm
00:15:03
just going to treat this as like how
00:15:04
often someone maybe shows up to the
00:15:06
website on that day and this is already
00:15:08
kind of uh aggregated as well but you
00:15:10
can imagine like there's maybe five
00:15:12
visits in a day on some very very
00:15:14
granular table somewhere and that's
00:15:16
going to be very hard to query it's
00:15:17
going to be very big it's going to be
00:15:18
very expensive it's going to take a long
00:15:19
time especially if you only want to look
00:15:21
at a specific subset of users so another
00:15:23
way you can do this to model it is like
00:15:24
hey if we already know you're going to
00:15:26
filter by users why not create one very
00:15:28
wide table where you create what we
00:15:30
often reference as a date list and so
00:15:33
there's a few different ways you can do
00:15:34
it some people will implement it using
00:15:36
kind of bits so zeros and ones to kind
00:15:38
of Define often if someone showed up but
00:15:40
this is another way where you can use a
00:15:42
dictionary to say hey which days people
00:15:43
show up and how many times do they show
00:15:45
up so that way if you want to have you
00:15:47
know a billion users but you really only
00:15:49
want to pull out a specific set of them
00:15:51
right like maybe you're only interested
00:15:52
in us data at a time you can one really
00:15:54
quickly get that you're not having to go
00:15:56
through you know billions and billions
00:15:57
of rows just to pull all that data back
00:15:59
you just have to go through however many
00:16:00
billion people there are on your on your
00:16:02
service so 2 billion which is not that
00:16:04
big for for data sets versus you know as
00:16:07
you can imagine if you've got two
00:16:08
billion users the amount of data that
00:16:10
that has for events is huge you've now
00:16:12
shrunk that down uh to as at most three
00:16:15
or four billion rows uh to filter out
00:16:17
and then you filter that filter out what
00:16:19
you want and then on top of that uh you
00:16:21
have all your data kind of nicely
00:16:22
aggregated um in a date list so you can
00:16:25
already just say hey now go into that
00:16:27
field and tell me how many visits for
00:16:29
this section of time right you see like
00:16:31
got first and last date for this section
00:16:32
of time did someone show up to our site
00:16:35
so it just drastically changes the way
00:16:37
you kind of track this event data this
00:16:39
is generally more pertinent to companies
00:16:40
with very large data sets right like cuz
00:16:42
this does add a technical bit of
00:16:44
complexity right you could argue that
00:16:46
this might be simpler to query for most
00:16:48
analysts whereas now if you have a
00:16:50
dictionary you have a little bit of
00:16:51
extra layer of technical complexity so
00:16:53
it is this trade-off where this is very
00:16:55
easy to query like hey I just want to
00:16:57
see you know certain dates certain date
00:16:59
range and you know some this field but
00:17:01
at a certain point there can be you know
00:17:04
this need to essentially have summarized
00:17:07
data that's one reason you might see
00:17:08
this data model this way um to track
00:17:10
historical information it also can lose
00:17:12
a little bit of information in this in
00:17:14
this manner because you know you're not
00:17:15
getting as much granularity but again
00:17:17
there's pros and cons to to both um
00:17:19
approaches here so hopeing that's
00:17:21
helpful for those of you who are trying
00:17:22
to understand how to model historical
00:17:24
data one we've hopefully gone over why
00:17:25
it's important right like if you'll have
00:17:27
reports to do it helps you add a little
00:17:29
more context it helps ensure that you
00:17:31
don't lose information but also there
00:17:32
are reasons in terms of performance that
00:17:33
you might model your data differently
00:17:35
like half the reason that I think we do
00:17:36
SL changing Dimensions is because you
00:17:38
used to have limited space and you
00:17:39
couldn't just do you know every date
00:17:41
partition and so as much as it seems
00:17:43
like oh that's very easy just take a
00:17:44
snapshot every day and capture that
00:17:46
information not everyone can afford that
00:17:48
not everyone has in unlimited compute
00:17:50
and storage and so it's really important
00:17:52
to always make sure your data model
00:17:54
doesn't just fit your needs in terms of
00:17:56
business needs but also your technical
00:17:58
needs you can't store or uh infinite
00:17:59
data on Prem generally unless you are a
00:18:02
large company that has their own
00:18:03
internal cloud like Facebook but
00:18:06
hopefully that was helpful for all yall
00:18:07
um I went through a few different ways
00:18:08
you can kind of store historical data
00:18:10
and with that guys I'll see you all in
00:18:11
the next one thanks all goodbye

Etiquetas

Data Modeling
Historical Data
Slowly Changing Dimensions
Fact Tables
Dimension Tables
Data Engineering
Facebook
Roblox
Data Storage
Data Aggregation