Data Modeling Where Theory Meets Reality - How Different Companies I Worked At Modeled Their Data

00:18:13
https://www.youtube.com/watch?v=rqLbn1PQPKA

摘要

TLDRIn this video, Ben Rogan discusses the importance of data modeling for capturing historical data, especially for data engineers. He introduces concepts such as slowly changing dimensions, which allow for tracking changes in data over time without losing historical context. Ben explains the use of fact tables in capturing events, which inherently stores historical information. He also talks about handling dimension tables that might lose history due to data changes, such as a customer's address, using Slowly Changing Dimensions Type 2 to maintain history with start and end dates. Additionally, he discusses the strategies employed by companies like Facebook and Roblox for managing large historical datasets. Facebook might use a snapshot method to store data at specific intervals, while Roblox can store page visits efficiently using date lists to avoid large data volumes. Ben emphasizes the balance needed between granular detail and technical feasibility, helping viewers understand different methods for modeling historical data and tailoring solutions to both business and technical needs.

心得

  • 🗂️ Understanding data modeling is crucial for data engineers.
  • 📈 Fact tables inherently capture historical data by recording events.
  • 🔄 Slowly changing dimensions help track changes over time without data loss.
  • ⚙️ Data modeling aids in answering business queries effectively with preserved historical context.
  • 🏢 Methods for capturing historical data vary across companies like Facebook and Roblox.
  • 💾 Large datasets require efficient storage strategies to manage historical data.
  • 📊 Effective data modeling balances business needs with technical constraints.
  • 🛠️ Different approaches cater to varying scenarios, such as customer moves or page visits.
  • 📜 Type 2 SCD keeps track of historical changes by using start and end dates.
  • 🌐 Not all companies can afford extensive data storage solutions; efficiency is key.

时间轴

  • 00:00:00 - 00:05:00

    Ben Rogan introduces the importance of data modeling for data engineers, highlighting previous videos in a miniseries that discuss crucial concepts. In this installment, he focuses on historical data modeling, specifically slowly changing dimensions and other methods for capturing history in data without losing information. The talk begins by outlining how fact tables capture events as they happen, using date IDs and timestamps, and explain how appending or replacing data can affect data retention, particularly when changes occur, such as customer movements.

  • 00:05:00 - 00:10:00

    Ben explains issues with dimension tables using an example of a 'dim customer' table, where capturing historical changes like customers changing cities can be problematic. To address this, he introduces the concept of slowly changing dimensions, specifically type two, which involves adding columns for start and end dates to track changes over time and ensure past information isn't lost during future updates. Techniques discussed include using new rows to log changes and querying current data with a filter for null in end dates to maintain an accurate record of past and current states.

  • 00:10:00 - 00:18:13

    The video discusses alternative methods used by large companies like Facebook and Roblox for capturing historical data, such as creating partitions to snapshot data or summarizing large volumes of event data with wide date-list tables. These strategies aim to optimize performance and resource usage despite complexities, requiring a balance between granularity and system efficiency. Ben emphasizes the importance of adapting data models to fit both business and technical needs, teaching viewers how to manage large datasets and prevent data loss while ensuring system performance.

思维导图

视频问答

  • What are slowly changing dimensions in data modeling?

    It involves creating a data structure that allows for changes in data over time without losing historical information.

  • How can you model data to capture historical changes?

    By using Slowly Changing Dimensions Type 2, or through other methods like storing snapshots of data at different times.

  • Why is data modeling important for historical data?

    To capture the historical data of dimensions when they change over time, such as a customer moving to a new city.

  • How does Slowly Changing Dimensions Type 2 work?

    Type 2 keeps historical data by adding a new row to the dimension table for each change and maintaining a start and end date for each period.

  • How do fact tables capture historical data?

    It appends new rows for each event, capturing events as they happen which inherently keeps historical data.

  • What alternative method does Facebook use for storing historical data?

    Ben discusses partitioning data and storing snapshots daily, which is feasible with large storage capacity.

  • How do you decide the best method for modeling historical data?

    By balancing the need for detailed information with the technical constraints and performance of the database.

  • How can you efficiently model page visit frequency in large datasets?

    Instead of storing every visit event, aggregate data into a dictionary or bit format to track presence and visits efficiently.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要!
字幕
en
自动滚动:
  • 00:00:00
    so you've heard so much about data
  • 00:00:02
    modeling everywhere that it is important
  • 00:00:04
    if you are a data engineer and in fact
  • 00:00:06
    I've kind of unintentionally created
  • 00:00:08
    this little minseries uh I've got two
  • 00:00:10
    other videos that you may have seen if
  • 00:00:12
    not you can watch them after this video
  • 00:00:14
    where I kind of give an intro to data
  • 00:00:16
    modeling we talk about important
  • 00:00:17
    Concepts as well as go over things like
  • 00:00:19
    back tables and dimension tables and now
  • 00:00:21
    we're going to continue in this
  • 00:00:23
    miniseries to talk about historical data
  • 00:00:26
    or how to actually model data for
  • 00:00:28
    history and why it's important but
  • 00:00:30
    before diving in hey everyone welcome
  • 00:00:32
    back to another video with me Ben Rogan
  • 00:00:34
    aka the C dat guy as I said today we're
  • 00:00:37
    going to be talking about data modeling
  • 00:00:38
    specifically discussing things like
  • 00:00:40
    slowly changing dimensions and other
  • 00:00:42
    methods to essentially capture
  • 00:00:46
    historical data so let's dive
  • 00:00:49
    [Music]
  • 00:00:55
    in recently I also saw that uh Zach
  • 00:00:58
    Wilson uh put out a video where he also
  • 00:01:01
    discussed a few other videos including
  • 00:01:03
    tables such as cumulative tables which
  • 00:01:05
    is another way you can definitely model
  • 00:01:07
    uh historical data um but we're going to
  • 00:01:09
    be not talking about that today uh we're
  • 00:01:12
    going be talking about facts and
  • 00:01:13
    dimension so if you're think about
  • 00:01:14
    historical fact data generally speaking
  • 00:01:16
    fact data really at least the way you
  • 00:01:18
    might be thinking about it is already
  • 00:01:20
    kind of capturing historical data right
  • 00:01:22
    because you're capturing events so let's
  • 00:01:23
    say you have this fact table here right
  • 00:01:26
    uh you might have a date uh date or date
  • 00:01:28
    ID here so it can connect date Dimension
  • 00:01:30
    table might have an event time stamp to
  • 00:01:32
    like actually capture the actual time
  • 00:01:34
    stamp of when it happened you know maybe
  • 00:01:36
    it's a user maybe it's a customer
  • 00:01:37
    purchasing a product and uh then you
  • 00:01:39
    have some other dimension IDs so
  • 00:01:40
    actually let me change this to customer
  • 00:01:41
    CU I think that'll be more consistent so
  • 00:01:43
    you know maybe you're talking about a
  • 00:01:44
    customer buying a product and then every
  • 00:01:47
    event that occurs right you can think of
  • 00:01:48
    that just as like an iner insertion as
  • 00:01:50
    much as that says process I am just
  • 00:01:52
    viewing this as a row so you can just
  • 00:01:54
    you know this is event one here so this
  • 00:01:56
    is you know customer
  • 00:01:58
    one uh event one for them and then
  • 00:02:01
    you're just going to keep appending to
  • 00:02:02
    that right like it's just going to be an
  • 00:02:04
    appending appending situation right okay
  • 00:02:06
    there's event two now you've got another
  • 00:02:08
    maybe customer data uh another customer
  • 00:02:11
    ID coming in and servers event Etc so
  • 00:02:13
    historical data is captured just by the
  • 00:02:16
    data being in a row State and we are
  • 00:02:18
    going to talk about a different way you
  • 00:02:19
    can capture kind of this event data uh
  • 00:02:21
    at the end of this video so don't think
  • 00:02:23
    that's the only way you can essentially
  • 00:02:24
    capture things like fact data there's at
  • 00:02:26
    least one other way that I've used and
  • 00:02:27
    again we're just talking about ways that
  • 00:02:28
    I've used I'm sure there are other ways
  • 00:02:29
    people people have used but this is what
  • 00:02:30
    I've seen I've seen across dozens of
  • 00:02:33
    companies and and used successfully so
  • 00:02:35
    now quick pause this has actually been
  • 00:02:37
    from the future one of the things I
  • 00:02:38
    realized uh as I was talking about fact
  • 00:02:41
    data you I was saying all that data just
  • 00:02:43
    kind of comes in and you append it uh
  • 00:02:45
    honestly it can really depend on how the
  • 00:02:48
    data comes in for example I worked at a
  • 00:02:50
    company where we would have to append
  • 00:02:52
    data at least in a few different ways
  • 00:02:54
    because there's the easy way which is
  • 00:02:55
    just appending it so every new event uh
  • 00:02:58
    you get you just append it so sometimes
  • 00:03:00
    this even takes care of things like
  • 00:03:01
    reverse outs or or kind of when you
  • 00:03:03
    maybe have something like a return or
  • 00:03:05
    maybe if you're like in healthcare maybe
  • 00:03:07
    one of the bills was disputed so you
  • 00:03:09
    have like a negative $50 on that bill so
  • 00:03:12
    you'll actually get that appended onto
  • 00:03:13
    it so that's one way where it's like
  • 00:03:15
    everything's appended you could just
  • 00:03:17
    have a replacement so sometimes you
  • 00:03:19
    might have the same ID so there might be
  • 00:03:21
    whatever that event ID and a new event
  • 00:03:23
    ID might come in and that might actually
  • 00:03:25
    just fully replace the prior one where
  • 00:03:27
    you delete the old data instead of
  • 00:03:29
    minusing $50 it's just the new amount
  • 00:03:31
    that you now owe or have to spend for
  • 00:03:33
    whatever it might be again could be
  • 00:03:35
    Healthcare so it is important to note
  • 00:03:37
    that although maybe how it looks like in
  • 00:03:39
    the table might always look the same
  • 00:03:41
    where it's just a bunch of events in a
  • 00:03:42
    row how it gets inserted can vary uh and
  • 00:03:46
    will change how you actually develop
  • 00:03:47
    your system on the other end so just
  • 00:03:49
    wanted to have that quick caveat and now
  • 00:03:50
    we can jump back to uh pass bed that's
  • 00:03:53
    kind of how you can think by a fact it's
  • 00:03:54
    very straightforward in that manner that
  • 00:03:56
    it's you know as each row comes in
  • 00:03:58
    you're capturing fact in information
  • 00:04:00
    you're not losing it that's the key is
  • 00:04:02
    you're not losing it now the problem is
  • 00:04:03
    let's say you have got another dimension
  • 00:04:04
    table let's put together a dimension
  • 00:04:06
    table so let's say uh usually I think a
  • 00:04:10
    good one is like dim customer uh looks
  • 00:04:12
    like they're doing this way so I'm just
  • 00:04:13
    going to stick with this way dim
  • 00:04:16
    customer you know you've got customer
  • 00:04:19
    ID you know City let's see City where
  • 00:04:22
    they live it's going to be a varchar of
  • 00:04:24
    some
  • 00:04:25
    kind maybe date created
  • 00:04:30
    their name right I then you can just
  • 00:04:32
    ignore this last one so I'm just going
  • 00:04:33
    to
  • 00:04:34
    that so then actually it would probably
  • 00:04:37
    be
  • 00:04:39
    swapped so now you've got this uh dim
  • 00:04:42
    customer table and if we were to look
  • 00:04:44
    into this table so let's you know dive
  • 00:04:47
    in so let's say you've got this
  • 00:04:50
    table right doesn't have all the columns
  • 00:04:52
    but we really just need a few to show
  • 00:04:53
    you the example so you've got customer
  • 00:04:55
    ID customer name and City and this is
  • 00:04:58
    really all you need so what you'll have
  • 00:05:01
    happen here right is id1 customer named
  • 00:05:03
    John and they live in New York City
  • 00:05:06
    right uh just for the example we'll do
  • 00:05:08
    another one Jane
  • 00:05:12
    Seattle so we've captured this data
  • 00:05:15
    right we've pull this out from the
  • 00:05:16
    database and this is how you've
  • 00:05:18
    currently modeled the data set you've
  • 00:05:19
    got customer ID your name and City the
  • 00:05:21
    problem occurs that it someday in the
  • 00:05:23
    future this customer will likely move so
  • 00:05:26
    now let's say John also moves to Seattle
  • 00:05:28
    so now that they've moved and you would
  • 00:05:31
    like to maybe answer a question for
  • 00:05:33
    management let's say ask a question like
  • 00:05:35
    you know how many uh sales are we
  • 00:05:37
    getting from our customers that live uh
  • 00:05:40
    essentially in different areas right
  • 00:05:42
    like let's not consider where stores are
  • 00:05:43
    let's consider like they're doing this
  • 00:05:44
    by customer want to know how much uh
  • 00:05:47
    different customers from different
  • 00:05:48
    states purchase from our product now uh
  • 00:05:51
    when you do that report and you write
  • 00:05:53
    that query SE this customer data John
  • 00:05:57
    will now go to Seattle right it's not
  • 00:05:59
    going to be captured that this customer
  • 00:06:01
    uh lived in New York that you don't have
  • 00:06:03
    that anywhere this this table cannot
  • 00:06:05
    capture that information and so you've
  • 00:06:07
    lost where they have lived in the past
  • 00:06:09
    and so you're losing information and so
  • 00:06:11
    this is where you actually have to think
  • 00:06:12
    about how you model historical data so
  • 00:06:14
    when you hear slowly changing Dimensions
  • 00:06:16
    what that really is is you trying to
  • 00:06:18
    capture and model change over time of
  • 00:06:21
    these Dimension uh data sets and in
  • 00:06:23
    general there there are a few different
  • 00:06:25
    types of slowly changing Dimensions I've
  • 00:06:27
    gone over them in the past this is
  • 00:06:28
    arguably the just switching you know
  • 00:06:30
    switching from Seattle to NYC or NYC to
  • 00:06:33
    Seattle is arguably type one generally
  • 00:06:35
    speaking I really have only seen either
  • 00:06:38
    type two or kind of type six some level
  • 00:06:40
    of type six I I don't think I've ever
  • 00:06:42
    seen it perfectly implemented for for
  • 00:06:43
    type six which means uh that type
  • 00:06:46
    essentially you could just think about
  • 00:06:48
    it that this would have at least two
  • 00:06:50
    more columns and we'll just add one more
  • 00:06:52
    Row for example and you've got your
  • 00:06:54
    start basically kind of like a start
  • 00:06:56
    effective date they might call it
  • 00:06:58
    effective date or something like that
  • 00:06:59
    and then end date I'm just doing this to
  • 00:07:01
    simplify it so what that will do is you
  • 00:07:03
    know for that
  • 00:07:05
    NYC you'll actually capture where that
  • 00:07:07
    person stops and starts essentially when
  • 00:07:09
    they live here and when they live
  • 00:07:10
    somewhere else so 2023 at1 and let's say
  • 00:07:12
    they end at 2024 let's make this easy on
  • 00:07:15
    me
  • 00:07:16
    a11 and then when you end
  • 00:07:20
    up read addding in this data you'll
  • 00:07:23
    actually add a new row so this started
  • 00:07:26
    let's say 20241 and then this will be
  • 00:07:28
    null
  • 00:07:30
    and again you'd have the same thing here
  • 00:07:34
    where since this hasn't changed yet
  • 00:07:36
    they'd be n so what this does is
  • 00:07:38
    actually capture dimensional data
  • 00:07:40
    through history right so as as things
  • 00:07:42
    change you know you should only pull
  • 00:07:44
    data where there is essentially a null
  • 00:07:46
    end date so when you write this query
  • 00:07:48
    you'll say select from dim customer
  • 00:07:50
    where end date is null so you only pull
  • 00:07:53
    the most recent uh information so you're
  • 00:07:55
    not getting you know all this historical
  • 00:07:56
    duplicate data that's one kind of part
  • 00:07:59
    here with this Rule and then if you are
  • 00:08:01
    doing something where let's say you're
  • 00:08:03
    trying to report historically you might
  • 00:08:06
    actually add in this missing data right
  • 00:08:08
    John when he lived here in these various
  • 00:08:10
    places that way when you join this data
  • 00:08:12
    you're going to do something where you
  • 00:08:13
    do a between and likely you end up
  • 00:08:15
    joining it uh here on the date field so
  • 00:08:18
    in between you you consider all those
  • 00:08:19
    purchases for that specific user in that
  • 00:08:22
    set of dates so that it doesn't get
  • 00:08:23
    confused so that when you pull in city
  • 00:08:25
    for that specific User it's correctly um
  • 00:08:28
    being used and then obviously you're
  • 00:08:29
    going to have a bunch of duplicate
  • 00:08:30
    Fields but each of those should be for
  • 00:08:32
    the specific events so now you're going
  • 00:08:34
    at the like sales uh level versus the
  • 00:08:37
    customer level so you're going to have
  • 00:08:38
    multiple instances of JN but in some
  • 00:08:40
    cases JN will be living in New York City
  • 00:08:42
    and in other cases they'll be living in
  • 00:08:44
    Seattle which would make sense right
  • 00:08:46
    because when they moved you don't want
  • 00:08:48
    to report that data U moving forward but
  • 00:08:50
    you want to make sure you accurately
  • 00:08:51
    report the past now this is just one way
  • 00:08:54
    essentially to capture this specific
  • 00:08:56
    data set um I say that because at
  • 00:08:57
    Facebook we actually did it slightly
  • 00:08:59
    differently now at Facebook what we
  • 00:09:00
    ended up doing so this is again this is
  • 00:09:02
    one way you can kind of your standard
  • 00:09:03
    slowly changing Dimensions but at
  • 00:09:05
    Facebook the way we ended up doing it
  • 00:09:06
    let me just get a box I want a box so
  • 00:09:08
    let me get this the way we ended up
  • 00:09:11
    doing it is you could kind of say let's
  • 00:09:13
    say I want to change this over here so
  • 00:09:15
    this was essentially uh just a partition
  • 00:09:18
    where somewhere in a folder this was
  • 00:09:20
    essentially a file somewhere that was
  • 00:09:22
    like 2023 01-01 so on this date we had
  • 00:09:25
    all the data we just took a snapshot of
  • 00:09:27
    all employee information at that time
  • 00:09:30
    so let's just say 2023 and then again
  • 00:09:31
    this is kind of a folder system so you
  • 00:09:33
    can think about like above this is like
  • 00:09:36
    some folder called employee somewhere
  • 00:09:38
    because it's in like a hive uh metast
  • 00:09:41
    store somewhere because it's in hdfs
  • 00:09:43
    folder system some somewhere and then
  • 00:09:45
    essentially what you can think from
  • 00:09:46
    there is that each one of these then had
  • 00:09:49
    you know for every date that continued
  • 00:09:50
    you would just have another one and we
  • 00:09:52
    would just store all the
  • 00:09:53
    data so each of the you know this would
  • 00:09:56
    be this first one here would be all the
  • 00:09:58
    data snapped on that day took a picture
  • 00:10:00
    of that specific day uh so with this one
  • 00:10:04
    right like for O2 that's how we capture
  • 00:10:05
    that data and 03 that would be that so
  • 00:10:07
    let's say it's all employee data from
  • 00:10:08
    that specific date now part of this is
  • 00:10:10
    possible because storage was rather
  • 00:10:12
    cheap and this kind of could make it
  • 00:10:14
    somewhat easy for analyst right because
  • 00:10:16
    what they end up doing on the query side
  • 00:10:18
    let me just take a little here is they
  • 00:10:20
    do something like
  • 00:10:21
    select you know obviously don't do
  • 00:10:23
    select star but we're doing select star
  • 00:10:25
    for an example uh select from uh
  • 00:10:27
    employee and we had a macro
  • 00:10:29
    that essentially let me just pin this
  • 00:10:31
    over here uh where we would often call
  • 00:10:35
    it DS was kind of your standard on every
  • 00:10:38
    table had it and we had a macro that
  • 00:10:39
    would essentially be that you could use
  • 00:10:42
    um there's a few of them one of them
  • 00:10:44
    which is basically your current DS but I
  • 00:10:45
    don't remember if it's I don't think
  • 00:10:46
    it's current DS I don't know what the
  • 00:10:47
    macro was someone who works at Facebook
  • 00:10:49
    maybe commented below it was it was
  • 00:10:51
    something that basically said today's
  • 00:10:52
    date um but you could also use uh your
  • 00:10:56
    own macros in dashboard so let's say for
  • 00:10:59
    example you you end up having to join
  • 00:11:00
    this to some other table join trying to
  • 00:11:03
    think of a good table that would be here
  • 00:11:05
    you know what let's do um what would you
  • 00:11:06
    do like employee learnings because
  • 00:11:08
    there's like a learnings table of like
  • 00:11:11
    like the courses you go through
  • 00:11:12
    internally so you know L whatever you
  • 00:11:14
    join this on both employee
  • 00:11:17
    ID uh you know L do employee ID you can
  • 00:11:20
    be mad at me for doing a onlet uh alas
  • 00:11:23
    later uh and you have to join this
  • 00:11:27
    on um some sort of like what the current
  • 00:11:29
    DS was right you want to make sure you
  • 00:11:31
    weren't going beyond that um so e. DS
  • 00:11:35
    equals l.
  • 00:11:37
    DS and then maybe you had a uh dashboard
  • 00:11:40
    kind of filter here so literally it
  • 00:11:42
    would go back it would automatically
  • 00:11:43
    populate hey we're looking at you know
  • 00:11:45
    2023 0103 and so as it went back and
  • 00:11:48
    proliferated through this join right
  • 00:11:49
    like it would automatically look at e.
  • 00:11:51
    DS let's say and when it did this join
  • 00:11:53
    up here right it would only pick one of
  • 00:11:54
    the dates automatically right like so
  • 00:11:56
    LDS is only going to pick one date so
  • 00:11:57
    you're not going to get duplicate data
  • 00:11:58
    you're only only going to get let's say
  • 00:12:00
    again 202 this data set here uh and then
  • 00:12:02
    only the learn data from that data set
  • 00:12:04
    if that's your goal like that that might
  • 00:12:05
    have not been your goal this is just an
  • 00:12:07
    example um and that way you could kind
  • 00:12:09
    of have your own historical approach
  • 00:12:11
    right like well now if you want to
  • 00:12:12
    filter to a different date you just
  • 00:12:14
    filter it to a different DS right like
  • 00:12:16
    as you're kind of looking through this
  • 00:12:17
    uh overall report that still was
  • 00:12:19
    sometimes difficult to use though as you
  • 00:12:21
    can imagine like let's say we're kind of
  • 00:12:23
    looking at this example here
  • 00:12:25
    again uh and you're instead wanting to
  • 00:12:28
    think about like when employees change
  • 00:12:30
    roles so instead of this you have got
  • 00:12:32
    employee
  • 00:12:34
    ID name sure stays the same role is
  • 00:12:36
    probably the difference right like maybe
  • 00:12:38
    they're a data engineer here this is
  • 00:12:41
    another data engineer and this is now
  • 00:12:43
    like they become a project manager now
  • 00:12:45
    one way you could do this obviously
  • 00:12:46
    right is um have hundreds of of data
  • 00:12:50
    partitions and you have to figure out
  • 00:12:51
    where exactly that change right like if
  • 00:12:52
    you want to figure out let's say like a
  • 00:12:54
    count of how many days are in between
  • 00:12:56
    someone switching roles at Facebook like
  • 00:12:57
    the average count it's going to be
  • 00:12:58
    really difficult to to kind of calculate
  • 00:13:01
    here and have to like do a lot of manual
  • 00:13:03
    things where really all you'd have to do
  • 00:13:04
    here is do something where you say you
  • 00:13:06
    know from this employee table uh count
  • 00:13:08
    the number of days in between here and
  • 00:13:10
    then you can figure out the average from
  • 00:13:11
    there which would be far more
  • 00:13:12
    straightforward than trying to go
  • 00:13:13
    through all these partitions so we
  • 00:13:14
    actually still ended up needing to
  • 00:13:15
    create this um I often find you still
  • 00:13:18
    sometimes need to make some of these
  • 00:13:19
    tables where you're like a this maybe
  • 00:13:20
    this other method is more performant but
  • 00:13:22
    sometimes you have to still make this
  • 00:13:23
    trade-off where you still end up making
  • 00:13:25
    this this specific slowly changing
  • 00:13:27
    Dimension table to capture it cuz it
  • 00:13:28
    just just is more succinct like instead
  • 00:13:31
    of having every date partition ever even
  • 00:13:34
    though we did have a certain retention
  • 00:13:35
    period 90 days was like default so this
  • 00:13:37
    data would generally be deleted after 90
  • 00:13:39
    days but you've got all of that data or
  • 00:13:41
    you could just have you know essentially
  • 00:13:43
    two rows and that's all you really need
  • 00:13:44
    to know about John right you don't need
  • 00:13:46
    90 days of data you need two rows um of
  • 00:13:50
    data so we still often sometimes switch
  • 00:13:52
    to this when required that's just
  • 00:13:54
    something I like to compare against like
  • 00:13:56
    you know yes you'll see certain things
  • 00:13:58
    done at Large companies and there are
  • 00:13:59
    other benefits that we we'll talk about
  • 00:14:01
    here um in terms of like why a company
  • 00:14:03
    might do this there are performance
  • 00:14:04
    reasons it can be it's easier to query
  • 00:14:06
    right like instead of having to do this
  • 00:14:07
    between thing you can just write you
  • 00:14:09
    know tell me what the current data is
  • 00:14:11
    and oftentimes that's what we really
  • 00:14:12
    cared about was like current data um
  • 00:14:14
    also when data gets really big sometimes
  • 00:14:16
    you don't want to store as much and you
  • 00:14:18
    just want to store like the current date
  • 00:14:20
    and you don't necessarily want a large
  • 00:14:22
    data set and that's where I'm going to
  • 00:14:23
    go into the next example for that I'm
  • 00:14:25
    going to go to Roblox they covered this
  • 00:14:26
    really well so one of the examples we
  • 00:14:28
    talked about was fact dat right and one
  • 00:14:30
    of the types of facts that you might try
  • 00:14:31
    to capture is something like page visits
  • 00:14:33
    like how many times people visit a page
  • 00:14:35
    um maybe you want to capture how many
  • 00:14:36
    times they visit that page in a day
  • 00:14:38
    right you got let's say six or seven or
  • 00:14:41
    eight or 100 uh page visits in a day and
  • 00:14:44
    that gets really big right when you
  • 00:14:46
    think about like Facebook you could have
  • 00:14:47
    millions of visits in a day and over
  • 00:14:50
    time that's going to get very very big
  • 00:14:52
    and maybe you want to be able to query
  • 00:14:53
    that data very effectively but still be
  • 00:14:55
    able to talk about historical data at
  • 00:14:57
    least in an aggregated fashion so as you
  • 00:14:59
    can see here like you can almost view
  • 00:15:01
    each of these almost like a user I'm
  • 00:15:03
    just going to treat this as like how
  • 00:15:04
    often someone maybe shows up to the
  • 00:15:06
    website on that day and this is already
  • 00:15:08
    kind of uh aggregated as well but you
  • 00:15:10
    can imagine like there's maybe five
  • 00:15:12
    visits in a day on some very very
  • 00:15:14
    granular table somewhere and that's
  • 00:15:16
    going to be very hard to query it's
  • 00:15:17
    going to be very big it's going to be
  • 00:15:18
    very expensive it's going to take a long
  • 00:15:19
    time especially if you only want to look
  • 00:15:21
    at a specific subset of users so another
  • 00:15:23
    way you can do this to model it is like
  • 00:15:24
    hey if we already know you're going to
  • 00:15:26
    filter by users why not create one very
  • 00:15:28
    wide table where you create what we
  • 00:15:30
    often reference as a date list and so
  • 00:15:33
    there's a few different ways you can do
  • 00:15:34
    it some people will implement it using
  • 00:15:36
    kind of bits so zeros and ones to kind
  • 00:15:38
    of Define often if someone showed up but
  • 00:15:40
    this is another way where you can use a
  • 00:15:42
    dictionary to say hey which days people
  • 00:15:43
    show up and how many times do they show
  • 00:15:45
    up so that way if you want to have you
  • 00:15:47
    know a billion users but you really only
  • 00:15:49
    want to pull out a specific set of them
  • 00:15:51
    right like maybe you're only interested
  • 00:15:52
    in us data at a time you can one really
  • 00:15:54
    quickly get that you're not having to go
  • 00:15:56
    through you know billions and billions
  • 00:15:57
    of rows just to pull all that data back
  • 00:15:59
    you just have to go through however many
  • 00:16:00
    billion people there are on your on your
  • 00:16:02
    service so 2 billion which is not that
  • 00:16:04
    big for for data sets versus you know as
  • 00:16:07
    you can imagine if you've got two
  • 00:16:08
    billion users the amount of data that
  • 00:16:10
    that has for events is huge you've now
  • 00:16:12
    shrunk that down uh to as at most three
  • 00:16:15
    or four billion rows uh to filter out
  • 00:16:17
    and then you filter that filter out what
  • 00:16:19
    you want and then on top of that uh you
  • 00:16:21
    have all your data kind of nicely
  • 00:16:22
    aggregated um in a date list so you can
  • 00:16:25
    already just say hey now go into that
  • 00:16:27
    field and tell me how many visits for
  • 00:16:29
    this section of time right you see like
  • 00:16:31
    got first and last date for this section
  • 00:16:32
    of time did someone show up to our site
  • 00:16:35
    so it just drastically changes the way
  • 00:16:37
    you kind of track this event data this
  • 00:16:39
    is generally more pertinent to companies
  • 00:16:40
    with very large data sets right like cuz
  • 00:16:42
    this does add a technical bit of
  • 00:16:44
    complexity right you could argue that
  • 00:16:46
    this might be simpler to query for most
  • 00:16:48
    analysts whereas now if you have a
  • 00:16:50
    dictionary you have a little bit of
  • 00:16:51
    extra layer of technical complexity so
  • 00:16:53
    it is this trade-off where this is very
  • 00:16:55
    easy to query like hey I just want to
  • 00:16:57
    see you know certain dates certain date
  • 00:16:59
    range and you know some this field but
  • 00:17:01
    at a certain point there can be you know
  • 00:17:04
    this need to essentially have summarized
  • 00:17:07
    data that's one reason you might see
  • 00:17:08
    this data model this way um to track
  • 00:17:10
    historical information it also can lose
  • 00:17:12
    a little bit of information in this in
  • 00:17:14
    this manner because you know you're not
  • 00:17:15
    getting as much granularity but again
  • 00:17:17
    there's pros and cons to to both um
  • 00:17:19
    approaches here so hopeing that's
  • 00:17:21
    helpful for those of you who are trying
  • 00:17:22
    to understand how to model historical
  • 00:17:24
    data one we've hopefully gone over why
  • 00:17:25
    it's important right like if you'll have
  • 00:17:27
    reports to do it helps you add a little
  • 00:17:29
    more context it helps ensure that you
  • 00:17:31
    don't lose information but also there
  • 00:17:32
    are reasons in terms of performance that
  • 00:17:33
    you might model your data differently
  • 00:17:35
    like half the reason that I think we do
  • 00:17:36
    SL changing Dimensions is because you
  • 00:17:38
    used to have limited space and you
  • 00:17:39
    couldn't just do you know every date
  • 00:17:41
    partition and so as much as it seems
  • 00:17:43
    like oh that's very easy just take a
  • 00:17:44
    snapshot every day and capture that
  • 00:17:46
    information not everyone can afford that
  • 00:17:48
    not everyone has in unlimited compute
  • 00:17:50
    and storage and so it's really important
  • 00:17:52
    to always make sure your data model
  • 00:17:54
    doesn't just fit your needs in terms of
  • 00:17:56
    business needs but also your technical
  • 00:17:58
    needs you can't store or uh infinite
  • 00:17:59
    data on Prem generally unless you are a
  • 00:18:02
    large company that has their own
  • 00:18:03
    internal cloud like Facebook but
  • 00:18:06
    hopefully that was helpful for all yall
  • 00:18:07
    um I went through a few different ways
  • 00:18:08
    you can kind of store historical data
  • 00:18:10
    and with that guys I'll see you all in
  • 00:18:11
    the next one thanks all goodbye
标签
  • Data Modeling
  • Historical Data
  • Slowly Changing Dimensions
  • Fact Tables
  • Dimension Tables
  • Data Engineering
  • Facebook
  • Roblox
  • Data Storage
  • Data Aggregation