How And Why Data Engineers Need To Care About Data Quality Now - And How To Implement It

00:16:06
https://www.youtube.com/watch?v=wvUiRHd47M0

Ringkasan

TLDRThe video highlights the significance of maintaining data quality in businesses, especially with the increasing dependence on AI technologies. Poor data quality can lead to substantial financial losses and operational errors, as illustrated by examples like Google's erroneous data release impacting stock prices. The video explains various data quality checks, such as range checks for numerical values, category checks for predefined categories, and freshness checks for data timelines. It also discusses systems for implementing these checks, ranging from simple solutions like Slack alerts to comprehensive platforms offered by companies such as DCube. The speaker emphasizes the importance of reliable and high-quality data for building trust and effectiveness in data-driven decision-making. Viewers are encouraged to assess their needs and resources when setting up these systems, understanding that data quality is essential for accurate business insights.

Takeaways

  • 📉 Poor data quality can lead to up to 20% revenue loss in companies.
  • 🛠 Range checks identify anomalies in numeric data ranges.
  • 🔠 Category checks prevent errors by ensuring expected value sets.
  • ⏰ Freshness checks maintain data relevance and timeliness.
  • 📊 Volume checks watch data input consistency to spot significant changes.
  • 🚀 Implementing data quality systems can be simple or complex.
  • 🔍 Specialized platforms can streamline data quality efforts.
  • 🔄 Continuous data quality tracking over time is vital.
  • 📡 Slack notifications offer a cost-effective data quality alert system.
  • 🎯 Custom systems can use scripts like Python for checking automation.

Garis waktu

  • 00:00:00 - 00:05:00

    The discussion highlights the critical importance of data quality, especially as companies increasingly rely on AI. Bad data quality can result in significant financial losses. Examples like Google Bard show how inaccurate data can negatively affect stock prices. The speaker outlines various data quality checks, focusing on range and category checks to validate data consistency. They emphasize that these checks not only identify bad data but also help align processes to prevent erroneous actions.

  • 00:05:00 - 00:10:00

    The speaker continues to detail different types of data quality checks such as data type, freshness, volume, and null checks. These checks ensure that data is timely, voluminous as expected, and without unexpected nulls, which help maintain system accuracy. They discuss implementing these tests using simple tools like Slack messages for alerts, or developing customized data quality systems depending on a company's requirements and resources.

  • 00:10:00 - 00:16:06

    A deeper dive into more advanced systems like data quality dashboards and automated DQ operators, which provide continuous data oversight. The speaker explains building custom systems using SQL-based and code-based solutions for broader checks and tracking over time. They mention tools like DBT that offer built-in tests. As reliance on AI grows, maintaining high data quality becomes crucial to avoid making decisions based on incorrect data. The speaker encourages setting up data quality systems tailored to specific needs, enhancing data fidelity.

Peta Pikiran

Mind Map

Pertanyaan yang Sering Diajukan

  • Why is data quality important?

    Data quality is crucial as poor data can lead to significant financial losses for companies and can lead to decision-making errors, especially in systems relying on AI.

  • What are range checks in data quality?

    Range checks involve ensuring numeric data falls within a specified range; if it doesn't, it may signal an error in data entry or processing.

  • How can I implement data quality checks with minimal resources?

    You can use simple slack notifications that send alerts if data quality checks fail. This approach is cost-effective and straightforward for small datasets.

  • What does a category check involve?

    A category check ensures that data falls within defined categories, such as state abbreviations or other limited sets of values, preventing anomalies.

  • Why are freshness checks necessary in data quality?

    Freshness checks ensure that the data is up-to-date, which is crucial for accurate reporting and timely decision-making.

  • What is a volume check?

    Volume checks monitor the amount of data being processed or received daily to identify significant deviations that might indicate an issue.

  • How can you create a system for data quality checks?

    Systems can be created using scripting languages like Python to automate checks and corresponding alerts, or by employing dedicated data quality platforms.

  • What are null tests in data quality?

    Null tests ensure that fields expected to contain data do not have null values or only allow a certain percentage of nulls.

  • Can existing platforms help with data quality checks?

    Yes, platforms like DCube provide features for data observability and governance, helping automate and standardize data quality checks.

  • What are the advantages of using a specialized data quality platform?

    Specialized platforms offer comprehensive solutions, including tracking data quality over time, and help teams without extensive resources maintain high data quality.

Lihat lebih banyak ringkasan video

Dapatkan akses instan ke ringkasan video YouTube gratis yang didukung oleh AI!
Teks
en
Gulir Otomatis:
  • 00:00:00
    depending what metric you look up
  • 00:00:01
    supposedly companies are losing as much
  • 00:00:03
    as 20% if not more due to bad data
  • 00:00:06
    quality now we all know that everyone
  • 00:00:09
    loves to talk about how important data
  • 00:00:11
    quality is especially as we're heading
  • 00:00:13
    into a world that seems to be gung-ho
  • 00:00:15
    about AI yet we have many examples like
  • 00:00:18
    the uh Google Bart example where they
  • 00:00:19
    put out erroneous data and that impacted
  • 00:00:22
    their stock price heavily so in this
  • 00:00:24
    video what I wanted to talk about were
  • 00:00:26
    data quality and how we actually even
  • 00:00:29
    implement data quality checks into
  • 00:00:31
    systems like we're actually going to
  • 00:00:32
    talk about different types of data
  • 00:00:34
    quality checks how you create systems
  • 00:00:36
    that check data quality so we're
  • 00:00:38
    actually talk about how you design those
  • 00:00:39
    systems and why you might use one over
  • 00:00:42
    the other so we're really going to be
  • 00:00:43
    talking about exactly how you could even
  • 00:00:45
    build some of your own systems and again
  • 00:00:48
    the point here is that data quality in
  • 00:00:50
    my opinion is very important but we
  • 00:00:51
    often Rush past it because it's not as
  • 00:00:54
    sexy as building a data pipeline so
  • 00:00:56
    let's talk about first how you can
  • 00:00:58
    actually even check your data now if
  • 00:01:00
    you've been in the data world for a
  • 00:01:01
    while you may have seen some of these
  • 00:01:02
    checks maybe you haven't but here are
  • 00:01:04
    some of the key ones you'll generally
  • 00:01:06
    see first let's talk about range checks
  • 00:01:08
    or at least that's what I call them is
  • 00:01:09
    data range checks these are generally
  • 00:01:11
    focused on numeric data sets where you
  • 00:01:14
    assume that there is a certain range
  • 00:01:16
    those numbers should generally be now
  • 00:01:19
    when implemented there's a few different
  • 00:01:20
    ways you might put this check in for
  • 00:01:23
    example you might use something where
  • 00:01:24
    you have some model that detects if
  • 00:01:27
    there's an anomaly and so if you seed or
  • 00:01:31
    go below a certain number that's typical
  • 00:01:33
    like let's say you're used to only
  • 00:01:34
    seeing $1,000 transactions in your uh
  • 00:01:37
    system you might see a million dooll
  • 00:01:39
    transaction and be like hey we should
  • 00:01:41
    probably check that out in fact this
  • 00:01:43
    actually happened to me at one company
  • 00:01:45
    where we suddenly started seeing these
  • 00:01:46
    $200,000 transactions which wasn't
  • 00:01:49
    actually even bad data in this case it
  • 00:01:51
    was something where the process was
  • 00:01:53
    wrong and someone was being allowed to
  • 00:01:55
    spend $200,000 where they shouldn't have
  • 00:01:57
    so it's not even that data quality
  • 00:01:59
    checks always cat sometimes bad data but
  • 00:02:03
    just erroneous steps in the process that
  • 00:02:05
    need to be checked and so that's your
  • 00:02:07
    essential range check right you should
  • 00:02:09
    see it from X to X you should see from X
  • 00:02:12
    to Y right like there's a certain range
  • 00:02:14
    that you should expect numbers and if
  • 00:02:15
    they exceed those expectations you
  • 00:02:17
    should probably check that another
  • 00:02:19
    common check that you will often see is
  • 00:02:21
    what I call category checks and when I
  • 00:02:23
    say category I mean generally there are
  • 00:02:25
    entities or values that have something
  • 00:02:27
    that either do with category or maybe
  • 00:02:29
    you're just expecting a certain set of
  • 00:02:32
    values that never change for example
  • 00:02:35
    I'll give you the clearest one I always
  • 00:02:36
    think of states or state abbreviations
  • 00:02:40
    um they're not necessarily categories
  • 00:02:41
    but there're some sort of limited set of
  • 00:02:44
    values and you shouldn't get anything
  • 00:02:46
    other than those limited sets of values
  • 00:02:48
    I say that because as someone who has
  • 00:02:50
    done a state abbreviation check I've
  • 00:02:53
    actually gotten states that were
  • 00:02:55
    non-existent right I assumed that more
  • 00:02:57
    than likely on the other side the system
  • 00:02:59
    did some sort of data quality check or
  • 00:03:01
    had a drop down so how could you
  • 00:03:03
    possibly get anything that wasn't a
  • 00:03:06
    state abbreviation and yet we got at
  • 00:03:08
    least a handful I think it was like four
  • 00:03:10
    or five abbreviations that didn't exist
  • 00:03:12
    and they came into our system from the
  • 00:03:14
    operational system and so you often need
  • 00:03:16
    a check for this I I call this category
  • 00:03:18
    checks because sometimes uh the way you
  • 00:03:20
    might think about it is that maybe you
  • 00:03:22
    have certain types of actions or events
  • 00:03:24
    that can occur in your system and you
  • 00:03:25
    only expect those events to occur PTO is
  • 00:03:28
    another example where we had like
  • 00:03:29
    various of PTO and every once in a while
  • 00:03:31
    we would get a new type of PTO because
  • 00:03:34
    the operational team wouldn't tell us
  • 00:03:36
    that they created it no matter what we
  • 00:03:37
    asked them to do so we would just have
  • 00:03:39
    to create a check that basically yelled
  • 00:03:40
    at us whenever we saw something where
  • 00:03:42
    there wasn't a check that matched or
  • 00:03:44
    wasn't a category that matched what we
  • 00:03:45
    were expeced quick pause everyone I just
  • 00:03:47
    want to say thank you so much to our
  • 00:03:49
    sponsor today dcbe dcube is a platform
  • 00:03:51
    that helps companies build trust in
  • 00:03:53
    governance in their data and AI products
  • 00:03:56
    as we gear towards llms and AI products
  • 00:03:59
    high quality and trustworthy data are
  • 00:04:01
    extremely critical to meet business
  • 00:04:03
    goals and objectives don't solve the
  • 00:04:05
    problem for silos but review the
  • 00:04:07
    overarching goals of how we as data
  • 00:04:10
    teams can deliver trusted data access
  • 00:04:12
    across the value chain with DC's lineage
  • 00:04:15
    and incident details know where the
  • 00:04:17
    incident took place and understand their
  • 00:04:19
    impact on Downstream assets dcbe is not
  • 00:04:21
    limited to databases or warehouses in
  • 00:04:23
    fact dcbe observes the data pipeline DBT
  • 00:04:26
    jobs five TR airf flow to extrap the job
  • 00:04:29
    run lineage incidents along with logs
  • 00:04:32
    dcube is a truly unified data platform
  • 00:04:34
    which manages data observability
  • 00:04:36
    Discovery and governance Allin one their
  • 00:04:38
    tagline is dcbe observe discover govern
  • 00:04:42
    start your journey towards trust and
  • 00:04:43
    governance today you can visit
  • 00:04:58
    dcu.org
  • 00:05:00
    uh you know that that should be a date
  • 00:05:01
    field that should be the interor field
  • 00:05:02
    it's really important that you check
  • 00:05:04
    your data types because sometimes what
  • 00:05:06
    happens especially if you're loading
  • 00:05:07
    data from files and headless files so
  • 00:05:09
    something that doesn't have a header you
  • 00:05:11
    might not have the right order come
  • 00:05:13
    every time I've had a company where I've
  • 00:05:16
    worked with them heavily to set send me
  • 00:05:18
    a CSV the same way every time and for 6
  • 00:05:22
    months it might be fine where we get the
  • 00:05:23
    same data every time in the same sets of
  • 00:05:26
    rows and columns and everything's
  • 00:05:28
    perfect great then for some reason
  • 00:05:30
    either someone quits or someone for some
  • 00:05:32
    reason changes the automated script and
  • 00:05:34
    now suddenly you're getting a different
  • 00:05:37
    uh field where you expected a date and
  • 00:05:39
    so it's great to have a quick data type
  • 00:05:41
    check to make sure hey are these dates
  • 00:05:43
    or are these just the values that I'm
  • 00:05:44
    expecting because again sometimes people
  • 00:05:46
    do ship these around and you do not want
  • 00:05:48
    to figure that out after you've loaded
  • 00:05:50
    the data into stage eight so hopefully
  • 00:05:52
    you can catch that at Raw another common
  • 00:05:54
    check is what is known as a freshness
  • 00:05:57
    check now this is interesting so if you
  • 00:05:58
    bring up uh this pill like these pillars
  • 00:06:00
    of data quality one of the things that
  • 00:06:02
    often is reference is timeliness because
  • 00:06:05
    data isn't just accurate in the sense of
  • 00:06:08
    is the data right as of yesterday if
  • 00:06:11
    you're working in a lot of companies
  • 00:06:12
    they might know that certain data has
  • 00:06:15
    occurred or certain transactions has
  • 00:06:16
    occurred and if your system isn't
  • 00:06:18
    freshed or up to dat to maybe what an
  • 00:06:21
    executive expects they're going to be a
  • 00:06:23
    little frustrated when they look at a
  • 00:06:25
    report and they see hey the this number
  • 00:06:28
    uh hasn't been updated in two days or
  • 00:06:30
    something and they were expecting that
  • 00:06:31
    or more likely they don't even know
  • 00:06:33
    that's what caused it they just know
  • 00:06:34
    they don't see a transaction that should
  • 00:06:35
    be captured and they're going to be like
  • 00:06:37
    hey your system's not working correctly
  • 00:06:39
    because it didn't capture this
  • 00:06:40
    transaction that happened 30 seconds ago
  • 00:06:42
    and so this is often why there are data
  • 00:06:44
    freshness checks to tell you how fresh
  • 00:06:46
    the data is usually what you'll do is
  • 00:06:49
    have some sort of warning that tells you
  • 00:06:50
    if it goes beyond a certain level uh and
  • 00:06:53
    on the other side on your dashboards you
  • 00:06:54
    should put a little thing that says
  • 00:06:56
    updated as of a certain date that way
  • 00:06:59
    people kind of at least know what that
  • 00:07:00
    data represents another very useful test
  • 00:07:03
    that I've seen be useful multiple times
  • 00:07:06
    is volume tests so this essentially
  • 00:07:08
    states that have a check where you're
  • 00:07:11
    looking at how much data is actually
  • 00:07:12
    being loaded so this is just the counter
  • 00:07:14
    rows that are being loaded per day
  • 00:07:15
    because generally if you're a company
  • 00:07:18
    you should see a similar amount of rows
  • 00:07:21
    except for generally on weekends
  • 00:07:22
    depending on what type of company you
  • 00:07:23
    are that occur on most days and if you
  • 00:07:27
    suddenly see three four five times that
  • 00:07:28
    number there's a problem somewhere and
  • 00:07:31
    I've very much seen that happen multiple
  • 00:07:33
    times in multiple data sets where a
  • 00:07:35
    system somewhere goes wrong right U
  • 00:07:38
    generally this is an operational system
  • 00:07:39
    where something changed and suddenly
  • 00:07:41
    you're starting to see massive amounts
  • 00:07:42
    more of data and that should set off
  • 00:07:44
    alarm Bells like why am I seeing more
  • 00:07:46
    data than I saw yesterday is there
  • 00:07:48
    something in an operational system and
  • 00:07:50
    in fact we've seen this happen where a
  • 00:07:52
    new feature was released and suddenly
  • 00:07:54
    we're getting 10 times the amount of
  • 00:07:55
    data that's coming in and we shouldn't
  • 00:07:57
    be right we're now seeing 10 times the
  • 00:07:59
    amount of transactions because something
  • 00:08:00
    in the the operational system wasn't
  • 00:08:02
    created correctly and is now firing off
  • 00:08:05
    way too many requests that we're now
  • 00:08:07
    tracking and so this is a great check
  • 00:08:09
    just for sanity like if you're seeing a
  • 00:08:11
    crazy amount of rows or less rows
  • 00:08:13
    something is more than likely wrong now
  • 00:08:15
    for now I'm going to go with one more
  • 00:08:16
    test that you should definitely
  • 00:08:18
    Implement which is the null test uh this
  • 00:08:21
    generally the way you'll implement it is
  • 00:08:22
    that either you're going to set it to
  • 00:08:24
    there can't be any nulls which is one
  • 00:08:25
    way to set it or you'll set that there's
  • 00:08:27
    a certain percentage of the fields that
  • 00:08:29
    you allow to be null and then maybe you
  • 00:08:30
    put some sort of filler if you need to
  • 00:08:33
    depending on the field into that field
  • 00:08:35
    when required because nulls act weird
  • 00:08:37
    and you need to make sure you understand
  • 00:08:38
    that so null tests are also super
  • 00:08:40
    valuable now that we've talked about
  • 00:08:42
    some of the types of tests you might
  • 00:08:43
    Implement let's talk about how you
  • 00:08:45
    actually Implement them like what is the
  • 00:08:47
    system that you create that lets you
  • 00:08:48
    know hey something's gone wrong and the
  • 00:08:50
    truth is that different companies need
  • 00:08:52
    different levels of checks some
  • 00:08:53
    companies that I've worked for are very
  • 00:08:55
    okay with the very light checks that I'm
  • 00:08:57
    going to discuss where it's going to be
  • 00:08:58
    like these are very simple ways you can
  • 00:08:59
    implement it and others want to develop
  • 00:09:01
    entire systems that are all around data
  • 00:09:04
    quality while still others look for
  • 00:09:06
    outof thebox Solutions like our sponsor
  • 00:09:08
    today uh to cover a lot of that because
  • 00:09:10
    they don't have an army of Engineers to
  • 00:09:12
    build these Solutions so let's talk
  • 00:09:14
    about some of these options let's start
  • 00:09:16
    with the easiest thing I think to
  • 00:09:17
    implement which is slack messages and
  • 00:09:20
    I've had to do this for some of my
  • 00:09:21
    clients where maybe they didn't want to
  • 00:09:23
    have a complex system but instead they
  • 00:09:26
    knew that they had a very limited set of
  • 00:09:28
    data sets so instead of having to build
  • 00:09:30
    a complex system that would cost them a
  • 00:09:32
    lot of money we created an automated uh
  • 00:09:35
    set of checks that would run at the end
  • 00:09:37
    of all of their key jobs and basically
  • 00:09:39
    it would just run all of these checks
  • 00:09:41
    and if any of them failed it would send
  • 00:09:43
    a slack message with a list of failures
  • 00:09:45
    so that the data engineer on call could
  • 00:09:48
    see them it's not super fancy honestly
  • 00:09:50
    it was just a bunch of unions for all of
  • 00:09:52
    these checks and it's also not very
  • 00:09:54
    generalized right like I had to write
  • 00:09:55
    each one uh individually versus what
  • 00:09:58
    we'll talk about more in the the future
  • 00:09:59
    where you create a generalized system
  • 00:10:01
    but if you only have 10 checks and
  • 00:10:03
    that's all you're really running you're
  • 00:10:04
    not planning to add more and it is
  • 00:10:06
    supporting everything you need that
  • 00:10:09
    could be sufficient I always think it's
  • 00:10:10
    important to figure out hey there's
  • 00:10:12
    costs and there's trade-offs and you
  • 00:10:13
    need to figure out and you need to find
  • 00:10:15
    the system that works best for you based
  • 00:10:16
    on what decisions are being made off
  • 00:10:18
    that data and how much budget you're
  • 00:10:21
    willing to implement another way that a
  • 00:10:22
    lot of people Implement data quality
  • 00:10:24
    checks and this one can be implemented
  • 00:10:25
    along with the slack messages is a data
  • 00:10:28
    quality dashboard
  • 00:10:29
    now in particular I see this a lot with
  • 00:10:31
    volume checks uh freshness checks and
  • 00:10:34
    things of that nature where they're
  • 00:10:35
    constantly trying to say you know maybe
  • 00:10:36
    you've got some high level metrics that
  • 00:10:37
    say um key tables and letting you know
  • 00:10:40
    which key tables have been loaded and
  • 00:10:42
    when so you can kind of see some red
  • 00:10:43
    flashing ones if it goes beyond let's
  • 00:10:45
    say 24 hours and volume checks where you
  • 00:10:47
    like say like okay how many rows are we
  • 00:10:48
    getting for these key tables are we
  • 00:10:50
    still seeing the number the right number
  • 00:10:52
    because this is one of those things
  • 00:10:53
    where you might have the data updated
  • 00:10:55
    but maybe the data that's updated is
  • 00:10:56
    only 10 rows and you're expecting
  • 00:10:58
    100,000
  • 00:10:59
    right so you want to make sure you've
  • 00:11:00
    got a few different ways you can see on
  • 00:11:02
    this da quality dashboard what where
  • 00:11:04
    something could have gone wrong you
  • 00:11:05
    might include some of the other checks
  • 00:11:07
    we talked about before like n checks uh
  • 00:11:09
    just all on the dashboard the problem is
  • 00:11:11
    then you have to know where to look and
  • 00:11:12
    you have to actually go through it all
  • 00:11:13
    so that's why it's kind of nice to
  • 00:11:15
    combin it with the slack checks that way
  • 00:11:17
    you kind of have one place that you can
  • 00:11:18
    see some of the checks when they occur
  • 00:11:19
    and maybe another place where if you
  • 00:11:21
    need to go uh see what's going on you
  • 00:11:24
    know you can see more alive interface of
  • 00:11:26
    it with the dashboard so one's kind of
  • 00:11:27
    automated and one's more of alive
  • 00:11:29
    approach that's running all the time now
  • 00:11:31
    taking a step further than that like I
  • 00:11:32
    said you can develop these data Quality
  • 00:11:35
    Systems and earlier I referenced that
  • 00:11:37
    they're very generalized so in my
  • 00:11:39
    experience when I've built them what you
  • 00:11:41
    essentially end up doing is having some
  • 00:11:43
    sort of either python or some sort of
  • 00:11:45
    script that you've built that abstracts
  • 00:11:47
    the fact that you're checking something
  • 00:11:48
    right so you can write a SQL function so
  • 00:11:50
    you can write a SQL query that you can
  • 00:11:52
    essentially pass in so you can have a
  • 00:11:54
    whole list of them often we would just
  • 00:11:55
    have a whole list somewhere either
  • 00:11:56
    listed in a folder somewhere or maybe in
  • 00:11:58
    your data base uh you'd have what type
  • 00:12:00
    of check it was you know was it a range
  • 00:12:02
    check Etc maybe you'd have how many
  • 00:12:04
    failures you'd allow kind of in this in
  • 00:12:07
    The Columns of the rows of your system
  • 00:12:09
    that way when this automated system ran
  • 00:12:10
    it would pick up these queries and
  • 00:12:12
    instead of having to create an
  • 00:12:14
    individual one for each of these you
  • 00:12:15
    just already had abstracted these away
  • 00:12:17
    so every time you actually add in a new
  • 00:12:19
    check you're just having to add in new
  • 00:12:21
    row into your table rather than the
  • 00:12:25
    other option which should be writing a
  • 00:12:27
    new query and then having to you know
  • 00:12:28
    change code you're not having to change
  • 00:12:29
    code in the system you're just having to
  • 00:12:31
    uh interject it into a table somewhere
  • 00:12:33
    and so this is one way that I see a lot
  • 00:12:35
    of people do it um the other thing that
  • 00:12:37
    they might do and may or may not do is
  • 00:12:39
    then track the output so that's the big
  • 00:12:41
    thing that some of these systems that
  • 00:12:42
    you have to pay for that's the big thing
  • 00:12:44
    that especially a lot of vendors do um
  • 00:12:46
    is they track the change over time so
  • 00:12:48
    how well is your data um acting over
  • 00:12:52
    time so you'll see a lot of this in many
  • 00:12:54
    systems where they actually track um you
  • 00:12:56
    know how how healthy a table is how many
  • 00:12:58
    failures do you often have on this this
  • 00:13:00
    table um where you may or may not do
  • 00:13:02
    that if you build your own system
  • 00:13:03
    because it takes more time but that's
  • 00:13:05
    generally what it is you've got the SQL
  • 00:13:06
    base system that has some sort of code
  • 00:13:07
    based wrapper I said python earlier but
  • 00:13:09
    it could be any code that you end up
  • 00:13:11
    using we were using Powershell honestly
  • 00:13:13
    and then that runs everything and Loops
  • 00:13:15
    through it all and then tracks it all
  • 00:13:16
    and then generally saves it or outputs
  • 00:13:18
    it somewhere that you can see it later
  • 00:13:19
    and that's more of the in-house
  • 00:13:20
    developed system now another way you can
  • 00:13:22
    often run data quality checks and this
  • 00:13:24
    is how we did it at Facebook um you
  • 00:13:26
    should have a bunch of DQ operators and
  • 00:13:28
    so C and for data quality operators um
  • 00:13:31
    so if you've used airflow there's
  • 00:13:33
    everything's references uh operators
  • 00:13:35
    Facebook uses something similar to air
  • 00:13:37
    flow so everything is referenced is
  • 00:13:39
    operators basically data quality
  • 00:13:40
    operators are pre-built um essentially
  • 00:13:43
    tasks that you can run and will
  • 00:13:45
    automatically actually interject into
  • 00:13:48
    the tracking system so I referenced
  • 00:13:49
    earlier you'd have to build your own if
  • 00:13:51
    you did it yourself but we had you know
  • 00:13:53
    abstracted it to a point where you could
  • 00:13:54
    just reference this DQ operator and that
  • 00:13:57
    would automatically feed into our whole
  • 00:13:59
    data catalog you could see the data
  • 00:14:01
    quality checks you could see the health
  • 00:14:02
    it all be there in one because it was
  • 00:14:04
    super abstracted to the point where you
  • 00:14:06
    just have to pretty much put the query
  • 00:14:07
    there set expectations and it would run
  • 00:14:10
    and tell you and and you just have to
  • 00:14:12
    set like should it fail or should it not
  • 00:14:13
    succeed based on um certain parameters
  • 00:14:16
    and so that's a great way um especially
  • 00:14:18
    once you start getting far enough along
  • 00:14:19
    and you have enough engergy ear to build
  • 00:14:21
    your own system and of course there are
  • 00:14:23
    solutions like DBT that come with DBT
  • 00:14:24
    tests and you can kind of check some of
  • 00:14:25
    this out uh even there you like there's
  • 00:14:27
    some basic ones that you can include for
  • 00:14:29
    example they have tests such as unique
  • 00:14:32
    uh not null except values which kind of
  • 00:14:34
    as they're saying you know the unique
  • 00:14:36
    checks if it's a unique value not null
  • 00:14:38
    excepts you know checks on the not null
  • 00:14:40
    factor that we talked about earlier and
  • 00:14:42
    so on you can also create your own
  • 00:14:44
    generic tests so that way as these
  • 00:14:46
    tables are running you run DBT tests as
  • 00:14:48
    well if you're building these models in
  • 00:14:50
    DBT again that is limited to the fact
  • 00:14:53
    you're using DBT and that's the only
  • 00:14:54
    place that's going to actually work so
  • 00:14:56
    as referenced before data quality is
  • 00:14:58
    becoming more and more important as we
  • 00:15:00
    want to make these kind of crazy
  • 00:15:02
    decisions and automated systems uh in
  • 00:15:05
    the future with all of this technology
  • 00:15:07
    with AI Etc it just pushes the need for
  • 00:15:10
    higher data quality because we do not
  • 00:15:12
    want you know things and bad things to
  • 00:15:14
    happen because we're going to rely on
  • 00:15:15
    these systems I imagine more and more in
  • 00:15:18
    the real world and so that data needs to
  • 00:15:20
    be of the highest quality meaning that
  • 00:15:22
    like we discussed there's a ton of data
  • 00:15:24
    quality checks you can run everything
  • 00:15:25
    from uniqueness checks to data to range
  • 00:15:27
    checks uh to not null checks Etc uh and
  • 00:15:31
    there's a lot of different ways you can
  • 00:15:32
    actually Implement those systems whether
  • 00:15:33
    it's just a few slack messages that yell
  • 00:15:35
    at you um to let you know if something's
  • 00:15:36
    gone wrong or if you've built an entire
  • 00:15:38
    system or if you've looked to purchase
  • 00:15:39
    one um like our sponsor today which
  • 00:15:41
    again thank you de Cube for sponsoring
  • 00:15:43
    this video and with that guys I really
  • 00:15:44
    hope you guys have learned how you can
  • 00:15:46
    set up your own data Quality Systems
  • 00:15:48
    that way if you're using something like
  • 00:15:49
    SQL or python you can build out your own
  • 00:15:53
    system quickly um or if you need to find
  • 00:15:55
    one that you can purchase or if you just
  • 00:15:57
    need to set up a few Slack notifications
  • 00:15:59
    you can just do that with it guys I want
  • 00:16:01
    to say thanks so much for watching this
  • 00:16:02
    video and I'll see you in the next one
  • 00:16:03
    thanks all
  • 00:16:04
    goodbye
Tags
  • Data Quality
  • AI
  • Range Checks
  • Category Checks
  • Freshness Checks
  • Null Tests
  • Volume Checks
  • Data Governance
  • DCube
  • Data Systems