00:00:00
depending what metric you look up
00:00:01
supposedly companies are losing as much
00:00:03
as 20% if not more due to bad data
00:00:06
quality now we all know that everyone
00:00:09
loves to talk about how important data
00:00:11
quality is especially as we're heading
00:00:13
into a world that seems to be gung-ho
00:00:15
about AI yet we have many examples like
00:00:18
the uh Google Bart example where they
00:00:19
put out erroneous data and that impacted
00:00:22
their stock price heavily so in this
00:00:24
video what I wanted to talk about were
00:00:26
data quality and how we actually even
00:00:29
implement data quality checks into
00:00:31
systems like we're actually going to
00:00:32
talk about different types of data
00:00:34
quality checks how you create systems
00:00:36
that check data quality so we're
00:00:38
actually talk about how you design those
00:00:39
systems and why you might use one over
00:00:42
the other so we're really going to be
00:00:43
talking about exactly how you could even
00:00:45
build some of your own systems and again
00:00:48
the point here is that data quality in
00:00:50
my opinion is very important but we
00:00:51
often Rush past it because it's not as
00:00:54
sexy as building a data pipeline so
00:00:56
let's talk about first how you can
00:00:58
actually even check your data now if
00:01:00
you've been in the data world for a
00:01:01
while you may have seen some of these
00:01:02
checks maybe you haven't but here are
00:01:04
some of the key ones you'll generally
00:01:06
see first let's talk about range checks
00:01:08
or at least that's what I call them is
00:01:09
data range checks these are generally
00:01:11
focused on numeric data sets where you
00:01:14
assume that there is a certain range
00:01:16
those numbers should generally be now
00:01:19
when implemented there's a few different
00:01:20
ways you might put this check in for
00:01:23
example you might use something where
00:01:24
you have some model that detects if
00:01:27
there's an anomaly and so if you seed or
00:01:31
go below a certain number that's typical
00:01:33
like let's say you're used to only
00:01:34
seeing $1,000 transactions in your uh
00:01:37
system you might see a million dooll
00:01:39
transaction and be like hey we should
00:01:41
probably check that out in fact this
00:01:43
actually happened to me at one company
00:01:45
where we suddenly started seeing these
00:01:46
$200,000 transactions which wasn't
00:01:49
actually even bad data in this case it
00:01:51
was something where the process was
00:01:53
wrong and someone was being allowed to
00:01:55
spend $200,000 where they shouldn't have
00:01:57
so it's not even that data quality
00:01:59
checks always cat sometimes bad data but
00:02:03
just erroneous steps in the process that
00:02:05
need to be checked and so that's your
00:02:07
essential range check right you should
00:02:09
see it from X to X you should see from X
00:02:12
to Y right like there's a certain range
00:02:14
that you should expect numbers and if
00:02:15
they exceed those expectations you
00:02:17
should probably check that another
00:02:19
common check that you will often see is
00:02:21
what I call category checks and when I
00:02:23
say category I mean generally there are
00:02:25
entities or values that have something
00:02:27
that either do with category or maybe
00:02:29
you're just expecting a certain set of
00:02:32
values that never change for example
00:02:35
I'll give you the clearest one I always
00:02:36
think of states or state abbreviations
00:02:40
um they're not necessarily categories
00:02:41
but there're some sort of limited set of
00:02:44
values and you shouldn't get anything
00:02:46
other than those limited sets of values
00:02:48
I say that because as someone who has
00:02:50
done a state abbreviation check I've
00:02:53
actually gotten states that were
00:02:55
non-existent right I assumed that more
00:02:57
than likely on the other side the system
00:02:59
did some sort of data quality check or
00:03:01
had a drop down so how could you
00:03:03
possibly get anything that wasn't a
00:03:06
state abbreviation and yet we got at
00:03:08
least a handful I think it was like four
00:03:10
or five abbreviations that didn't exist
00:03:12
and they came into our system from the
00:03:14
operational system and so you often need
00:03:16
a check for this I I call this category
00:03:18
checks because sometimes uh the way you
00:03:20
might think about it is that maybe you
00:03:22
have certain types of actions or events
00:03:24
that can occur in your system and you
00:03:25
only expect those events to occur PTO is
00:03:28
another example where we had like
00:03:29
various of PTO and every once in a while
00:03:31
we would get a new type of PTO because
00:03:34
the operational team wouldn't tell us
00:03:36
that they created it no matter what we
00:03:37
asked them to do so we would just have
00:03:39
to create a check that basically yelled
00:03:40
at us whenever we saw something where
00:03:42
there wasn't a check that matched or
00:03:44
wasn't a category that matched what we
00:03:45
were expeced quick pause everyone I just
00:03:47
want to say thank you so much to our
00:03:49
sponsor today dcbe dcube is a platform
00:03:51
that helps companies build trust in
00:03:53
governance in their data and AI products
00:03:56
as we gear towards llms and AI products
00:03:59
high quality and trustworthy data are
00:04:01
extremely critical to meet business
00:04:03
goals and objectives don't solve the
00:04:05
problem for silos but review the
00:04:07
overarching goals of how we as data
00:04:10
teams can deliver trusted data access
00:04:12
across the value chain with DC's lineage
00:04:15
and incident details know where the
00:04:17
incident took place and understand their
00:04:19
impact on Downstream assets dcbe is not
00:04:21
limited to databases or warehouses in
00:04:23
fact dcbe observes the data pipeline DBT
00:04:26
jobs five TR airf flow to extrap the job
00:04:29
run lineage incidents along with logs
00:04:32
dcube is a truly unified data platform
00:04:34
which manages data observability
00:04:36
Discovery and governance Allin one their
00:04:38
tagline is dcbe observe discover govern
00:04:42
start your journey towards trust and
00:04:43
governance today you can visit
00:04:58
dcu.org
00:05:00
uh you know that that should be a date
00:05:01
field that should be the interor field
00:05:02
it's really important that you check
00:05:04
your data types because sometimes what
00:05:06
happens especially if you're loading
00:05:07
data from files and headless files so
00:05:09
something that doesn't have a header you
00:05:11
might not have the right order come
00:05:13
every time I've had a company where I've
00:05:16
worked with them heavily to set send me
00:05:18
a CSV the same way every time and for 6
00:05:22
months it might be fine where we get the
00:05:23
same data every time in the same sets of
00:05:26
rows and columns and everything's
00:05:28
perfect great then for some reason
00:05:30
either someone quits or someone for some
00:05:32
reason changes the automated script and
00:05:34
now suddenly you're getting a different
00:05:37
uh field where you expected a date and
00:05:39
so it's great to have a quick data type
00:05:41
check to make sure hey are these dates
00:05:43
or are these just the values that I'm
00:05:44
expecting because again sometimes people
00:05:46
do ship these around and you do not want
00:05:48
to figure that out after you've loaded
00:05:50
the data into stage eight so hopefully
00:05:52
you can catch that at Raw another common
00:05:54
check is what is known as a freshness
00:05:57
check now this is interesting so if you
00:05:58
bring up uh this pill like these pillars
00:06:00
of data quality one of the things that
00:06:02
often is reference is timeliness because
00:06:05
data isn't just accurate in the sense of
00:06:08
is the data right as of yesterday if
00:06:11
you're working in a lot of companies
00:06:12
they might know that certain data has
00:06:15
occurred or certain transactions has
00:06:16
occurred and if your system isn't
00:06:18
freshed or up to dat to maybe what an
00:06:21
executive expects they're going to be a
00:06:23
little frustrated when they look at a
00:06:25
report and they see hey the this number
00:06:28
uh hasn't been updated in two days or
00:06:30
something and they were expecting that
00:06:31
or more likely they don't even know
00:06:33
that's what caused it they just know
00:06:34
they don't see a transaction that should
00:06:35
be captured and they're going to be like
00:06:37
hey your system's not working correctly
00:06:39
because it didn't capture this
00:06:40
transaction that happened 30 seconds ago
00:06:42
and so this is often why there are data
00:06:44
freshness checks to tell you how fresh
00:06:46
the data is usually what you'll do is
00:06:49
have some sort of warning that tells you
00:06:50
if it goes beyond a certain level uh and
00:06:53
on the other side on your dashboards you
00:06:54
should put a little thing that says
00:06:56
updated as of a certain date that way
00:06:59
people kind of at least know what that
00:07:00
data represents another very useful test
00:07:03
that I've seen be useful multiple times
00:07:06
is volume tests so this essentially
00:07:08
states that have a check where you're
00:07:11
looking at how much data is actually
00:07:12
being loaded so this is just the counter
00:07:14
rows that are being loaded per day
00:07:15
because generally if you're a company
00:07:18
you should see a similar amount of rows
00:07:21
except for generally on weekends
00:07:22
depending on what type of company you
00:07:23
are that occur on most days and if you
00:07:27
suddenly see three four five times that
00:07:28
number there's a problem somewhere and
00:07:31
I've very much seen that happen multiple
00:07:33
times in multiple data sets where a
00:07:35
system somewhere goes wrong right U
00:07:38
generally this is an operational system
00:07:39
where something changed and suddenly
00:07:41
you're starting to see massive amounts
00:07:42
more of data and that should set off
00:07:44
alarm Bells like why am I seeing more
00:07:46
data than I saw yesterday is there
00:07:48
something in an operational system and
00:07:50
in fact we've seen this happen where a
00:07:52
new feature was released and suddenly
00:07:54
we're getting 10 times the amount of
00:07:55
data that's coming in and we shouldn't
00:07:57
be right we're now seeing 10 times the
00:07:59
amount of transactions because something
00:08:00
in the the operational system wasn't
00:08:02
created correctly and is now firing off
00:08:05
way too many requests that we're now
00:08:07
tracking and so this is a great check
00:08:09
just for sanity like if you're seeing a
00:08:11
crazy amount of rows or less rows
00:08:13
something is more than likely wrong now
00:08:15
for now I'm going to go with one more
00:08:16
test that you should definitely
00:08:18
Implement which is the null test uh this
00:08:21
generally the way you'll implement it is
00:08:22
that either you're going to set it to
00:08:24
there can't be any nulls which is one
00:08:25
way to set it or you'll set that there's
00:08:27
a certain percentage of the fields that
00:08:29
you allow to be null and then maybe you
00:08:30
put some sort of filler if you need to
00:08:33
depending on the field into that field
00:08:35
when required because nulls act weird
00:08:37
and you need to make sure you understand
00:08:38
that so null tests are also super
00:08:40
valuable now that we've talked about
00:08:42
some of the types of tests you might
00:08:43
Implement let's talk about how you
00:08:45
actually Implement them like what is the
00:08:47
system that you create that lets you
00:08:48
know hey something's gone wrong and the
00:08:50
truth is that different companies need
00:08:52
different levels of checks some
00:08:53
companies that I've worked for are very
00:08:55
okay with the very light checks that I'm
00:08:57
going to discuss where it's going to be
00:08:58
like these are very simple ways you can
00:08:59
implement it and others want to develop
00:09:01
entire systems that are all around data
00:09:04
quality while still others look for
00:09:06
outof thebox Solutions like our sponsor
00:09:08
today uh to cover a lot of that because
00:09:10
they don't have an army of Engineers to
00:09:12
build these Solutions so let's talk
00:09:14
about some of these options let's start
00:09:16
with the easiest thing I think to
00:09:17
implement which is slack messages and
00:09:20
I've had to do this for some of my
00:09:21
clients where maybe they didn't want to
00:09:23
have a complex system but instead they
00:09:26
knew that they had a very limited set of
00:09:28
data sets so instead of having to build
00:09:30
a complex system that would cost them a
00:09:32
lot of money we created an automated uh
00:09:35
set of checks that would run at the end
00:09:37
of all of their key jobs and basically
00:09:39
it would just run all of these checks
00:09:41
and if any of them failed it would send
00:09:43
a slack message with a list of failures
00:09:45
so that the data engineer on call could
00:09:48
see them it's not super fancy honestly
00:09:50
it was just a bunch of unions for all of
00:09:52
these checks and it's also not very
00:09:54
generalized right like I had to write
00:09:55
each one uh individually versus what
00:09:58
we'll talk about more in the the future
00:09:59
where you create a generalized system
00:10:01
but if you only have 10 checks and
00:10:03
that's all you're really running you're
00:10:04
not planning to add more and it is
00:10:06
supporting everything you need that
00:10:09
could be sufficient I always think it's
00:10:10
important to figure out hey there's
00:10:12
costs and there's trade-offs and you
00:10:13
need to figure out and you need to find
00:10:15
the system that works best for you based
00:10:16
on what decisions are being made off
00:10:18
that data and how much budget you're
00:10:21
willing to implement another way that a
00:10:22
lot of people Implement data quality
00:10:24
checks and this one can be implemented
00:10:25
along with the slack messages is a data
00:10:28
quality dashboard
00:10:29
now in particular I see this a lot with
00:10:31
volume checks uh freshness checks and
00:10:34
things of that nature where they're
00:10:35
constantly trying to say you know maybe
00:10:36
you've got some high level metrics that
00:10:37
say um key tables and letting you know
00:10:40
which key tables have been loaded and
00:10:42
when so you can kind of see some red
00:10:43
flashing ones if it goes beyond let's
00:10:45
say 24 hours and volume checks where you
00:10:47
like say like okay how many rows are we
00:10:48
getting for these key tables are we
00:10:50
still seeing the number the right number
00:10:52
because this is one of those things
00:10:53
where you might have the data updated
00:10:55
but maybe the data that's updated is
00:10:56
only 10 rows and you're expecting
00:10:58
100,000
00:10:59
right so you want to make sure you've
00:11:00
got a few different ways you can see on
00:11:02
this da quality dashboard what where
00:11:04
something could have gone wrong you
00:11:05
might include some of the other checks
00:11:07
we talked about before like n checks uh
00:11:09
just all on the dashboard the problem is
00:11:11
then you have to know where to look and
00:11:12
you have to actually go through it all
00:11:13
so that's why it's kind of nice to
00:11:15
combin it with the slack checks that way
00:11:17
you kind of have one place that you can
00:11:18
see some of the checks when they occur
00:11:19
and maybe another place where if you
00:11:21
need to go uh see what's going on you
00:11:24
know you can see more alive interface of
00:11:26
it with the dashboard so one's kind of
00:11:27
automated and one's more of alive
00:11:29
approach that's running all the time now
00:11:31
taking a step further than that like I
00:11:32
said you can develop these data Quality
00:11:35
Systems and earlier I referenced that
00:11:37
they're very generalized so in my
00:11:39
experience when I've built them what you
00:11:41
essentially end up doing is having some
00:11:43
sort of either python or some sort of
00:11:45
script that you've built that abstracts
00:11:47
the fact that you're checking something
00:11:48
right so you can write a SQL function so
00:11:50
you can write a SQL query that you can
00:11:52
essentially pass in so you can have a
00:11:54
whole list of them often we would just
00:11:55
have a whole list somewhere either
00:11:56
listed in a folder somewhere or maybe in
00:11:58
your data base uh you'd have what type
00:12:00
of check it was you know was it a range
00:12:02
check Etc maybe you'd have how many
00:12:04
failures you'd allow kind of in this in
00:12:07
The Columns of the rows of your system
00:12:09
that way when this automated system ran
00:12:10
it would pick up these queries and
00:12:12
instead of having to create an
00:12:14
individual one for each of these you
00:12:15
just already had abstracted these away
00:12:17
so every time you actually add in a new
00:12:19
check you're just having to add in new
00:12:21
row into your table rather than the
00:12:25
other option which should be writing a
00:12:27
new query and then having to you know
00:12:28
change code you're not having to change
00:12:29
code in the system you're just having to
00:12:31
uh interject it into a table somewhere
00:12:33
and so this is one way that I see a lot
00:12:35
of people do it um the other thing that
00:12:37
they might do and may or may not do is
00:12:39
then track the output so that's the big
00:12:41
thing that some of these systems that
00:12:42
you have to pay for that's the big thing
00:12:44
that especially a lot of vendors do um
00:12:46
is they track the change over time so
00:12:48
how well is your data um acting over
00:12:52
time so you'll see a lot of this in many
00:12:54
systems where they actually track um you
00:12:56
know how how healthy a table is how many
00:12:58
failures do you often have on this this
00:13:00
table um where you may or may not do
00:13:02
that if you build your own system
00:13:03
because it takes more time but that's
00:13:05
generally what it is you've got the SQL
00:13:06
base system that has some sort of code
00:13:07
based wrapper I said python earlier but
00:13:09
it could be any code that you end up
00:13:11
using we were using Powershell honestly
00:13:13
and then that runs everything and Loops
00:13:15
through it all and then tracks it all
00:13:16
and then generally saves it or outputs
00:13:18
it somewhere that you can see it later
00:13:19
and that's more of the in-house
00:13:20
developed system now another way you can
00:13:22
often run data quality checks and this
00:13:24
is how we did it at Facebook um you
00:13:26
should have a bunch of DQ operators and
00:13:28
so C and for data quality operators um
00:13:31
so if you've used airflow there's
00:13:33
everything's references uh operators
00:13:35
Facebook uses something similar to air
00:13:37
flow so everything is referenced is
00:13:39
operators basically data quality
00:13:40
operators are pre-built um essentially
00:13:43
tasks that you can run and will
00:13:45
automatically actually interject into
00:13:48
the tracking system so I referenced
00:13:49
earlier you'd have to build your own if
00:13:51
you did it yourself but we had you know
00:13:53
abstracted it to a point where you could
00:13:54
just reference this DQ operator and that
00:13:57
would automatically feed into our whole
00:13:59
data catalog you could see the data
00:14:01
quality checks you could see the health
00:14:02
it all be there in one because it was
00:14:04
super abstracted to the point where you
00:14:06
just have to pretty much put the query
00:14:07
there set expectations and it would run
00:14:10
and tell you and and you just have to
00:14:12
set like should it fail or should it not
00:14:13
succeed based on um certain parameters
00:14:16
and so that's a great way um especially
00:14:18
once you start getting far enough along
00:14:19
and you have enough engergy ear to build
00:14:21
your own system and of course there are
00:14:23
solutions like DBT that come with DBT
00:14:24
tests and you can kind of check some of
00:14:25
this out uh even there you like there's
00:14:27
some basic ones that you can include for
00:14:29
example they have tests such as unique
00:14:32
uh not null except values which kind of
00:14:34
as they're saying you know the unique
00:14:36
checks if it's a unique value not null
00:14:38
excepts you know checks on the not null
00:14:40
factor that we talked about earlier and
00:14:42
so on you can also create your own
00:14:44
generic tests so that way as these
00:14:46
tables are running you run DBT tests as
00:14:48
well if you're building these models in
00:14:50
DBT again that is limited to the fact
00:14:53
you're using DBT and that's the only
00:14:54
place that's going to actually work so
00:14:56
as referenced before data quality is
00:14:58
becoming more and more important as we
00:15:00
want to make these kind of crazy
00:15:02
decisions and automated systems uh in
00:15:05
the future with all of this technology
00:15:07
with AI Etc it just pushes the need for
00:15:10
higher data quality because we do not
00:15:12
want you know things and bad things to
00:15:14
happen because we're going to rely on
00:15:15
these systems I imagine more and more in
00:15:18
the real world and so that data needs to
00:15:20
be of the highest quality meaning that
00:15:22
like we discussed there's a ton of data
00:15:24
quality checks you can run everything
00:15:25
from uniqueness checks to data to range
00:15:27
checks uh to not null checks Etc uh and
00:15:31
there's a lot of different ways you can
00:15:32
actually Implement those systems whether
00:15:33
it's just a few slack messages that yell
00:15:35
at you um to let you know if something's
00:15:36
gone wrong or if you've built an entire
00:15:38
system or if you've looked to purchase
00:15:39
one um like our sponsor today which
00:15:41
again thank you de Cube for sponsoring
00:15:43
this video and with that guys I really
00:15:44
hope you guys have learned how you can
00:15:46
set up your own data Quality Systems
00:15:48
that way if you're using something like
00:15:49
SQL or python you can build out your own
00:15:53
system quickly um or if you need to find
00:15:55
one that you can purchase or if you just
00:15:57
need to set up a few Slack notifications
00:15:59
you can just do that with it guys I want
00:16:01
to say thanks so much for watching this
00:16:02
video and I'll see you in the next one
00:16:03
thanks all
00:16:04
goodbye