What is the role of a data engineer at Netflix?

A data engineer at Netflix is essentially a software engineer specializing in data, focusing on distributed systems and making data consumable for the rest of the company.

How does Netflix handle data events?

Netflix logs all interactions as events using Kafka, storing them in an AWS S3-backed data warehouse, processed by tools like Spark for analytics and visualization.

How much data does Netflix process daily?

Netflix processes about 700 billion events daily, peaking over a trillion, with a warehouse currently at 60 petabytes, growing by 300 terabytes every day.

How does Netflix ensure data quality and trust?

Netflix uses statistical checks, anomaly detection, and a process to prevent bad data from becoming visible to maintain data quality and trust.

What innovation has Netflix implemented for ETL processes?

Netflix uses a push-based system where jobs notify downstream processes of new data, improving efficiency and accuracy in data handling.

Why is metadata important for Netflix?

Metadata provides statistics about data, aiding in anomaly detection and ensuring data quality across streams and storage environments.

How does Netflix's data infrastructure impact its content strategy?

Data analytics guide content investment decisions, like Netflix's move into original content, which now aims to comprise 50% of its catalog.

What challenges does Netflix face with data visualization?

Challenges include ensuring timely data access, accurate data representation, and addressing performance limitations in visualization tools.

How does Netflix address bad data or unfortunate events?

Netflix uses early detection and visibility strategies to prevent bad data from reaching reports, maintaining trust in data-driven decisions.

What role do data consumers have at Netflix?

Data consumers at Netflix, including business analysts and data scientists, rely on accurate data to make strategic and operational decisions.

Netflix: A series of unfortunate events: Delivering high-quality analytics

00:49:15

https://www.youtube.com/watch?v=XpOwXMB8mTA

摘要

TLDRMichele Aford, leading a team at Netflix, highlights the company's approach to handling vast amounts of data generated from over 100 million subscribers globally. Netflix writes 700 billion events per day, processing it to ensure the reliability of analytics. Every interaction across the Netflix service logs as events, capturing data in a Kafka-backed pipeline. This data eventually gets processed and stored in a cloud-based data warehouse powered by tools like Spark. Netflix embraces a mentality of expecting system and data failures instead of preventing them outright, using anomaly detection methods to manage data quality. The company implements a push-based ETL system that cascades updates downstream efficiently, as opposed to traditional scheduled systems. Ensuring data quality is vital, as various data stakeholders, from engineers to executives, utilize this data in decision processes, impacting content investment and user experience strategies. Netflix has embraced data-driven decision-making to influence original content production, aiming to increase its share of unique content. Michele emphasizes the complexity of managing Netflix's data at scale, highlighting the need for robust infrastructure and processes to ensure high data availability and reliability for data consumers. The talk underscores the importance of enabling confidence in data-driven decision-making and innovative analytics-driven solutions.

心得

📊 Netflix writes 700 billion events daily, using them for analytics and visualization.
📈 Netflix employs a push-based ETL system for more efficient data processing.
🚀 The company's data warehouse handles 60 petabytes, growing by 300 TB daily.
🔎 They use anomaly detection to maintain data quality and prevent bad data.
💡 Netflix relies heavily on data-driven decision-making, especially in content strategy.
🌐 The company logs every interaction as data, from app usage to content consumption.
🛠️ Spark and Kafka are pivotal tools for processing Netflix's massive data streams.
🖥️ Data quality is ensured by statistical and meta-analytics checks before usage.
📅 Data updates can trigger automatic downstream process execution in a push system.
🎬 Data has driven successful strategies, including investing in original content.

时间轴

00:00:00 - 00:05:00
Michele Aford introduces herself as the narrator, leading a data engineering team at Netflix focused on analytics and trusted data. She highlights Netflix's global presence and data scale, emphasizing the company's massive data collection and processing capabilities.
00:05:00 - 00:10:00
Michele discusses the Netflix data architecture, including event data logging, ingestion pipelines, and a vast data warehouse built on open-source technologies. The company processes large volumes of data daily, enabling various company-wide applications.
00:10:00 - 00:15:00
The process of handling 'unfortunate data' is explored. Michele explains events as user interactions, how Netflix logs these to a pipeline, processes with big data technologies like Spark, and visualizes using tools like Tableau for access across the company.
00:15:00 - 00:20:00
Problems with data reliability are addressed as 'unfortunate data,' without implying fault. Visualization tools help identify issues. Michele emphasizes the complexity of data processes, highlighting possible points of failure and the significant traffic Netflix deals with.
00:20:00 - 00:25:00
She stresses the importance of data quality for decision-making, especially for a data-driven company like Netflix. Various roles within Netflix interact with data, including data engineers, analytics engineers, and visualization experts. Executives rely heavily on data for strategic decisions.
00:25:00 - 00:30:00
The impact of accurate data is crucial for roles like product managers and engineers at Netflix, who make decisions on content investment, algorithm adjustments, and user experience based on data insights. Netflix's strategic content decisions are deeply data-driven.
00:30:00 - 00:35:00
Michele outlines strategies Netflix employs to ensure data quality, advocating for detection and response over prevention. She discusses utilizing data statistics and anomaly detection, explaining how unexpected data behavior is monitored and managed to prevent inaccurate reporting.
00:35:00 - 00:40:00
Handling data anomalies involves alert systems and user notifications when data is missing or suspect. Michele highlights internal tools for data visibility and lineage, supporting efficient troubleshooting and maintaining user trust by clarifying data integrity.
00:40:00 - 00:49:15
Michele reveals that despite the complexities, Netflix's processes maintain data confidence, enabling impactful decisions like content investment in originals. The talk concludes with strategic advice on managing data reliability and the challenges of data operations at scale.

显示更多

思维导图

视频问答

What is the role of a data engineer at Netflix?
A data engineer at Netflix is essentially a software engineer specializing in data, focusing on distributed systems and making data consumable for the rest of the company.
How does Netflix handle data events?
Netflix logs all interactions as events using Kafka, storing them in an AWS S3-backed data warehouse, processed by tools like Spark for analytics and visualization.
How much data does Netflix process daily?
Netflix processes about 700 billion events daily, peaking over a trillion, with a warehouse currently at 60 petabytes, growing by 300 terabytes every day.
How does Netflix ensure data quality and trust?
Netflix uses statistical checks, anomaly detection, and a process to prevent bad data from becoming visible to maintain data quality and trust.
What innovation has Netflix implemented for ETL processes?
Netflix uses a push-based system where jobs notify downstream processes of new data, improving efficiency and accuracy in data handling.
Why is metadata important for Netflix?
Metadata provides statistics about data, aiding in anomaly detection and ensuring data quality across streams and storage environments.
How does Netflix's data infrastructure impact its content strategy?
Data analytics guide content investment decisions, like Netflix's move into original content, which now aims to comprise 50% of its catalog.
What challenges does Netflix face with data visualization?
Challenges include ensuring timely data access, accurate data representation, and addressing performance limitations in visualization tools.
How does Netflix address bad data or unfortunate events?
Netflix uses early detection and visibility strategies to prevent bad data from reaching reports, maintaining trust in data-driven decisions.
What role do data consumers have at Netflix?
Data consumers at Netflix, including business analysts and data scientists, rely on accurate data to make strategic and operational decisions.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要！

字幕

自动滚动:

00:00:01
welcome my name is Michele aford and I
00:00:04
am your humble narrator for today's talk
00:00:07
I also lead a team at Netflix focused on
00:00:10
data engineering innovation and
00:00:11
centralized solutions and I want to
00:00:14
share with you some of the really cool
00:00:15
things we're doing around analytics and
00:00:18
also how we ensure that the analytics
00:00:21
that we're delivering can be trusted I
00:00:23
also want you guys to understand that
00:00:25
this stuff is really really hard and I'm
00:00:29
trying to give you some ideas on ways
00:00:31
that you can deal with this in your own
00:00:33
environment Netflix was born on a cold
00:00:41
stormy night back in 1997 but seriously
00:00:45
it's 20 years old which most people
00:00:47
don't realize and as of q2 2017 we have
00:00:52
a hundred million members worldwide we
00:00:56
are on track to spend six billion
00:00:59
dollars on content this year and you
00:01:03
guys watch a lot of that content in fact
00:01:05
you watch a hundred and twenty five
00:01:07
million hours every single day these are
00:01:13
not peak numbers On January 8th of this
00:01:16
year you guys actually watched 250
00:01:20
million hours in a single day it's
00:01:24
impressive as of q2 2016 we are in 130
00:01:31
countries worldwide and we are on over I
00:01:37
think around 4,000 different devices now
00:01:40
there's a reason why I'm telling you
00:01:41
this we have a hundred million members
00:01:43
watching a hundred and twenty five
00:01:45
million hours of content every day in a
00:01:48
hundred and thirty countries on four
00:01:50
thousand different devices we have a lot
00:01:54
of data we write 700 billion events to
00:02:00
our stream and adjusted pipeline every
00:02:02
single day this is average we peak at
00:02:06
well over a trillion this data is
00:02:09
processed and landed in our
00:02:13
warehouse which is built entirely using
00:02:15
open-source Big Data technologies we're
00:02:17
currently sitting at around 60 petabytes
00:02:20
and growing at a rate of 300 terabytes a
00:02:22
day and this data is actively used
00:02:26
across the company I'll give you some
00:02:28
specific examples later but on average
00:02:30
we do about five petabytes of reads now
00:02:35
what I'm trying to demonstrate to you in
00:02:38
this talk is that we can do this and we
00:02:40
can use these principles at scale but
00:02:42
you don't need this type of environment
00:02:45
to still get value out of it it's just
00:02:48
showing you that it does work at scale
00:02:49
so now for the fun stuff the unfortunate
00:02:54
events and events is actually a play on
00:02:57
words here when I say events what I'm
00:02:58
really talking about is all of the
00:03:01
interactions you take across the service
00:03:04
so this could be authenticating into a
00:03:08
app it could be the content that you
00:03:11
receive as to you like what we recommend
00:03:14
for you to watch it could be when you
00:03:16
click on content or you pause it or you
00:03:18
stop it or when you click on that next
00:03:21
thing to watch all of those are
00:03:23
considered events for us this data is
00:03:27
written into our justin pipeline which
00:03:28
is backed by kafka and it's landed in a
00:03:32
raw ingestion layer inside of our data
00:03:34
warehouse so just for those who are
00:03:37
curious this is all 100% based in the
00:03:39
cloud everything you see on this it's
00:03:41
all using Amazon's AWS but you can think
00:03:45
of this s3 is really just a data
00:03:47
warehouse so we have a raw layer we use
00:03:50
a variety of big data processing
00:03:52
technologies most notably spark right
00:03:54
now to process that data transform it
00:03:57
and we land it into our data warehouse
00:03:59
and then we can also aggregate and to
00:04:03
normalize and summarize that data and
00:04:04
put it into a reporting layer a subset
00:04:07
of this data is moved over to a variety
00:04:10
of fast access storage gene storage
00:04:13
where that data is made available for
00:04:17
our data visualization tools tableau is
00:04:20
the most widely used visualization tool
00:04:23
at Netflix but we also have a variety of
00:04:26
other use case
00:04:26
so we have other visualization tools
00:04:29
that we support and then this data is
00:04:31
also available to be queried and
00:04:34
interacted with using a variety of other
00:04:36
tools every person in the company has
00:04:40
access to our reports and to our data
00:04:47
but this is this is what happens when
00:04:51
everything works well this is what it
00:04:53
looks like let's talk about when things
00:04:55
go wrong when you have bad data but
00:04:58
really I don't like the term bad because
00:05:00
it implies intent and the data is not
00:05:03
trying to ruin your Monday it doesn't
00:05:06
want to create problems in your reports
00:05:08
so I like to think of it as unfortunate
00:05:10
data this is a visualization it's
00:05:15
actually I think one of the coolest
00:05:16
visualizations I've ever seen it's a
00:05:18
tool called visceral and it shows all of
00:05:20
the traffic that is coming into our
00:05:22
service and every single of those small
00:05:25
dots is an API and those large thoughts
00:05:28
are represent various regions and our
00:05:31
eight of us so if you look here you
00:05:34
might notice one of these circles is red
00:05:37
and that means that there's a problem
00:05:38
with that API what that means is that
00:05:46
we're receiving for the most part data
00:05:49
but we might not be receiving all data
00:05:51
we might not be receiving one type of
00:05:53
event so in this example let's say that
00:05:56
we can receive the events when you play
00:05:59
on click play on new content and when
00:06:03
you get to the end and you click that
00:06:04
click to play next episode we don't see
00:06:08
that right so let's just use that as a
00:06:10
hypothetical so now we have our bad data
00:06:15
our unfortunate data that is coming in
00:06:17
and let's say just for example purposes
00:06:19
that this is only affecting our tablet
00:06:23
devices maybe it's the latest release of
00:06:25
the Android SDK caused some sort of
00:06:27
compatibility issue and now we don't see
00:06:29
that one type of event it could be that
00:06:32
the event doesn't come through at all it
00:06:34
could be that the event comes through
00:06:35
but it's malformed it could be that it's
00:06:38
an empty payload
00:06:39
point is that we are somehow missing the
00:06:41
data that we're expecting and it doesn't
00:06:43
stop there and makes its way through our
00:06:45
entire system goes into our ingestion
00:06:47
pipeline it gets landed over into s3 it
00:06:49
ultimately makes its way to the data
00:06:51
warehouse it's going to be copied over
00:06:53
to our fast storage and if you're doing
00:06:55
extracts that's going to live there too
00:06:57
and ultimately this data is going to be
00:07:01
in front of our users and the problem
00:07:04
here is that you guys are the face of
00:07:10
the problem even though you had nothing
00:07:12
to do with it
00:07:13
you created the report and the user is
00:07:16
interacting with the report it is
00:07:21
generally not going to be your fault
00:07:23
I sometimes it's your fault but most of
00:07:25
the time what I've observed is that
00:07:27
reports issues with reports are quote
00:07:30
unquote upstream what does that mean
00:07:32
every single icon and every single arrow
00:07:37
on this diagram is a point of failure
00:07:40
and not just one or two possible things
00:07:43
that could go wrong a dozen or more
00:07:45
different things could go wrong and this
00:07:48
is a high-level view if we drilled down
00:07:50
you would see even more points of
00:07:51
failure so it's realistic to expect that
00:07:54
things will go wrong compounding this
00:07:58
problem for us is that according to San
00:07:59
vine Netflix accounts for 35% of all
00:08:03
peak traffic in North America
00:08:20
so I've described the problem we have
00:08:23
some bad data we have some unfortunate
00:08:24
data that is not the issue itself right
00:08:27
I mean what does it really matter if
00:08:29
I've got unfortunate data sitting in a
00:08:32
table somewhere the problem is when you
00:08:35
are using that data to make decisions
00:08:38
right that's the impact and it is my
00:08:43
personal belief that there is no more
00:08:46
there is no other company in the world
00:08:48
who is more data driven than Netflix
00:08:50
there are other companies who are as
00:08:52
data driven and they're not as big and
00:08:54
there are bigger companies and they're
00:08:56
not as data driven if if I'm mistaken
00:08:59
please see me afterwards I would love to
00:09:01
hear but when you're really a
00:09:04
data-driven company that means that you
00:09:06
are actively using and looking on that
00:09:08
looking at that data you're relying upon
00:09:09
it so how do we ensure they still have
00:09:12
confidence well first let's look at all
00:09:15
of the different roles we have from from
00:09:18
Netflix and why I think like we're so
00:09:20
data-driven we start with our data
00:09:22
engineers and and from my perspective
00:09:23
this is just a software engineer who
00:09:25
really specializes in data they
00:09:27
understand distributed systems they are
00:09:29
processing that 700 billion events and
00:09:31
making a consumable for the rest of the
00:09:33
company we also have our analytics
00:09:35
engineers who will usually pick up where
00:09:37
that data engineer left off they might
00:09:39
be doing some aggregations or creating
00:09:42
some summary tables they'll be creating
00:09:45
some visualizations and they might even
00:09:46
do some ad hoc analysis so we consider
00:09:49
them sort of full stack within this data
00:09:51
space and then we have people that
00:09:52
specialize in just data visualization
00:09:55
they are really really good at making
00:09:59
the data makes sense to people I'm
00:10:02
curious though how many of you would
00:10:04
consider yourself like an analytics
00:10:06
engineer you have to create tables as
00:10:08
well as the reports wow that's actually
00:10:13
more than I thought it's a pretty good
00:10:15
portion of the room how many of you only
00:10:17
do visualization show hands ok so more
00:10:22
people actually have to create the
00:10:23
tables than they do just the
00:10:25
visualizations interesting so those are
00:10:28
the people that can that I consider
00:10:29
these data producers or they're creating
00:10:32
these
00:10:33
these data objects for the rest of the
00:10:34
company to consume then we move into our
00:10:36
data consumers and this would be our
00:10:38
business analyst which are probably very
00:10:39
similar to your business analyst they
00:10:42
have really deep vertical expertise and
00:10:44
they are producing they're producing
00:10:48
analysis like what is the subscriber
00:10:51
forecast for amia we also have research
00:10:54
scientist and quantitative analyst in
00:10:56
our science and algorithms groups and
00:10:58
they are focused on answering really big
00:11:02
hard questions we have our data
00:11:07
scientist and machine learning scientist
00:11:09
and they are creating models to help us
00:11:11
predict behaviors or make better
00:11:12
decisions and these would be our
00:11:15
consumers so these people are affected
00:11:18
by the bad data but they're not there's
00:11:20
really no impact yet it's not until we
00:11:22
get to this top layer that we really
00:11:24
start to see impact and starts with our
00:11:26
executives they are looking at that data
00:11:29
to make decisions about the company's
00:11:32
strategy many companies say they're
00:11:34
data-driven
00:11:35
but what that really means is that I've
00:11:37
got an idea and I just need the data to
00:11:39
prove it oh that doesn't look good go
00:11:42
look over here instead until they can
00:11:43
find the data that proves their points
00:11:46
you can go off in the direction they
00:11:47
want being data-driven means that you
00:11:49
look at the data first and then you make
00:11:50
decisions so if we if we provide them
00:11:54
with bad data bad insights they could
00:11:57
make a really bad decision for the
00:11:59
company our product managers we have
00:12:02
these across every verticals but one
00:12:03
example would be our content team they
00:12:06
are asking the question what should we
00:12:08
spend that six billion dollars on what
00:12:11
titles should we license what titles
00:12:14
should we create and they do that by
00:12:16
relying upon predictive models built by
00:12:19
our data scientist saying here's what we
00:12:21
expect the audience to be for a title
00:12:24
and based upon that we can back into a
00:12:28
number that we're willing to pay this is
00:12:30
actually a really good model for this
00:12:33
because it allows us to support niche
00:12:35
audiences with small film titles but
00:12:38
also spend a lot of money on things like
00:12:41
the the Marvel and Disney partnerships
00:12:43
where we know it's going to have broad
00:12:44
appeal
00:12:46
we were algorithm engineers who are
00:12:49
trying to decide what is the right
00:12:50
content to show you on the site we have
00:12:53
between 60 and 90 seconds for you to
00:12:56
find content before you leave and I know
00:12:58
it feels like longer than 60 or 90
00:13:00
seconds when you're clicking next next
00:13:02
next but that's about how long we have
00:13:05
before you go and spend your free time
00:13:06
doing something else and then we have
00:13:10
our software engineers who are trying to
00:13:13
just constantly experiment with things
00:13:15
they look at the data and they they roll
00:13:18
it out to everybody if it makes sense
00:13:19
and this could be everything from the
00:13:21
user experience that you actually see to
00:13:23
things that you don't see like what is
00:13:25
the optimal compression for for our
00:13:29
video encoding so that we can lower the
00:13:32
amount of bandwidth that you have to
00:13:33
spend while also preventing you from
00:13:35
having like a really bad video
00:13:37
experience so these are the the impact
00:13:40
is really at that top level so how do we
00:13:46
design for these unfortunate events I
00:13:49
mean we we have the data we've got lots
00:13:51
of data we've got lots of people who
00:13:53
want to look at it a lot of people who
00:13:54
are depending upon it you know and I
00:13:57
think that you have two options here the
00:13:58
first option is you can say we're going
00:14:01
to prevent anything from going wrong
00:14:03
we're gonna check for everything and
00:14:05
we're just gonna we're just gonna lock
00:14:06
it down and when something gets deployed
00:14:08
we're gonna make sure that that thing is
00:14:10
airtight and that works that that sounds
00:14:12
good in principle but the reality is
00:14:14
that usually these issues don't occur
00:14:15
when you deploy something usually
00:14:18
everything looks great
00:14:19
and then six months later there's a
00:14:21
problem right so it's a lot in my
00:14:24
perspective a lot better instead to
00:14:26
detect issues and respond to them than
00:14:29
it is to try to prevent them and when it
00:14:32
comes to detecting data quality now I'm
00:14:35
gonna get a little bit more technical
00:14:36
here please bear with me I think that
00:14:38
all this stuff that was really relevant
00:14:39
for you guys and I think that there's
00:14:41
some really good takeaways for you so
00:14:42
there's a reason I'm showing you this so
00:14:44
we're gonna drill down into this data
00:14:45
storage layer
00:14:50
and we're gonna look at at this concept
00:14:53
of a table right and so in Hadoop how
00:14:55
many of you actually work with Hadoop at
00:14:57
all how many of you work with only like
00:15:01
a you work with like an enterprise data
00:15:02
warehouse but it's on something else
00:15:04
like Terra data okay so in Tara data a a
00:15:09
table is both a logical and a physical
00:15:13
construct you cannot separate the two in
00:15:15
Hadoop you can we have the data sitting
00:15:17
somewhere on storage and we have this
00:15:19
concept of a table which is really just
00:15:20
a pointer to that and we can choose to
00:15:23
point to the data we can choose to not -
00:15:25
one thing though that we've built is a
00:15:28
tool called medic at and whenever we
00:15:30
write that data whenever we do that
00:15:31
pointing we are creating another logical
00:15:34
object we're creating a partition object
00:15:36
and this partition object has statistics
00:15:40
about the data that was just written we
00:15:43
can look at things like the row counts
00:15:45
and we can look at the number of nulls
00:15:48
in that file and say okay we're gonna
00:15:51
use this information to see if there's a
00:15:53
problem we can also drill down a little
00:15:56
bit deeper into the field level and we
00:15:59
can use this to say like some really
00:16:01
explicit checks we can say well I'm
00:16:03
checking for the max value of this
00:16:05
metric and the max value is zero and
00:16:07
that doesn't make sense unless it's some
00:16:09
sort of negative value for this field
00:16:11
but chances are this is either a brand
00:16:12
new field or there's a problem and so we
00:16:15
can check for these things you don't
00:16:17
have to do things the way that I'm
00:16:19
describing but the concept I think is
00:16:21
pretty transferable having statistics
00:16:22
about the data that you write enable a
00:16:24
lot more powerful things
00:16:32
okay so now that we have the statistics
00:16:34
I mean we can use them in isolation and
00:16:36
say oh we got a zero that's a problem
00:16:37
but typically the issues are a little
00:16:40
bit more difficult to find than than
00:16:43
just that and so what we can do is take
00:16:45
that data in chart for example row
00:16:47
counts over time and you can see that
00:16:49
we've got peaks and valleys here and
00:16:51
this is really denoting that there's
00:16:53
some difference in behavior based upon
00:16:56
the day of week and so if we use a
00:17:00
standard normal deviate distribution we
00:17:04
can look for something that falls
00:17:06
outside of like a 90% confidence
00:17:08
interval and if it does we can be pretty
00:17:11
confident that maybe there's not a
00:17:13
problem but we definitely want someone
00:17:14
to go look to see if there's a problem
00:17:16
and so when we compare this for the same
00:17:20
day of week week over week for 30
00:17:23
periods we start to see that we have
00:17:25
some outliers we have some things that
00:17:26
might be problems we can also see that
00:17:30
the data that we rip we wrote most
00:17:32
recently looks really suspect because I
00:17:35
wrote 10 billion rows and typically I
00:17:41
write between 80 and a hundred billion
00:17:43
rows right so chances are there's a
00:17:46
problem with this particular run of the
00:17:48
CTL
00:17:53
so we can detect the issues but that
00:17:56
doesn't really prevent the problem of
00:17:58
the impact the perennial question can I
00:18:02
trust this report can I trust this data
00:18:05
I have no idea
00:18:07
looking at this if there's a data
00:18:08
quality issue and what's really
00:18:11
problematic and what is really the issue
00:18:14
for you guys is when people look at
00:18:17
these reports they trust the reports and
00:18:19
then afterwards we tell them that data
00:18:22
is actually wrong we're gonna back it
00:18:25
out we're gonna fix it for you
00:18:27
there was no indication to them looking
00:18:30
at this report that they couldn't trust
00:18:32
it but now the next time they look at
00:18:33
this report guess what it's gonna be
00:18:36
there in the back of their mind is this
00:18:37
data good can I trust it so what we've
00:18:44
done is built a process that checks for
00:18:47
these before the data becomes visible
00:18:48
and all of the bad unfortunate stuff can
00:18:51
still happen we still have the data
00:18:54
coming in it's still landing in our
00:18:56
ingestion layer but before we write it
00:18:59
out to our data warehouse we were
00:19:00
checking for those standard deviations
00:19:02
and when we find exceptions we failed at
00:19:05
ETL we don't go any further in the
00:19:07
process we also check before we get to
00:19:10
our reporting layer same thing what this
00:19:16
means is that your user is not going to
00:19:19
see their data right they're gonna come
00:19:21
to the report and it looks like this
00:19:26
your user your business user is going to
00:19:30
see there's missing data and now they're
00:19:32
going to know there was a quality issue
00:19:34
and we don't want them to know that
00:19:36
right
00:19:37
wrong we want them to know there was a
00:19:41
problem because it's not your fault
00:19:43
there was a problem it's there's so many
00:19:46
things that could go wrong but simply by
00:19:48
showing them this explicitly that we
00:19:51
have no data they retain confidence they
00:19:55
know they're not making decisions bayit
00:19:57
faster based on bad data and your
00:20:00
business should not be making major
00:20:02
decisions on a single day's worth of
00:20:04
data
00:20:05
where it becomes really problematic is
00:20:07
when you're doing trends and percent
00:20:09
changes and you know those things even
00:20:11
bad data can really have a big impact so
00:20:18
one thing that you do have to do to make
00:20:20
this work is you have to surface the
00:20:23
information so that users can really see
00:20:25
when was the data last loaded when did
00:20:27
we vet last validate it through so
00:20:30
there's two things not showing them bad
00:20:32
data and providing visibility into the
00:20:34
current state
00:20:35
we're also this is a view of our big
00:20:37
data portal it's an internal tool that
00:20:39
we've developed I think there's other
00:20:40
third-party tools out there that might
00:20:42
do some of their things we're also
00:20:43
planning to add visibility to the actual
00:20:45
failures and alerts so that business
00:20:47
users can see those but so now we've
00:20:52
we've detected the issue we've prevented
00:20:54
there from being any negative impact but
00:20:56
we still have to fix the problem right
00:20:59
they still want the data at the end of
00:21:01
the day there's two components to fixing
00:21:05
the problem quickly the first one is as
00:21:07
I just mentioned visibility but this
00:21:09
time visibility for the people who need
00:21:10
to understand what the problem is and
00:21:11
they need to fix it so one of the things
00:21:13
that we're doing is surfacing this
00:21:15
information you know the question might
00:21:17
be why did my job suddenly spike in in
00:21:20
run time right why is this taking so
00:21:22
long and you can look here and you can
00:21:25
easily see oh it's because you received
00:21:27
a lot more data and then this becomes a
00:21:30
question well is that because somebody
00:21:32
deployed something upstream and nots
00:21:33
duplicating everything I mean it gives
00:21:36
you a starting point to understand what
00:21:37
are the problems we also directly
00:21:40
display what they fail your message or
00:21:42
the failures and then give you a link to
00:21:44
go see the failure messages themselves
00:21:45
so that when users are trying to
00:21:47
troubleshoot it again we're just trying
00:21:49
to make it easier and faster for them to
00:21:51
get there and we are exposing the
00:21:58
relationships between data sets so this
00:22:01
is the lineage data you know how do
00:22:03
these things relate what things are
00:22:05
waiting on me to fix this and who do I
00:22:08
need to notify that there's a problem
00:22:12
all right now I'm gonna cover this real
00:22:14
quick this is about scheduling and
00:22:18
pooling versus pushing
00:22:20
not something that you guys here would
00:22:23
implement but something that you should
00:22:24
be having conversations with your
00:22:26
infrastructure teams about traditionally
00:22:28
we use a schedule based system where we
00:22:33
say ok it's 6 o'clock my job's gonna run
00:22:34
and I'm gonna take that 700 billion
00:22:37
events and I'm gonna create this really
00:22:38
clean detailed table and then at 7
00:22:41
o'clock I'm gonna have another job run
00:22:43
and it's gonna a grenade that data at 8
00:22:47
o'clock I'm gonna have another process
00:22:50
that runs and it's going to normalize it
00:22:51
to get it ready for my report I'm gonna
00:22:55
copy it over to my fast access layer at
00:22:57
8:30 and by 9 o'clock my report should
00:23:00
be ready in a push based system you
00:23:04
might still have some scheduling
00:23:05
component to it you might say well I
00:23:07
want everything to start at 6:00 a.m.
00:23:08
but the difference is that once this job
00:23:11
is done it notifies the aggregate job
00:23:14
that it's ready to run because there's
00:23:15
new data which notifies the
00:23:17
de-normalized job that it's ready to run
00:23:19
which notifies or just executes the
00:23:22
extract over to your fast access layer
00:23:25
and your report becomes available to
00:23:27
everybody by maybe 7 42 you could see
00:23:33
the benefits here and being able to get
00:23:35
to data and getting the data out faster
00:23:37
to your users this is not why you guys
00:23:39
care about this this is probably why
00:23:41
your business users might care about it
00:23:43
why you guys care about it is because
00:23:45
things don't always work perfectly in
00:23:50
fact they usually don't and when that
00:23:52
happens you're gonna have to reflow and
00:23:53
fix things so this is an actual table
00:23:56
that we have we have one table that
00:24:00
populates six tables which populates 38
00:24:03
tables which populates 586 this is a
00:24:07
pretty run-of-the-mill table for us I
00:24:10
have one table that by the third level
00:24:11
of dependency has 2000 table
00:24:15
dependencies so how do we fix the data
00:24:18
when the data has started off on one
00:24:20
place and has been propagated to all of
00:24:22
these other places in a full system you
00:24:26
rerun your job and my my detailed data
00:24:30
my aggregate my
00:24:32
normalised view all of these views get
00:24:34
updated and my report is good but the
00:24:37
other 582 tables are kind of left
00:24:41
hanging and you could notify them if you
00:24:45
have visibility to who these people are
00:24:47
that are consuming your data but they
00:24:50
still have to go take action and what's
00:24:52
gonna happen is you're gonna tell them
00:24:53
hey we've had this data quality issue
00:24:55
and we flowed and it's really important
00:24:58
that you're on your job and they're
00:24:59
gonna think okay yeah but I deprecated
00:25:01
that and yeah I might have forgot to
00:25:03
turn off my ETL and they have no idea
00:25:05
that somebody else I started to rely
00:25:07
upon that data for the report right
00:25:09
happens all the time people don't feel
00:25:11
particularly incentivized to go rerun
00:25:14
and clean up things unless they know
00:25:15
what the impact is in a push system we
00:25:20
fix the one table it notifies the next
00:25:23
tables that there's new data which
00:25:25
notifies those tables that there's new
00:25:26
data and everything gets fixed
00:25:29
downstream this is a perfect world it's
00:25:32
very idealistic you this is like a very
00:25:35
pure push type system what you should be
00:25:39
having discussions with your your
00:25:41
internal infrastructure team is is that
00:25:43
you should not need to know that there
00:25:45
is an upstream issue you should just be
00:25:47
able to rely upon the fact that when
00:25:50
there is a problem
00:25:51
your jobs will be executed for you so
00:25:53
that you can rely upon that data nobody
00:25:55
should have to go do that manually it
00:25:57
doesn't scale part four
00:26:05
so we've gone through this and we talked
00:26:10
about some of the different ways that
00:26:11
that unfortunate data would impact us
00:26:13
but that's not the reality for us the
00:26:16
reality is that our users do have
00:26:19
confidence in the data pre-produce that
00:26:21
doesn't mean there's not quality issues
00:26:22
but overall generally speaking they have
00:26:25
confidence that what we produce and what
00:26:26
we provide to them is good and because
00:26:29
of that were able to do some really cool
00:26:30
things our executives actually looked up
00:26:34
the data for how content was being used
00:26:37
and the efficiency of it and they made a
00:26:39
decision based upon the data to start
00:26:43
investing in originals and over the next
00:26:47
few years we went from I think it was a
00:26:51
handful of hours in 2012 to about a
00:26:54
thousand hours of content in 2017 we've
00:26:58
ramped this up very very quickly and we
00:27:01
have set a goal of by 2020 50% of our
00:27:06
content will be original content 50% of
00:27:08
that six billion dollars that were
00:27:10
spending will be on new content that we
00:27:11
create but this was a strategy decision
00:27:14
that was informed by the data that we
00:27:16
had we also have our product managers
00:27:21
here are looking at the data and they
00:27:24
are I mean they're making some pretty
00:27:27
good selections with what content we
00:27:30
should be purchasing on the service
00:27:31
we've had some pretty good ones and
00:27:32
they're going to continue to use the
00:27:34
data to decide what is the next best
00:27:37
thing for us to buy we have our software
00:27:41
engineers who have built out constantly
00:27:49
evolving you use it user interfaces user
00:27:54
experience for us and this doesn't
00:27:56
happen all at once this isn't like a
00:27:57
monolithic project where they just roll
00:28:00
out these big changes instead they are
00:28:04
testing these they're making small
00:28:06
changes they're testing them
00:28:07
incrementally they're making the
00:28:08
decision to roll it out we we have about
00:28:10
like a hundred different tests going on
00:28:11
right this moment to see what is the
00:28:13
best thing that we should do and so you
00:28:15
can see what we looked like in 2017 or
00:28:17
2016
00:28:18
here's what we look like in 2017
00:28:50
so you can imagine for a moment the
00:28:54
amount of complexity and different
00:28:57
systems that are involved in making
00:28:58
something like that happen and before we
00:29:01
really make that investment we go and
00:29:03
test it is our theory correct one thing
00:29:06
that I think is really interesting is
00:29:07
that we find oftentimes we can predict
00:29:12
what people will like if those people
00:29:13
are exactly like us and usually people
00:29:17
are not exactly like us and so instead
00:29:19
we just throw it out there whenever we
00:29:20
have ideas we test them and then we we
00:29:23
respond to the results and then we have
00:29:27
our algorithm engineers these are the
00:29:30
people who are responsible for just
00:29:33
putting the intelligence in our system
00:29:34
and making it fast and and seamless and
00:29:37
so the most well-known case is our
00:29:39
recommendation systems I could talk
00:29:41
about it but I actually think this video
00:29:43
is a little bit more interesting
00:29:48
[Music]
00:30:03
[Music]
00:30:12
[Music]
00:30:26
[Music]
00:30:45
just
00:30:49
[Applause]
00:30:53
[Music]
00:30:55
now we did not create the vape the video
00:30:58
but we provided the data behind the
00:31:01
video 80% of people watch content
00:31:05
through recommendations 80% we did not
00:31:10
start off there at that number it was
00:31:13
only through constant iteration looking
00:31:15
at the data and getting or responding to
00:31:19
the things that had the positive results
00:31:20
that we got to this place okay so we'll
00:31:24
start wrapping up the key takeaways
00:31:27
obviously I don't expect you guys to go
00:31:31
back and do this in your environment but
00:31:34
I think that there are some key
00:31:35
principles that really make sense for a
00:31:38
lot of people outside of just what we're
00:31:40
doing here at Netflix the first one is
00:31:43
that expecting failure is more efficient
00:31:47
than trying to prevent it and this is
00:31:51
true for your your data teams but this
00:31:53
is also true for you as data
00:31:55
visualization people how can you expect
00:31:59
and respond to failures and to issues
00:32:01
with the data and with your reports
00:32:03
rather than trying to prevent them so
00:32:06
shift your mind mindset and say I know
00:32:09
it's gonna happen what are we gonna do
00:32:11
when it happens stale data you know I
00:32:20
have never heard someone tell me I would
00:32:23
have rather had the incomplete data
00:32:25
faster than I had the stale accurate
00:32:28
data I am sure there are cases out there
00:32:30
where that is true
00:32:31
but it is almost never the case people
00:32:35
don't make decisions based upon one hour
00:32:38
or one day's worth of data they might
00:32:41
want to know what's happening they might
00:32:42
say I just launched something and I
00:32:44
really want to know how it's performing
00:32:45
I mean that's it that's a natural human
00:32:47
trait that curiosity but it is not
00:32:52
impactful right and so ask yourself
00:32:56
would they rather see data faster and
00:32:59
have it be wrong or would they rather
00:33:01
know that when they do finally see the
00:33:03
data and maybe a few minutes later maybe
00:33:05
an hour later
00:33:06
that it's right and this is actually I
00:33:11
know this is really hard it's easy to
00:33:13
tell you guys this I know that you have
00:33:14
to go back to your business users and
00:33:15
tell them this I realized that but you
00:33:18
know if you explain it to them in this
00:33:19
way usually they can begin to understand
00:33:23
and then the last thing this is really
00:33:27
validation for you guys I've had a lot
00:33:30
of people ask me how do you guys do this
00:33:33
and we have all these problems and the
00:33:35
reality is that we have those problems
00:33:37
too
00:33:38
this stuff is really really hard it's
00:33:41
hard to get right it's hard to do well
00:33:44
it's hard for us and I think we'd do it
00:33:47
pretty well I think there's a lot of
00:33:48
things we can do better though so don't
00:33:51
just know that you're not alone it's not
00:33:53
you it's not your environment take this
00:33:55
slide back to your boss and show them
00:33:58
like this stuff is really really hard to
00:34:00
do right ok please when you are done
00:34:07
complete the session survey and that's
00:34:11
my talk
00:34:13
[Applause]
00:34:27
anybody have questions
00:34:33
and if you have questions and you're not
00:34:34
able to stay right now feel free to find
00:34:36
me afterwards I'm happy to answer them
00:34:41
hi thanks for the very insightful talk I
00:34:45
had one question I was curious when you
00:34:47
showed the dependency diagram let's say
00:34:49
you have a mainstream table that has a
00:34:51
couple of hundred or let's say a
00:34:53
thousand tables feeding downstream
00:34:55
dependencies and if you have to make a
00:34:57
change I'm sure like no table design is
00:35:00
constant so for the upcoming changes how
00:35:02
do you manage to like ensure that you
00:35:05
are still flexible for those changes and
00:35:07
also are fulfilling these downstream
00:35:09
dependencies for the historical data
00:35:11
also so the question is how do you
00:35:15
identify and ensure that there are no
00:35:17
issues and your downstream dependencies
00:35:18
when you have thousands of tables no
00:35:21
let's say no it's rather like if you had
00:35:23
to make a change how do you make a quick
00:35:25
enough change and also make sure that
00:35:26
you have the data historically to for a
00:35:30
major upstream table that you are you
00:35:34
wanting the change applied to your
00:35:36
downstream tables yeah oh okay good
00:35:38
question so how do you how do you
00:35:40
quickly evolve schema probably and apply
00:35:43
that downstream you know the first thing
00:35:46
is you have to have lineage data without
00:35:48
lineage data you can't really get
00:35:50
anywhere so the first step is always to
00:35:52
make sure you have lineage data the
00:35:53
second thing is around automation and
00:35:56
your ability to understand the data and
00:35:57
understand the schema so if you can
00:35:59
understand schema evolution and changes
00:36:01
there then you can start to apply them
00:36:02
programmatically at any point where you
00:36:04
can automate so automate automation of
00:36:07
ingestion and the export of data would
00:36:10
be an optimal place beyond that you
00:36:12
could start to you know it becomes a
00:36:16
little bit more pragmatic it depends on
00:36:18
how complex your scripts are right so if
00:36:19
you've got pretty simple like select
00:36:20
from here group by block you could
00:36:24
automate that if you had all the data
00:36:26
sitting in like a centralized code base
00:36:29
but that becomes a lot more tricky and
00:36:31
we actually have not gone that route
00:36:32
because we we want people to have
00:36:34
control instead what we've done is we've
00:36:35
notified them that there is going to be
00:36:37
a change we notify them when the change
00:36:40
is made and
00:36:40
it's on them to actually apply the
00:36:43
changes and then everything that we can
00:36:44
automate because there's a simple select
00:36:46
move from here to there we've automated
00:36:47
so they don't have to worry about that
00:36:49
good question one more question yes so
00:36:51
with the rising speech based AI
00:36:54
assistants so in the future if Netflix
00:36:56
gets support of let's say Google home or
00:36:59
Siri how do we think that the how do we
00:37:10
think the data flow is going to be
00:37:11
affected by home systems and speech
00:37:14
recognition yeah I mean there'll be a
00:37:16
lot more data than we are used to right
00:37:18
right now yes I don't have a specific
00:37:23
answer but I would imagine that it's
00:37:25
just gonna be another endpoint for us
00:37:27
and that it will just be writing out to
00:37:28
those ingestions the you know typically
00:37:32
what's happening in those cases is that
00:37:34
there's like another service out there
00:37:35
on a lexer Google that is doing that
00:37:37
translation so we probably wouldn't get
00:37:39
visibility that it would be more of like
00:37:41
you know this phrase was made and then
00:37:44
here was the content that was shown and
00:37:45
we'd probably logged that in the same
00:37:46
way we log pretty much everything but I
00:37:48
don't actually know that for sure thanks
00:37:50
I I have a few questions but I'll try to
00:37:56
put it in one I missed the first part of
00:38:01
your presentation where you're drawing
00:38:03
the architecture and you're showing us
00:38:04
how data reaches the tableau dashboard
00:38:07
you mentioned later on that you're using
00:38:09
tableau data extracts when you're
00:38:11
building extracts or and feeding the
00:38:14
dashboard at a certain time my question
00:38:17
is do you have live connections as the
00:38:19
data is flowing through your ETL process
00:38:21
checking the quality and running
00:38:23
analytics for people to actually do
00:38:25
visualize and see and make a change
00:38:30
based on what they see where you live
00:38:32
connection to the data sets and running
00:38:35
analytics on top of that using tableau
00:38:38
or are you guys using tableau data
00:38:44
extracts I mean this is actually no
00:38:47
furthest to my colleague Jason flitting
00:38:50
over here who works more with the
00:38:51
tableau stuff and
00:38:52
yeah I I think what you're getting at
00:38:54
you're talking about the data itself on
00:38:57
the quality of the data are we
00:39:00
visualizing that in tabloids English
00:39:02
action the data itself not a metadata
00:39:05
how many notes you got and all the
00:39:07
performance metrics that she showed sure
00:39:09
really looking at the quality of the
00:39:11
data and something for a product owner
00:39:13
to make a decision on hey how do I look
00:39:16
as the process is going on using live
00:39:18
connections you're talking about a lot
00:39:20
of data and how do you stream all of
00:39:22
that a lot of data onto the live
00:39:24
connection and not affect your table
00:39:26
dashboards performance because you
00:39:28
really have to go back and look through
00:39:30
terabytes of data to be able to get a
00:39:32
visualization yeah what I would say
00:39:34
there is a you know the the data around
00:39:37
the quality checks that are happening is
00:39:39
surface not so much to be a tableau live
00:39:43
connection today but I think there's
00:39:45
real opportunity with this is actually
00:39:47
having some sort of aspect of your
00:39:52
dashboard that surfaces that information
00:39:54
so in a more kind of easy-to-understand
00:39:56
way having just an indicator on your
00:39:59
dashboard hey you know this dashboard
00:40:00
has some stale data it's being worked on
00:40:03
that's probably more realistic you know
00:40:06
thing that we would do is kind of
00:40:08
surface it ad on the dashboard rather
00:40:10
than you know the business user maybe
00:40:12
they won't have as much context to
00:40:14
understand the complexity of why the
00:40:16
data quality you know is not good so I
00:40:19
could see surfacing that kind of like
00:40:21
high level data point in a dashboard
00:40:22
okay
00:40:24
I think is a long line but perhaps yeah
00:40:27
why don't you swing around you can dig
00:40:28
it more and we'll stick around to guys
00:40:30
so if you don't have a chance to ask
00:40:32
your question yeah appreciate it thanks
00:40:34
I heard you um good question how do you
00:40:38
attribute viewing to the recommendation
00:40:41
engine versus anything that you can't
00:40:44
measure off like off platform like so
00:40:46
your traditional ads word of mouth like
00:40:48
someone has told me about Luke Cage and
00:40:51
I got recommended to me so then I
00:40:53
watched it but my first discovery moment
00:40:55
was really gonna the word of mouth of my
00:40:57
friend that's a great question how do we
00:40:59
attribute
00:41:01
a view from like a search engine versus
00:41:04
a recommendation and the answer is I
00:41:05
don't know I'm an engineer so I don't
00:41:07
actually have that insight but that's a
00:41:09
good question thank you sorry hello my
00:41:15
question is about the data validation
00:41:16
step you are saying that you take a
00:41:18
normal distribution you say look we're
00:41:20
uploading normally 80 to 100 million
00:41:22
rows and so we kind of find this and
00:41:24
this is what we think we should be able
00:41:25
to see there are obviously lots of
00:41:27
events that will cause things to be
00:41:29
outside of that normal bounds so I mean
00:41:32
just as a catastrophic event say the
00:41:35
power outage that hit in 2004's shut
00:41:37
down the eastern seaboard I'm gonna go
00:41:39
out on a limb and say viewership dropped
00:41:41
well outside of your normal distribution
00:41:44
if that happened today so my question is
00:41:46
is when you're doing this and you're
00:41:47
saying this data is not good and you've
00:41:50
set up this automated process that's
00:41:52
alerting of that how do you then
00:41:54
interact to say well actually no there's
00:41:55
a reason for this or we've looked at it
00:41:57
and you actually think this is a good
00:41:58
data packet and then push it back
00:42:00
through great questions so to to
00:42:03
reiterate it would be when there is a
00:42:06
data quality issue or when we find that
00:42:08
there's exceptions how do we how do we
00:42:11
communicate and notify that that really
00:42:14
wasn't a problem that or it really was
00:42:16
an issue outside of this that we wanted
00:42:18
our job to continue is that right yeah
00:42:20
so the first thing is that I really
00:42:23
glossed over that whole piece a lot so
00:42:25
normal distribution is one of the things
00:42:27
that we can do we also do you like
00:42:28
Poisson distributions we use anomaly
00:42:30
detection so there's other more
00:42:31
sophisticated ways that we can do to
00:42:33
ensure that there's not problems when we
00:42:34
do find that there are major problems we
00:42:36
were working on right now a learning
00:42:39
service that will allow people to
00:42:41
communicate you know here's this issue
00:42:44
I'm gonna acknowledge the issue I'm
00:42:46
going to annotate what the problem was
00:42:47
and then I'm gonna let my ETL move on so
00:42:50
we're working on that right now but at
00:42:52
the moment it would just be a matter of
00:42:54
going off and like manually releasing
00:42:57
the job to continue to move on to the
00:42:58
next step
00:42:59
does that power sit in one person or a
00:43:02
small group of people that sit down and
00:43:04
say yes we think this is good we've
00:43:06
discussed this really quickly in a
00:43:07
30-minute session we think the ETL can
00:43:10
continue or is it not that critical for
00:43:12
you guys to be that fast or how does
00:43:15
that decision actually get made
00:43:17
I'm gonna defer against my colleague
00:43:20
yeah no worries
00:43:20
so I'd say it comes down to the data
00:43:23
point and the team that owns that data
00:43:25
it's not really a centralized decision
00:43:27
per se it would oftentimes you know the
00:43:30
most relevant team or person would get
00:43:32
alerted usually it's not just one
00:43:34
individual usually it's a team and they
00:43:37
would dig in see if there is actually a
00:43:39
data issue if if there is they would you
00:43:41
know fix that if not they would release
00:43:43
things and hopefully improve the audit
00:43:46
so that that same type of anomaly
00:43:48
doesn't trick the audit again time time
00:43:50
frame for those types of decisions again
00:43:53
it can depend on the data set there are
00:43:55
some data sets that you know you want to
00:43:58
have up and running 24/7 and then there
00:44:00
are other data sets that you know run
00:44:02
once a day or once a week so it varies
00:44:06
but I would say that you know they
00:44:08
oftentimes that kind of like 24-hour
00:44:11
turnaround is a pretty normal kind of
00:44:13
time line cool thank you so much sure I
00:44:18
questioned about the statistics that you
00:44:21
capture how did you determine what
00:44:23
metadata to capture for your data and
00:44:25
part two if if you find there's a new
00:44:28
feature that you want to capture you
00:44:30
ever have to go back through your
00:44:32
historical data and collect that good
00:44:36
question so how do we how do we collect
00:44:38
the statistics and then how do we
00:44:42
support evolution of those statistics
00:44:44
and backfill if necessary to answer your
00:44:48
question that statistic is collected
00:44:49
it's collected as part of the the
00:44:51
storage driver so whenever data is
00:44:54
written we are collecting that
00:44:55
information every time it's written from
00:44:57
like spark or pic and we we have the
00:45:01
ability you know the whole the whole
00:45:03
model here is collaboration so people
00:45:06
are welcome to create a brand new
00:45:09
statistic it actually we just had this
00:45:11
happen recently
00:45:12
and that statistic in this case it was
00:45:15
looking at approximate cardinality and
00:45:17
so that's something that not everybody
00:45:19
would want to turn on and so we can flag
00:45:21
it and say okay explicitly disable this
00:45:24
by default and explicitly enable this
00:45:26
for this one data set so people have
00:45:27
that control where it's appropriate and
00:45:29
then we don't really worry about
00:45:32
backfilling of the statistics we have
00:45:34
logic that says okay if we don't have
00:45:35
enough data for this normal distribution
00:45:38
perhaps we should be using some other
00:45:40
detection method in the interim and once
00:45:43
we do then we'll switch over to a normal
00:45:45
distribution or will just simply
00:45:47
invalidate that audit until we have
00:45:49
enough data for the new statistic but we
00:45:52
wouldn't necessarily go back because I'd
00:45:53
be too expensive for us to do that means
00:45:56
you have never done it we've never never
00:46:00
done the backfill or is just not common
00:46:01
we have never done the back field
00:46:03
because it literally is part of like the
00:46:04
storage function and we would have to
00:46:06
either create something brand new
00:46:08
basically you'd have to go interrogate
00:46:09
that data and for us that's that's very
00:46:11
very expensive so we want to touch the
00:46:13
data as little as we can which is why we
00:46:14
try to collect it whenever the data's
00:46:16
being written not that we couldn't come
00:46:19
up with a solution if we needed to but
00:46:20
we've never had a case where we really
00:46:22
needed to my question is regarding
00:46:28
tableau connectivity to interactive
00:46:31
queries just to be in specific like in
00:46:35
order to calculate any complex
00:46:36
calculation in databases it's so tedious
00:46:39
like you want to calculate the standard
00:46:40
deviation z-score for our own last few
00:46:42
years based on the dynamic
00:46:44
parameterization you can't do it in
00:46:46
databases so easily so writing in a
00:46:48
Python code or creating a web services
00:46:50
on top is so much easier but tableau
00:46:53
lacks a good way of connectivity to the
00:46:56
Web API so as live interactivity I mean
00:47:00
do you guys try to I mean have you came
00:47:03
across these issues or are there any
00:47:05
invest implementation models for missile
00:47:08
that I'm gonna defer
00:47:13
I'm sorry don't repeat the question
00:47:15
again I think I can answer sorry I when
00:47:20
it comes to the live connections that we
00:47:23
make use of I would say for tableau we
00:47:26
have primarily used tableau data
00:47:29
extracts in the past so for cases like
00:47:31
that where we need to go out and pull in
00:47:34
data from an external service and kind
00:47:36
of make that available we would usually
00:47:38
bring it in to our data warehouse and
00:47:40
materialize that into a table and then
00:47:42
pull it into tableau for something more
00:47:45
like what you're talking about where you
00:47:47
have to go out and hit an external API
00:47:48
we actually use some custom
00:47:51
visualization solutions where we do
00:47:53
those types of things the Netflix person
00:47:58
sorry I just add on to that I think
00:47:59
actually the feature that they demo this
00:48:00
morning the extensions API is sort of an
00:48:03
interesting idea like now that that's
00:48:04
available because we've built this as an
00:48:06
API driven service potentially we could
00:48:08
hook into the alert and pull through the
00:48:10
extension and place an annotation that
00:48:12
says this data that's building this
00:48:15
dashboard is suspect at the moment yes
00:48:17
there's some enhancements that maybe we
00:48:19
can look to to bring in that
00:48:20
notification based on the alerting
00:48:22
that's running in the ETL process it's
00:48:25
kind of an out of the box but I'm the
00:48:27
whole world is now looking into
00:48:28
evolution of the web services where
00:48:30
Amazon is promoting as a primary data
00:48:33
source for everything capturing your
00:48:35
enterprise data layer as an service base
00:48:37
and a lot of BI tools are not you know
00:48:40
good enough to talk to these API so do
00:48:43
you think of any intermediate solutions
00:48:45
like that could get to that connectivity
00:48:48
or we should look at it as a separate
00:48:50
use case and go as an extent real quick
00:48:54
I'm going to pause there's a really bad
00:48:56
echo so what I'm gonna do instead is
00:48:57
invite you guys all up to the front if
00:48:59
you have questions and there's several
00:49:00
of us so you'll get your question
00:49:01
answered faster but just the Netflix
00:49:04
Group is over here and if you don't have
00:49:07
time to stay and get your question
00:49:08
answered again feel free to reach out to
00:49:09
any of us during the conference thanks
00:49:11
for coming thank you
00:49:13
[Applause]

标签

Netflix
Data Engineering
Analytics
Data Quality
ETL
Big Data
Data Warehousing
Data Visualization
Cloud Infrastructure
Content Strategy