Netflix: A series of unfortunate events: Delivering high-quality analytics
Zusammenfassung
TLDRMichele Aford, leading a team at Netflix, highlights the company's approach to handling vast amounts of data generated from over 100 million subscribers globally. Netflix writes 700 billion events per day, processing it to ensure the reliability of analytics. Every interaction across the Netflix service logs as events, capturing data in a Kafka-backed pipeline. This data eventually gets processed and stored in a cloud-based data warehouse powered by tools like Spark. Netflix embraces a mentality of expecting system and data failures instead of preventing them outright, using anomaly detection methods to manage data quality. The company implements a push-based ETL system that cascades updates downstream efficiently, as opposed to traditional scheduled systems. Ensuring data quality is vital, as various data stakeholders, from engineers to executives, utilize this data in decision processes, impacting content investment and user experience strategies. Netflix has embraced data-driven decision-making to influence original content production, aiming to increase its share of unique content. Michele emphasizes the complexity of managing Netflix's data at scale, highlighting the need for robust infrastructure and processes to ensure high data availability and reliability for data consumers. The talk underscores the importance of enabling confidence in data-driven decision-making and innovative analytics-driven solutions.
Mitbringsel
- 📊 Netflix writes 700 billion events daily, using them for analytics and visualization.
- 📈 Netflix employs a push-based ETL system for more efficient data processing.
- 🚀 The company's data warehouse handles 60 petabytes, growing by 300 TB daily.
- 🔎 They use anomaly detection to maintain data quality and prevent bad data.
- 💡 Netflix relies heavily on data-driven decision-making, especially in content strategy.
- 🌐 The company logs every interaction as data, from app usage to content consumption.
- 🛠️ Spark and Kafka are pivotal tools for processing Netflix's massive data streams.
- 🖥️ Data quality is ensured by statistical and meta-analytics checks before usage.
- 📅 Data updates can trigger automatic downstream process execution in a push system.
- 🎬 Data has driven successful strategies, including investing in original content.
Zeitleiste
- 00:00:00 - 00:05:00
Michele Aford introduces herself as the narrator, leading a data engineering team at Netflix focused on analytics and trusted data. She highlights Netflix's global presence and data scale, emphasizing the company's massive data collection and processing capabilities.
- 00:05:00 - 00:10:00
Michele discusses the Netflix data architecture, including event data logging, ingestion pipelines, and a vast data warehouse built on open-source technologies. The company processes large volumes of data daily, enabling various company-wide applications.
- 00:10:00 - 00:15:00
The process of handling 'unfortunate data' is explored. Michele explains events as user interactions, how Netflix logs these to a pipeline, processes with big data technologies like Spark, and visualizes using tools like Tableau for access across the company.
- 00:15:00 - 00:20:00
Problems with data reliability are addressed as 'unfortunate data,' without implying fault. Visualization tools help identify issues. Michele emphasizes the complexity of data processes, highlighting possible points of failure and the significant traffic Netflix deals with.
- 00:20:00 - 00:25:00
She stresses the importance of data quality for decision-making, especially for a data-driven company like Netflix. Various roles within Netflix interact with data, including data engineers, analytics engineers, and visualization experts. Executives rely heavily on data for strategic decisions.
- 00:25:00 - 00:30:00
The impact of accurate data is crucial for roles like product managers and engineers at Netflix, who make decisions on content investment, algorithm adjustments, and user experience based on data insights. Netflix's strategic content decisions are deeply data-driven.
- 00:30:00 - 00:35:00
Michele outlines strategies Netflix employs to ensure data quality, advocating for detection and response over prevention. She discusses utilizing data statistics and anomaly detection, explaining how unexpected data behavior is monitored and managed to prevent inaccurate reporting.
- 00:35:00 - 00:40:00
Handling data anomalies involves alert systems and user notifications when data is missing or suspect. Michele highlights internal tools for data visibility and lineage, supporting efficient troubleshooting and maintaining user trust by clarifying data integrity.
- 00:40:00 - 00:49:15
Michele reveals that despite the complexities, Netflix's processes maintain data confidence, enabling impactful decisions like content investment in originals. The talk concludes with strategic advice on managing data reliability and the challenges of data operations at scale.
Mind Map
Häufig gestellte Fragen
What is the role of a data engineer at Netflix?
A data engineer at Netflix is essentially a software engineer specializing in data, focusing on distributed systems and making data consumable for the rest of the company.
How does Netflix handle data events?
Netflix logs all interactions as events using Kafka, storing them in an AWS S3-backed data warehouse, processed by tools like Spark for analytics and visualization.
How much data does Netflix process daily?
Netflix processes about 700 billion events daily, peaking over a trillion, with a warehouse currently at 60 petabytes, growing by 300 terabytes every day.
How does Netflix ensure data quality and trust?
Netflix uses statistical checks, anomaly detection, and a process to prevent bad data from becoming visible to maintain data quality and trust.
What innovation has Netflix implemented for ETL processes?
Netflix uses a push-based system where jobs notify downstream processes of new data, improving efficiency and accuracy in data handling.
Why is metadata important for Netflix?
Metadata provides statistics about data, aiding in anomaly detection and ensuring data quality across streams and storage environments.
How does Netflix's data infrastructure impact its content strategy?
Data analytics guide content investment decisions, like Netflix's move into original content, which now aims to comprise 50% of its catalog.
What challenges does Netflix face with data visualization?
Challenges include ensuring timely data access, accurate data representation, and addressing performance limitations in visualization tools.
How does Netflix address bad data or unfortunate events?
Netflix uses early detection and visibility strategies to prevent bad data from reaching reports, maintaining trust in data-driven decisions.
What role do data consumers have at Netflix?
Data consumers at Netflix, including business analysts and data scientists, rely on accurate data to make strategic and operational decisions.
Weitere Video-Zusammenfassungen anzeigen
Dr.Jon Mochel, Director, Precision One Health, presents a seminar at UGA Institute of Bioinformatics
Components of IoT
#The Happy Prince Story #Oscar Wilde#summary #explanation #BCU#IISemB.Com/BBA#Generic English
Reproductive System, Part 4 - Pregnancy & Development: Crash Course Anatomy & Physiology #43
MULESOFT INTEGRATION ARCHITECT SESSION 03
Minibiografía: Luis Echeverría Álvarez
- 00:00:01welcome my name is Michele aford and I
- 00:00:04am your humble narrator for today's talk
- 00:00:07I also lead a team at Netflix focused on
- 00:00:10data engineering innovation and
- 00:00:11centralized solutions and I want to
- 00:00:14share with you some of the really cool
- 00:00:15things we're doing around analytics and
- 00:00:18also how we ensure that the analytics
- 00:00:21that we're delivering can be trusted I
- 00:00:23also want you guys to understand that
- 00:00:25this stuff is really really hard and I'm
- 00:00:29trying to give you some ideas on ways
- 00:00:31that you can deal with this in your own
- 00:00:33environment Netflix was born on a cold
- 00:00:41stormy night back in 1997 but seriously
- 00:00:45it's 20 years old which most people
- 00:00:47don't realize and as of q2 2017 we have
- 00:00:52a hundred million members worldwide we
- 00:00:56are on track to spend six billion
- 00:00:59dollars on content this year and you
- 00:01:03guys watch a lot of that content in fact
- 00:01:05you watch a hundred and twenty five
- 00:01:07million hours every single day these are
- 00:01:13not peak numbers On January 8th of this
- 00:01:16year you guys actually watched 250
- 00:01:20million hours in a single day it's
- 00:01:24impressive as of q2 2016 we are in 130
- 00:01:31countries worldwide and we are on over I
- 00:01:37think around 4,000 different devices now
- 00:01:40there's a reason why I'm telling you
- 00:01:41this we have a hundred million members
- 00:01:43watching a hundred and twenty five
- 00:01:45million hours of content every day in a
- 00:01:48hundred and thirty countries on four
- 00:01:50thousand different devices we have a lot
- 00:01:54of data we write 700 billion events to
- 00:02:00our stream and adjusted pipeline every
- 00:02:02single day this is average we peak at
- 00:02:06well over a trillion this data is
- 00:02:09processed and landed in our
- 00:02:13warehouse which is built entirely using
- 00:02:15open-source Big Data technologies we're
- 00:02:17currently sitting at around 60 petabytes
- 00:02:20and growing at a rate of 300 terabytes a
- 00:02:22day and this data is actively used
- 00:02:26across the company I'll give you some
- 00:02:28specific examples later but on average
- 00:02:30we do about five petabytes of reads now
- 00:02:35what I'm trying to demonstrate to you in
- 00:02:38this talk is that we can do this and we
- 00:02:40can use these principles at scale but
- 00:02:42you don't need this type of environment
- 00:02:45to still get value out of it it's just
- 00:02:48showing you that it does work at scale
- 00:02:49so now for the fun stuff the unfortunate
- 00:02:54events and events is actually a play on
- 00:02:57words here when I say events what I'm
- 00:02:58really talking about is all of the
- 00:03:01interactions you take across the service
- 00:03:04so this could be authenticating into a
- 00:03:08app it could be the content that you
- 00:03:11receive as to you like what we recommend
- 00:03:14for you to watch it could be when you
- 00:03:16click on content or you pause it or you
- 00:03:18stop it or when you click on that next
- 00:03:21thing to watch all of those are
- 00:03:23considered events for us this data is
- 00:03:27written into our justin pipeline which
- 00:03:28is backed by kafka and it's landed in a
- 00:03:32raw ingestion layer inside of our data
- 00:03:34warehouse so just for those who are
- 00:03:37curious this is all 100% based in the
- 00:03:39cloud everything you see on this it's
- 00:03:41all using Amazon's AWS but you can think
- 00:03:45of this s3 is really just a data
- 00:03:47warehouse so we have a raw layer we use
- 00:03:50a variety of big data processing
- 00:03:52technologies most notably spark right
- 00:03:54now to process that data transform it
- 00:03:57and we land it into our data warehouse
- 00:03:59and then we can also aggregate and to
- 00:04:03normalize and summarize that data and
- 00:04:04put it into a reporting layer a subset
- 00:04:07of this data is moved over to a variety
- 00:04:10of fast access storage gene storage
- 00:04:13where that data is made available for
- 00:04:17our data visualization tools tableau is
- 00:04:20the most widely used visualization tool
- 00:04:23at Netflix but we also have a variety of
- 00:04:26other use case
- 00:04:26so we have other visualization tools
- 00:04:29that we support and then this data is
- 00:04:31also available to be queried and
- 00:04:34interacted with using a variety of other
- 00:04:36tools every person in the company has
- 00:04:40access to our reports and to our data
- 00:04:47but this is this is what happens when
- 00:04:51everything works well this is what it
- 00:04:53looks like let's talk about when things
- 00:04:55go wrong when you have bad data but
- 00:04:58really I don't like the term bad because
- 00:05:00it implies intent and the data is not
- 00:05:03trying to ruin your Monday it doesn't
- 00:05:06want to create problems in your reports
- 00:05:08so I like to think of it as unfortunate
- 00:05:10data this is a visualization it's
- 00:05:15actually I think one of the coolest
- 00:05:16visualizations I've ever seen it's a
- 00:05:18tool called visceral and it shows all of
- 00:05:20the traffic that is coming into our
- 00:05:22service and every single of those small
- 00:05:25dots is an API and those large thoughts
- 00:05:28are represent various regions and our
- 00:05:31eight of us so if you look here you
- 00:05:34might notice one of these circles is red
- 00:05:37and that means that there's a problem
- 00:05:38with that API what that means is that
- 00:05:46we're receiving for the most part data
- 00:05:49but we might not be receiving all data
- 00:05:51we might not be receiving one type of
- 00:05:53event so in this example let's say that
- 00:05:56we can receive the events when you play
- 00:05:59on click play on new content and when
- 00:06:03you get to the end and you click that
- 00:06:04click to play next episode we don't see
- 00:06:08that right so let's just use that as a
- 00:06:10hypothetical so now we have our bad data
- 00:06:15our unfortunate data that is coming in
- 00:06:17and let's say just for example purposes
- 00:06:19that this is only affecting our tablet
- 00:06:23devices maybe it's the latest release of
- 00:06:25the Android SDK caused some sort of
- 00:06:27compatibility issue and now we don't see
- 00:06:29that one type of event it could be that
- 00:06:32the event doesn't come through at all it
- 00:06:34could be that the event comes through
- 00:06:35but it's malformed it could be that it's
- 00:06:38an empty payload
- 00:06:39point is that we are somehow missing the
- 00:06:41data that we're expecting and it doesn't
- 00:06:43stop there and makes its way through our
- 00:06:45entire system goes into our ingestion
- 00:06:47pipeline it gets landed over into s3 it
- 00:06:49ultimately makes its way to the data
- 00:06:51warehouse it's going to be copied over
- 00:06:53to our fast storage and if you're doing
- 00:06:55extracts that's going to live there too
- 00:06:57and ultimately this data is going to be
- 00:07:01in front of our users and the problem
- 00:07:04here is that you guys are the face of
- 00:07:10the problem even though you had nothing
- 00:07:12to do with it
- 00:07:13you created the report and the user is
- 00:07:16interacting with the report it is
- 00:07:21generally not going to be your fault
- 00:07:23I sometimes it's your fault but most of
- 00:07:25the time what I've observed is that
- 00:07:27reports issues with reports are quote
- 00:07:30unquote upstream what does that mean
- 00:07:32every single icon and every single arrow
- 00:07:37on this diagram is a point of failure
- 00:07:40and not just one or two possible things
- 00:07:43that could go wrong a dozen or more
- 00:07:45different things could go wrong and this
- 00:07:48is a high-level view if we drilled down
- 00:07:50you would see even more points of
- 00:07:51failure so it's realistic to expect that
- 00:07:54things will go wrong compounding this
- 00:07:58problem for us is that according to San
- 00:07:59vine Netflix accounts for 35% of all
- 00:08:03peak traffic in North America
- 00:08:20so I've described the problem we have
- 00:08:23some bad data we have some unfortunate
- 00:08:24data that is not the issue itself right
- 00:08:27I mean what does it really matter if
- 00:08:29I've got unfortunate data sitting in a
- 00:08:32table somewhere the problem is when you
- 00:08:35are using that data to make decisions
- 00:08:38right that's the impact and it is my
- 00:08:43personal belief that there is no more
- 00:08:46there is no other company in the world
- 00:08:48who is more data driven than Netflix
- 00:08:50there are other companies who are as
- 00:08:52data driven and they're not as big and
- 00:08:54there are bigger companies and they're
- 00:08:56not as data driven if if I'm mistaken
- 00:08:59please see me afterwards I would love to
- 00:09:01hear but when you're really a
- 00:09:04data-driven company that means that you
- 00:09:06are actively using and looking on that
- 00:09:08looking at that data you're relying upon
- 00:09:09it so how do we ensure they still have
- 00:09:12confidence well first let's look at all
- 00:09:15of the different roles we have from from
- 00:09:18Netflix and why I think like we're so
- 00:09:20data-driven we start with our data
- 00:09:22engineers and and from my perspective
- 00:09:23this is just a software engineer who
- 00:09:25really specializes in data they
- 00:09:27understand distributed systems they are
- 00:09:29processing that 700 billion events and
- 00:09:31making a consumable for the rest of the
- 00:09:33company we also have our analytics
- 00:09:35engineers who will usually pick up where
- 00:09:37that data engineer left off they might
- 00:09:39be doing some aggregations or creating
- 00:09:42some summary tables they'll be creating
- 00:09:45some visualizations and they might even
- 00:09:46do some ad hoc analysis so we consider
- 00:09:49them sort of full stack within this data
- 00:09:51space and then we have people that
- 00:09:52specialize in just data visualization
- 00:09:55they are really really good at making
- 00:09:59the data makes sense to people I'm
- 00:10:02curious though how many of you would
- 00:10:04consider yourself like an analytics
- 00:10:06engineer you have to create tables as
- 00:10:08well as the reports wow that's actually
- 00:10:13more than I thought it's a pretty good
- 00:10:15portion of the room how many of you only
- 00:10:17do visualization show hands ok so more
- 00:10:22people actually have to create the
- 00:10:23tables than they do just the
- 00:10:25visualizations interesting so those are
- 00:10:28the people that can that I consider
- 00:10:29these data producers or they're creating
- 00:10:32these
- 00:10:33these data objects for the rest of the
- 00:10:34company to consume then we move into our
- 00:10:36data consumers and this would be our
- 00:10:38business analyst which are probably very
- 00:10:39similar to your business analyst they
- 00:10:42have really deep vertical expertise and
- 00:10:44they are producing they're producing
- 00:10:48analysis like what is the subscriber
- 00:10:51forecast for amia we also have research
- 00:10:54scientist and quantitative analyst in
- 00:10:56our science and algorithms groups and
- 00:10:58they are focused on answering really big
- 00:11:02hard questions we have our data
- 00:11:07scientist and machine learning scientist
- 00:11:09and they are creating models to help us
- 00:11:11predict behaviors or make better
- 00:11:12decisions and these would be our
- 00:11:15consumers so these people are affected
- 00:11:18by the bad data but they're not there's
- 00:11:20really no impact yet it's not until we
- 00:11:22get to this top layer that we really
- 00:11:24start to see impact and starts with our
- 00:11:26executives they are looking at that data
- 00:11:29to make decisions about the company's
- 00:11:32strategy many companies say they're
- 00:11:34data-driven
- 00:11:35but what that really means is that I've
- 00:11:37got an idea and I just need the data to
- 00:11:39prove it oh that doesn't look good go
- 00:11:42look over here instead until they can
- 00:11:43find the data that proves their points
- 00:11:46you can go off in the direction they
- 00:11:47want being data-driven means that you
- 00:11:49look at the data first and then you make
- 00:11:50decisions so if we if we provide them
- 00:11:54with bad data bad insights they could
- 00:11:57make a really bad decision for the
- 00:11:59company our product managers we have
- 00:12:02these across every verticals but one
- 00:12:03example would be our content team they
- 00:12:06are asking the question what should we
- 00:12:08spend that six billion dollars on what
- 00:12:11titles should we license what titles
- 00:12:14should we create and they do that by
- 00:12:16relying upon predictive models built by
- 00:12:19our data scientist saying here's what we
- 00:12:21expect the audience to be for a title
- 00:12:24and based upon that we can back into a
- 00:12:28number that we're willing to pay this is
- 00:12:30actually a really good model for this
- 00:12:33because it allows us to support niche
- 00:12:35audiences with small film titles but
- 00:12:38also spend a lot of money on things like
- 00:12:41the the Marvel and Disney partnerships
- 00:12:43where we know it's going to have broad
- 00:12:44appeal
- 00:12:46we were algorithm engineers who are
- 00:12:49trying to decide what is the right
- 00:12:50content to show you on the site we have
- 00:12:53between 60 and 90 seconds for you to
- 00:12:56find content before you leave and I know
- 00:12:58it feels like longer than 60 or 90
- 00:13:00seconds when you're clicking next next
- 00:13:02next but that's about how long we have
- 00:13:05before you go and spend your free time
- 00:13:06doing something else and then we have
- 00:13:10our software engineers who are trying to
- 00:13:13just constantly experiment with things
- 00:13:15they look at the data and they they roll
- 00:13:18it out to everybody if it makes sense
- 00:13:19and this could be everything from the
- 00:13:21user experience that you actually see to
- 00:13:23things that you don't see like what is
- 00:13:25the optimal compression for for our
- 00:13:29video encoding so that we can lower the
- 00:13:32amount of bandwidth that you have to
- 00:13:33spend while also preventing you from
- 00:13:35having like a really bad video
- 00:13:37experience so these are the the impact
- 00:13:40is really at that top level so how do we
- 00:13:46design for these unfortunate events I
- 00:13:49mean we we have the data we've got lots
- 00:13:51of data we've got lots of people who
- 00:13:53want to look at it a lot of people who
- 00:13:54are depending upon it you know and I
- 00:13:57think that you have two options here the
- 00:13:58first option is you can say we're going
- 00:14:01to prevent anything from going wrong
- 00:14:03we're gonna check for everything and
- 00:14:05we're just gonna we're just gonna lock
- 00:14:06it down and when something gets deployed
- 00:14:08we're gonna make sure that that thing is
- 00:14:10airtight and that works that that sounds
- 00:14:12good in principle but the reality is
- 00:14:14that usually these issues don't occur
- 00:14:15when you deploy something usually
- 00:14:18everything looks great
- 00:14:19and then six months later there's a
- 00:14:21problem right so it's a lot in my
- 00:14:24perspective a lot better instead to
- 00:14:26detect issues and respond to them than
- 00:14:29it is to try to prevent them and when it
- 00:14:32comes to detecting data quality now I'm
- 00:14:35gonna get a little bit more technical
- 00:14:36here please bear with me I think that
- 00:14:38all this stuff that was really relevant
- 00:14:39for you guys and I think that there's
- 00:14:41some really good takeaways for you so
- 00:14:42there's a reason I'm showing you this so
- 00:14:44we're gonna drill down into this data
- 00:14:45storage layer
- 00:14:50and we're gonna look at at this concept
- 00:14:53of a table right and so in Hadoop how
- 00:14:55many of you actually work with Hadoop at
- 00:14:57all how many of you work with only like
- 00:15:01a you work with like an enterprise data
- 00:15:02warehouse but it's on something else
- 00:15:04like Terra data okay so in Tara data a a
- 00:15:09table is both a logical and a physical
- 00:15:13construct you cannot separate the two in
- 00:15:15Hadoop you can we have the data sitting
- 00:15:17somewhere on storage and we have this
- 00:15:19concept of a table which is really just
- 00:15:20a pointer to that and we can choose to
- 00:15:23point to the data we can choose to not -
- 00:15:25one thing though that we've built is a
- 00:15:28tool called medic at and whenever we
- 00:15:30write that data whenever we do that
- 00:15:31pointing we are creating another logical
- 00:15:34object we're creating a partition object
- 00:15:36and this partition object has statistics
- 00:15:40about the data that was just written we
- 00:15:43can look at things like the row counts
- 00:15:45and we can look at the number of nulls
- 00:15:48in that file and say okay we're gonna
- 00:15:51use this information to see if there's a
- 00:15:53problem we can also drill down a little
- 00:15:56bit deeper into the field level and we
- 00:15:59can use this to say like some really
- 00:16:01explicit checks we can say well I'm
- 00:16:03checking for the max value of this
- 00:16:05metric and the max value is zero and
- 00:16:07that doesn't make sense unless it's some
- 00:16:09sort of negative value for this field
- 00:16:11but chances are this is either a brand
- 00:16:12new field or there's a problem and so we
- 00:16:15can check for these things you don't
- 00:16:17have to do things the way that I'm
- 00:16:19describing but the concept I think is
- 00:16:21pretty transferable having statistics
- 00:16:22about the data that you write enable a
- 00:16:24lot more powerful things
- 00:16:32okay so now that we have the statistics
- 00:16:34I mean we can use them in isolation and
- 00:16:36say oh we got a zero that's a problem
- 00:16:37but typically the issues are a little
- 00:16:40bit more difficult to find than than
- 00:16:43just that and so what we can do is take
- 00:16:45that data in chart for example row
- 00:16:47counts over time and you can see that
- 00:16:49we've got peaks and valleys here and
- 00:16:51this is really denoting that there's
- 00:16:53some difference in behavior based upon
- 00:16:56the day of week and so if we use a
- 00:17:00standard normal deviate distribution we
- 00:17:04can look for something that falls
- 00:17:06outside of like a 90% confidence
- 00:17:08interval and if it does we can be pretty
- 00:17:11confident that maybe there's not a
- 00:17:13problem but we definitely want someone
- 00:17:14to go look to see if there's a problem
- 00:17:16and so when we compare this for the same
- 00:17:20day of week week over week for 30
- 00:17:23periods we start to see that we have
- 00:17:25some outliers we have some things that
- 00:17:26might be problems we can also see that
- 00:17:30the data that we rip we wrote most
- 00:17:32recently looks really suspect because I
- 00:17:35wrote 10 billion rows and typically I
- 00:17:41write between 80 and a hundred billion
- 00:17:43rows right so chances are there's a
- 00:17:46problem with this particular run of the
- 00:17:48CTL
- 00:17:53so we can detect the issues but that
- 00:17:56doesn't really prevent the problem of
- 00:17:58the impact the perennial question can I
- 00:18:02trust this report can I trust this data
- 00:18:05I have no idea
- 00:18:07looking at this if there's a data
- 00:18:08quality issue and what's really
- 00:18:11problematic and what is really the issue
- 00:18:14for you guys is when people look at
- 00:18:17these reports they trust the reports and
- 00:18:19then afterwards we tell them that data
- 00:18:22is actually wrong we're gonna back it
- 00:18:25out we're gonna fix it for you
- 00:18:27there was no indication to them looking
- 00:18:30at this report that they couldn't trust
- 00:18:32it but now the next time they look at
- 00:18:33this report guess what it's gonna be
- 00:18:36there in the back of their mind is this
- 00:18:37data good can I trust it so what we've
- 00:18:44done is built a process that checks for
- 00:18:47these before the data becomes visible
- 00:18:48and all of the bad unfortunate stuff can
- 00:18:51still happen we still have the data
- 00:18:54coming in it's still landing in our
- 00:18:56ingestion layer but before we write it
- 00:18:59out to our data warehouse we were
- 00:19:00checking for those standard deviations
- 00:19:02and when we find exceptions we failed at
- 00:19:05ETL we don't go any further in the
- 00:19:07process we also check before we get to
- 00:19:10our reporting layer same thing what this
- 00:19:16means is that your user is not going to
- 00:19:19see their data right they're gonna come
- 00:19:21to the report and it looks like this
- 00:19:26your user your business user is going to
- 00:19:30see there's missing data and now they're
- 00:19:32going to know there was a quality issue
- 00:19:34and we don't want them to know that
- 00:19:36right
- 00:19:37wrong we want them to know there was a
- 00:19:41problem because it's not your fault
- 00:19:43there was a problem it's there's so many
- 00:19:46things that could go wrong but simply by
- 00:19:48showing them this explicitly that we
- 00:19:51have no data they retain confidence they
- 00:19:55know they're not making decisions bayit
- 00:19:57faster based on bad data and your
- 00:20:00business should not be making major
- 00:20:02decisions on a single day's worth of
- 00:20:04data
- 00:20:05where it becomes really problematic is
- 00:20:07when you're doing trends and percent
- 00:20:09changes and you know those things even
- 00:20:11bad data can really have a big impact so
- 00:20:18one thing that you do have to do to make
- 00:20:20this work is you have to surface the
- 00:20:23information so that users can really see
- 00:20:25when was the data last loaded when did
- 00:20:27we vet last validate it through so
- 00:20:30there's two things not showing them bad
- 00:20:32data and providing visibility into the
- 00:20:34current state
- 00:20:35we're also this is a view of our big
- 00:20:37data portal it's an internal tool that
- 00:20:39we've developed I think there's other
- 00:20:40third-party tools out there that might
- 00:20:42do some of their things we're also
- 00:20:43planning to add visibility to the actual
- 00:20:45failures and alerts so that business
- 00:20:47users can see those but so now we've
- 00:20:52we've detected the issue we've prevented
- 00:20:54there from being any negative impact but
- 00:20:56we still have to fix the problem right
- 00:20:59they still want the data at the end of
- 00:21:01the day there's two components to fixing
- 00:21:05the problem quickly the first one is as
- 00:21:07I just mentioned visibility but this
- 00:21:09time visibility for the people who need
- 00:21:10to understand what the problem is and
- 00:21:11they need to fix it so one of the things
- 00:21:13that we're doing is surfacing this
- 00:21:15information you know the question might
- 00:21:17be why did my job suddenly spike in in
- 00:21:20run time right why is this taking so
- 00:21:22long and you can look here and you can
- 00:21:25easily see oh it's because you received
- 00:21:27a lot more data and then this becomes a
- 00:21:30question well is that because somebody
- 00:21:32deployed something upstream and nots
- 00:21:33duplicating everything I mean it gives
- 00:21:36you a starting point to understand what
- 00:21:37are the problems we also directly
- 00:21:40display what they fail your message or
- 00:21:42the failures and then give you a link to
- 00:21:44go see the failure messages themselves
- 00:21:45so that when users are trying to
- 00:21:47troubleshoot it again we're just trying
- 00:21:49to make it easier and faster for them to
- 00:21:51get there and we are exposing the
- 00:21:58relationships between data sets so this
- 00:22:01is the lineage data you know how do
- 00:22:03these things relate what things are
- 00:22:05waiting on me to fix this and who do I
- 00:22:08need to notify that there's a problem
- 00:22:12all right now I'm gonna cover this real
- 00:22:14quick this is about scheduling and
- 00:22:18pooling versus pushing
- 00:22:20not something that you guys here would
- 00:22:23implement but something that you should
- 00:22:24be having conversations with your
- 00:22:26infrastructure teams about traditionally
- 00:22:28we use a schedule based system where we
- 00:22:33say ok it's 6 o'clock my job's gonna run
- 00:22:34and I'm gonna take that 700 billion
- 00:22:37events and I'm gonna create this really
- 00:22:38clean detailed table and then at 7
- 00:22:41o'clock I'm gonna have another job run
- 00:22:43and it's gonna a grenade that data at 8
- 00:22:47o'clock I'm gonna have another process
- 00:22:50that runs and it's going to normalize it
- 00:22:51to get it ready for my report I'm gonna
- 00:22:55copy it over to my fast access layer at
- 00:22:578:30 and by 9 o'clock my report should
- 00:23:00be ready in a push based system you
- 00:23:04might still have some scheduling
- 00:23:05component to it you might say well I
- 00:23:07want everything to start at 6:00 a.m.
- 00:23:08but the difference is that once this job
- 00:23:11is done it notifies the aggregate job
- 00:23:14that it's ready to run because there's
- 00:23:15new data which notifies the
- 00:23:17de-normalized job that it's ready to run
- 00:23:19which notifies or just executes the
- 00:23:22extract over to your fast access layer
- 00:23:25and your report becomes available to
- 00:23:27everybody by maybe 7 42 you could see
- 00:23:33the benefits here and being able to get
- 00:23:35to data and getting the data out faster
- 00:23:37to your users this is not why you guys
- 00:23:39care about this this is probably why
- 00:23:41your business users might care about it
- 00:23:43why you guys care about it is because
- 00:23:45things don't always work perfectly in
- 00:23:50fact they usually don't and when that
- 00:23:52happens you're gonna have to reflow and
- 00:23:53fix things so this is an actual table
- 00:23:56that we have we have one table that
- 00:24:00populates six tables which populates 38
- 00:24:03tables which populates 586 this is a
- 00:24:07pretty run-of-the-mill table for us I
- 00:24:10have one table that by the third level
- 00:24:11of dependency has 2000 table
- 00:24:15dependencies so how do we fix the data
- 00:24:18when the data has started off on one
- 00:24:20place and has been propagated to all of
- 00:24:22these other places in a full system you
- 00:24:26rerun your job and my my detailed data
- 00:24:30my aggregate my
- 00:24:32normalised view all of these views get
- 00:24:34updated and my report is good but the
- 00:24:37other 582 tables are kind of left
- 00:24:41hanging and you could notify them if you
- 00:24:45have visibility to who these people are
- 00:24:47that are consuming your data but they
- 00:24:50still have to go take action and what's
- 00:24:52gonna happen is you're gonna tell them
- 00:24:53hey we've had this data quality issue
- 00:24:55and we flowed and it's really important
- 00:24:58that you're on your job and they're
- 00:24:59gonna think okay yeah but I deprecated
- 00:25:01that and yeah I might have forgot to
- 00:25:03turn off my ETL and they have no idea
- 00:25:05that somebody else I started to rely
- 00:25:07upon that data for the report right
- 00:25:09happens all the time people don't feel
- 00:25:11particularly incentivized to go rerun
- 00:25:14and clean up things unless they know
- 00:25:15what the impact is in a push system we
- 00:25:20fix the one table it notifies the next
- 00:25:23tables that there's new data which
- 00:25:25notifies those tables that there's new
- 00:25:26data and everything gets fixed
- 00:25:29downstream this is a perfect world it's
- 00:25:32very idealistic you this is like a very
- 00:25:35pure push type system what you should be
- 00:25:39having discussions with your your
- 00:25:41internal infrastructure team is is that
- 00:25:43you should not need to know that there
- 00:25:45is an upstream issue you should just be
- 00:25:47able to rely upon the fact that when
- 00:25:50there is a problem
- 00:25:51your jobs will be executed for you so
- 00:25:53that you can rely upon that data nobody
- 00:25:55should have to go do that manually it
- 00:25:57doesn't scale part four
- 00:26:05so we've gone through this and we talked
- 00:26:10about some of the different ways that
- 00:26:11that unfortunate data would impact us
- 00:26:13but that's not the reality for us the
- 00:26:16reality is that our users do have
- 00:26:19confidence in the data pre-produce that
- 00:26:21doesn't mean there's not quality issues
- 00:26:22but overall generally speaking they have
- 00:26:25confidence that what we produce and what
- 00:26:26we provide to them is good and because
- 00:26:29of that were able to do some really cool
- 00:26:30things our executives actually looked up
- 00:26:34the data for how content was being used
- 00:26:37and the efficiency of it and they made a
- 00:26:39decision based upon the data to start
- 00:26:43investing in originals and over the next
- 00:26:47few years we went from I think it was a
- 00:26:51handful of hours in 2012 to about a
- 00:26:54thousand hours of content in 2017 we've
- 00:26:58ramped this up very very quickly and we
- 00:27:01have set a goal of by 2020 50% of our
- 00:27:06content will be original content 50% of
- 00:27:08that six billion dollars that were
- 00:27:10spending will be on new content that we
- 00:27:11create but this was a strategy decision
- 00:27:14that was informed by the data that we
- 00:27:16had we also have our product managers
- 00:27:21here are looking at the data and they
- 00:27:24are I mean they're making some pretty
- 00:27:27good selections with what content we
- 00:27:30should be purchasing on the service
- 00:27:31we've had some pretty good ones and
- 00:27:32they're going to continue to use the
- 00:27:34data to decide what is the next best
- 00:27:37thing for us to buy we have our software
- 00:27:41engineers who have built out constantly
- 00:27:49evolving you use it user interfaces user
- 00:27:54experience for us and this doesn't
- 00:27:56happen all at once this isn't like a
- 00:27:57monolithic project where they just roll
- 00:28:00out these big changes instead they are
- 00:28:04testing these they're making small
- 00:28:06changes they're testing them
- 00:28:07incrementally they're making the
- 00:28:08decision to roll it out we we have about
- 00:28:10like a hundred different tests going on
- 00:28:11right this moment to see what is the
- 00:28:13best thing that we should do and so you
- 00:28:15can see what we looked like in 2017 or
- 00:28:172016
- 00:28:18here's what we look like in 2017
- 00:28:50so you can imagine for a moment the
- 00:28:54amount of complexity and different
- 00:28:57systems that are involved in making
- 00:28:58something like that happen and before we
- 00:29:01really make that investment we go and
- 00:29:03test it is our theory correct one thing
- 00:29:06that I think is really interesting is
- 00:29:07that we find oftentimes we can predict
- 00:29:12what people will like if those people
- 00:29:13are exactly like us and usually people
- 00:29:17are not exactly like us and so instead
- 00:29:19we just throw it out there whenever we
- 00:29:20have ideas we test them and then we we
- 00:29:23respond to the results and then we have
- 00:29:27our algorithm engineers these are the
- 00:29:30people who are responsible for just
- 00:29:33putting the intelligence in our system
- 00:29:34and making it fast and and seamless and
- 00:29:37so the most well-known case is our
- 00:29:39recommendation systems I could talk
- 00:29:41about it but I actually think this video
- 00:29:43is a little bit more interesting
- 00:29:48[Music]
- 00:30:03[Music]
- 00:30:12[Music]
- 00:30:26[Music]
- 00:30:45just
- 00:30:49[Applause]
- 00:30:53[Music]
- 00:30:55now we did not create the vape the video
- 00:30:58but we provided the data behind the
- 00:31:01video 80% of people watch content
- 00:31:05through recommendations 80% we did not
- 00:31:10start off there at that number it was
- 00:31:13only through constant iteration looking
- 00:31:15at the data and getting or responding to
- 00:31:19the things that had the positive results
- 00:31:20that we got to this place okay so we'll
- 00:31:24start wrapping up the key takeaways
- 00:31:27obviously I don't expect you guys to go
- 00:31:31back and do this in your environment but
- 00:31:34I think that there are some key
- 00:31:35principles that really make sense for a
- 00:31:38lot of people outside of just what we're
- 00:31:40doing here at Netflix the first one is
- 00:31:43that expecting failure is more efficient
- 00:31:47than trying to prevent it and this is
- 00:31:51true for your your data teams but this
- 00:31:53is also true for you as data
- 00:31:55visualization people how can you expect
- 00:31:59and respond to failures and to issues
- 00:32:01with the data and with your reports
- 00:32:03rather than trying to prevent them so
- 00:32:06shift your mind mindset and say I know
- 00:32:09it's gonna happen what are we gonna do
- 00:32:11when it happens stale data you know I
- 00:32:20have never heard someone tell me I would
- 00:32:23have rather had the incomplete data
- 00:32:25faster than I had the stale accurate
- 00:32:28data I am sure there are cases out there
- 00:32:30where that is true
- 00:32:31but it is almost never the case people
- 00:32:35don't make decisions based upon one hour
- 00:32:38or one day's worth of data they might
- 00:32:41want to know what's happening they might
- 00:32:42say I just launched something and I
- 00:32:44really want to know how it's performing
- 00:32:45I mean that's it that's a natural human
- 00:32:47trait that curiosity but it is not
- 00:32:52impactful right and so ask yourself
- 00:32:56would they rather see data faster and
- 00:32:59have it be wrong or would they rather
- 00:33:01know that when they do finally see the
- 00:33:03data and maybe a few minutes later maybe
- 00:33:05an hour later
- 00:33:06that it's right and this is actually I
- 00:33:11know this is really hard it's easy to
- 00:33:13tell you guys this I know that you have
- 00:33:14to go back to your business users and
- 00:33:15tell them this I realized that but you
- 00:33:18know if you explain it to them in this
- 00:33:19way usually they can begin to understand
- 00:33:23and then the last thing this is really
- 00:33:27validation for you guys I've had a lot
- 00:33:30of people ask me how do you guys do this
- 00:33:33and we have all these problems and the
- 00:33:35reality is that we have those problems
- 00:33:37too
- 00:33:38this stuff is really really hard it's
- 00:33:41hard to get right it's hard to do well
- 00:33:44it's hard for us and I think we'd do it
- 00:33:47pretty well I think there's a lot of
- 00:33:48things we can do better though so don't
- 00:33:51just know that you're not alone it's not
- 00:33:53you it's not your environment take this
- 00:33:55slide back to your boss and show them
- 00:33:58like this stuff is really really hard to
- 00:34:00do right ok please when you are done
- 00:34:07complete the session survey and that's
- 00:34:11my talk
- 00:34:13[Applause]
- 00:34:27anybody have questions
- 00:34:33and if you have questions and you're not
- 00:34:34able to stay right now feel free to find
- 00:34:36me afterwards I'm happy to answer them
- 00:34:41hi thanks for the very insightful talk I
- 00:34:45had one question I was curious when you
- 00:34:47showed the dependency diagram let's say
- 00:34:49you have a mainstream table that has a
- 00:34:51couple of hundred or let's say a
- 00:34:53thousand tables feeding downstream
- 00:34:55dependencies and if you have to make a
- 00:34:57change I'm sure like no table design is
- 00:35:00constant so for the upcoming changes how
- 00:35:02do you manage to like ensure that you
- 00:35:05are still flexible for those changes and
- 00:35:07also are fulfilling these downstream
- 00:35:09dependencies for the historical data
- 00:35:11also so the question is how do you
- 00:35:15identify and ensure that there are no
- 00:35:17issues and your downstream dependencies
- 00:35:18when you have thousands of tables no
- 00:35:21let's say no it's rather like if you had
- 00:35:23to make a change how do you make a quick
- 00:35:25enough change and also make sure that
- 00:35:26you have the data historically to for a
- 00:35:30major upstream table that you are you
- 00:35:34wanting the change applied to your
- 00:35:36downstream tables yeah oh okay good
- 00:35:38question so how do you how do you
- 00:35:40quickly evolve schema probably and apply
- 00:35:43that downstream you know the first thing
- 00:35:46is you have to have lineage data without
- 00:35:48lineage data you can't really get
- 00:35:50anywhere so the first step is always to
- 00:35:52make sure you have lineage data the
- 00:35:53second thing is around automation and
- 00:35:56your ability to understand the data and
- 00:35:57understand the schema so if you can
- 00:35:59understand schema evolution and changes
- 00:36:01there then you can start to apply them
- 00:36:02programmatically at any point where you
- 00:36:04can automate so automate automation of
- 00:36:07ingestion and the export of data would
- 00:36:10be an optimal place beyond that you
- 00:36:12could start to you know it becomes a
- 00:36:16little bit more pragmatic it depends on
- 00:36:18how complex your scripts are right so if
- 00:36:19you've got pretty simple like select
- 00:36:20from here group by block you could
- 00:36:24automate that if you had all the data
- 00:36:26sitting in like a centralized code base
- 00:36:29but that becomes a lot more tricky and
- 00:36:31we actually have not gone that route
- 00:36:32because we we want people to have
- 00:36:34control instead what we've done is we've
- 00:36:35notified them that there is going to be
- 00:36:37a change we notify them when the change
- 00:36:40is made and
- 00:36:40it's on them to actually apply the
- 00:36:43changes and then everything that we can
- 00:36:44automate because there's a simple select
- 00:36:46move from here to there we've automated
- 00:36:47so they don't have to worry about that
- 00:36:49good question one more question yes so
- 00:36:51with the rising speech based AI
- 00:36:54assistants so in the future if Netflix
- 00:36:56gets support of let's say Google home or
- 00:36:59Siri how do we think that the how do we
- 00:37:10think the data flow is going to be
- 00:37:11affected by home systems and speech
- 00:37:14recognition yeah I mean there'll be a
- 00:37:16lot more data than we are used to right
- 00:37:18right now yes I don't have a specific
- 00:37:23answer but I would imagine that it's
- 00:37:25just gonna be another endpoint for us
- 00:37:27and that it will just be writing out to
- 00:37:28those ingestions the you know typically
- 00:37:32what's happening in those cases is that
- 00:37:34there's like another service out there
- 00:37:35on a lexer Google that is doing that
- 00:37:37translation so we probably wouldn't get
- 00:37:39visibility that it would be more of like
- 00:37:41you know this phrase was made and then
- 00:37:44here was the content that was shown and
- 00:37:45we'd probably logged that in the same
- 00:37:46way we log pretty much everything but I
- 00:37:48don't actually know that for sure thanks
- 00:37:50I I have a few questions but I'll try to
- 00:37:56put it in one I missed the first part of
- 00:38:01your presentation where you're drawing
- 00:38:03the architecture and you're showing us
- 00:38:04how data reaches the tableau dashboard
- 00:38:07you mentioned later on that you're using
- 00:38:09tableau data extracts when you're
- 00:38:11building extracts or and feeding the
- 00:38:14dashboard at a certain time my question
- 00:38:17is do you have live connections as the
- 00:38:19data is flowing through your ETL process
- 00:38:21checking the quality and running
- 00:38:23analytics for people to actually do
- 00:38:25visualize and see and make a change
- 00:38:30based on what they see where you live
- 00:38:32connection to the data sets and running
- 00:38:35analytics on top of that using tableau
- 00:38:38or are you guys using tableau data
- 00:38:44extracts I mean this is actually no
- 00:38:47furthest to my colleague Jason flitting
- 00:38:50over here who works more with the
- 00:38:51tableau stuff and
- 00:38:52yeah I I think what you're getting at
- 00:38:54you're talking about the data itself on
- 00:38:57the quality of the data are we
- 00:39:00visualizing that in tabloids English
- 00:39:02action the data itself not a metadata
- 00:39:05how many notes you got and all the
- 00:39:07performance metrics that she showed sure
- 00:39:09really looking at the quality of the
- 00:39:11data and something for a product owner
- 00:39:13to make a decision on hey how do I look
- 00:39:16as the process is going on using live
- 00:39:18connections you're talking about a lot
- 00:39:20of data and how do you stream all of
- 00:39:22that a lot of data onto the live
- 00:39:24connection and not affect your table
- 00:39:26dashboards performance because you
- 00:39:28really have to go back and look through
- 00:39:30terabytes of data to be able to get a
- 00:39:32visualization yeah what I would say
- 00:39:34there is a you know the the data around
- 00:39:37the quality checks that are happening is
- 00:39:39surface not so much to be a tableau live
- 00:39:43connection today but I think there's
- 00:39:45real opportunity with this is actually
- 00:39:47having some sort of aspect of your
- 00:39:52dashboard that surfaces that information
- 00:39:54so in a more kind of easy-to-understand
- 00:39:56way having just an indicator on your
- 00:39:59dashboard hey you know this dashboard
- 00:40:00has some stale data it's being worked on
- 00:40:03that's probably more realistic you know
- 00:40:06thing that we would do is kind of
- 00:40:08surface it ad on the dashboard rather
- 00:40:10than you know the business user maybe
- 00:40:12they won't have as much context to
- 00:40:14understand the complexity of why the
- 00:40:16data quality you know is not good so I
- 00:40:19could see surfacing that kind of like
- 00:40:21high level data point in a dashboard
- 00:40:22okay
- 00:40:24I think is a long line but perhaps yeah
- 00:40:27why don't you swing around you can dig
- 00:40:28it more and we'll stick around to guys
- 00:40:30so if you don't have a chance to ask
- 00:40:32your question yeah appreciate it thanks
- 00:40:34I heard you um good question how do you
- 00:40:38attribute viewing to the recommendation
- 00:40:41engine versus anything that you can't
- 00:40:44measure off like off platform like so
- 00:40:46your traditional ads word of mouth like
- 00:40:48someone has told me about Luke Cage and
- 00:40:51I got recommended to me so then I
- 00:40:53watched it but my first discovery moment
- 00:40:55was really gonna the word of mouth of my
- 00:40:57friend that's a great question how do we
- 00:40:59attribute
- 00:41:01a view from like a search engine versus
- 00:41:04a recommendation and the answer is I
- 00:41:05don't know I'm an engineer so I don't
- 00:41:07actually have that insight but that's a
- 00:41:09good question thank you sorry hello my
- 00:41:15question is about the data validation
- 00:41:16step you are saying that you take a
- 00:41:18normal distribution you say look we're
- 00:41:20uploading normally 80 to 100 million
- 00:41:22rows and so we kind of find this and
- 00:41:24this is what we think we should be able
- 00:41:25to see there are obviously lots of
- 00:41:27events that will cause things to be
- 00:41:29outside of that normal bounds so I mean
- 00:41:32just as a catastrophic event say the
- 00:41:35power outage that hit in 2004's shut
- 00:41:37down the eastern seaboard I'm gonna go
- 00:41:39out on a limb and say viewership dropped
- 00:41:41well outside of your normal distribution
- 00:41:44if that happened today so my question is
- 00:41:46is when you're doing this and you're
- 00:41:47saying this data is not good and you've
- 00:41:50set up this automated process that's
- 00:41:52alerting of that how do you then
- 00:41:54interact to say well actually no there's
- 00:41:55a reason for this or we've looked at it
- 00:41:57and you actually think this is a good
- 00:41:58data packet and then push it back
- 00:42:00through great questions so to to
- 00:42:03reiterate it would be when there is a
- 00:42:06data quality issue or when we find that
- 00:42:08there's exceptions how do we how do we
- 00:42:11communicate and notify that that really
- 00:42:14wasn't a problem that or it really was
- 00:42:16an issue outside of this that we wanted
- 00:42:18our job to continue is that right yeah
- 00:42:20so the first thing is that I really
- 00:42:23glossed over that whole piece a lot so
- 00:42:25normal distribution is one of the things
- 00:42:27that we can do we also do you like
- 00:42:28Poisson distributions we use anomaly
- 00:42:30detection so there's other more
- 00:42:31sophisticated ways that we can do to
- 00:42:33ensure that there's not problems when we
- 00:42:34do find that there are major problems we
- 00:42:36were working on right now a learning
- 00:42:39service that will allow people to
- 00:42:41communicate you know here's this issue
- 00:42:44I'm gonna acknowledge the issue I'm
- 00:42:46going to annotate what the problem was
- 00:42:47and then I'm gonna let my ETL move on so
- 00:42:50we're working on that right now but at
- 00:42:52the moment it would just be a matter of
- 00:42:54going off and like manually releasing
- 00:42:57the job to continue to move on to the
- 00:42:58next step
- 00:42:59does that power sit in one person or a
- 00:43:02small group of people that sit down and
- 00:43:04say yes we think this is good we've
- 00:43:06discussed this really quickly in a
- 00:43:0730-minute session we think the ETL can
- 00:43:10continue or is it not that critical for
- 00:43:12you guys to be that fast or how does
- 00:43:15that decision actually get made
- 00:43:17I'm gonna defer against my colleague
- 00:43:20yeah no worries
- 00:43:20so I'd say it comes down to the data
- 00:43:23point and the team that owns that data
- 00:43:25it's not really a centralized decision
- 00:43:27per se it would oftentimes you know the
- 00:43:30most relevant team or person would get
- 00:43:32alerted usually it's not just one
- 00:43:34individual usually it's a team and they
- 00:43:37would dig in see if there is actually a
- 00:43:39data issue if if there is they would you
- 00:43:41know fix that if not they would release
- 00:43:43things and hopefully improve the audit
- 00:43:46so that that same type of anomaly
- 00:43:48doesn't trick the audit again time time
- 00:43:50frame for those types of decisions again
- 00:43:53it can depend on the data set there are
- 00:43:55some data sets that you know you want to
- 00:43:58have up and running 24/7 and then there
- 00:44:00are other data sets that you know run
- 00:44:02once a day or once a week so it varies
- 00:44:06but I would say that you know they
- 00:44:08oftentimes that kind of like 24-hour
- 00:44:11turnaround is a pretty normal kind of
- 00:44:13time line cool thank you so much sure I
- 00:44:18questioned about the statistics that you
- 00:44:21capture how did you determine what
- 00:44:23metadata to capture for your data and
- 00:44:25part two if if you find there's a new
- 00:44:28feature that you want to capture you
- 00:44:30ever have to go back through your
- 00:44:32historical data and collect that good
- 00:44:36question so how do we how do we collect
- 00:44:38the statistics and then how do we
- 00:44:42support evolution of those statistics
- 00:44:44and backfill if necessary to answer your
- 00:44:48question that statistic is collected
- 00:44:49it's collected as part of the the
- 00:44:51storage driver so whenever data is
- 00:44:54written we are collecting that
- 00:44:55information every time it's written from
- 00:44:57like spark or pic and we we have the
- 00:45:01ability you know the whole the whole
- 00:45:03model here is collaboration so people
- 00:45:06are welcome to create a brand new
- 00:45:09statistic it actually we just had this
- 00:45:11happen recently
- 00:45:12and that statistic in this case it was
- 00:45:15looking at approximate cardinality and
- 00:45:17so that's something that not everybody
- 00:45:19would want to turn on and so we can flag
- 00:45:21it and say okay explicitly disable this
- 00:45:24by default and explicitly enable this
- 00:45:26for this one data set so people have
- 00:45:27that control where it's appropriate and
- 00:45:29then we don't really worry about
- 00:45:32backfilling of the statistics we have
- 00:45:34logic that says okay if we don't have
- 00:45:35enough data for this normal distribution
- 00:45:38perhaps we should be using some other
- 00:45:40detection method in the interim and once
- 00:45:43we do then we'll switch over to a normal
- 00:45:45distribution or will just simply
- 00:45:47invalidate that audit until we have
- 00:45:49enough data for the new statistic but we
- 00:45:52wouldn't necessarily go back because I'd
- 00:45:53be too expensive for us to do that means
- 00:45:56you have never done it we've never never
- 00:46:00done the backfill or is just not common
- 00:46:01we have never done the back field
- 00:46:03because it literally is part of like the
- 00:46:04storage function and we would have to
- 00:46:06either create something brand new
- 00:46:08basically you'd have to go interrogate
- 00:46:09that data and for us that's that's very
- 00:46:11very expensive so we want to touch the
- 00:46:13data as little as we can which is why we
- 00:46:14try to collect it whenever the data's
- 00:46:16being written not that we couldn't come
- 00:46:19up with a solution if we needed to but
- 00:46:20we've never had a case where we really
- 00:46:22needed to my question is regarding
- 00:46:28tableau connectivity to interactive
- 00:46:31queries just to be in specific like in
- 00:46:35order to calculate any complex
- 00:46:36calculation in databases it's so tedious
- 00:46:39like you want to calculate the standard
- 00:46:40deviation z-score for our own last few
- 00:46:42years based on the dynamic
- 00:46:44parameterization you can't do it in
- 00:46:46databases so easily so writing in a
- 00:46:48Python code or creating a web services
- 00:46:50on top is so much easier but tableau
- 00:46:53lacks a good way of connectivity to the
- 00:46:56Web API so as live interactivity I mean
- 00:47:00do you guys try to I mean have you came
- 00:47:03across these issues or are there any
- 00:47:05invest implementation models for missile
- 00:47:08that I'm gonna defer
- 00:47:13I'm sorry don't repeat the question
- 00:47:15again I think I can answer sorry I when
- 00:47:20it comes to the live connections that we
- 00:47:23make use of I would say for tableau we
- 00:47:26have primarily used tableau data
- 00:47:29extracts in the past so for cases like
- 00:47:31that where we need to go out and pull in
- 00:47:34data from an external service and kind
- 00:47:36of make that available we would usually
- 00:47:38bring it in to our data warehouse and
- 00:47:40materialize that into a table and then
- 00:47:42pull it into tableau for something more
- 00:47:45like what you're talking about where you
- 00:47:47have to go out and hit an external API
- 00:47:48we actually use some custom
- 00:47:51visualization solutions where we do
- 00:47:53those types of things the Netflix person
- 00:47:58sorry I just add on to that I think
- 00:47:59actually the feature that they demo this
- 00:48:00morning the extensions API is sort of an
- 00:48:03interesting idea like now that that's
- 00:48:04available because we've built this as an
- 00:48:06API driven service potentially we could
- 00:48:08hook into the alert and pull through the
- 00:48:10extension and place an annotation that
- 00:48:12says this data that's building this
- 00:48:15dashboard is suspect at the moment yes
- 00:48:17there's some enhancements that maybe we
- 00:48:19can look to to bring in that
- 00:48:20notification based on the alerting
- 00:48:22that's running in the ETL process it's
- 00:48:25kind of an out of the box but I'm the
- 00:48:27whole world is now looking into
- 00:48:28evolution of the web services where
- 00:48:30Amazon is promoting as a primary data
- 00:48:33source for everything capturing your
- 00:48:35enterprise data layer as an service base
- 00:48:37and a lot of BI tools are not you know
- 00:48:40good enough to talk to these API so do
- 00:48:43you think of any intermediate solutions
- 00:48:45like that could get to that connectivity
- 00:48:48or we should look at it as a separate
- 00:48:50use case and go as an extent real quick
- 00:48:54I'm going to pause there's a really bad
- 00:48:56echo so what I'm gonna do instead is
- 00:48:57invite you guys all up to the front if
- 00:48:59you have questions and there's several
- 00:49:00of us so you'll get your question
- 00:49:01answered faster but just the Netflix
- 00:49:04Group is over here and if you don't have
- 00:49:07time to stay and get your question
- 00:49:08answered again feel free to reach out to
- 00:49:09any of us during the conference thanks
- 00:49:11for coming thank you
- 00:49:13[Applause]
- Netflix
- Data Engineering
- Analytics
- Data Quality
- ETL
- Big Data
- Data Warehousing
- Data Visualization
- Cloud Infrastructure
- Content Strategy