Netflix: A series of unfortunate events: Delivering high-quality analytics

00:49:15
https://www.youtube.com/watch?v=XpOwXMB8mTA

摘要

TLDRMichele Aford, leading a team at Netflix, highlights the company's approach to handling vast amounts of data generated from over 100 million subscribers globally. Netflix writes 700 billion events per day, processing it to ensure the reliability of analytics. Every interaction across the Netflix service logs as events, capturing data in a Kafka-backed pipeline. This data eventually gets processed and stored in a cloud-based data warehouse powered by tools like Spark. Netflix embraces a mentality of expecting system and data failures instead of preventing them outright, using anomaly detection methods to manage data quality. The company implements a push-based ETL system that cascades updates downstream efficiently, as opposed to traditional scheduled systems. Ensuring data quality is vital, as various data stakeholders, from engineers to executives, utilize this data in decision processes, impacting content investment and user experience strategies. Netflix has embraced data-driven decision-making to influence original content production, aiming to increase its share of unique content. Michele emphasizes the complexity of managing Netflix's data at scale, highlighting the need for robust infrastructure and processes to ensure high data availability and reliability for data consumers. The talk underscores the importance of enabling confidence in data-driven decision-making and innovative analytics-driven solutions.

心得

  • 📊 Netflix writes 700 billion events daily, using them for analytics and visualization.
  • 📈 Netflix employs a push-based ETL system for more efficient data processing.
  • 🚀 The company's data warehouse handles 60 petabytes, growing by 300 TB daily.
  • 🔎 They use anomaly detection to maintain data quality and prevent bad data.
  • 💡 Netflix relies heavily on data-driven decision-making, especially in content strategy.
  • 🌐 The company logs every interaction as data, from app usage to content consumption.
  • 🛠️ Spark and Kafka are pivotal tools for processing Netflix's massive data streams.
  • 🖥️ Data quality is ensured by statistical and meta-analytics checks before usage.
  • 📅 Data updates can trigger automatic downstream process execution in a push system.
  • 🎬 Data has driven successful strategies, including investing in original content.

时间轴

  • 00:00:00 - 00:05:00

    Michele Aford introduces herself as the narrator, leading a data engineering team at Netflix focused on analytics and trusted data. She highlights Netflix's global presence and data scale, emphasizing the company's massive data collection and processing capabilities.

  • 00:05:00 - 00:10:00

    Michele discusses the Netflix data architecture, including event data logging, ingestion pipelines, and a vast data warehouse built on open-source technologies. The company processes large volumes of data daily, enabling various company-wide applications.

  • 00:10:00 - 00:15:00

    The process of handling 'unfortunate data' is explored. Michele explains events as user interactions, how Netflix logs these to a pipeline, processes with big data technologies like Spark, and visualizes using tools like Tableau for access across the company.

  • 00:15:00 - 00:20:00

    Problems with data reliability are addressed as 'unfortunate data,' without implying fault. Visualization tools help identify issues. Michele emphasizes the complexity of data processes, highlighting possible points of failure and the significant traffic Netflix deals with.

  • 00:20:00 - 00:25:00

    She stresses the importance of data quality for decision-making, especially for a data-driven company like Netflix. Various roles within Netflix interact with data, including data engineers, analytics engineers, and visualization experts. Executives rely heavily on data for strategic decisions.

  • 00:25:00 - 00:30:00

    The impact of accurate data is crucial for roles like product managers and engineers at Netflix, who make decisions on content investment, algorithm adjustments, and user experience based on data insights. Netflix's strategic content decisions are deeply data-driven.

  • 00:30:00 - 00:35:00

    Michele outlines strategies Netflix employs to ensure data quality, advocating for detection and response over prevention. She discusses utilizing data statistics and anomaly detection, explaining how unexpected data behavior is monitored and managed to prevent inaccurate reporting.

  • 00:35:00 - 00:40:00

    Handling data anomalies involves alert systems and user notifications when data is missing or suspect. Michele highlights internal tools for data visibility and lineage, supporting efficient troubleshooting and maintaining user trust by clarifying data integrity.

  • 00:40:00 - 00:49:15

    Michele reveals that despite the complexities, Netflix's processes maintain data confidence, enabling impactful decisions like content investment in originals. The talk concludes with strategic advice on managing data reliability and the challenges of data operations at scale.

显示更多

思维导图

Mind Map

常见问题

  • What is the role of a data engineer at Netflix?

    A data engineer at Netflix is essentially a software engineer specializing in data, focusing on distributed systems and making data consumable for the rest of the company.

  • How does Netflix handle data events?

    Netflix logs all interactions as events using Kafka, storing them in an AWS S3-backed data warehouse, processed by tools like Spark for analytics and visualization.

  • How much data does Netflix process daily?

    Netflix processes about 700 billion events daily, peaking over a trillion, with a warehouse currently at 60 petabytes, growing by 300 terabytes every day.

  • How does Netflix ensure data quality and trust?

    Netflix uses statistical checks, anomaly detection, and a process to prevent bad data from becoming visible to maintain data quality and trust.

  • What innovation has Netflix implemented for ETL processes?

    Netflix uses a push-based system where jobs notify downstream processes of new data, improving efficiency and accuracy in data handling.

  • Why is metadata important for Netflix?

    Metadata provides statistics about data, aiding in anomaly detection and ensuring data quality across streams and storage environments.

  • How does Netflix's data infrastructure impact its content strategy?

    Data analytics guide content investment decisions, like Netflix's move into original content, which now aims to comprise 50% of its catalog.

  • What challenges does Netflix face with data visualization?

    Challenges include ensuring timely data access, accurate data representation, and addressing performance limitations in visualization tools.

  • How does Netflix address bad data or unfortunate events?

    Netflix uses early detection and visibility strategies to prevent bad data from reaching reports, maintaining trust in data-driven decisions.

  • What role do data consumers have at Netflix?

    Data consumers at Netflix, including business analysts and data scientists, rely on accurate data to make strategic and operational decisions.

查看更多视频摘要

即时访问由人工智能支持的免费 YouTube 视频摘要!
字幕
en
自动滚动:
  • 00:00:01
    welcome my name is Michele aford and I
  • 00:00:04
    am your humble narrator for today's talk
  • 00:00:07
    I also lead a team at Netflix focused on
  • 00:00:10
    data engineering innovation and
  • 00:00:11
    centralized solutions and I want to
  • 00:00:14
    share with you some of the really cool
  • 00:00:15
    things we're doing around analytics and
  • 00:00:18
    also how we ensure that the analytics
  • 00:00:21
    that we're delivering can be trusted I
  • 00:00:23
    also want you guys to understand that
  • 00:00:25
    this stuff is really really hard and I'm
  • 00:00:29
    trying to give you some ideas on ways
  • 00:00:31
    that you can deal with this in your own
  • 00:00:33
    environment Netflix was born on a cold
  • 00:00:41
    stormy night back in 1997 but seriously
  • 00:00:45
    it's 20 years old which most people
  • 00:00:47
    don't realize and as of q2 2017 we have
  • 00:00:52
    a hundred million members worldwide we
  • 00:00:56
    are on track to spend six billion
  • 00:00:59
    dollars on content this year and you
  • 00:01:03
    guys watch a lot of that content in fact
  • 00:01:05
    you watch a hundred and twenty five
  • 00:01:07
    million hours every single day these are
  • 00:01:13
    not peak numbers On January 8th of this
  • 00:01:16
    year you guys actually watched 250
  • 00:01:20
    million hours in a single day it's
  • 00:01:24
    impressive as of q2 2016 we are in 130
  • 00:01:31
    countries worldwide and we are on over I
  • 00:01:37
    think around 4,000 different devices now
  • 00:01:40
    there's a reason why I'm telling you
  • 00:01:41
    this we have a hundred million members
  • 00:01:43
    watching a hundred and twenty five
  • 00:01:45
    million hours of content every day in a
  • 00:01:48
    hundred and thirty countries on four
  • 00:01:50
    thousand different devices we have a lot
  • 00:01:54
    of data we write 700 billion events to
  • 00:02:00
    our stream and adjusted pipeline every
  • 00:02:02
    single day this is average we peak at
  • 00:02:06
    well over a trillion this data is
  • 00:02:09
    processed and landed in our
  • 00:02:13
    warehouse which is built entirely using
  • 00:02:15
    open-source Big Data technologies we're
  • 00:02:17
    currently sitting at around 60 petabytes
  • 00:02:20
    and growing at a rate of 300 terabytes a
  • 00:02:22
    day and this data is actively used
  • 00:02:26
    across the company I'll give you some
  • 00:02:28
    specific examples later but on average
  • 00:02:30
    we do about five petabytes of reads now
  • 00:02:35
    what I'm trying to demonstrate to you in
  • 00:02:38
    this talk is that we can do this and we
  • 00:02:40
    can use these principles at scale but
  • 00:02:42
    you don't need this type of environment
  • 00:02:45
    to still get value out of it it's just
  • 00:02:48
    showing you that it does work at scale
  • 00:02:49
    so now for the fun stuff the unfortunate
  • 00:02:54
    events and events is actually a play on
  • 00:02:57
    words here when I say events what I'm
  • 00:02:58
    really talking about is all of the
  • 00:03:01
    interactions you take across the service
  • 00:03:04
    so this could be authenticating into a
  • 00:03:08
    app it could be the content that you
  • 00:03:11
    receive as to you like what we recommend
  • 00:03:14
    for you to watch it could be when you
  • 00:03:16
    click on content or you pause it or you
  • 00:03:18
    stop it or when you click on that next
  • 00:03:21
    thing to watch all of those are
  • 00:03:23
    considered events for us this data is
  • 00:03:27
    written into our justin pipeline which
  • 00:03:28
    is backed by kafka and it's landed in a
  • 00:03:32
    raw ingestion layer inside of our data
  • 00:03:34
    warehouse so just for those who are
  • 00:03:37
    curious this is all 100% based in the
  • 00:03:39
    cloud everything you see on this it's
  • 00:03:41
    all using Amazon's AWS but you can think
  • 00:03:45
    of this s3 is really just a data
  • 00:03:47
    warehouse so we have a raw layer we use
  • 00:03:50
    a variety of big data processing
  • 00:03:52
    technologies most notably spark right
  • 00:03:54
    now to process that data transform it
  • 00:03:57
    and we land it into our data warehouse
  • 00:03:59
    and then we can also aggregate and to
  • 00:04:03
    normalize and summarize that data and
  • 00:04:04
    put it into a reporting layer a subset
  • 00:04:07
    of this data is moved over to a variety
  • 00:04:10
    of fast access storage gene storage
  • 00:04:13
    where that data is made available for
  • 00:04:17
    our data visualization tools tableau is
  • 00:04:20
    the most widely used visualization tool
  • 00:04:23
    at Netflix but we also have a variety of
  • 00:04:26
    other use case
  • 00:04:26
    so we have other visualization tools
  • 00:04:29
    that we support and then this data is
  • 00:04:31
    also available to be queried and
  • 00:04:34
    interacted with using a variety of other
  • 00:04:36
    tools every person in the company has
  • 00:04:40
    access to our reports and to our data
  • 00:04:47
    but this is this is what happens when
  • 00:04:51
    everything works well this is what it
  • 00:04:53
    looks like let's talk about when things
  • 00:04:55
    go wrong when you have bad data but
  • 00:04:58
    really I don't like the term bad because
  • 00:05:00
    it implies intent and the data is not
  • 00:05:03
    trying to ruin your Monday it doesn't
  • 00:05:06
    want to create problems in your reports
  • 00:05:08
    so I like to think of it as unfortunate
  • 00:05:10
    data this is a visualization it's
  • 00:05:15
    actually I think one of the coolest
  • 00:05:16
    visualizations I've ever seen it's a
  • 00:05:18
    tool called visceral and it shows all of
  • 00:05:20
    the traffic that is coming into our
  • 00:05:22
    service and every single of those small
  • 00:05:25
    dots is an API and those large thoughts
  • 00:05:28
    are represent various regions and our
  • 00:05:31
    eight of us so if you look here you
  • 00:05:34
    might notice one of these circles is red
  • 00:05:37
    and that means that there's a problem
  • 00:05:38
    with that API what that means is that
  • 00:05:46
    we're receiving for the most part data
  • 00:05:49
    but we might not be receiving all data
  • 00:05:51
    we might not be receiving one type of
  • 00:05:53
    event so in this example let's say that
  • 00:05:56
    we can receive the events when you play
  • 00:05:59
    on click play on new content and when
  • 00:06:03
    you get to the end and you click that
  • 00:06:04
    click to play next episode we don't see
  • 00:06:08
    that right so let's just use that as a
  • 00:06:10
    hypothetical so now we have our bad data
  • 00:06:15
    our unfortunate data that is coming in
  • 00:06:17
    and let's say just for example purposes
  • 00:06:19
    that this is only affecting our tablet
  • 00:06:23
    devices maybe it's the latest release of
  • 00:06:25
    the Android SDK caused some sort of
  • 00:06:27
    compatibility issue and now we don't see
  • 00:06:29
    that one type of event it could be that
  • 00:06:32
    the event doesn't come through at all it
  • 00:06:34
    could be that the event comes through
  • 00:06:35
    but it's malformed it could be that it's
  • 00:06:38
    an empty payload
  • 00:06:39
    point is that we are somehow missing the
  • 00:06:41
    data that we're expecting and it doesn't
  • 00:06:43
    stop there and makes its way through our
  • 00:06:45
    entire system goes into our ingestion
  • 00:06:47
    pipeline it gets landed over into s3 it
  • 00:06:49
    ultimately makes its way to the data
  • 00:06:51
    warehouse it's going to be copied over
  • 00:06:53
    to our fast storage and if you're doing
  • 00:06:55
    extracts that's going to live there too
  • 00:06:57
    and ultimately this data is going to be
  • 00:07:01
    in front of our users and the problem
  • 00:07:04
    here is that you guys are the face of
  • 00:07:10
    the problem even though you had nothing
  • 00:07:12
    to do with it
  • 00:07:13
    you created the report and the user is
  • 00:07:16
    interacting with the report it is
  • 00:07:21
    generally not going to be your fault
  • 00:07:23
    I sometimes it's your fault but most of
  • 00:07:25
    the time what I've observed is that
  • 00:07:27
    reports issues with reports are quote
  • 00:07:30
    unquote upstream what does that mean
  • 00:07:32
    every single icon and every single arrow
  • 00:07:37
    on this diagram is a point of failure
  • 00:07:40
    and not just one or two possible things
  • 00:07:43
    that could go wrong a dozen or more
  • 00:07:45
    different things could go wrong and this
  • 00:07:48
    is a high-level view if we drilled down
  • 00:07:50
    you would see even more points of
  • 00:07:51
    failure so it's realistic to expect that
  • 00:07:54
    things will go wrong compounding this
  • 00:07:58
    problem for us is that according to San
  • 00:07:59
    vine Netflix accounts for 35% of all
  • 00:08:03
    peak traffic in North America
  • 00:08:20
    so I've described the problem we have
  • 00:08:23
    some bad data we have some unfortunate
  • 00:08:24
    data that is not the issue itself right
  • 00:08:27
    I mean what does it really matter if
  • 00:08:29
    I've got unfortunate data sitting in a
  • 00:08:32
    table somewhere the problem is when you
  • 00:08:35
    are using that data to make decisions
  • 00:08:38
    right that's the impact and it is my
  • 00:08:43
    personal belief that there is no more
  • 00:08:46
    there is no other company in the world
  • 00:08:48
    who is more data driven than Netflix
  • 00:08:50
    there are other companies who are as
  • 00:08:52
    data driven and they're not as big and
  • 00:08:54
    there are bigger companies and they're
  • 00:08:56
    not as data driven if if I'm mistaken
  • 00:08:59
    please see me afterwards I would love to
  • 00:09:01
    hear but when you're really a
  • 00:09:04
    data-driven company that means that you
  • 00:09:06
    are actively using and looking on that
  • 00:09:08
    looking at that data you're relying upon
  • 00:09:09
    it so how do we ensure they still have
  • 00:09:12
    confidence well first let's look at all
  • 00:09:15
    of the different roles we have from from
  • 00:09:18
    Netflix and why I think like we're so
  • 00:09:20
    data-driven we start with our data
  • 00:09:22
    engineers and and from my perspective
  • 00:09:23
    this is just a software engineer who
  • 00:09:25
    really specializes in data they
  • 00:09:27
    understand distributed systems they are
  • 00:09:29
    processing that 700 billion events and
  • 00:09:31
    making a consumable for the rest of the
  • 00:09:33
    company we also have our analytics
  • 00:09:35
    engineers who will usually pick up where
  • 00:09:37
    that data engineer left off they might
  • 00:09:39
    be doing some aggregations or creating
  • 00:09:42
    some summary tables they'll be creating
  • 00:09:45
    some visualizations and they might even
  • 00:09:46
    do some ad hoc analysis so we consider
  • 00:09:49
    them sort of full stack within this data
  • 00:09:51
    space and then we have people that
  • 00:09:52
    specialize in just data visualization
  • 00:09:55
    they are really really good at making
  • 00:09:59
    the data makes sense to people I'm
  • 00:10:02
    curious though how many of you would
  • 00:10:04
    consider yourself like an analytics
  • 00:10:06
    engineer you have to create tables as
  • 00:10:08
    well as the reports wow that's actually
  • 00:10:13
    more than I thought it's a pretty good
  • 00:10:15
    portion of the room how many of you only
  • 00:10:17
    do visualization show hands ok so more
  • 00:10:22
    people actually have to create the
  • 00:10:23
    tables than they do just the
  • 00:10:25
    visualizations interesting so those are
  • 00:10:28
    the people that can that I consider
  • 00:10:29
    these data producers or they're creating
  • 00:10:32
    these
  • 00:10:33
    these data objects for the rest of the
  • 00:10:34
    company to consume then we move into our
  • 00:10:36
    data consumers and this would be our
  • 00:10:38
    business analyst which are probably very
  • 00:10:39
    similar to your business analyst they
  • 00:10:42
    have really deep vertical expertise and
  • 00:10:44
    they are producing they're producing
  • 00:10:48
    analysis like what is the subscriber
  • 00:10:51
    forecast for amia we also have research
  • 00:10:54
    scientist and quantitative analyst in
  • 00:10:56
    our science and algorithms groups and
  • 00:10:58
    they are focused on answering really big
  • 00:11:02
    hard questions we have our data
  • 00:11:07
    scientist and machine learning scientist
  • 00:11:09
    and they are creating models to help us
  • 00:11:11
    predict behaviors or make better
  • 00:11:12
    decisions and these would be our
  • 00:11:15
    consumers so these people are affected
  • 00:11:18
    by the bad data but they're not there's
  • 00:11:20
    really no impact yet it's not until we
  • 00:11:22
    get to this top layer that we really
  • 00:11:24
    start to see impact and starts with our
  • 00:11:26
    executives they are looking at that data
  • 00:11:29
    to make decisions about the company's
  • 00:11:32
    strategy many companies say they're
  • 00:11:34
    data-driven
  • 00:11:35
    but what that really means is that I've
  • 00:11:37
    got an idea and I just need the data to
  • 00:11:39
    prove it oh that doesn't look good go
  • 00:11:42
    look over here instead until they can
  • 00:11:43
    find the data that proves their points
  • 00:11:46
    you can go off in the direction they
  • 00:11:47
    want being data-driven means that you
  • 00:11:49
    look at the data first and then you make
  • 00:11:50
    decisions so if we if we provide them
  • 00:11:54
    with bad data bad insights they could
  • 00:11:57
    make a really bad decision for the
  • 00:11:59
    company our product managers we have
  • 00:12:02
    these across every verticals but one
  • 00:12:03
    example would be our content team they
  • 00:12:06
    are asking the question what should we
  • 00:12:08
    spend that six billion dollars on what
  • 00:12:11
    titles should we license what titles
  • 00:12:14
    should we create and they do that by
  • 00:12:16
    relying upon predictive models built by
  • 00:12:19
    our data scientist saying here's what we
  • 00:12:21
    expect the audience to be for a title
  • 00:12:24
    and based upon that we can back into a
  • 00:12:28
    number that we're willing to pay this is
  • 00:12:30
    actually a really good model for this
  • 00:12:33
    because it allows us to support niche
  • 00:12:35
    audiences with small film titles but
  • 00:12:38
    also spend a lot of money on things like
  • 00:12:41
    the the Marvel and Disney partnerships
  • 00:12:43
    where we know it's going to have broad
  • 00:12:44
    appeal
  • 00:12:46
    we were algorithm engineers who are
  • 00:12:49
    trying to decide what is the right
  • 00:12:50
    content to show you on the site we have
  • 00:12:53
    between 60 and 90 seconds for you to
  • 00:12:56
    find content before you leave and I know
  • 00:12:58
    it feels like longer than 60 or 90
  • 00:13:00
    seconds when you're clicking next next
  • 00:13:02
    next but that's about how long we have
  • 00:13:05
    before you go and spend your free time
  • 00:13:06
    doing something else and then we have
  • 00:13:10
    our software engineers who are trying to
  • 00:13:13
    just constantly experiment with things
  • 00:13:15
    they look at the data and they they roll
  • 00:13:18
    it out to everybody if it makes sense
  • 00:13:19
    and this could be everything from the
  • 00:13:21
    user experience that you actually see to
  • 00:13:23
    things that you don't see like what is
  • 00:13:25
    the optimal compression for for our
  • 00:13:29
    video encoding so that we can lower the
  • 00:13:32
    amount of bandwidth that you have to
  • 00:13:33
    spend while also preventing you from
  • 00:13:35
    having like a really bad video
  • 00:13:37
    experience so these are the the impact
  • 00:13:40
    is really at that top level so how do we
  • 00:13:46
    design for these unfortunate events I
  • 00:13:49
    mean we we have the data we've got lots
  • 00:13:51
    of data we've got lots of people who
  • 00:13:53
    want to look at it a lot of people who
  • 00:13:54
    are depending upon it you know and I
  • 00:13:57
    think that you have two options here the
  • 00:13:58
    first option is you can say we're going
  • 00:14:01
    to prevent anything from going wrong
  • 00:14:03
    we're gonna check for everything and
  • 00:14:05
    we're just gonna we're just gonna lock
  • 00:14:06
    it down and when something gets deployed
  • 00:14:08
    we're gonna make sure that that thing is
  • 00:14:10
    airtight and that works that that sounds
  • 00:14:12
    good in principle but the reality is
  • 00:14:14
    that usually these issues don't occur
  • 00:14:15
    when you deploy something usually
  • 00:14:18
    everything looks great
  • 00:14:19
    and then six months later there's a
  • 00:14:21
    problem right so it's a lot in my
  • 00:14:24
    perspective a lot better instead to
  • 00:14:26
    detect issues and respond to them than
  • 00:14:29
    it is to try to prevent them and when it
  • 00:14:32
    comes to detecting data quality now I'm
  • 00:14:35
    gonna get a little bit more technical
  • 00:14:36
    here please bear with me I think that
  • 00:14:38
    all this stuff that was really relevant
  • 00:14:39
    for you guys and I think that there's
  • 00:14:41
    some really good takeaways for you so
  • 00:14:42
    there's a reason I'm showing you this so
  • 00:14:44
    we're gonna drill down into this data
  • 00:14:45
    storage layer
  • 00:14:50
    and we're gonna look at at this concept
  • 00:14:53
    of a table right and so in Hadoop how
  • 00:14:55
    many of you actually work with Hadoop at
  • 00:14:57
    all how many of you work with only like
  • 00:15:01
    a you work with like an enterprise data
  • 00:15:02
    warehouse but it's on something else
  • 00:15:04
    like Terra data okay so in Tara data a a
  • 00:15:09
    table is both a logical and a physical
  • 00:15:13
    construct you cannot separate the two in
  • 00:15:15
    Hadoop you can we have the data sitting
  • 00:15:17
    somewhere on storage and we have this
  • 00:15:19
    concept of a table which is really just
  • 00:15:20
    a pointer to that and we can choose to
  • 00:15:23
    point to the data we can choose to not -
  • 00:15:25
    one thing though that we've built is a
  • 00:15:28
    tool called medic at and whenever we
  • 00:15:30
    write that data whenever we do that
  • 00:15:31
    pointing we are creating another logical
  • 00:15:34
    object we're creating a partition object
  • 00:15:36
    and this partition object has statistics
  • 00:15:40
    about the data that was just written we
  • 00:15:43
    can look at things like the row counts
  • 00:15:45
    and we can look at the number of nulls
  • 00:15:48
    in that file and say okay we're gonna
  • 00:15:51
    use this information to see if there's a
  • 00:15:53
    problem we can also drill down a little
  • 00:15:56
    bit deeper into the field level and we
  • 00:15:59
    can use this to say like some really
  • 00:16:01
    explicit checks we can say well I'm
  • 00:16:03
    checking for the max value of this
  • 00:16:05
    metric and the max value is zero and
  • 00:16:07
    that doesn't make sense unless it's some
  • 00:16:09
    sort of negative value for this field
  • 00:16:11
    but chances are this is either a brand
  • 00:16:12
    new field or there's a problem and so we
  • 00:16:15
    can check for these things you don't
  • 00:16:17
    have to do things the way that I'm
  • 00:16:19
    describing but the concept I think is
  • 00:16:21
    pretty transferable having statistics
  • 00:16:22
    about the data that you write enable a
  • 00:16:24
    lot more powerful things
  • 00:16:32
    okay so now that we have the statistics
  • 00:16:34
    I mean we can use them in isolation and
  • 00:16:36
    say oh we got a zero that's a problem
  • 00:16:37
    but typically the issues are a little
  • 00:16:40
    bit more difficult to find than than
  • 00:16:43
    just that and so what we can do is take
  • 00:16:45
    that data in chart for example row
  • 00:16:47
    counts over time and you can see that
  • 00:16:49
    we've got peaks and valleys here and
  • 00:16:51
    this is really denoting that there's
  • 00:16:53
    some difference in behavior based upon
  • 00:16:56
    the day of week and so if we use a
  • 00:17:00
    standard normal deviate distribution we
  • 00:17:04
    can look for something that falls
  • 00:17:06
    outside of like a 90% confidence
  • 00:17:08
    interval and if it does we can be pretty
  • 00:17:11
    confident that maybe there's not a
  • 00:17:13
    problem but we definitely want someone
  • 00:17:14
    to go look to see if there's a problem
  • 00:17:16
    and so when we compare this for the same
  • 00:17:20
    day of week week over week for 30
  • 00:17:23
    periods we start to see that we have
  • 00:17:25
    some outliers we have some things that
  • 00:17:26
    might be problems we can also see that
  • 00:17:30
    the data that we rip we wrote most
  • 00:17:32
    recently looks really suspect because I
  • 00:17:35
    wrote 10 billion rows and typically I
  • 00:17:41
    write between 80 and a hundred billion
  • 00:17:43
    rows right so chances are there's a
  • 00:17:46
    problem with this particular run of the
  • 00:17:48
    CTL
  • 00:17:53
    so we can detect the issues but that
  • 00:17:56
    doesn't really prevent the problem of
  • 00:17:58
    the impact the perennial question can I
  • 00:18:02
    trust this report can I trust this data
  • 00:18:05
    I have no idea
  • 00:18:07
    looking at this if there's a data
  • 00:18:08
    quality issue and what's really
  • 00:18:11
    problematic and what is really the issue
  • 00:18:14
    for you guys is when people look at
  • 00:18:17
    these reports they trust the reports and
  • 00:18:19
    then afterwards we tell them that data
  • 00:18:22
    is actually wrong we're gonna back it
  • 00:18:25
    out we're gonna fix it for you
  • 00:18:27
    there was no indication to them looking
  • 00:18:30
    at this report that they couldn't trust
  • 00:18:32
    it but now the next time they look at
  • 00:18:33
    this report guess what it's gonna be
  • 00:18:36
    there in the back of their mind is this
  • 00:18:37
    data good can I trust it so what we've
  • 00:18:44
    done is built a process that checks for
  • 00:18:47
    these before the data becomes visible
  • 00:18:48
    and all of the bad unfortunate stuff can
  • 00:18:51
    still happen we still have the data
  • 00:18:54
    coming in it's still landing in our
  • 00:18:56
    ingestion layer but before we write it
  • 00:18:59
    out to our data warehouse we were
  • 00:19:00
    checking for those standard deviations
  • 00:19:02
    and when we find exceptions we failed at
  • 00:19:05
    ETL we don't go any further in the
  • 00:19:07
    process we also check before we get to
  • 00:19:10
    our reporting layer same thing what this
  • 00:19:16
    means is that your user is not going to
  • 00:19:19
    see their data right they're gonna come
  • 00:19:21
    to the report and it looks like this
  • 00:19:26
    your user your business user is going to
  • 00:19:30
    see there's missing data and now they're
  • 00:19:32
    going to know there was a quality issue
  • 00:19:34
    and we don't want them to know that
  • 00:19:36
    right
  • 00:19:37
    wrong we want them to know there was a
  • 00:19:41
    problem because it's not your fault
  • 00:19:43
    there was a problem it's there's so many
  • 00:19:46
    things that could go wrong but simply by
  • 00:19:48
    showing them this explicitly that we
  • 00:19:51
    have no data they retain confidence they
  • 00:19:55
    know they're not making decisions bayit
  • 00:19:57
    faster based on bad data and your
  • 00:20:00
    business should not be making major
  • 00:20:02
    decisions on a single day's worth of
  • 00:20:04
    data
  • 00:20:05
    where it becomes really problematic is
  • 00:20:07
    when you're doing trends and percent
  • 00:20:09
    changes and you know those things even
  • 00:20:11
    bad data can really have a big impact so
  • 00:20:18
    one thing that you do have to do to make
  • 00:20:20
    this work is you have to surface the
  • 00:20:23
    information so that users can really see
  • 00:20:25
    when was the data last loaded when did
  • 00:20:27
    we vet last validate it through so
  • 00:20:30
    there's two things not showing them bad
  • 00:20:32
    data and providing visibility into the
  • 00:20:34
    current state
  • 00:20:35
    we're also this is a view of our big
  • 00:20:37
    data portal it's an internal tool that
  • 00:20:39
    we've developed I think there's other
  • 00:20:40
    third-party tools out there that might
  • 00:20:42
    do some of their things we're also
  • 00:20:43
    planning to add visibility to the actual
  • 00:20:45
    failures and alerts so that business
  • 00:20:47
    users can see those but so now we've
  • 00:20:52
    we've detected the issue we've prevented
  • 00:20:54
    there from being any negative impact but
  • 00:20:56
    we still have to fix the problem right
  • 00:20:59
    they still want the data at the end of
  • 00:21:01
    the day there's two components to fixing
  • 00:21:05
    the problem quickly the first one is as
  • 00:21:07
    I just mentioned visibility but this
  • 00:21:09
    time visibility for the people who need
  • 00:21:10
    to understand what the problem is and
  • 00:21:11
    they need to fix it so one of the things
  • 00:21:13
    that we're doing is surfacing this
  • 00:21:15
    information you know the question might
  • 00:21:17
    be why did my job suddenly spike in in
  • 00:21:20
    run time right why is this taking so
  • 00:21:22
    long and you can look here and you can
  • 00:21:25
    easily see oh it's because you received
  • 00:21:27
    a lot more data and then this becomes a
  • 00:21:30
    question well is that because somebody
  • 00:21:32
    deployed something upstream and nots
  • 00:21:33
    duplicating everything I mean it gives
  • 00:21:36
    you a starting point to understand what
  • 00:21:37
    are the problems we also directly
  • 00:21:40
    display what they fail your message or
  • 00:21:42
    the failures and then give you a link to
  • 00:21:44
    go see the failure messages themselves
  • 00:21:45
    so that when users are trying to
  • 00:21:47
    troubleshoot it again we're just trying
  • 00:21:49
    to make it easier and faster for them to
  • 00:21:51
    get there and we are exposing the
  • 00:21:58
    relationships between data sets so this
  • 00:22:01
    is the lineage data you know how do
  • 00:22:03
    these things relate what things are
  • 00:22:05
    waiting on me to fix this and who do I
  • 00:22:08
    need to notify that there's a problem
  • 00:22:12
    all right now I'm gonna cover this real
  • 00:22:14
    quick this is about scheduling and
  • 00:22:18
    pooling versus pushing
  • 00:22:20
    not something that you guys here would
  • 00:22:23
    implement but something that you should
  • 00:22:24
    be having conversations with your
  • 00:22:26
    infrastructure teams about traditionally
  • 00:22:28
    we use a schedule based system where we
  • 00:22:33
    say ok it's 6 o'clock my job's gonna run
  • 00:22:34
    and I'm gonna take that 700 billion
  • 00:22:37
    events and I'm gonna create this really
  • 00:22:38
    clean detailed table and then at 7
  • 00:22:41
    o'clock I'm gonna have another job run
  • 00:22:43
    and it's gonna a grenade that data at 8
  • 00:22:47
    o'clock I'm gonna have another process
  • 00:22:50
    that runs and it's going to normalize it
  • 00:22:51
    to get it ready for my report I'm gonna
  • 00:22:55
    copy it over to my fast access layer at
  • 00:22:57
    8:30 and by 9 o'clock my report should
  • 00:23:00
    be ready in a push based system you
  • 00:23:04
    might still have some scheduling
  • 00:23:05
    component to it you might say well I
  • 00:23:07
    want everything to start at 6:00 a.m.
  • 00:23:08
    but the difference is that once this job
  • 00:23:11
    is done it notifies the aggregate job
  • 00:23:14
    that it's ready to run because there's
  • 00:23:15
    new data which notifies the
  • 00:23:17
    de-normalized job that it's ready to run
  • 00:23:19
    which notifies or just executes the
  • 00:23:22
    extract over to your fast access layer
  • 00:23:25
    and your report becomes available to
  • 00:23:27
    everybody by maybe 7 42 you could see
  • 00:23:33
    the benefits here and being able to get
  • 00:23:35
    to data and getting the data out faster
  • 00:23:37
    to your users this is not why you guys
  • 00:23:39
    care about this this is probably why
  • 00:23:41
    your business users might care about it
  • 00:23:43
    why you guys care about it is because
  • 00:23:45
    things don't always work perfectly in
  • 00:23:50
    fact they usually don't and when that
  • 00:23:52
    happens you're gonna have to reflow and
  • 00:23:53
    fix things so this is an actual table
  • 00:23:56
    that we have we have one table that
  • 00:24:00
    populates six tables which populates 38
  • 00:24:03
    tables which populates 586 this is a
  • 00:24:07
    pretty run-of-the-mill table for us I
  • 00:24:10
    have one table that by the third level
  • 00:24:11
    of dependency has 2000 table
  • 00:24:15
    dependencies so how do we fix the data
  • 00:24:18
    when the data has started off on one
  • 00:24:20
    place and has been propagated to all of
  • 00:24:22
    these other places in a full system you
  • 00:24:26
    rerun your job and my my detailed data
  • 00:24:30
    my aggregate my
  • 00:24:32
    normalised view all of these views get
  • 00:24:34
    updated and my report is good but the
  • 00:24:37
    other 582 tables are kind of left
  • 00:24:41
    hanging and you could notify them if you
  • 00:24:45
    have visibility to who these people are
  • 00:24:47
    that are consuming your data but they
  • 00:24:50
    still have to go take action and what's
  • 00:24:52
    gonna happen is you're gonna tell them
  • 00:24:53
    hey we've had this data quality issue
  • 00:24:55
    and we flowed and it's really important
  • 00:24:58
    that you're on your job and they're
  • 00:24:59
    gonna think okay yeah but I deprecated
  • 00:25:01
    that and yeah I might have forgot to
  • 00:25:03
    turn off my ETL and they have no idea
  • 00:25:05
    that somebody else I started to rely
  • 00:25:07
    upon that data for the report right
  • 00:25:09
    happens all the time people don't feel
  • 00:25:11
    particularly incentivized to go rerun
  • 00:25:14
    and clean up things unless they know
  • 00:25:15
    what the impact is in a push system we
  • 00:25:20
    fix the one table it notifies the next
  • 00:25:23
    tables that there's new data which
  • 00:25:25
    notifies those tables that there's new
  • 00:25:26
    data and everything gets fixed
  • 00:25:29
    downstream this is a perfect world it's
  • 00:25:32
    very idealistic you this is like a very
  • 00:25:35
    pure push type system what you should be
  • 00:25:39
    having discussions with your your
  • 00:25:41
    internal infrastructure team is is that
  • 00:25:43
    you should not need to know that there
  • 00:25:45
    is an upstream issue you should just be
  • 00:25:47
    able to rely upon the fact that when
  • 00:25:50
    there is a problem
  • 00:25:51
    your jobs will be executed for you so
  • 00:25:53
    that you can rely upon that data nobody
  • 00:25:55
    should have to go do that manually it
  • 00:25:57
    doesn't scale part four
  • 00:26:05
    so we've gone through this and we talked
  • 00:26:10
    about some of the different ways that
  • 00:26:11
    that unfortunate data would impact us
  • 00:26:13
    but that's not the reality for us the
  • 00:26:16
    reality is that our users do have
  • 00:26:19
    confidence in the data pre-produce that
  • 00:26:21
    doesn't mean there's not quality issues
  • 00:26:22
    but overall generally speaking they have
  • 00:26:25
    confidence that what we produce and what
  • 00:26:26
    we provide to them is good and because
  • 00:26:29
    of that were able to do some really cool
  • 00:26:30
    things our executives actually looked up
  • 00:26:34
    the data for how content was being used
  • 00:26:37
    and the efficiency of it and they made a
  • 00:26:39
    decision based upon the data to start
  • 00:26:43
    investing in originals and over the next
  • 00:26:47
    few years we went from I think it was a
  • 00:26:51
    handful of hours in 2012 to about a
  • 00:26:54
    thousand hours of content in 2017 we've
  • 00:26:58
    ramped this up very very quickly and we
  • 00:27:01
    have set a goal of by 2020 50% of our
  • 00:27:06
    content will be original content 50% of
  • 00:27:08
    that six billion dollars that were
  • 00:27:10
    spending will be on new content that we
  • 00:27:11
    create but this was a strategy decision
  • 00:27:14
    that was informed by the data that we
  • 00:27:16
    had we also have our product managers
  • 00:27:21
    here are looking at the data and they
  • 00:27:24
    are I mean they're making some pretty
  • 00:27:27
    good selections with what content we
  • 00:27:30
    should be purchasing on the service
  • 00:27:31
    we've had some pretty good ones and
  • 00:27:32
    they're going to continue to use the
  • 00:27:34
    data to decide what is the next best
  • 00:27:37
    thing for us to buy we have our software
  • 00:27:41
    engineers who have built out constantly
  • 00:27:49
    evolving you use it user interfaces user
  • 00:27:54
    experience for us and this doesn't
  • 00:27:56
    happen all at once this isn't like a
  • 00:27:57
    monolithic project where they just roll
  • 00:28:00
    out these big changes instead they are
  • 00:28:04
    testing these they're making small
  • 00:28:06
    changes they're testing them
  • 00:28:07
    incrementally they're making the
  • 00:28:08
    decision to roll it out we we have about
  • 00:28:10
    like a hundred different tests going on
  • 00:28:11
    right this moment to see what is the
  • 00:28:13
    best thing that we should do and so you
  • 00:28:15
    can see what we looked like in 2017 or
  • 00:28:17
    2016
  • 00:28:18
    here's what we look like in 2017
  • 00:28:50
    so you can imagine for a moment the
  • 00:28:54
    amount of complexity and different
  • 00:28:57
    systems that are involved in making
  • 00:28:58
    something like that happen and before we
  • 00:29:01
    really make that investment we go and
  • 00:29:03
    test it is our theory correct one thing
  • 00:29:06
    that I think is really interesting is
  • 00:29:07
    that we find oftentimes we can predict
  • 00:29:12
    what people will like if those people
  • 00:29:13
    are exactly like us and usually people
  • 00:29:17
    are not exactly like us and so instead
  • 00:29:19
    we just throw it out there whenever we
  • 00:29:20
    have ideas we test them and then we we
  • 00:29:23
    respond to the results and then we have
  • 00:29:27
    our algorithm engineers these are the
  • 00:29:30
    people who are responsible for just
  • 00:29:33
    putting the intelligence in our system
  • 00:29:34
    and making it fast and and seamless and
  • 00:29:37
    so the most well-known case is our
  • 00:29:39
    recommendation systems I could talk
  • 00:29:41
    about it but I actually think this video
  • 00:29:43
    is a little bit more interesting
  • 00:29:48
    [Music]
  • 00:30:03
    [Music]
  • 00:30:12
    [Music]
  • 00:30:26
    [Music]
  • 00:30:45
    just
  • 00:30:49
    [Applause]
  • 00:30:53
    [Music]
  • 00:30:55
    now we did not create the vape the video
  • 00:30:58
    but we provided the data behind the
  • 00:31:01
    video 80% of people watch content
  • 00:31:05
    through recommendations 80% we did not
  • 00:31:10
    start off there at that number it was
  • 00:31:13
    only through constant iteration looking
  • 00:31:15
    at the data and getting or responding to
  • 00:31:19
    the things that had the positive results
  • 00:31:20
    that we got to this place okay so we'll
  • 00:31:24
    start wrapping up the key takeaways
  • 00:31:27
    obviously I don't expect you guys to go
  • 00:31:31
    back and do this in your environment but
  • 00:31:34
    I think that there are some key
  • 00:31:35
    principles that really make sense for a
  • 00:31:38
    lot of people outside of just what we're
  • 00:31:40
    doing here at Netflix the first one is
  • 00:31:43
    that expecting failure is more efficient
  • 00:31:47
    than trying to prevent it and this is
  • 00:31:51
    true for your your data teams but this
  • 00:31:53
    is also true for you as data
  • 00:31:55
    visualization people how can you expect
  • 00:31:59
    and respond to failures and to issues
  • 00:32:01
    with the data and with your reports
  • 00:32:03
    rather than trying to prevent them so
  • 00:32:06
    shift your mind mindset and say I know
  • 00:32:09
    it's gonna happen what are we gonna do
  • 00:32:11
    when it happens stale data you know I
  • 00:32:20
    have never heard someone tell me I would
  • 00:32:23
    have rather had the incomplete data
  • 00:32:25
    faster than I had the stale accurate
  • 00:32:28
    data I am sure there are cases out there
  • 00:32:30
    where that is true
  • 00:32:31
    but it is almost never the case people
  • 00:32:35
    don't make decisions based upon one hour
  • 00:32:38
    or one day's worth of data they might
  • 00:32:41
    want to know what's happening they might
  • 00:32:42
    say I just launched something and I
  • 00:32:44
    really want to know how it's performing
  • 00:32:45
    I mean that's it that's a natural human
  • 00:32:47
    trait that curiosity but it is not
  • 00:32:52
    impactful right and so ask yourself
  • 00:32:56
    would they rather see data faster and
  • 00:32:59
    have it be wrong or would they rather
  • 00:33:01
    know that when they do finally see the
  • 00:33:03
    data and maybe a few minutes later maybe
  • 00:33:05
    an hour later
  • 00:33:06
    that it's right and this is actually I
  • 00:33:11
    know this is really hard it's easy to
  • 00:33:13
    tell you guys this I know that you have
  • 00:33:14
    to go back to your business users and
  • 00:33:15
    tell them this I realized that but you
  • 00:33:18
    know if you explain it to them in this
  • 00:33:19
    way usually they can begin to understand
  • 00:33:23
    and then the last thing this is really
  • 00:33:27
    validation for you guys I've had a lot
  • 00:33:30
    of people ask me how do you guys do this
  • 00:33:33
    and we have all these problems and the
  • 00:33:35
    reality is that we have those problems
  • 00:33:37
    too
  • 00:33:38
    this stuff is really really hard it's
  • 00:33:41
    hard to get right it's hard to do well
  • 00:33:44
    it's hard for us and I think we'd do it
  • 00:33:47
    pretty well I think there's a lot of
  • 00:33:48
    things we can do better though so don't
  • 00:33:51
    just know that you're not alone it's not
  • 00:33:53
    you it's not your environment take this
  • 00:33:55
    slide back to your boss and show them
  • 00:33:58
    like this stuff is really really hard to
  • 00:34:00
    do right ok please when you are done
  • 00:34:07
    complete the session survey and that's
  • 00:34:11
    my talk
  • 00:34:13
    [Applause]
  • 00:34:27
    anybody have questions
  • 00:34:33
    and if you have questions and you're not
  • 00:34:34
    able to stay right now feel free to find
  • 00:34:36
    me afterwards I'm happy to answer them
  • 00:34:41
    hi thanks for the very insightful talk I
  • 00:34:45
    had one question I was curious when you
  • 00:34:47
    showed the dependency diagram let's say
  • 00:34:49
    you have a mainstream table that has a
  • 00:34:51
    couple of hundred or let's say a
  • 00:34:53
    thousand tables feeding downstream
  • 00:34:55
    dependencies and if you have to make a
  • 00:34:57
    change I'm sure like no table design is
  • 00:35:00
    constant so for the upcoming changes how
  • 00:35:02
    do you manage to like ensure that you
  • 00:35:05
    are still flexible for those changes and
  • 00:35:07
    also are fulfilling these downstream
  • 00:35:09
    dependencies for the historical data
  • 00:35:11
    also so the question is how do you
  • 00:35:15
    identify and ensure that there are no
  • 00:35:17
    issues and your downstream dependencies
  • 00:35:18
    when you have thousands of tables no
  • 00:35:21
    let's say no it's rather like if you had
  • 00:35:23
    to make a change how do you make a quick
  • 00:35:25
    enough change and also make sure that
  • 00:35:26
    you have the data historically to for a
  • 00:35:30
    major upstream table that you are you
  • 00:35:34
    wanting the change applied to your
  • 00:35:36
    downstream tables yeah oh okay good
  • 00:35:38
    question so how do you how do you
  • 00:35:40
    quickly evolve schema probably and apply
  • 00:35:43
    that downstream you know the first thing
  • 00:35:46
    is you have to have lineage data without
  • 00:35:48
    lineage data you can't really get
  • 00:35:50
    anywhere so the first step is always to
  • 00:35:52
    make sure you have lineage data the
  • 00:35:53
    second thing is around automation and
  • 00:35:56
    your ability to understand the data and
  • 00:35:57
    understand the schema so if you can
  • 00:35:59
    understand schema evolution and changes
  • 00:36:01
    there then you can start to apply them
  • 00:36:02
    programmatically at any point where you
  • 00:36:04
    can automate so automate automation of
  • 00:36:07
    ingestion and the export of data would
  • 00:36:10
    be an optimal place beyond that you
  • 00:36:12
    could start to you know it becomes a
  • 00:36:16
    little bit more pragmatic it depends on
  • 00:36:18
    how complex your scripts are right so if
  • 00:36:19
    you've got pretty simple like select
  • 00:36:20
    from here group by block you could
  • 00:36:24
    automate that if you had all the data
  • 00:36:26
    sitting in like a centralized code base
  • 00:36:29
    but that becomes a lot more tricky and
  • 00:36:31
    we actually have not gone that route
  • 00:36:32
    because we we want people to have
  • 00:36:34
    control instead what we've done is we've
  • 00:36:35
    notified them that there is going to be
  • 00:36:37
    a change we notify them when the change
  • 00:36:40
    is made and
  • 00:36:40
    it's on them to actually apply the
  • 00:36:43
    changes and then everything that we can
  • 00:36:44
    automate because there's a simple select
  • 00:36:46
    move from here to there we've automated
  • 00:36:47
    so they don't have to worry about that
  • 00:36:49
    good question one more question yes so
  • 00:36:51
    with the rising speech based AI
  • 00:36:54
    assistants so in the future if Netflix
  • 00:36:56
    gets support of let's say Google home or
  • 00:36:59
    Siri how do we think that the how do we
  • 00:37:10
    think the data flow is going to be
  • 00:37:11
    affected by home systems and speech
  • 00:37:14
    recognition yeah I mean there'll be a
  • 00:37:16
    lot more data than we are used to right
  • 00:37:18
    right now yes I don't have a specific
  • 00:37:23
    answer but I would imagine that it's
  • 00:37:25
    just gonna be another endpoint for us
  • 00:37:27
    and that it will just be writing out to
  • 00:37:28
    those ingestions the you know typically
  • 00:37:32
    what's happening in those cases is that
  • 00:37:34
    there's like another service out there
  • 00:37:35
    on a lexer Google that is doing that
  • 00:37:37
    translation so we probably wouldn't get
  • 00:37:39
    visibility that it would be more of like
  • 00:37:41
    you know this phrase was made and then
  • 00:37:44
    here was the content that was shown and
  • 00:37:45
    we'd probably logged that in the same
  • 00:37:46
    way we log pretty much everything but I
  • 00:37:48
    don't actually know that for sure thanks
  • 00:37:50
    I I have a few questions but I'll try to
  • 00:37:56
    put it in one I missed the first part of
  • 00:38:01
    your presentation where you're drawing
  • 00:38:03
    the architecture and you're showing us
  • 00:38:04
    how data reaches the tableau dashboard
  • 00:38:07
    you mentioned later on that you're using
  • 00:38:09
    tableau data extracts when you're
  • 00:38:11
    building extracts or and feeding the
  • 00:38:14
    dashboard at a certain time my question
  • 00:38:17
    is do you have live connections as the
  • 00:38:19
    data is flowing through your ETL process
  • 00:38:21
    checking the quality and running
  • 00:38:23
    analytics for people to actually do
  • 00:38:25
    visualize and see and make a change
  • 00:38:30
    based on what they see where you live
  • 00:38:32
    connection to the data sets and running
  • 00:38:35
    analytics on top of that using tableau
  • 00:38:38
    or are you guys using tableau data
  • 00:38:44
    extracts I mean this is actually no
  • 00:38:47
    furthest to my colleague Jason flitting
  • 00:38:50
    over here who works more with the
  • 00:38:51
    tableau stuff and
  • 00:38:52
    yeah I I think what you're getting at
  • 00:38:54
    you're talking about the data itself on
  • 00:38:57
    the quality of the data are we
  • 00:39:00
    visualizing that in tabloids English
  • 00:39:02
    action the data itself not a metadata
  • 00:39:05
    how many notes you got and all the
  • 00:39:07
    performance metrics that she showed sure
  • 00:39:09
    really looking at the quality of the
  • 00:39:11
    data and something for a product owner
  • 00:39:13
    to make a decision on hey how do I look
  • 00:39:16
    as the process is going on using live
  • 00:39:18
    connections you're talking about a lot
  • 00:39:20
    of data and how do you stream all of
  • 00:39:22
    that a lot of data onto the live
  • 00:39:24
    connection and not affect your table
  • 00:39:26
    dashboards performance because you
  • 00:39:28
    really have to go back and look through
  • 00:39:30
    terabytes of data to be able to get a
  • 00:39:32
    visualization yeah what I would say
  • 00:39:34
    there is a you know the the data around
  • 00:39:37
    the quality checks that are happening is
  • 00:39:39
    surface not so much to be a tableau live
  • 00:39:43
    connection today but I think there's
  • 00:39:45
    real opportunity with this is actually
  • 00:39:47
    having some sort of aspect of your
  • 00:39:52
    dashboard that surfaces that information
  • 00:39:54
    so in a more kind of easy-to-understand
  • 00:39:56
    way having just an indicator on your
  • 00:39:59
    dashboard hey you know this dashboard
  • 00:40:00
    has some stale data it's being worked on
  • 00:40:03
    that's probably more realistic you know
  • 00:40:06
    thing that we would do is kind of
  • 00:40:08
    surface it ad on the dashboard rather
  • 00:40:10
    than you know the business user maybe
  • 00:40:12
    they won't have as much context to
  • 00:40:14
    understand the complexity of why the
  • 00:40:16
    data quality you know is not good so I
  • 00:40:19
    could see surfacing that kind of like
  • 00:40:21
    high level data point in a dashboard
  • 00:40:22
    okay
  • 00:40:24
    I think is a long line but perhaps yeah
  • 00:40:27
    why don't you swing around you can dig
  • 00:40:28
    it more and we'll stick around to guys
  • 00:40:30
    so if you don't have a chance to ask
  • 00:40:32
    your question yeah appreciate it thanks
  • 00:40:34
    I heard you um good question how do you
  • 00:40:38
    attribute viewing to the recommendation
  • 00:40:41
    engine versus anything that you can't
  • 00:40:44
    measure off like off platform like so
  • 00:40:46
    your traditional ads word of mouth like
  • 00:40:48
    someone has told me about Luke Cage and
  • 00:40:51
    I got recommended to me so then I
  • 00:40:53
    watched it but my first discovery moment
  • 00:40:55
    was really gonna the word of mouth of my
  • 00:40:57
    friend that's a great question how do we
  • 00:40:59
    attribute
  • 00:41:01
    a view from like a search engine versus
  • 00:41:04
    a recommendation and the answer is I
  • 00:41:05
    don't know I'm an engineer so I don't
  • 00:41:07
    actually have that insight but that's a
  • 00:41:09
    good question thank you sorry hello my
  • 00:41:15
    question is about the data validation
  • 00:41:16
    step you are saying that you take a
  • 00:41:18
    normal distribution you say look we're
  • 00:41:20
    uploading normally 80 to 100 million
  • 00:41:22
    rows and so we kind of find this and
  • 00:41:24
    this is what we think we should be able
  • 00:41:25
    to see there are obviously lots of
  • 00:41:27
    events that will cause things to be
  • 00:41:29
    outside of that normal bounds so I mean
  • 00:41:32
    just as a catastrophic event say the
  • 00:41:35
    power outage that hit in 2004's shut
  • 00:41:37
    down the eastern seaboard I'm gonna go
  • 00:41:39
    out on a limb and say viewership dropped
  • 00:41:41
    well outside of your normal distribution
  • 00:41:44
    if that happened today so my question is
  • 00:41:46
    is when you're doing this and you're
  • 00:41:47
    saying this data is not good and you've
  • 00:41:50
    set up this automated process that's
  • 00:41:52
    alerting of that how do you then
  • 00:41:54
    interact to say well actually no there's
  • 00:41:55
    a reason for this or we've looked at it
  • 00:41:57
    and you actually think this is a good
  • 00:41:58
    data packet and then push it back
  • 00:42:00
    through great questions so to to
  • 00:42:03
    reiterate it would be when there is a
  • 00:42:06
    data quality issue or when we find that
  • 00:42:08
    there's exceptions how do we how do we
  • 00:42:11
    communicate and notify that that really
  • 00:42:14
    wasn't a problem that or it really was
  • 00:42:16
    an issue outside of this that we wanted
  • 00:42:18
    our job to continue is that right yeah
  • 00:42:20
    so the first thing is that I really
  • 00:42:23
    glossed over that whole piece a lot so
  • 00:42:25
    normal distribution is one of the things
  • 00:42:27
    that we can do we also do you like
  • 00:42:28
    Poisson distributions we use anomaly
  • 00:42:30
    detection so there's other more
  • 00:42:31
    sophisticated ways that we can do to
  • 00:42:33
    ensure that there's not problems when we
  • 00:42:34
    do find that there are major problems we
  • 00:42:36
    were working on right now a learning
  • 00:42:39
    service that will allow people to
  • 00:42:41
    communicate you know here's this issue
  • 00:42:44
    I'm gonna acknowledge the issue I'm
  • 00:42:46
    going to annotate what the problem was
  • 00:42:47
    and then I'm gonna let my ETL move on so
  • 00:42:50
    we're working on that right now but at
  • 00:42:52
    the moment it would just be a matter of
  • 00:42:54
    going off and like manually releasing
  • 00:42:57
    the job to continue to move on to the
  • 00:42:58
    next step
  • 00:42:59
    does that power sit in one person or a
  • 00:43:02
    small group of people that sit down and
  • 00:43:04
    say yes we think this is good we've
  • 00:43:06
    discussed this really quickly in a
  • 00:43:07
    30-minute session we think the ETL can
  • 00:43:10
    continue or is it not that critical for
  • 00:43:12
    you guys to be that fast or how does
  • 00:43:15
    that decision actually get made
  • 00:43:17
    I'm gonna defer against my colleague
  • 00:43:20
    yeah no worries
  • 00:43:20
    so I'd say it comes down to the data
  • 00:43:23
    point and the team that owns that data
  • 00:43:25
    it's not really a centralized decision
  • 00:43:27
    per se it would oftentimes you know the
  • 00:43:30
    most relevant team or person would get
  • 00:43:32
    alerted usually it's not just one
  • 00:43:34
    individual usually it's a team and they
  • 00:43:37
    would dig in see if there is actually a
  • 00:43:39
    data issue if if there is they would you
  • 00:43:41
    know fix that if not they would release
  • 00:43:43
    things and hopefully improve the audit
  • 00:43:46
    so that that same type of anomaly
  • 00:43:48
    doesn't trick the audit again time time
  • 00:43:50
    frame for those types of decisions again
  • 00:43:53
    it can depend on the data set there are
  • 00:43:55
    some data sets that you know you want to
  • 00:43:58
    have up and running 24/7 and then there
  • 00:44:00
    are other data sets that you know run
  • 00:44:02
    once a day or once a week so it varies
  • 00:44:06
    but I would say that you know they
  • 00:44:08
    oftentimes that kind of like 24-hour
  • 00:44:11
    turnaround is a pretty normal kind of
  • 00:44:13
    time line cool thank you so much sure I
  • 00:44:18
    questioned about the statistics that you
  • 00:44:21
    capture how did you determine what
  • 00:44:23
    metadata to capture for your data and
  • 00:44:25
    part two if if you find there's a new
  • 00:44:28
    feature that you want to capture you
  • 00:44:30
    ever have to go back through your
  • 00:44:32
    historical data and collect that good
  • 00:44:36
    question so how do we how do we collect
  • 00:44:38
    the statistics and then how do we
  • 00:44:42
    support evolution of those statistics
  • 00:44:44
    and backfill if necessary to answer your
  • 00:44:48
    question that statistic is collected
  • 00:44:49
    it's collected as part of the the
  • 00:44:51
    storage driver so whenever data is
  • 00:44:54
    written we are collecting that
  • 00:44:55
    information every time it's written from
  • 00:44:57
    like spark or pic and we we have the
  • 00:45:01
    ability you know the whole the whole
  • 00:45:03
    model here is collaboration so people
  • 00:45:06
    are welcome to create a brand new
  • 00:45:09
    statistic it actually we just had this
  • 00:45:11
    happen recently
  • 00:45:12
    and that statistic in this case it was
  • 00:45:15
    looking at approximate cardinality and
  • 00:45:17
    so that's something that not everybody
  • 00:45:19
    would want to turn on and so we can flag
  • 00:45:21
    it and say okay explicitly disable this
  • 00:45:24
    by default and explicitly enable this
  • 00:45:26
    for this one data set so people have
  • 00:45:27
    that control where it's appropriate and
  • 00:45:29
    then we don't really worry about
  • 00:45:32
    backfilling of the statistics we have
  • 00:45:34
    logic that says okay if we don't have
  • 00:45:35
    enough data for this normal distribution
  • 00:45:38
    perhaps we should be using some other
  • 00:45:40
    detection method in the interim and once
  • 00:45:43
    we do then we'll switch over to a normal
  • 00:45:45
    distribution or will just simply
  • 00:45:47
    invalidate that audit until we have
  • 00:45:49
    enough data for the new statistic but we
  • 00:45:52
    wouldn't necessarily go back because I'd
  • 00:45:53
    be too expensive for us to do that means
  • 00:45:56
    you have never done it we've never never
  • 00:46:00
    done the backfill or is just not common
  • 00:46:01
    we have never done the back field
  • 00:46:03
    because it literally is part of like the
  • 00:46:04
    storage function and we would have to
  • 00:46:06
    either create something brand new
  • 00:46:08
    basically you'd have to go interrogate
  • 00:46:09
    that data and for us that's that's very
  • 00:46:11
    very expensive so we want to touch the
  • 00:46:13
    data as little as we can which is why we
  • 00:46:14
    try to collect it whenever the data's
  • 00:46:16
    being written not that we couldn't come
  • 00:46:19
    up with a solution if we needed to but
  • 00:46:20
    we've never had a case where we really
  • 00:46:22
    needed to my question is regarding
  • 00:46:28
    tableau connectivity to interactive
  • 00:46:31
    queries just to be in specific like in
  • 00:46:35
    order to calculate any complex
  • 00:46:36
    calculation in databases it's so tedious
  • 00:46:39
    like you want to calculate the standard
  • 00:46:40
    deviation z-score for our own last few
  • 00:46:42
    years based on the dynamic
  • 00:46:44
    parameterization you can't do it in
  • 00:46:46
    databases so easily so writing in a
  • 00:46:48
    Python code or creating a web services
  • 00:46:50
    on top is so much easier but tableau
  • 00:46:53
    lacks a good way of connectivity to the
  • 00:46:56
    Web API so as live interactivity I mean
  • 00:47:00
    do you guys try to I mean have you came
  • 00:47:03
    across these issues or are there any
  • 00:47:05
    invest implementation models for missile
  • 00:47:08
    that I'm gonna defer
  • 00:47:13
    I'm sorry don't repeat the question
  • 00:47:15
    again I think I can answer sorry I when
  • 00:47:20
    it comes to the live connections that we
  • 00:47:23
    make use of I would say for tableau we
  • 00:47:26
    have primarily used tableau data
  • 00:47:29
    extracts in the past so for cases like
  • 00:47:31
    that where we need to go out and pull in
  • 00:47:34
    data from an external service and kind
  • 00:47:36
    of make that available we would usually
  • 00:47:38
    bring it in to our data warehouse and
  • 00:47:40
    materialize that into a table and then
  • 00:47:42
    pull it into tableau for something more
  • 00:47:45
    like what you're talking about where you
  • 00:47:47
    have to go out and hit an external API
  • 00:47:48
    we actually use some custom
  • 00:47:51
    visualization solutions where we do
  • 00:47:53
    those types of things the Netflix person
  • 00:47:58
    sorry I just add on to that I think
  • 00:47:59
    actually the feature that they demo this
  • 00:48:00
    morning the extensions API is sort of an
  • 00:48:03
    interesting idea like now that that's
  • 00:48:04
    available because we've built this as an
  • 00:48:06
    API driven service potentially we could
  • 00:48:08
    hook into the alert and pull through the
  • 00:48:10
    extension and place an annotation that
  • 00:48:12
    says this data that's building this
  • 00:48:15
    dashboard is suspect at the moment yes
  • 00:48:17
    there's some enhancements that maybe we
  • 00:48:19
    can look to to bring in that
  • 00:48:20
    notification based on the alerting
  • 00:48:22
    that's running in the ETL process it's
  • 00:48:25
    kind of an out of the box but I'm the
  • 00:48:27
    whole world is now looking into
  • 00:48:28
    evolution of the web services where
  • 00:48:30
    Amazon is promoting as a primary data
  • 00:48:33
    source for everything capturing your
  • 00:48:35
    enterprise data layer as an service base
  • 00:48:37
    and a lot of BI tools are not you know
  • 00:48:40
    good enough to talk to these API so do
  • 00:48:43
    you think of any intermediate solutions
  • 00:48:45
    like that could get to that connectivity
  • 00:48:48
    or we should look at it as a separate
  • 00:48:50
    use case and go as an extent real quick
  • 00:48:54
    I'm going to pause there's a really bad
  • 00:48:56
    echo so what I'm gonna do instead is
  • 00:48:57
    invite you guys all up to the front if
  • 00:48:59
    you have questions and there's several
  • 00:49:00
    of us so you'll get your question
  • 00:49:01
    answered faster but just the Netflix
  • 00:49:04
    Group is over here and if you don't have
  • 00:49:07
    time to stay and get your question
  • 00:49:08
    answered again feel free to reach out to
  • 00:49:09
    any of us during the conference thanks
  • 00:49:11
    for coming thank you
  • 00:49:13
    [Applause]
标签
  • Netflix
  • Data Engineering
  • Analytics
  • Data Quality
  • ETL
  • Big Data
  • Data Warehousing
  • Data Visualization
  • Cloud Infrastructure
  • Content Strategy