ExpertTalks2017 - ABC of Distributed Data Processing by Piyush Verma

00:46:14
https://www.youtube.com/watch?v=tKuCoKDPPEQ

Summary

TLDRThis talk explores the challenges and complexity of distributed data processing, defining when data is classified as big dataβ€”a term associated with immense volume, rapid velocity, and diverse varieties. It discusses the shift from small databases to large-scale systems that integrate multiple data sources, emphasizing operational (OLTP) and analytical (OLAP) processes in data management. The speaker delves into architectural components of data warehouses and highlights the significance of data consistency and integrity while handling large datasets. Challenges such as complex joins, scalability, and the need for efficient querying are addressed, alongside the distinctions between batch and streaming processing, detailing the advantages and caveats of each. Finally, the importance of utilizing tools and technologies to tackle issues of data ingestion, storage, and analysis in modern data landscapes is underlined.

Takeaways

  • πŸ“Š Understanding when data is considered 'big data'
  • πŸ”„ Difference between OLTP and OLAP systems
  • πŸ—οΈ Challenges of traditional data processing
  • ⭐ Importance of star schema architecture
  • πŸ“ˆ Workloads: Descriptive, Predictive, and Prescriptive
  • πŸš€ Distinguishing between streaming and batch processing
  • πŸ” Utilizing bloom filters for data deduplication
  • πŸ“¦ Data cubes for efficient data retrieval
  • πŸ•’ Handling out-of-order processing
  • βš™οΈ Comparison of Spark and Flink in streaming contexts

Timeline

  • 00:00:00 - 00:05:00

    The discussion begins with the concept of distributed data processing, emphasizing the importance of understanding when data becomes 'big data'. A small business scenario illustrates how simple data needs can evolve as businesses grow, leading to complexities in data management.

  • 00:05:00 - 00:10:00

    As businesses scale, data sources multiply and the volume of transactions increases. The speaker highlights the limitations of traditional databases in handling vast data amounts and the need for different work types, specifically operational and analytical data processing systems.

  • 00:10:00 - 00:15:00

    'Big Data' is defined using three characteristics: volume, velocity, and variety. High volume datasets, such as those generated by companies like Facebook and the airline industry, showcase the immense scale of data produced and the challenges it creates in processing and analyzing such datasets.

  • 00:15:00 - 00:20:00

    Velocity describes the pace of incoming data, particularly within the telecom and banking sectors, where real-time analysis is necessary for operations like call quality assessment and fraud detection. This characteristic emphasizes the need for more advanced data handling compared to traditional systems.

  • 00:20:00 - 00:25:00

    Variety indicates the different forms of data encountered, such as structured, semi-structured, and unstructured formats. This range necessitates careful storage and processing strategies, as businesses aim to retain all generated data for future analysis of unknown needs.

  • 00:25:00 - 00:30:00

    The significance of data engineering is highlighted, where workloads can be categorized into descriptive, predictive, and prescriptive analysis. Each type serves a distinct purpose, influencing business decisions and future strategies based on historical data patterns.

  • 00:30:00 - 00:35:00

    Challenges in traditional data architecture are discussed, including vertical scaling and complex data relationships. The speaker notes that as data grows, systems must efficiently manage data storage while minimizing costs and ensuring performance, particularly through complex join operations in databases.

  • 00:35:00 - 00:40:00

    Anatomy of data warehouse systems is examined from hardware aspects to file systems, storage formats, and processing frameworks. Tools for distributed processing, efficient querying, and data access layers are emphasized as key components in building robust data architectures.

  • 00:40:00 - 00:46:14

    Clarification of concepts such as snowflake and star schemas illustrates how data is organized for optimal querying performance in data warehouses. This leads into discussing deduplication, handling of evolving dimensions, and the challenges of maintaining data integrity across various points of sale during integration.

Show more

Mind Map

Video Q&A

  • What defines big data?

    Big data is defined by its volume, velocity, and variety. When data surpasses traditional processing capabilities, it becomes big data.

  • What is the difference between OLAP and OLTP?

    OLTP (Online Transaction Processing) focuses on managing transaction-oriented applications, whereas OLAP (Online Analytical Processing) is geared towards data analysis and complex queries.

  • What are the challenges of traditional data processing systems?

    Traditional systems face challenges like scalability, complex joins, and inefficiencies in handling large datasets.

  • What is a star schema?

    A star schema is a type of database schema that allows for efficient data retrieval, where a central fact table is connected to dimension tables.

  • What are the different types of workloads in data processing?

    The three main workloads are descriptive, predictive, and prescriptive analyses.

  • How does streaming differ from batch processing?

    Streaming processes data in real-time, while batch processing deals with large volumes of data at once, often retrospectively.

  • What is a bloom filter?

    A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.

  • What is the purpose of data cubes in data warehouses?

    Data cubes are used to optimize retrieval of multi-dimensional data for analysis purposes.

  • How can out-of-order processing be managed?

    Out-of-order processing can be handled by defining a time window for arrival and only processing data within that window.

  • What is the difference between Spark and Flink for streaming?

    Spark uses micro-batching for streaming, while Flink is designed for true streaming applications.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!
Subtitles
en
Auto Scroll:
  • 00:00:04
    yeah so we are talking about distributed
  • 00:00:08
    data processing so how many of you deal
  • 00:00:10
    with data or big data like engineering
  • 00:00:13
    data in any manner mine analyze etcetera
  • 00:00:16
    alright cool so the most common
  • 00:00:20
    questions that I get asked around data
  • 00:00:22
    is like when does my data become big
  • 00:00:25
    data I think this must be a common side
  • 00:00:28
    that you've seen at the back of the
  • 00:00:29
    trucks so I hope you guys can see that
  • 00:00:31
    back or you know like we show off data
  • 00:00:36
    my data is bigger than the audit so well
  • 00:00:40
    it's a common thing so let's start with
  • 00:00:43
    how did we reach a point of it being a
  • 00:00:46
    big data or data being so massive where
  • 00:00:48
    we have to even talk about it
  • 00:00:50
    they come in sis data why do you need to
  • 00:00:51
    talk about it so let's start with a
  • 00:00:55
    small business you have a point-of-sale
  • 00:00:56
    system which is connected to it by
  • 00:01:00
    itself yeah it's connected to a small my
  • 00:01:01
    sequel database you are selling stuff
  • 00:01:04
    you so the problem that we are trying to
  • 00:01:06
    solve here is let's say we have to
  • 00:01:07
    identify the most profitable products in
  • 00:01:11
    form of the least shelf-life so the
  • 00:01:15
    product that stays for the least amount
  • 00:01:17
    of time on the shelf before it is
  • 00:01:19
    procure like afters procured and before
  • 00:01:21
    it is so fairly easy to solve in this
  • 00:01:24
    setup like you just have a small DB just
  • 00:01:26
    connect to it and figure out your
  • 00:01:27
    answers you go big in your business now
  • 00:01:30
    you have multiple point-of-sale systems
  • 00:01:32
    in the same departmental store still
  • 00:01:34
    connected to the same database we have
  • 00:01:37
    problems of failover recovery etc all
  • 00:01:38
    standard problems we go one step bigger
  • 00:01:41
    now you attach it to an ERP solution
  • 00:01:44
    where the inventory coming in you're
  • 00:01:46
    tied to that as well because you realize
  • 00:01:48
    that calculating profitability is not
  • 00:01:51
    just computing that point of procurement
  • 00:01:53
    versus point of sale you also need to
  • 00:01:54
    take into account taxations
  • 00:01:56
    or what it meant to get that product on
  • 00:02:00
    that shelf out there because the
  • 00:02:02
    standard answer that you would find from
  • 00:02:04
    this is that the dairy products or the
  • 00:02:06
    poultry products are the most profitable
  • 00:02:07
    products but apparently they are not
  • 00:02:09
    because there's at least shelf life so
  • 00:02:11
    there could be something more now
  • 00:02:13
    imagine this problem over entire
  • 00:02:16
    like you now are but scale of a huge
  • 00:02:19
    departmental store you have headquarters
  • 00:02:21
    located somewhere in the middle of
  • 00:02:22
    nowhere and you have sales point
  • 00:02:24
    everywhere now if this information was
  • 00:02:27
    still to be computed would you be able
  • 00:02:29
    to do it at the same pace by just
  • 00:02:30
    connected to my signal database now
  • 00:02:32
    millions of transactions are happening
  • 00:02:34
    every day probably every second as well
  • 00:02:36
    there are multiple points of interaction
  • 00:02:38
    of sales and purchases you could be
  • 00:02:40
    buying via Amazon you could be buying
  • 00:02:42
    over phone or email or whatever now all
  • 00:02:45
    of this has to be ingested and you need
  • 00:02:47
    to crunch stock now
  • 00:02:49
    absolutely not feasible in the standard
  • 00:02:52
    ways now this is how we learn into
  • 00:02:54
    problem that no what do we do about it
  • 00:02:56
    so one clear distinction that we'll try
  • 00:02:59
    to make here is type of workloads that
  • 00:03:02
    we usually have any dealing with data
  • 00:03:04
    problems well one our operation and the
  • 00:03:08
    other is analytical so that's the
  • 00:03:10
    standard OLTP model that we always used
  • 00:03:12
    to have and on the right as Olaf model
  • 00:03:14
    so most of the OLTP stuff is categorized
  • 00:03:18
    as online now by online I don't mean web
  • 00:03:21
    online online means that you make a
  • 00:03:23
    purchase an entry is made whatever
  • 00:03:25
    logistical checks that you have to do
  • 00:03:27
    around the data is it sanitized is it
  • 00:03:30
    forming enough locks you don't wanna
  • 00:03:31
    sell items more than what you have all
  • 00:03:34
    that is tackled by the whole TP system
  • 00:03:36
    and the stuff that we're just talking
  • 00:03:38
    about at the back now analyzing and
  • 00:03:40
    crunching is my business making money is
  • 00:03:42
    my business not making money which are
  • 00:03:44
    the product that I should invest in or
  • 00:03:45
    should not invest in is usually done
  • 00:03:47
    sitting at a pile of data and it's done
  • 00:03:50
    later so you can clearly call it offline
  • 00:03:52
    access so these are the typical
  • 00:03:55
    workloads all I really like to split it
  • 00:03:57
    out to us and as we go further we will
  • 00:03:59
    be actually talking more about the
  • 00:04:00
    offline systems and not the online
  • 00:04:02
    systems because you know I'm assuming
  • 00:04:04
    that most of you have already mastered
  • 00:04:06
    those things I assume so when do we call
  • 00:04:12
    our data to be big enough now this is
  • 00:04:16
    not my terminology it's a standard
  • 00:04:18
    terminology which is used in the data
  • 00:04:19
    engineering textbooks as well now and a
  • 00:04:23
    lot of people who pioneered the same
  • 00:04:25
    came up with this information so we
  • 00:04:27
    define as three
  • 00:04:30
    one is volume now at any one of time you
  • 00:04:35
    have a problem if you have two of these
  • 00:04:38
    three anyone in isolation is easy to
  • 00:04:41
    address and solve using traditional ways
  • 00:04:43
    but if you have any two of these you
  • 00:04:45
    usually would look beyond your
  • 00:04:47
    traditional systems of what you have
  • 00:04:48
    we'll come back we'll later have a look
  • 00:04:51
    at what traditional systems look like
  • 00:04:52
    and what challenges do they run into now
  • 00:04:55
    let's talk about volume of data for a
  • 00:04:56
    moment now facebook produces 500
  • 00:04:58
    terabytes of data a day starts from
  • 00:05:01
    facebook I'm not making them up and with
  • 00:05:04
    that person ordered data you need to
  • 00:05:05
    look into where the advertisements have
  • 00:05:07
    to go now it's not just these online
  • 00:05:11
    businesses like every time you sit on a
  • 00:05:12
    Boeing flight which goes from the west
  • 00:05:14
    coast to the east coast or vice versa
  • 00:05:16
    one flight produces 240 terabytes of
  • 00:05:20
    data like and like really it's just too
  • 00:05:24
    much huge this is just too huge a data
  • 00:05:26
    to process and that's not one flight
  • 00:05:28
    then 18,000 and 800 flights going across
  • 00:05:31
    west coast and east coast every day and
  • 00:05:33
    you can imagine the amount of data
  • 00:05:34
    they're producing so and we've reached
  • 00:05:37
    this point because we have this tendency
  • 00:05:39
    of you know having data so we will save
  • 00:05:41
    it like might as well save it we are
  • 00:05:43
    producing it so yeah on the right we
  • 00:05:48
    have velocity of data velocity of data
  • 00:05:50
    is when your data is coming in at a pace
  • 00:05:54
    where you really cannot process it the
  • 00:05:57
    standard tools that you used to one of
  • 00:06:00
    the biggest consumers of velocity of
  • 00:06:02
    data is telecom industry now every time
  • 00:06:05
    you are making that voice call or a
  • 00:06:07
    cellular call the what happens is that
  • 00:06:10
    these telecom operators are processing
  • 00:06:12
    your call drops quality of voice
  • 00:06:15
    analyzing frequency see where the drop
  • 00:06:17
    is where the quality is going bad users
  • 00:06:20
    on the fly are going to get shuffled as
  • 00:06:22
    well now this is done also taken into
  • 00:06:24
    account previous considerations of a
  • 00:06:26
    zone someone don't like my presentation
  • 00:06:30
    right so you have so these are being
  • 00:06:35
    analyzed at a rate of around 220 million
  • 00:06:38
    calls per minute now this no mean a a
  • 00:06:42
    small feed you know like this
  • 00:06:43
    you have to ingest the data which is
  • 00:06:45
    beyond your traditional computation
  • 00:06:48
    boundaries it's also used in a lot of
  • 00:06:50
    other companies as well fraud of every
  • 00:06:52
    time you make a transaction you get a
  • 00:06:54
    call from HDFC bank saying that did you
  • 00:06:56
    make this transaction well someone is
  • 00:06:58
    analyzing it someone is processing it at
  • 00:07:00
    least you feel good about the fact that
  • 00:07:01
    someone is taking care then there is a
  • 00:07:05
    break in like someone tries to access
  • 00:07:07
    your account you get a notification from
  • 00:07:09
    Google that you know someone's trying to
  • 00:07:11
    log in with an IP address which doesn't
  • 00:07:13
    really look like yours is it that really
  • 00:07:15
    you or do you want to change all your
  • 00:07:16
    passwords prediction rewards credit all
  • 00:07:20
    of the financial industry largely works
  • 00:07:23
    around these variety of data well as we
  • 00:07:25
    are going forward we do not have any
  • 00:07:27
    single source of data consumer producers
  • 00:07:30
    there are logs unstructured data semi
  • 00:07:33
    structured data structured data
  • 00:07:35
    audio/video this data coming in every
  • 00:07:37
    format as I said like you know we have
  • 00:07:39
    this tendency we're in age where we want
  • 00:07:41
    to produce it and we'll say that would
  • 00:07:43
    bother about it later we'll look and
  • 00:07:44
    this largely comes from one of the
  • 00:07:46
    previous talks I think someone named
  • 00:07:48
    mister lally that's right and he was
  • 00:07:51
    making a point about that there are
  • 00:07:54
    unknowns which you do not know about so
  • 00:07:56
    you save data for what you want to
  • 00:07:59
    analyze but what when you don't know
  • 00:08:01
    what you are know analyze some an asset
  • 00:08:03
    question from the audience well it
  • 00:08:06
    should have been in my talk but anyway
  • 00:08:07
    it has already been asked so when you
  • 00:08:10
    have that you rather save everything
  • 00:08:12
    and later bother about okay this is what
  • 00:08:14
    I need to pull out and this what I need
  • 00:08:16
    to look at for that the super set has to
  • 00:08:17
    be really huge so anyway pick two of
  • 00:08:20
    these three if you have them you really
  • 00:08:22
    have a problem and you should attend the
  • 00:08:23
    rest of this talk and otherwise as well
  • 00:08:26
    this day so why bother with data
  • 00:08:28
    engineering like why do we analyze data
  • 00:08:30
    to begin with most of it is covered in
  • 00:08:32
    the previous talk but I still quickly go
  • 00:08:34
    through it I didn't have time to change
  • 00:08:36
    the slides on the fly there three kind
  • 00:08:39
    of workloads that you usually have in
  • 00:08:40
    analytical processing one of them is
  • 00:08:42
    descriptive descriptive when you are
  • 00:08:45
    looking at traditional data you want to
  • 00:08:47
    see that how did I behave in the past
  • 00:08:49
    one year five years ten years twenty
  • 00:08:51
    years or as long as your business has
  • 00:08:53
    been around and you want to generate
  • 00:08:55
    behavior laid out
  • 00:08:57
    and you wanna see that okay you make
  • 00:08:59
    pretty graphs out of it at the end of it
  • 00:09:01
    nothing else then there's predictive
  • 00:09:03
    analysis predictive analysis is using
  • 00:09:05
    the descriptive data you use the same
  • 00:09:07
    patterns and now you want to superimpose
  • 00:09:10
    data and see that okay can I make a
  • 00:09:12
    prediction on what is it going to look
  • 00:09:13
    like they can I make a guess that maybe
  • 00:09:15
    this brand of toothpaste is going to
  • 00:09:17
    sell more in this particular store and
  • 00:09:19
    not the other one so coming back to the
  • 00:09:20
    original problem of a departmental store
  • 00:09:22
    now you are trying to analyze that maybe
  • 00:09:24
    I don't even need to ship poultry in the
  • 00:09:26
    times of let's say in september/october
  • 00:09:29
    people people don't buy it so this can
  • 00:09:32
    help you make good guesses and at the
  • 00:09:34
    end of Frey you can make more money out
  • 00:09:36
    of it because you are taking the right
  • 00:09:37
    decisions in your business and then this
  • 00:09:39
    prescriptive prescriptive is absolutely
  • 00:09:41
    guesswork I don't know how people do
  • 00:09:43
    that but well they do so as we move from
  • 00:09:47
    left to the right what you have to make
  • 00:09:48
    sure is that this is deterministic
  • 00:09:50
    behavior and you're moving towards
  • 00:09:51
    probabilistic mathematics there now that
  • 00:09:53
    is like purely into the game of machine
  • 00:09:55
    learning I mean of course there as well
  • 00:09:57
    but now you're really making guess works
  • 00:09:59
    based on a lot of in like big
  • 00:10:01
    polynomials that you really won't be
  • 00:10:03
    able to solve on a piece of paper just a
  • 00:10:07
    quick walk through descriptive is mostly
  • 00:10:09
    done on historical deterministic data
  • 00:10:11
    it's F the manner of inferential well or
  • 00:10:14
    ever managers usually make this pretty
  • 00:10:16
    graphs out of it a friend of mine makes
  • 00:10:18
    a lot of money just doing this so yeah
  • 00:10:21
    predictive predictive is future it's
  • 00:10:24
    probably stick in nature it's based on
  • 00:10:27
    descriptive and you try to take that
  • 00:10:29
    information and try to guess what's
  • 00:10:31
    going to happen later prescriptive is
  • 00:10:33
    well Google assist and all kinds of
  • 00:10:35
    things most common example that I could
  • 00:10:37
    think of so now let's try to solve that
  • 00:10:39
    architecture as round 1 as we have been
  • 00:10:41
    doing mostly so all along let's say
  • 00:10:46
    these are the sources of information
  • 00:10:47
    that are coming in there is this
  • 00:10:49
    direction happening there are events
  • 00:10:51
    happening the semi structured data
  • 00:10:52
    happening somewhere and the sources of
  • 00:10:55
    information are kind of like 1 2 & 3 and
  • 00:10:58
    we're not trying to save them in some
  • 00:11:00
    storage one of the storage choices now
  • 00:11:03
    if there's a talk there has to be
  • 00:11:05
    in a big data conversation that's why I
  • 00:11:06
    put it there so there's my sink well you
  • 00:11:10
    would say that okay my
  • 00:11:11
    put it to elasticsearch cluster and I
  • 00:11:13
    searched it the other day a friend of
  • 00:11:15
    mine was asking me like what did
  • 00:11:17
    traditional systems look like the big
  • 00:11:18
    data system like everything had ended at
  • 00:11:20
    elasticsearch was big data so now you
  • 00:11:23
    have elastics Hazzard and you try
  • 00:11:25
    to build a small query engine around it
  • 00:11:26
    where you are able to map and relate the
  • 00:11:29
    data in all different sources storage
  • 00:11:32
    choice - you say that ok you know I
  • 00:11:34
    don't really care about all these
  • 00:11:35
    different choice of data sources one
  • 00:11:37
    database is good enough so I'll take my
  • 00:11:39
    existing solution like MySQL or Postgres
  • 00:11:42
    because they scale well up to say 6
  • 00:11:44
    terabytes of data I have read that in
  • 00:11:46
    documentation it works well I've seen it
  • 00:11:47
    somewhere as well so let's dump
  • 00:11:49
    everything into that database and will
  • 00:11:51
    later bother about it now over here
  • 00:11:53
    we're putting everything into that
  • 00:11:54
    simple thing and we call it a sort of a
  • 00:11:57
    data warehouse now where does it start
  • 00:12:01
    breaking apart oh you don't have the
  • 00:12:04
    latest version of the slide but then
  • 00:12:06
    that's ok so I'll come back to the slide
  • 00:12:08
    later
  • 00:12:08
    so challenges what are the challenges
  • 00:12:11
    that we face with the regular
  • 00:12:13
    architecture that we have well one is
  • 00:12:15
    vertical scaling obviously now you are
  • 00:12:18
    pumping in more and more data and now
  • 00:12:21
    this way you have to make an interesting
  • 00:12:22
    point if you do not categorize the
  • 00:12:24
    sources of your data and put everything
  • 00:12:28
    into the same bucket what you are
  • 00:12:30
    effectively doing is you're paying for
  • 00:12:33
    something that you want to access at low
  • 00:12:35
    latency and while you're putting
  • 00:12:37
    everything aside as the system is going
  • 00:12:39
    to grow your costs are going to grow as
  • 00:12:41
    well like just a comparative study like
  • 00:12:43
    for let's say a gigabyte of data you pay
  • 00:12:46
    around $10 on a hosted system now
  • 00:12:49
    if you were to put everything into that
  • 00:12:51
    system when it becomes 2 terabyte your
  • 00:12:53
    cost is linear its multiplied by that
  • 00:12:55
    not all data is of the same nature so
  • 00:12:59
    you have to categorize selectively
  • 00:13:01
    saying that ok this is something that I
  • 00:13:03
    need now this is something that I need
  • 00:13:05
    later
  • 00:13:06
    this is something that I need more often
  • 00:13:07
    than not while the other stuff well it's
  • 00:13:09
    ok if I can get it hard like say every
  • 00:13:12
    hour or every one day as well sometimes
  • 00:13:15
    that works as well now so vertical
  • 00:13:18
    scaling off a particular single database
  • 00:13:20
    installation is going to be a new
  • 00:13:21
    challenge archival
  • 00:13:24
    because not all data is going to be
  • 00:13:27
    relevant at all points of time so you
  • 00:13:29
    want to store them in manners which is
  • 00:13:31
    retrievable later albiet a little slowly
  • 00:13:33
    with a bit of effort but that's ok
  • 00:13:36
    garbage project while we are pulling in
  • 00:13:41
    data from left right center heavily
  • 00:13:43
    corner that we can most of it is good
  • 00:13:46
    not going to be relevant after we have
  • 00:13:48
    reached a majority of an analytical
  • 00:13:50
    system where we say that ok these are
  • 00:13:51
    the things that we really do not care
  • 00:13:52
    about ever in the lifecycle of my entire
  • 00:13:55
    analytical processing now this seems to
  • 00:13:57
    be taking a lot of space or maybe it is
  • 00:14:00
    redundant because too many streams are
  • 00:14:03
    sending in the same data let's start
  • 00:14:05
    purging it and has anyone ever tried
  • 00:14:09
    doing a delete or operation on a large
  • 00:14:12
    my sequel table or an altar table you
  • 00:14:16
    guys can relate with this problem in
  • 00:14:17
    that case complex joints
  • 00:14:21
    you have left joints right joints and
  • 00:14:25
    outer left joint outer right joints
  • 00:14:27
    other joints and as you grow you know
  • 00:14:30
    they get really complex to tackle in
  • 00:14:31
    your relational sense well
  • 00:14:34
    relationship will complex as well over a
  • 00:14:36
    period of time and your queries become
  • 00:14:39
    more and more complex and long now
  • 00:14:41
    coming back to the old slide and the
  • 00:14:43
    group by operations now most of the
  • 00:14:45
    traditional systems start failing at a
  • 00:14:48
    group by operation now if you aren't
  • 00:14:50
    using your indexes well your build you
  • 00:14:52
    probably would be doing a full table
  • 00:14:54
    scan operation imagine a full table scan
  • 00:14:56
    operation across billions of rows which
  • 00:14:58
    probably spans across multiple discs as
  • 00:15:02
    well or maybe if you have done a
  • 00:15:03
    partition or sharding then across the
  • 00:15:05
    shards as well each of your cluster and
  • 00:15:07
    the shard is actually going to be bogged
  • 00:15:09
    down with one single query now you have
  • 00:15:11
    to join them on a single driver and that
  • 00:15:13
    driver goes out of memory as well so a
  • 00:15:15
    group by or a joint operation is going
  • 00:15:17
    to kill your existing system so how do
  • 00:15:23
    we solve these problems but before we
  • 00:15:25
    solve this problem I would actually like
  • 00:15:27
    to take a step back on anatomy of what a
  • 00:15:29
    data warehouse system looks like or the
  • 00:15:31
    layers at which it operates so the
  • 00:15:36
    lowest layer that I usually categorize
  • 00:15:38
    it
  • 00:15:38
    processing network and storage layer
  • 00:15:40
    this is the underlying hardware now this
  • 00:15:43
    is the hardware which you pick for and
  • 00:15:46
    you obviously each layer works
  • 00:15:48
    independently of the other layer whether
  • 00:15:52
    this is my own work so if any one of you
  • 00:15:54
    do not agree with this please feel free
  • 00:15:55
    to talk to me about it
  • 00:15:57
    I would like to improve this so if you
  • 00:16:00
    look at each layer each layer is
  • 00:16:02
    independent of what the underlying layer
  • 00:16:03
    is about or what the consumer layer is
  • 00:16:05
    going to look like now as you add more
  • 00:16:08
    and more computation you obviously need
  • 00:16:10
    more and more servers you obviously need
  • 00:16:12
    more and more dries because store is
  • 00:16:13
    going to increase as well you want to
  • 00:16:15
    make those switches and hubs and route
  • 00:16:17
    as faster as faster as well for faster
  • 00:16:19
    network because it is going to move
  • 00:16:20
    across nodes so this is the lowest
  • 00:16:23
    Hardware layer and this is the challenge
  • 00:16:26
    where infrastructure and operations
  • 00:16:28
    usually come around and this is where
  • 00:16:30
    you need cluster managers like if you
  • 00:16:32
    have heard names like yawn okay now from
  • 00:16:35
    this point onward I'll start putting and
  • 00:16:37
    throwing in some popular names of tools
  • 00:16:39
    and technologies that exists out there
  • 00:16:41
    just so that you can map them to what
  • 00:16:43
    exists and what can be associated where
  • 00:16:45
    so the next time you hear a new tool and
  • 00:16:47
    saying that ok how does this fit like
  • 00:16:49
    I've heard of spark where does flink fit
  • 00:16:51
    you can associate with ok what layer is
  • 00:16:53
    it supposed to operate in and you can
  • 00:16:55
    take a more conscious and educated guess
  • 00:16:57
    on ok do I really need this technology
  • 00:16:59
    or does it solve my problem like what
  • 00:17:01
    exactly is my problem
  • 00:17:02
    so is this my problem then I need to
  • 00:17:06
    look at a cluster manager something
  • 00:17:07
    which can help me bring up or bring down
  • 00:17:11
    nodes disks network on demand or
  • 00:17:15
    whenever I need it
  • 00:17:17
    or wherever the system needs it a layer
  • 00:17:19
    above that is file systems file systems
  • 00:17:22
    because now remember we are talking
  • 00:17:24
    about distributed systems not a one
  • 00:17:26
    single system what that means is that we
  • 00:17:29
    need file systems which are aware of
  • 00:17:30
    where is my data going to be saved
  • 00:17:33
    now is this going to be saved on node
  • 00:17:35
    one or node two or node three now this
  • 00:17:37
    distribution of events is where a file
  • 00:17:39
    system like HDFS or edifice or NFS and
  • 00:17:43
    those kind of systems will come into the
  • 00:17:45
    picture now the difference there would
  • 00:17:46
    be that s device is mostly in user space
  • 00:17:48
    whereas
  • 00:17:52
    just telling the password
  • 00:18:01
    right now computers a problem sir
  • 00:18:09
    I'll remember to move the swipe my
  • 00:18:11
    finger on that thing yeah so where was I
  • 00:18:16
    yeah a Cephas so the level at which the
  • 00:18:20
    file system operates might change but
  • 00:18:22
    effectively all of them will be the
  • 00:18:24
    exact same problem now we go one level
  • 00:18:26
    above the nut store it format now DITA
  • 00:18:30
    has to be saved in a file system now
  • 00:18:32
    this is where you will make a conscious
  • 00:18:33
    choice of what do I save it as because
  • 00:18:35
    you want data to be accessible faster
  • 00:18:39
    you wanna compress the data because at
  • 00:18:41
    rest you would wanted that okay maybe
  • 00:18:44
    I'll just inflate it back when I receive
  • 00:18:46
    it so compression decompression all
  • 00:18:48
    tackled into it or you might also want
  • 00:18:51
    to look how many of you are aware ever
  • 00:18:52
    of aware of column storages okay who
  • 00:18:57
    doesn't know column soils and would like
  • 00:18:58
    me to quickly cover that is there
  • 00:19:00
    anybody please raise your hand okay the
  • 00:19:02
    ample hands all right now so I'll just
  • 00:19:05
    quickly cover this up let's say
  • 00:19:07
    traditionally we save data in okay dot
  • 00:19:12
    one going off so you save data has rows
  • 00:19:15
    like each data is a row row with the
  • 00:19:18
    schema now you have let's say you're
  • 00:19:20
    saving an information of inventory you
  • 00:19:22
    would have an item item code price label
  • 00:19:24
    brand etc all of these would be saved in
  • 00:19:27
    per row basis now one effective store
  • 00:19:32
    that you can also uses is by inverting
  • 00:19:34
    the storage so what you can do is store
  • 00:19:38
    in columns so if one column is called
  • 00:19:41
    item code the entire storage is going to
  • 00:19:44
    be around an ID and of an ID and a van
  • 00:19:48
    so in a large system like this a column
  • 00:19:51
    storage makes more sense at times
  • 00:19:54
    again it's not a general rule of thumb
  • 00:19:55
    because it depends on your requirement
  • 00:19:57
    but why it would make sense is because
  • 00:19:59
    column so it allows you to be flexible
  • 00:20:02
    on the schema because in a row based
  • 00:20:04
    system if you had placed let's say if
  • 00:20:06
    you're saving this all in a traditional
  • 00:20:08
    table of a my sequel and a new column
  • 00:20:11
    was added later you would have to run an
  • 00:20:13
    alter table or add a new column to it in
  • 00:20:17
    the new schema and again imagine running
  • 00:20:20
    this on a my sequel table which has
  • 00:20:22
    around terabytes of data
  • 00:20:23
    your system will actually choke it's
  • 00:20:24
    going to lock it down and you won't be
  • 00:20:26
    able to perform any other operation
  • 00:20:27
    around on it now an alternate way would
  • 00:20:30
    be if we were using a column based
  • 00:20:33
    storage where the storage has been
  • 00:20:35
    inverted in form of that new column will
  • 00:20:38
    now only have the new IDs and the val
  • 00:20:40
    according to it the older storage
  • 00:20:42
    doesn't have to be altered so if that
  • 00:20:45
    quickly explains what column storage is
  • 00:20:47
    I'll be more than happy to answer your
  • 00:20:48
    questions in detail so please catch me
  • 00:20:50
    outside or any one of my team members
  • 00:20:51
    who are around here now so there are
  • 00:20:55
    multiple storage formats which exists in
  • 00:20:57
    form of H file for a system like HBase
  • 00:21:00
    or parking or Avro and all these are
  • 00:21:04
    storage formats that have been optimized
  • 00:21:06
    for such larger workloads now over and
  • 00:21:11
    above this now you need a distributed
  • 00:21:13
    processing framework as well because
  • 00:21:16
    every time I'm supposed to ask a
  • 00:21:18
    question that to the to the data system
  • 00:21:20
    that tell me how many people were buying
  • 00:21:23
    this product around Diwali or around
  • 00:21:26
    Christmas or around New Year or around
  • 00:21:28
    $15 whenever now this question would
  • 00:21:32
    probably span across multiple nodes and
  • 00:21:36
    across the data would be stored in any
  • 00:21:38
    one of these components down there you
  • 00:21:41
    need something which can massively
  • 00:21:44
    paralyze this operation you really don't
  • 00:21:46
    have to carry this operation as one
  • 00:21:49
    single operation so this sort of goes
  • 00:21:53
    back into roots of functional
  • 00:21:54
    programming and what you are basically
  • 00:21:56
    trying to do is divide an operation into
  • 00:21:59
    smaller units and spread it around and
  • 00:22:03
    an aggregation of that would give you
  • 00:22:05
    the result it's like saying if you have
  • 00:22:07
    to multiply 48 by 62 and so what you do
  • 00:22:12
    is 62 into 48 is you know what 62 into
  • 00:22:16
    one is you know what 62 into two is you
  • 00:22:18
    know what now once you have computed 62
  • 00:22:20
    times the two times 62 you can double
  • 00:22:23
    that up that makes it 6 - 2 times 4 now
  • 00:22:26
    you can add these unless it becomes 48
  • 00:22:28
    and you can distribute it around so you
  • 00:22:30
    get a product faster it if since you
  • 00:22:33
    should do mathematic that way now this
  • 00:22:36
    layer over here is where your standard
  • 00:22:38
    stuff like spark Map Reduce
  • 00:22:41
    or fling and all Sansa and all these
  • 00:22:45
    will fit now over and above that is a
  • 00:22:48
    query engine now query engine is is a
  • 00:22:53
    layer which basically understands what
  • 00:22:56
    the query is and how is it supposed to
  • 00:22:58
    be distributed on my computer aim book
  • 00:23:01
    so query engines are exists in form of
  • 00:23:05
    like these are very internal to the
  • 00:23:07
    they're very tightly coupled with the
  • 00:23:09
    distributed processing framework as well
  • 00:23:11
    but like graph query engines are one
  • 00:23:13
    example spark has its own query engine
  • 00:23:15
    and form of things called tachyon and
  • 00:23:18
    their others as well now over and above
  • 00:23:21
    that the most the highest layer of this
  • 00:23:24
    is the query language this is the manner
  • 00:23:26
    in which you interact with your system
  • 00:23:29
    now query language one of the famous
  • 00:23:31
    ones is SQL that's a query language
  • 00:23:33
    sequel is another query language graph
  • 00:23:36
    QL is another query language so these
  • 00:23:38
    are query languages are basically
  • 00:23:40
    declarative languages most of them
  • 00:23:42
    declarative they will always not be
  • 00:23:44
    declarative it depends on the language
  • 00:23:45
    these languages are as a standard going
  • 00:23:50
    to be used for your consumer tools and
  • 00:23:53
    entire tool chain that you have around
  • 00:23:55
    it so your existing my sequel client is
  • 00:24:00
    if it speaks SQL and your query engine
  • 00:24:04
    and the distributed processing framework
  • 00:24:06
    uses the exact same thing then you can
  • 00:24:09
    technically use the same bicycle client
  • 00:24:11
    to where your data like one such common
  • 00:24:13
    example of a query engine is thrift
  • 00:24:14
    thrift is a is a common name and the
  • 00:24:17
    Hadoop system if you have used so next
  • 00:24:20
    time you run into a problem of a new
  • 00:24:23
    name first understand first is see where
  • 00:24:27
    exactly is your problem and then try to
  • 00:24:31
    map the proper the tool to okay does it
  • 00:24:33
    fit the layer in which I am having that
  • 00:24:36
    problem and see if it solves your
  • 00:24:37
    problem now coming now this is basically
  • 00:24:42
    where we set up warehouses etcetera now
  • 00:24:45
    I want to come to the challenges of what
  • 00:24:48
    challenge
  • 00:24:49
    do you face in warehouse system as well
  • 00:24:53
    this goes slightly into the concepts of
  • 00:24:56
    data modeling and the us understand
  • 00:25:00
    snowflake schema a star schema should I
  • 00:25:02
    cover dot cookie raise your hand if you
  • 00:25:05
    would like me to quickly talk about that
  • 00:25:07
    okay that's a decent amount of hand so
  • 00:25:09
    I'll quickly say so data right usually
  • 00:25:11
    comes in is heavily normalized and what
  • 00:25:14
    you get is you know this is how I save
  • 00:25:16
    data in relational tables you know like
  • 00:25:18
    there is okay so before we do that
  • 00:25:20
    actually I like to cover another point
  • 00:25:21
    in all data analytical systems one
  • 00:25:25
    distinction that we always have to make
  • 00:25:26
    is is something a fact or a dimension
  • 00:25:29
    now this is the underlying ground that
  • 00:25:32
    everything has to work around dimension
  • 00:25:35
    is something which is going to remain
  • 00:25:37
    static like an item when I say static as
  • 00:25:41
    in they don't change by they they really
  • 00:25:43
    do not come in as like event is a fact
  • 00:25:45
    that happens around something let's take
  • 00:25:49
    an example of a car sale and one factual
  • 00:25:54
    process like a car is a dimension a sale
  • 00:25:58
    record is a fact around a dimension so
  • 00:26:02
    whatever you call as metrics or events
  • 00:26:05
    are actually facts like earth is flood
  • 00:26:09
    is this a fact or a dimension well Earth
  • 00:26:12
    is a dimension and Earth is a earth is
  • 00:26:14
    flat or round is a fact it has changed
  • 00:26:16
    over time like the facts can change but
  • 00:26:19
    dimensions will pretty much inherently
  • 00:26:21
    these are the entities or actors of your
  • 00:26:23
    above your relational system so if this
  • 00:26:27
    is a dimension a dimension may depend on
  • 00:26:29
    another dimension like a car may belong
  • 00:26:33
    to a brand a car might have an engine or
  • 00:26:35
    gearbox or tires now all of these are
  • 00:26:39
    dimensions which are also connected to
  • 00:26:41
    other dimensions like a tire is not
  • 00:26:43
    going to change like it's a dimension
  • 00:26:44
    like it's a purely tire or a blister on
  • 00:26:47
    tire and there are facts around it now
  • 00:26:50
    one single common fact would be at the
  • 00:26:52
    center let's say sale happened now if
  • 00:26:56
    you were to make this query on a system
  • 00:26:59
    like this
  • 00:27:01
    most of the time your query engine is
  • 00:27:03
    actually going to be busy making these
  • 00:27:05
    joints now remember we said that joints
  • 00:27:08
    are expensive so this system is really
  • 00:27:11
    not going to optimize it for you so what
  • 00:27:14
    do you have to do instead is you convert
  • 00:27:16
    snowflake schema why is it called
  • 00:27:18
    snowflake is because it looks like a
  • 00:27:19
    snowflake because each of these have new
  • 00:27:22
    tentacles which are connected to
  • 00:27:23
    different dimensions and you convert
  • 00:27:25
    this into a star schema so star schema
  • 00:27:28
    is basically the primary point
  • 00:27:31
    misogynous of where data warehouse
  • 00:27:32
    system start and this is what you need
  • 00:27:35
    as a bare minimum to start doing your
  • 00:27:37
    analytical workloads now this is where
  • 00:27:40
    you have a common fact and dimensions
  • 00:27:41
    have been merged so like this dimension
  • 00:27:45
    if it was a car then it would have all
  • 00:27:47
    the attributes of tire gearbox etc all
  • 00:27:49
    flattened out into one single dimension
  • 00:27:52
    so it's not the true dimensions that
  • 00:27:53
    transform dimension and the reason why
  • 00:27:56
    this is done this way is because now you
  • 00:27:58
    have to do lesser number of joints when
  • 00:28:00
    you have to make a query it leads to a
  • 00:28:04
    very interesting problem though which
  • 00:28:08
    I'll cover very soon but if you guys can
  • 00:28:12
    see it then just hold on to that problem
  • 00:28:14
    another thing is deduplication like when
  • 00:28:18
    data comes in and mostly these are
  • 00:28:21
    asynchronous systems which are sending
  • 00:28:23
    data from different Lasogga that's other
  • 00:28:27
    problem of retail market itself point of
  • 00:28:28
    sister point-of-sale terminals
  • 00:28:30
    now these point-of-sale terminals are
  • 00:28:32
    remotely connected internet may or may
  • 00:28:34
    not be there so they eventually sync
  • 00:28:36
    with the original data warehouse now
  • 00:28:39
    they may rethink like the application
  • 00:28:43
    developer who has written that code on
  • 00:28:44
    the other side and if it's seeding the
  • 00:28:46
    data into the warehouse this periodical
  • 00:28:49
    sync in case of a network failure can't
  • 00:28:51
    happen over and over again now there is
  • 00:28:54
    no way to watermark your point where you
  • 00:28:57
    say that you know okay I have synced
  • 00:28:59
    till point X now you can do this on
  • 00:29:02
    every single record of your transaction
  • 00:29:04
    so you would do it in some batches you
  • 00:29:05
    know like I'll send 10 records at a time
  • 00:29:07
    now what happens if the network fails
  • 00:29:09
    while those 10 are being sent so you
  • 00:29:12
    would invariably
  • 00:29:14
    regardless of no matter how good code
  • 00:29:16
    you write which is a fault-tolerant or
  • 00:29:19
    you know has all hysterics and all kinds
  • 00:29:22
    of things inbuilt into it you will run
  • 00:29:25
    into a problem where there will be some
  • 00:29:26
    duplicates which are said these
  • 00:29:28
    duplicates have to be handled at the
  • 00:29:31
    injection time of the warehouse system
  • 00:29:34
    one of the standard ways to do this
  • 00:29:37
    is something called bloom filters and
  • 00:29:39
    cuckoo filters which is an alternate
  • 00:29:42
    implementation of bloom filters now how
  • 00:29:45
    basically they work is that given a
  • 00:29:48
    stream of data which is coming in you
  • 00:29:51
    keep an array of what has been some one
  • 00:29:54
    interesting property of these filters is
  • 00:29:56
    that they can never tell you whether
  • 00:29:57
    something exists or not
  • 00:29:59
    but they can certainly tell you that
  • 00:30:01
    this is something not present inside it
  • 00:30:04
    so the nature of the question that you
  • 00:30:06
    ask you the filter sort of changes so
  • 00:30:08
    you can never ask it have you seen this
  • 00:30:10
    value earlier the bloom filter can only
  • 00:30:13
    tell you with the probability that I may
  • 00:30:15
    or may not have seen this but have you
  • 00:30:18
    never seen this value they can tell you
  • 00:30:20
    with a certainty that yes I have never
  • 00:30:22
    seen this value so in a sample
  • 00:30:25
    implementation like this like these are
  • 00:30:27
    the objects that has already seen inside
  • 00:30:28
    it does a multi hashing and saves it
  • 00:30:31
    across a fixed storage space where the
  • 00:30:35
    index locations can actually be
  • 00:30:37
    duplicated and when you get a new object
  • 00:30:39
    you see that have you not seen this and
  • 00:30:41
    if the bloom filter says that yes I have
  • 00:30:42
    not seen this only then you actually go
  • 00:30:44
    forward and ingested into the warehouse
  • 00:30:46
    so this very effectively handles your D
  • 00:30:51
    duplications but one thing to understand
  • 00:30:53
    is a duplication can most of these
  • 00:30:56
    things can only be done across a window
  • 00:30:58
    of time that you cannot say that D
  • 00:31:00
    duplicate over an entire year because if
  • 00:31:02
    your entire year had two terabytes of
  • 00:31:04
    data a new event comes in you can search
  • 00:31:06
    over the entire terabytes or petabytes
  • 00:31:08
    of data so you will have to do it in a
  • 00:31:10
    finite window that you define let's say
  • 00:31:13
    finally duplicates in past to a
  • 00:31:17
    fortnight or a week or ten days I mean
  • 00:31:19
    that's a call that you have to take on
  • 00:31:21
    how your system is implemented now if
  • 00:31:25
    you come back to the star schema where
  • 00:31:27
    we said that you know in a snowflake
  • 00:31:30
    dimensions are actually merged into a
  • 00:31:33
    common dimension while ingesting into
  • 00:31:34
    the DB now there is a probability that a
  • 00:31:38
    dimension has changed over time now this
  • 00:31:41
    is one of the most common and the most
  • 00:31:43
    challenging problem that a warehouse
  • 00:31:45
    system can face primarily because you
  • 00:31:49
    dump data today you do analysis day
  • 00:31:53
    after tomorrow now when you look back at
  • 00:31:56
    the data and you want to do an analysis
  • 00:31:59
    and you do the cross joints across
  • 00:32:03
    tables and if these tables are only
  • 00:32:05
    carrying today's value you can never do
  • 00:32:08
    a point-in-time check so if I have to
  • 00:32:10
    find let's say certain goods let's say
  • 00:32:16
    that the same example again I'm trying
  • 00:32:18
    to find the most profitable goods which
  • 00:32:20
    have the most profitable shelf life
  • 00:32:22
    now this shelf life is dependent on a
  • 00:32:24
    lot of things on transportation of the
  • 00:32:25
    goods as well
  • 00:32:26
    transportation is dependent on oil
  • 00:32:28
    prices oil prices change almost every
  • 00:32:31
    day well not until recently they were
  • 00:32:33
    fixed but they change like globally they
  • 00:32:35
    change every day so when you want to do
  • 00:32:38
    that analysis you want historical data
  • 00:32:41
    that for that particular fact I wanna do
  • 00:32:44
    a join on the dimension value that
  • 00:32:47
    existed back then so what you are trying
  • 00:32:51
    to track is a dimensions journey through
  • 00:32:54
    time as well that dimension value looked
  • 00:32:56
    like this back then like a listing a
  • 00:32:59
    there an example of the car thing itself
  • 00:33:01
    like a tire used to cost 10 rupees back
  • 00:33:04
    then but today if it cost thousand
  • 00:33:05
    rupees or 10,000 rupees when you do an
  • 00:33:07
    analysis today of ten years of data you
  • 00:33:10
    only want to look at that 10 rupees not
  • 00:33:13
    ten thousand rupees
  • 00:33:14
    as it exists today so the way to
  • 00:33:17
    implement this this problem is called
  • 00:33:19
    slow changing dimensions or acronyms as
  • 00:33:21
    SCD they have multiple ways to solve
  • 00:33:25
    this one is as always you ignore it and
  • 00:33:28
    you do not solve it so like you only get
  • 00:33:31
    today's data no I mean it's actually a
  • 00:33:33
    theoretically liquid inside they have
  • 00:33:35
    called they call it type 0 which is
  • 00:33:36
    ignoring the the entire thing that it
  • 00:33:39
    doesn't change
  • 00:33:41
    then there are type ones where you say
  • 00:33:42
    that I can overwrite the values or not
  • 00:33:46
    change the value at all I'll keep it as
  • 00:33:47
    it is and I'll never change it or you
  • 00:33:49
    can add certain time filters next to it
  • 00:33:52
    where you say that start time of this
  • 00:33:55
    value was this and end time was this and
  • 00:33:57
    next time when you're making a joint you
  • 00:33:59
    can see when the fact had arrived and do
  • 00:34:01
    a join on anything which is less than n
  • 00:34:04
    time and greater than start time of the
  • 00:34:05
    value yeah this is fairly explained or
  • 00:34:09
    very well across the Wikipedia article
  • 00:34:11
    as well usually I find Wikipedia too
  • 00:34:14
    overwhelming and not explaining at all
  • 00:34:15
    but for one this rare exception they
  • 00:34:17
    have done a good job so this spin maybe
  • 00:34:20
    you can just look this up they'll really
  • 00:34:22
    thoroughly explain you what the
  • 00:34:23
    solutions are as well batching versus
  • 00:34:27
    streaming now this one of the most
  • 00:34:29
    common question that I get asked and
  • 00:34:31
    probably because as I promised this talk
  • 00:34:34
    is about buzz word so I had to add this
  • 00:34:36
    in now bears versus stream these are two
  • 00:34:39
    inherent different kinds of problem and
  • 00:34:42
    different kind of datasets now what you
  • 00:34:46
    can do against the streaming data is
  • 00:34:48
    what you what's that buzzer
  • 00:34:57
    Oh am i over tonight okay are all
  • 00:35:06
    projectors down can you guys see over
  • 00:35:08
    there
  • 00:35:08
    you guys can okay so should I continue
  • 00:35:10
    what about the guys in the front or you
  • 00:35:11
    don't care so back to batching versus
  • 00:35:19
    screaming like one of the most important
  • 00:35:22
    questions that high master and probably
  • 00:35:23
    you will face as well when you try to go
  • 00:35:25
    about starting on your Big Data projects
  • 00:35:28
    or trying to implement what tool and
  • 00:35:30
    what framework is good now inherently
  • 00:35:32
    there are two different kind of problems
  • 00:35:33
    both screaming and batching batching is
  • 00:35:37
    done on huge loads of data and always
  • 00:35:41
    done usually in retrospect you can use
  • 00:35:44
    batch analysis to emit of something like
  • 00:35:50
    cuboids of data I'm going to cover that
  • 00:35:52
    very soon or pre computations which are
  • 00:35:56
    next going to assist your streaming like
  • 00:35:59
    now to put this into perspective let's
  • 00:36:01
    say a real-time transaction is happening
  • 00:36:04
    in your banking system you want to check
  • 00:36:06
    whether this is fraud or not what other
  • 00:36:09
    easy ways to check fraud like one easy
  • 00:36:11
    way is by looking at all historical data
  • 00:36:14
    you can take quick patterns on that here
  • 00:36:18
    when transactions happen after 11:00
  • 00:36:21
    p.m. from this IP address they can be
  • 00:36:24
    fraudulent now you can use this as
  • 00:36:26
    retrospective analysis feed this as a
  • 00:36:29
    quick filter to your stream which is
  • 00:36:31
    coming in and do a quick analysis so
  • 00:36:33
    these are two absolutely disconnected
  • 00:36:36
    processing events and but still the
  • 00:36:38
    question has on one versus the other and
  • 00:36:40
    so we must be able to separate out that
  • 00:36:43
    while we are doing streaming you cannot
  • 00:36:45
    go check the entire earth for is this a
  • 00:36:47
    valid transaction or not well you can if
  • 00:36:51
    there is no one waiting of the
  • 00:36:52
    transition on the other side but for an
  • 00:36:55
    IOT device or a real-time system where
  • 00:36:57
    events are happening like for let's say
  • 00:37:00
    a flight tracker
  • 00:37:03
    this happened in nice as well when the
  • 00:37:06
    bus incident happened now when one of
  • 00:37:10
    the birth unforce
  • 00:37:11
    when bus had crammed into a bunch of
  • 00:37:13
    pedestrians the alarm that went off was
  • 00:37:16
    actually three hours later and people
  • 00:37:18
    thought that was a separate attack
  • 00:37:19
    altogether so this is where you have to
  • 00:37:23
    understand that streaming has to happen
  • 00:37:24
    faster all analysis have to happen
  • 00:37:26
    faster in a way where you might take
  • 00:37:29
    assistance from any of the other
  • 00:37:31
    analysis that you have proceeded by
  • 00:37:33
    doing any batch analysis earlier out of
  • 00:37:37
    order processing now if you have a
  • 00:37:41
    asynchronous system where you are
  • 00:37:44
    sending events it is quite tough to sort
  • 00:37:47
    of guarantee the order in which it
  • 00:37:50
    arrives like one event might I arrive
  • 00:37:53
    after the first one where it was
  • 00:37:54
    supposed to be an order of T 1 T 2 T 3
  • 00:37:56
    it might arrive as T 2 T 3 and T 1 now
  • 00:37:59
    imagine if you're generating reports and
  • 00:38:02
    you are computing these nice pretty
  • 00:38:04
    graphs where you're doing some analysis
  • 00:38:07
    and probably some thresholds as well
  • 00:38:08
    actionable alerts on that and after all
  • 00:38:12
    that is done an hour later an event
  • 00:38:14
    which was supposed to be done for an
  • 00:38:16
    hour back arrives you will have to go
  • 00:38:19
    through that entire chain of pipelining
  • 00:38:21
    again you know like transform
  • 00:38:22
    deduplicate recompute the entire thing
  • 00:38:25
    so out of corner processing is something
  • 00:38:28
    that is very crucial to your warehouse
  • 00:38:30
    and there are n number of ways to solve
  • 00:38:34
    this but as a challenge it's worth
  • 00:38:36
    putting this out on the table because
  • 00:38:38
    there's no like X weight you know do
  • 00:38:41
    this and this will solve your problem
  • 00:38:42
    but one thing that you can apply here is
  • 00:38:45
    by taking a window of arrival by
  • 00:38:49
    deafening your compute slightly and say
  • 00:38:52
    that I'm going to only process after
  • 00:38:54
    five minutes of buffer this is sort of
  • 00:38:56
    where buffers and this buffer etc will
  • 00:38:58
    also start coming into the picture that
  • 00:39:00
    I'm going to stack these up for a window
  • 00:39:01
    of five minutes and every five minutes
  • 00:39:05
    I'll do an analysis of what comes in in
  • 00:39:07
    what order and whatever might arrive
  • 00:39:10
    after this are not going to process this
  • 00:39:11
    at all then that could be one very quick
  • 00:39:13
    way of solving this while we are talking
  • 00:39:17
    about data warehouses data cubes are a
  • 00:39:19
    very important aspect of it because
  • 00:39:24
    cubes and data marts as well I would
  • 00:39:26
    very quickly covered that as well
  • 00:39:28
    data Mart is actually nothing but just a
  • 00:39:30
    siloed view of a warehouse which is
  • 00:39:32
    given to teams as only read-only access
  • 00:39:34
    but warehouse is mostly read-only but
  • 00:39:36
    this is a further narrowed down version
  • 00:39:38
    of lesser your entire warehouse consists
  • 00:39:41
    of inventory sales salaries profit loss
  • 00:39:48
    whatever everything is inside that
  • 00:39:49
    systems to the finance department you
  • 00:39:54
    would rather only expose the salary view
  • 00:39:56
    which is one data Mart for them you
  • 00:39:58
    might have another view which is a very
  • 00:40:02
    siloed view of let's say inventory
  • 00:40:04
    management to the procurement team which
  • 00:40:07
    will have with silo will be called de
  • 00:40:09
    tomate cubes are just a view of your
  • 00:40:13
    data which could be multi-dimensional as
  • 00:40:16
    well Y cubes are done in first place is
  • 00:40:20
    for efficiency of retrieval
  • 00:40:22
    because warehouses as they are massive
  • 00:40:25
    they also suffer the same problem that
  • 00:40:27
    you cannot ask a query quickly like it
  • 00:40:29
    has a certain threshold you can optimize
  • 00:40:31
    that but that's going to take your lot
  • 00:40:33
    of cost to have a single-digit
  • 00:40:35
    millisecond warehouse it's almost next
  • 00:40:38
    to impossible but if you want something
  • 00:40:39
    like that you want to build a real-time
  • 00:40:41
    analytics graph out of it or a stream
  • 00:40:43
    out of it you will have to build cubes
  • 00:40:45
    cubes are pre computed they could be
  • 00:40:48
    computed on-demand as well in terms of
  • 00:40:50
    SQL databases think of them as either
  • 00:40:51
    views or materialized views and these
  • 00:40:54
    are snapshots that you create which get
  • 00:40:57
    pre computed and pre populated now let's
  • 00:40:59
    take a sample example of item location
  • 00:41:02
    day and quantity this table if you're
  • 00:41:06
    familiar can serve a lot of queries like
  • 00:41:09
    on this day how many items were sold for
  • 00:41:12
    this location or how many times is this
  • 00:41:15
    item sold on that day so it's sort of a
  • 00:41:18
    multi-dimensional query view for you
  • 00:41:19
    effectively just a table of SQL but if
  • 00:41:23
    you have to visualize it you can
  • 00:41:24
    visualize it is like this like if this
  • 00:41:26
    was day access this is the location
  • 00:41:28
    access this is the item access then each
  • 00:41:30
    point here is actually the quantity
  • 00:41:33
    which is being sold so you can actually
  • 00:41:34
    get faster results using this standard
  • 00:41:38
    operation that you use on this is called
  • 00:41:39
    slice dice and rotate if you revisit the
  • 00:41:43
    architecture very quickly this is one
  • 00:41:45
    sample the implementation that we had
  • 00:41:48
    done so you have an event Q coming in
  • 00:41:52
    there are workers these workers are
  • 00:41:55
    possible for my sanitization of data
  • 00:41:58
    star schema conversion deduplication out
  • 00:42:02
    of order processing and shorting layer
  • 00:42:04
    in orders you take that data you very
  • 00:42:06
    quickly save that into a cheap storage
  • 00:42:08
    which is done for effective archival
  • 00:42:11
    which could be something like a blob
  • 00:42:14
    storage of s3 which can be mounted as a
  • 00:42:17
    HDFS system basically this acts as a
  • 00:42:20
    file system now try to map this to the
  • 00:42:22
    same Anatomy this is the file system
  • 00:42:24
    layer next says you want to compute
  • 00:42:26
    framework which can actually access to
  • 00:42:28
    this access this do your queries
  • 00:42:30
    analysis etcetera you can use something
  • 00:42:33
    like SPARC here or Apache flink or math
  • 00:42:37
    or well you could use a hosted EMR and
  • 00:42:40
    you will find tons of other solutions
  • 00:42:42
    out there as well now this is where the
  • 00:42:45
    compute framework will do its job
  • 00:42:47
    generate transformations for you create
  • 00:42:50
    time tables and finally output it into a
  • 00:42:52
    cube system one such cube system which
  • 00:42:54
    is very popular for data analysis called
  • 00:42:57
    druid a passive druid you can use that
  • 00:43:00
    as well and druid has now this
  • 00:43:02
    capability of serving very fast cube use
  • 00:43:05
    for analysis purposes
  • 00:43:07
    and if one aspect of the consumption of
  • 00:43:09
    warehouse was data visualization you can
  • 00:43:11
    use a tool like tableau to analyze your
  • 00:43:14
    data that would be it for my talk if I
  • 00:43:18
    am Valentine
  • 00:43:20
    [Applause]
  • 00:43:30
    we'll take about two questions and if
  • 00:43:34
    anybody has any more please reach out to
  • 00:43:36
    Pioche we you're gonna have a coffee
  • 00:43:37
    break in some time anyway
  • 00:43:39
    questions anyone I see a question there
  • 00:43:46
    I be sure hello just a quick question do
  • 00:43:50
    you have any observation there's stuff
  • 00:43:53
    on the kind of projects we're doing once
  • 00:43:56
    of distinction between sparks trimming
  • 00:43:58
    and flink so in zone information yes
  • 00:44:02
    I'll probably take one step back to
  • 00:44:05
    address your question and I hope I'll be
  • 00:44:07
    able to answer it and that is so how
  • 00:44:10
    streaming effectively is underneath
  • 00:44:14
    window operation like you can't apply
  • 00:44:17
    something on an infinite stream you have
  • 00:44:19
    to yeah so stream is actually a window
  • 00:44:28
    operation so you define a small window
  • 00:44:31
    and you say that I'll take that now this
  • 00:44:33
    window could be bound by two things
  • 00:44:36
    either size of the window or time of the
  • 00:44:38
    window as well and you say that you know
  • 00:44:41
    for these five minutes and five minutes
  • 00:44:43
    is actually too long a time for swimming
  • 00:44:45
    you probably will bring it down to five
  • 00:44:47
    seconds or sometimes 500 milliseconds as
  • 00:44:49
    well depending on the rate of ingestion
  • 00:44:50
    and this window operation is what will
  • 00:44:53
    define what is going to be processed as
  • 00:44:56
    a batch now spark streaming underneath
  • 00:44:59
    is micro batching so they use the exact
  • 00:45:02
    same batching code that they wrote for
  • 00:45:05
    earlier when there used to be a batch
  • 00:45:07
    processing system they have taken that
  • 00:45:09
    to spark streaming now whereas flink is
  • 00:45:13
    actually swimming first they don't reuse
  • 00:45:14
    the same code so what that means is that
  • 00:45:17
    spark streaming has a problem of I'm
  • 00:45:20
    actually answering this question before
  • 00:45:21
    the newest release so the newest release
  • 00:45:24
    doesn't have this problem they've
  • 00:45:26
    revamped a lot of code had a problem of
  • 00:45:28
    back pressure because if your steam is
  • 00:45:30
    not processing fast enough you are
  • 00:45:33
    actually going to fall behind and the
  • 00:45:35
    again we'll have stream overflow so
  • 00:45:37
    sparks
  • 00:45:38
    used to have that problem whereas flink
  • 00:45:40
    was done as streaming first so phones
  • 00:45:42
    flink it's the other way around bats is
  • 00:45:44
    cream first
  • 00:45:45
    so you basically pick up a stream and
  • 00:45:48
    then you batch it for an operation of
  • 00:45:50
    yours so that is probably the difference
  • 00:45:53
    that you're looking for at at the crux
  • 00:45:55
    of it of course there's an API
  • 00:45:56
    difference like you can do
  • 00:45:57
    transformations on stream faster and
  • 00:45:59
    flick any other questions
  • 00:46:10
    okay thank you thank you
  • 00:46:13
    [Applause]
Tags
  • Big Data
  • Data Processing
  • OLTP
  • OLAP
  • Data Warehousing
  • Streaming
  • Batch Processing
  • Data Architecture
  • Descriptive Analysis
  • Predictive Analysis