ExpertTalks2017 - ABC of Distributed Data Processing by Piyush Verma
Resumo
TLDRThis talk explores the challenges and complexity of distributed data processing, defining when data is classified as big dataβa term associated with immense volume, rapid velocity, and diverse varieties. It discusses the shift from small databases to large-scale systems that integrate multiple data sources, emphasizing operational (OLTP) and analytical (OLAP) processes in data management. The speaker delves into architectural components of data warehouses and highlights the significance of data consistency and integrity while handling large datasets. Challenges such as complex joins, scalability, and the need for efficient querying are addressed, alongside the distinctions between batch and streaming processing, detailing the advantages and caveats of each. Finally, the importance of utilizing tools and technologies to tackle issues of data ingestion, storage, and analysis in modern data landscapes is underlined.
ConclusΓ΅es
- π Understanding when data is considered 'big data'
- π Difference between OLTP and OLAP systems
- ποΈ Challenges of traditional data processing
- β Importance of star schema architecture
- π Workloads: Descriptive, Predictive, and Prescriptive
- π Distinguishing between streaming and batch processing
- π Utilizing bloom filters for data deduplication
- π¦ Data cubes for efficient data retrieval
- π Handling out-of-order processing
- βοΈ Comparison of Spark and Flink in streaming contexts
Linha do tempo
- 00:00:00 - 00:05:00
The discussion begins with the concept of distributed data processing, emphasizing the importance of understanding when data becomes 'big data'. A small business scenario illustrates how simple data needs can evolve as businesses grow, leading to complexities in data management.
- 00:05:00 - 00:10:00
As businesses scale, data sources multiply and the volume of transactions increases. The speaker highlights the limitations of traditional databases in handling vast data amounts and the need for different work types, specifically operational and analytical data processing systems.
- 00:10:00 - 00:15:00
'Big Data' is defined using three characteristics: volume, velocity, and variety. High volume datasets, such as those generated by companies like Facebook and the airline industry, showcase the immense scale of data produced and the challenges it creates in processing and analyzing such datasets.
- 00:15:00 - 00:20:00
Velocity describes the pace of incoming data, particularly within the telecom and banking sectors, where real-time analysis is necessary for operations like call quality assessment and fraud detection. This characteristic emphasizes the need for more advanced data handling compared to traditional systems.
- 00:20:00 - 00:25:00
Variety indicates the different forms of data encountered, such as structured, semi-structured, and unstructured formats. This range necessitates careful storage and processing strategies, as businesses aim to retain all generated data for future analysis of unknown needs.
- 00:25:00 - 00:30:00
The significance of data engineering is highlighted, where workloads can be categorized into descriptive, predictive, and prescriptive analysis. Each type serves a distinct purpose, influencing business decisions and future strategies based on historical data patterns.
- 00:30:00 - 00:35:00
Challenges in traditional data architecture are discussed, including vertical scaling and complex data relationships. The speaker notes that as data grows, systems must efficiently manage data storage while minimizing costs and ensuring performance, particularly through complex join operations in databases.
- 00:35:00 - 00:40:00
Anatomy of data warehouse systems is examined from hardware aspects to file systems, storage formats, and processing frameworks. Tools for distributed processing, efficient querying, and data access layers are emphasized as key components in building robust data architectures.
- 00:40:00 - 00:46:14
Clarification of concepts such as snowflake and star schemas illustrates how data is organized for optimal querying performance in data warehouses. This leads into discussing deduplication, handling of evolving dimensions, and the challenges of maintaining data integrity across various points of sale during integration.
Mapa mental
VΓdeo de perguntas e respostas
What defines big data?
Big data is defined by its volume, velocity, and variety. When data surpasses traditional processing capabilities, it becomes big data.
What is the difference between OLAP and OLTP?
OLTP (Online Transaction Processing) focuses on managing transaction-oriented applications, whereas OLAP (Online Analytical Processing) is geared towards data analysis and complex queries.
What are the challenges of traditional data processing systems?
Traditional systems face challenges like scalability, complex joins, and inefficiencies in handling large datasets.
What is a star schema?
A star schema is a type of database schema that allows for efficient data retrieval, where a central fact table is connected to dimension tables.
What are the different types of workloads in data processing?
The three main workloads are descriptive, predictive, and prescriptive analyses.
How does streaming differ from batch processing?
Streaming processes data in real-time, while batch processing deals with large volumes of data at once, often retrospectively.
What is a bloom filter?
A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.
What is the purpose of data cubes in data warehouses?
Data cubes are used to optimize retrieval of multi-dimensional data for analysis purposes.
How can out-of-order processing be managed?
Out-of-order processing can be handled by defining a time window for arrival and only processing data within that window.
What is the difference between Spark and Flink for streaming?
Spark uses micro-batching for streaming, while Flink is designed for true streaming applications.
Ver mais resumos de vΓdeos
Dairy Judging Part 1 - The Basics with Megan Davenport
Episode 5.1 Identifying and Defining Problems
Beauty and the Beast in English | Story | @EnglishFairyTales
HIKMAH BERPUASA | Ustadz Abdul Somad
All About Lily Chou-Chou: When the Ether isnβt Enough in the World of Grey
BIOLOGI SMA Kelas 11 - Sistem Pencernaan Manusia | GIA Academy
- 00:00:04yeah so we are talking about distributed
- 00:00:08data processing so how many of you deal
- 00:00:10with data or big data like engineering
- 00:00:13data in any manner mine analyze etcetera
- 00:00:16alright cool so the most common
- 00:00:20questions that I get asked around data
- 00:00:22is like when does my data become big
- 00:00:25data I think this must be a common side
- 00:00:28that you've seen at the back of the
- 00:00:29trucks so I hope you guys can see that
- 00:00:31back or you know like we show off data
- 00:00:36my data is bigger than the audit so well
- 00:00:40it's a common thing so let's start with
- 00:00:43how did we reach a point of it being a
- 00:00:46big data or data being so massive where
- 00:00:48we have to even talk about it
- 00:00:50they come in sis data why do you need to
- 00:00:51talk about it so let's start with a
- 00:00:55small business you have a point-of-sale
- 00:00:56system which is connected to it by
- 00:01:00itself yeah it's connected to a small my
- 00:01:01sequel database you are selling stuff
- 00:01:04you so the problem that we are trying to
- 00:01:06solve here is let's say we have to
- 00:01:07identify the most profitable products in
- 00:01:11form of the least shelf-life so the
- 00:01:15product that stays for the least amount
- 00:01:17of time on the shelf before it is
- 00:01:19procure like afters procured and before
- 00:01:21it is so fairly easy to solve in this
- 00:01:24setup like you just have a small DB just
- 00:01:26connect to it and figure out your
- 00:01:27answers you go big in your business now
- 00:01:30you have multiple point-of-sale systems
- 00:01:32in the same departmental store still
- 00:01:34connected to the same database we have
- 00:01:37problems of failover recovery etc all
- 00:01:38standard problems we go one step bigger
- 00:01:41now you attach it to an ERP solution
- 00:01:44where the inventory coming in you're
- 00:01:46tied to that as well because you realize
- 00:01:48that calculating profitability is not
- 00:01:51just computing that point of procurement
- 00:01:53versus point of sale you also need to
- 00:01:54take into account taxations
- 00:01:56or what it meant to get that product on
- 00:02:00that shelf out there because the
- 00:02:02standard answer that you would find from
- 00:02:04this is that the dairy products or the
- 00:02:06poultry products are the most profitable
- 00:02:07products but apparently they are not
- 00:02:09because there's at least shelf life so
- 00:02:11there could be something more now
- 00:02:13imagine this problem over entire
- 00:02:16like you now are but scale of a huge
- 00:02:19departmental store you have headquarters
- 00:02:21located somewhere in the middle of
- 00:02:22nowhere and you have sales point
- 00:02:24everywhere now if this information was
- 00:02:27still to be computed would you be able
- 00:02:29to do it at the same pace by just
- 00:02:30connected to my signal database now
- 00:02:32millions of transactions are happening
- 00:02:34every day probably every second as well
- 00:02:36there are multiple points of interaction
- 00:02:38of sales and purchases you could be
- 00:02:40buying via Amazon you could be buying
- 00:02:42over phone or email or whatever now all
- 00:02:45of this has to be ingested and you need
- 00:02:47to crunch stock now
- 00:02:49absolutely not feasible in the standard
- 00:02:52ways now this is how we learn into
- 00:02:54problem that no what do we do about it
- 00:02:56so one clear distinction that we'll try
- 00:02:59to make here is type of workloads that
- 00:03:02we usually have any dealing with data
- 00:03:04problems well one our operation and the
- 00:03:08other is analytical so that's the
- 00:03:10standard OLTP model that we always used
- 00:03:12to have and on the right as Olaf model
- 00:03:14so most of the OLTP stuff is categorized
- 00:03:18as online now by online I don't mean web
- 00:03:21online online means that you make a
- 00:03:23purchase an entry is made whatever
- 00:03:25logistical checks that you have to do
- 00:03:27around the data is it sanitized is it
- 00:03:30forming enough locks you don't wanna
- 00:03:31sell items more than what you have all
- 00:03:34that is tackled by the whole TP system
- 00:03:36and the stuff that we're just talking
- 00:03:38about at the back now analyzing and
- 00:03:40crunching is my business making money is
- 00:03:42my business not making money which are
- 00:03:44the product that I should invest in or
- 00:03:45should not invest in is usually done
- 00:03:47sitting at a pile of data and it's done
- 00:03:50later so you can clearly call it offline
- 00:03:52access so these are the typical
- 00:03:55workloads all I really like to split it
- 00:03:57out to us and as we go further we will
- 00:03:59be actually talking more about the
- 00:04:00offline systems and not the online
- 00:04:02systems because you know I'm assuming
- 00:04:04that most of you have already mastered
- 00:04:06those things I assume so when do we call
- 00:04:12our data to be big enough now this is
- 00:04:16not my terminology it's a standard
- 00:04:18terminology which is used in the data
- 00:04:19engineering textbooks as well now and a
- 00:04:23lot of people who pioneered the same
- 00:04:25came up with this information so we
- 00:04:27define as three
- 00:04:30one is volume now at any one of time you
- 00:04:35have a problem if you have two of these
- 00:04:38three anyone in isolation is easy to
- 00:04:41address and solve using traditional ways
- 00:04:43but if you have any two of these you
- 00:04:45usually would look beyond your
- 00:04:47traditional systems of what you have
- 00:04:48we'll come back we'll later have a look
- 00:04:51at what traditional systems look like
- 00:04:52and what challenges do they run into now
- 00:04:55let's talk about volume of data for a
- 00:04:56moment now facebook produces 500
- 00:04:58terabytes of data a day starts from
- 00:05:01facebook I'm not making them up and with
- 00:05:04that person ordered data you need to
- 00:05:05look into where the advertisements have
- 00:05:07to go now it's not just these online
- 00:05:11businesses like every time you sit on a
- 00:05:12Boeing flight which goes from the west
- 00:05:14coast to the east coast or vice versa
- 00:05:16one flight produces 240 terabytes of
- 00:05:20data like and like really it's just too
- 00:05:24much huge this is just too huge a data
- 00:05:26to process and that's not one flight
- 00:05:28then 18,000 and 800 flights going across
- 00:05:31west coast and east coast every day and
- 00:05:33you can imagine the amount of data
- 00:05:34they're producing so and we've reached
- 00:05:37this point because we have this tendency
- 00:05:39of you know having data so we will save
- 00:05:41it like might as well save it we are
- 00:05:43producing it so yeah on the right we
- 00:05:48have velocity of data velocity of data
- 00:05:50is when your data is coming in at a pace
- 00:05:54where you really cannot process it the
- 00:05:57standard tools that you used to one of
- 00:06:00the biggest consumers of velocity of
- 00:06:02data is telecom industry now every time
- 00:06:05you are making that voice call or a
- 00:06:07cellular call the what happens is that
- 00:06:10these telecom operators are processing
- 00:06:12your call drops quality of voice
- 00:06:15analyzing frequency see where the drop
- 00:06:17is where the quality is going bad users
- 00:06:20on the fly are going to get shuffled as
- 00:06:22well now this is done also taken into
- 00:06:24account previous considerations of a
- 00:06:26zone someone don't like my presentation
- 00:06:30right so you have so these are being
- 00:06:35analyzed at a rate of around 220 million
- 00:06:38calls per minute now this no mean a a
- 00:06:42small feed you know like this
- 00:06:43you have to ingest the data which is
- 00:06:45beyond your traditional computation
- 00:06:48boundaries it's also used in a lot of
- 00:06:50other companies as well fraud of every
- 00:06:52time you make a transaction you get a
- 00:06:54call from HDFC bank saying that did you
- 00:06:56make this transaction well someone is
- 00:06:58analyzing it someone is processing it at
- 00:07:00least you feel good about the fact that
- 00:07:01someone is taking care then there is a
- 00:07:05break in like someone tries to access
- 00:07:07your account you get a notification from
- 00:07:09Google that you know someone's trying to
- 00:07:11log in with an IP address which doesn't
- 00:07:13really look like yours is it that really
- 00:07:15you or do you want to change all your
- 00:07:16passwords prediction rewards credit all
- 00:07:20of the financial industry largely works
- 00:07:23around these variety of data well as we
- 00:07:25are going forward we do not have any
- 00:07:27single source of data consumer producers
- 00:07:30there are logs unstructured data semi
- 00:07:33structured data structured data
- 00:07:35audio/video this data coming in every
- 00:07:37format as I said like you know we have
- 00:07:39this tendency we're in age where we want
- 00:07:41to produce it and we'll say that would
- 00:07:43bother about it later we'll look and
- 00:07:44this largely comes from one of the
- 00:07:46previous talks I think someone named
- 00:07:48mister lally that's right and he was
- 00:07:51making a point about that there are
- 00:07:54unknowns which you do not know about so
- 00:07:56you save data for what you want to
- 00:07:59analyze but what when you don't know
- 00:08:01what you are know analyze some an asset
- 00:08:03question from the audience well it
- 00:08:06should have been in my talk but anyway
- 00:08:07it has already been asked so when you
- 00:08:10have that you rather save everything
- 00:08:12and later bother about okay this is what
- 00:08:14I need to pull out and this what I need
- 00:08:16to look at for that the super set has to
- 00:08:17be really huge so anyway pick two of
- 00:08:20these three if you have them you really
- 00:08:22have a problem and you should attend the
- 00:08:23rest of this talk and otherwise as well
- 00:08:26this day so why bother with data
- 00:08:28engineering like why do we analyze data
- 00:08:30to begin with most of it is covered in
- 00:08:32the previous talk but I still quickly go
- 00:08:34through it I didn't have time to change
- 00:08:36the slides on the fly there three kind
- 00:08:39of workloads that you usually have in
- 00:08:40analytical processing one of them is
- 00:08:42descriptive descriptive when you are
- 00:08:45looking at traditional data you want to
- 00:08:47see that how did I behave in the past
- 00:08:49one year five years ten years twenty
- 00:08:51years or as long as your business has
- 00:08:53been around and you want to generate
- 00:08:55behavior laid out
- 00:08:57and you wanna see that okay you make
- 00:08:59pretty graphs out of it at the end of it
- 00:09:01nothing else then there's predictive
- 00:09:03analysis predictive analysis is using
- 00:09:05the descriptive data you use the same
- 00:09:07patterns and now you want to superimpose
- 00:09:10data and see that okay can I make a
- 00:09:12prediction on what is it going to look
- 00:09:13like they can I make a guess that maybe
- 00:09:15this brand of toothpaste is going to
- 00:09:17sell more in this particular store and
- 00:09:19not the other one so coming back to the
- 00:09:20original problem of a departmental store
- 00:09:22now you are trying to analyze that maybe
- 00:09:24I don't even need to ship poultry in the
- 00:09:26times of let's say in september/october
- 00:09:29people people don't buy it so this can
- 00:09:32help you make good guesses and at the
- 00:09:34end of Frey you can make more money out
- 00:09:36of it because you are taking the right
- 00:09:37decisions in your business and then this
- 00:09:39prescriptive prescriptive is absolutely
- 00:09:41guesswork I don't know how people do
- 00:09:43that but well they do so as we move from
- 00:09:47left to the right what you have to make
- 00:09:48sure is that this is deterministic
- 00:09:50behavior and you're moving towards
- 00:09:51probabilistic mathematics there now that
- 00:09:53is like purely into the game of machine
- 00:09:55learning I mean of course there as well
- 00:09:57but now you're really making guess works
- 00:09:59based on a lot of in like big
- 00:10:01polynomials that you really won't be
- 00:10:03able to solve on a piece of paper just a
- 00:10:07quick walk through descriptive is mostly
- 00:10:09done on historical deterministic data
- 00:10:11it's F the manner of inferential well or
- 00:10:14ever managers usually make this pretty
- 00:10:16graphs out of it a friend of mine makes
- 00:10:18a lot of money just doing this so yeah
- 00:10:21predictive predictive is future it's
- 00:10:24probably stick in nature it's based on
- 00:10:27descriptive and you try to take that
- 00:10:29information and try to guess what's
- 00:10:31going to happen later prescriptive is
- 00:10:33well Google assist and all kinds of
- 00:10:35things most common example that I could
- 00:10:37think of so now let's try to solve that
- 00:10:39architecture as round 1 as we have been
- 00:10:41doing mostly so all along let's say
- 00:10:46these are the sources of information
- 00:10:47that are coming in there is this
- 00:10:49direction happening there are events
- 00:10:51happening the semi structured data
- 00:10:52happening somewhere and the sources of
- 00:10:55information are kind of like 1 2 & 3 and
- 00:10:58we're not trying to save them in some
- 00:11:00storage one of the storage choices now
- 00:11:03if there's a talk there has to be
- 00:11:05in a big data conversation that's why I
- 00:11:06put it there so there's my sink well you
- 00:11:10would say that okay my
- 00:11:11put it to elasticsearch cluster and I
- 00:11:13searched it the other day a friend of
- 00:11:15mine was asking me like what did
- 00:11:17traditional systems look like the big
- 00:11:18data system like everything had ended at
- 00:11:20elasticsearch was big data so now you
- 00:11:23have elastics Hazzard and you try
- 00:11:25to build a small query engine around it
- 00:11:26where you are able to map and relate the
- 00:11:29data in all different sources storage
- 00:11:32choice - you say that ok you know I
- 00:11:34don't really care about all these
- 00:11:35different choice of data sources one
- 00:11:37database is good enough so I'll take my
- 00:11:39existing solution like MySQL or Postgres
- 00:11:42because they scale well up to say 6
- 00:11:44terabytes of data I have read that in
- 00:11:46documentation it works well I've seen it
- 00:11:47somewhere as well so let's dump
- 00:11:49everything into that database and will
- 00:11:51later bother about it now over here
- 00:11:53we're putting everything into that
- 00:11:54simple thing and we call it a sort of a
- 00:11:57data warehouse now where does it start
- 00:12:01breaking apart oh you don't have the
- 00:12:04latest version of the slide but then
- 00:12:06that's ok so I'll come back to the slide
- 00:12:08later
- 00:12:08so challenges what are the challenges
- 00:12:11that we face with the regular
- 00:12:13architecture that we have well one is
- 00:12:15vertical scaling obviously now you are
- 00:12:18pumping in more and more data and now
- 00:12:21this way you have to make an interesting
- 00:12:22point if you do not categorize the
- 00:12:24sources of your data and put everything
- 00:12:28into the same bucket what you are
- 00:12:30effectively doing is you're paying for
- 00:12:33something that you want to access at low
- 00:12:35latency and while you're putting
- 00:12:37everything aside as the system is going
- 00:12:39to grow your costs are going to grow as
- 00:12:41well like just a comparative study like
- 00:12:43for let's say a gigabyte of data you pay
- 00:12:46around $10 on a hosted system now
- 00:12:49if you were to put everything into that
- 00:12:51system when it becomes 2 terabyte your
- 00:12:53cost is linear its multiplied by that
- 00:12:55not all data is of the same nature so
- 00:12:59you have to categorize selectively
- 00:13:01saying that ok this is something that I
- 00:13:03need now this is something that I need
- 00:13:05later
- 00:13:06this is something that I need more often
- 00:13:07than not while the other stuff well it's
- 00:13:09ok if I can get it hard like say every
- 00:13:12hour or every one day as well sometimes
- 00:13:15that works as well now so vertical
- 00:13:18scaling off a particular single database
- 00:13:20installation is going to be a new
- 00:13:21challenge archival
- 00:13:24because not all data is going to be
- 00:13:27relevant at all points of time so you
- 00:13:29want to store them in manners which is
- 00:13:31retrievable later albiet a little slowly
- 00:13:33with a bit of effort but that's ok
- 00:13:36garbage project while we are pulling in
- 00:13:41data from left right center heavily
- 00:13:43corner that we can most of it is good
- 00:13:46not going to be relevant after we have
- 00:13:48reached a majority of an analytical
- 00:13:50system where we say that ok these are
- 00:13:51the things that we really do not care
- 00:13:52about ever in the lifecycle of my entire
- 00:13:55analytical processing now this seems to
- 00:13:57be taking a lot of space or maybe it is
- 00:14:00redundant because too many streams are
- 00:14:03sending in the same data let's start
- 00:14:05purging it and has anyone ever tried
- 00:14:09doing a delete or operation on a large
- 00:14:12my sequel table or an altar table you
- 00:14:16guys can relate with this problem in
- 00:14:17that case complex joints
- 00:14:21you have left joints right joints and
- 00:14:25outer left joint outer right joints
- 00:14:27other joints and as you grow you know
- 00:14:30they get really complex to tackle in
- 00:14:31your relational sense well
- 00:14:34relationship will complex as well over a
- 00:14:36period of time and your queries become
- 00:14:39more and more complex and long now
- 00:14:41coming back to the old slide and the
- 00:14:43group by operations now most of the
- 00:14:45traditional systems start failing at a
- 00:14:48group by operation now if you aren't
- 00:14:50using your indexes well your build you
- 00:14:52probably would be doing a full table
- 00:14:54scan operation imagine a full table scan
- 00:14:56operation across billions of rows which
- 00:14:58probably spans across multiple discs as
- 00:15:02well or maybe if you have done a
- 00:15:03partition or sharding then across the
- 00:15:05shards as well each of your cluster and
- 00:15:07the shard is actually going to be bogged
- 00:15:09down with one single query now you have
- 00:15:11to join them on a single driver and that
- 00:15:13driver goes out of memory as well so a
- 00:15:15group by or a joint operation is going
- 00:15:17to kill your existing system so how do
- 00:15:23we solve these problems but before we
- 00:15:25solve this problem I would actually like
- 00:15:27to take a step back on anatomy of what a
- 00:15:29data warehouse system looks like or the
- 00:15:31layers at which it operates so the
- 00:15:36lowest layer that I usually categorize
- 00:15:38it
- 00:15:38processing network and storage layer
- 00:15:40this is the underlying hardware now this
- 00:15:43is the hardware which you pick for and
- 00:15:46you obviously each layer works
- 00:15:48independently of the other layer whether
- 00:15:52this is my own work so if any one of you
- 00:15:54do not agree with this please feel free
- 00:15:55to talk to me about it
- 00:15:57I would like to improve this so if you
- 00:16:00look at each layer each layer is
- 00:16:02independent of what the underlying layer
- 00:16:03is about or what the consumer layer is
- 00:16:05going to look like now as you add more
- 00:16:08and more computation you obviously need
- 00:16:10more and more servers you obviously need
- 00:16:12more and more dries because store is
- 00:16:13going to increase as well you want to
- 00:16:15make those switches and hubs and route
- 00:16:17as faster as faster as well for faster
- 00:16:19network because it is going to move
- 00:16:20across nodes so this is the lowest
- 00:16:23Hardware layer and this is the challenge
- 00:16:26where infrastructure and operations
- 00:16:28usually come around and this is where
- 00:16:30you need cluster managers like if you
- 00:16:32have heard names like yawn okay now from
- 00:16:35this point onward I'll start putting and
- 00:16:37throwing in some popular names of tools
- 00:16:39and technologies that exists out there
- 00:16:41just so that you can map them to what
- 00:16:43exists and what can be associated where
- 00:16:45so the next time you hear a new tool and
- 00:16:47saying that ok how does this fit like
- 00:16:49I've heard of spark where does flink fit
- 00:16:51you can associate with ok what layer is
- 00:16:53it supposed to operate in and you can
- 00:16:55take a more conscious and educated guess
- 00:16:57on ok do I really need this technology
- 00:16:59or does it solve my problem like what
- 00:17:01exactly is my problem
- 00:17:02so is this my problem then I need to
- 00:17:06look at a cluster manager something
- 00:17:07which can help me bring up or bring down
- 00:17:11nodes disks network on demand or
- 00:17:15whenever I need it
- 00:17:17or wherever the system needs it a layer
- 00:17:19above that is file systems file systems
- 00:17:22because now remember we are talking
- 00:17:24about distributed systems not a one
- 00:17:26single system what that means is that we
- 00:17:29need file systems which are aware of
- 00:17:30where is my data going to be saved
- 00:17:33now is this going to be saved on node
- 00:17:35one or node two or node three now this
- 00:17:37distribution of events is where a file
- 00:17:39system like HDFS or edifice or NFS and
- 00:17:43those kind of systems will come into the
- 00:17:45picture now the difference there would
- 00:17:46be that s device is mostly in user space
- 00:17:48whereas
- 00:17:52just telling the password
- 00:18:01right now computers a problem sir
- 00:18:09I'll remember to move the swipe my
- 00:18:11finger on that thing yeah so where was I
- 00:18:16yeah a Cephas so the level at which the
- 00:18:20file system operates might change but
- 00:18:22effectively all of them will be the
- 00:18:24exact same problem now we go one level
- 00:18:26above the nut store it format now DITA
- 00:18:30has to be saved in a file system now
- 00:18:32this is where you will make a conscious
- 00:18:33choice of what do I save it as because
- 00:18:35you want data to be accessible faster
- 00:18:39you wanna compress the data because at
- 00:18:41rest you would wanted that okay maybe
- 00:18:44I'll just inflate it back when I receive
- 00:18:46it so compression decompression all
- 00:18:48tackled into it or you might also want
- 00:18:51to look how many of you are aware ever
- 00:18:52of aware of column storages okay who
- 00:18:57doesn't know column soils and would like
- 00:18:58me to quickly cover that is there
- 00:19:00anybody please raise your hand okay the
- 00:19:02ample hands all right now so I'll just
- 00:19:05quickly cover this up let's say
- 00:19:07traditionally we save data in okay dot
- 00:19:12one going off so you save data has rows
- 00:19:15like each data is a row row with the
- 00:19:18schema now you have let's say you're
- 00:19:20saving an information of inventory you
- 00:19:22would have an item item code price label
- 00:19:24brand etc all of these would be saved in
- 00:19:27per row basis now one effective store
- 00:19:32that you can also uses is by inverting
- 00:19:34the storage so what you can do is store
- 00:19:38in columns so if one column is called
- 00:19:41item code the entire storage is going to
- 00:19:44be around an ID and of an ID and a van
- 00:19:48so in a large system like this a column
- 00:19:51storage makes more sense at times
- 00:19:54again it's not a general rule of thumb
- 00:19:55because it depends on your requirement
- 00:19:57but why it would make sense is because
- 00:19:59column so it allows you to be flexible
- 00:20:02on the schema because in a row based
- 00:20:04system if you had placed let's say if
- 00:20:06you're saving this all in a traditional
- 00:20:08table of a my sequel and a new column
- 00:20:11was added later you would have to run an
- 00:20:13alter table or add a new column to it in
- 00:20:17the new schema and again imagine running
- 00:20:20this on a my sequel table which has
- 00:20:22around terabytes of data
- 00:20:23your system will actually choke it's
- 00:20:24going to lock it down and you won't be
- 00:20:26able to perform any other operation
- 00:20:27around on it now an alternate way would
- 00:20:30be if we were using a column based
- 00:20:33storage where the storage has been
- 00:20:35inverted in form of that new column will
- 00:20:38now only have the new IDs and the val
- 00:20:40according to it the older storage
- 00:20:42doesn't have to be altered so if that
- 00:20:45quickly explains what column storage is
- 00:20:47I'll be more than happy to answer your
- 00:20:48questions in detail so please catch me
- 00:20:50outside or any one of my team members
- 00:20:51who are around here now so there are
- 00:20:55multiple storage formats which exists in
- 00:20:57form of H file for a system like HBase
- 00:21:00or parking or Avro and all these are
- 00:21:04storage formats that have been optimized
- 00:21:06for such larger workloads now over and
- 00:21:11above this now you need a distributed
- 00:21:13processing framework as well because
- 00:21:16every time I'm supposed to ask a
- 00:21:18question that to the to the data system
- 00:21:20that tell me how many people were buying
- 00:21:23this product around Diwali or around
- 00:21:26Christmas or around New Year or around
- 00:21:28$15 whenever now this question would
- 00:21:32probably span across multiple nodes and
- 00:21:36across the data would be stored in any
- 00:21:38one of these components down there you
- 00:21:41need something which can massively
- 00:21:44paralyze this operation you really don't
- 00:21:46have to carry this operation as one
- 00:21:49single operation so this sort of goes
- 00:21:53back into roots of functional
- 00:21:54programming and what you are basically
- 00:21:56trying to do is divide an operation into
- 00:21:59smaller units and spread it around and
- 00:22:03an aggregation of that would give you
- 00:22:05the result it's like saying if you have
- 00:22:07to multiply 48 by 62 and so what you do
- 00:22:12is 62 into 48 is you know what 62 into
- 00:22:16one is you know what 62 into two is you
- 00:22:18know what now once you have computed 62
- 00:22:20times the two times 62 you can double
- 00:22:23that up that makes it 6 - 2 times 4 now
- 00:22:26you can add these unless it becomes 48
- 00:22:28and you can distribute it around so you
- 00:22:30get a product faster it if since you
- 00:22:33should do mathematic that way now this
- 00:22:36layer over here is where your standard
- 00:22:38stuff like spark Map Reduce
- 00:22:41or fling and all Sansa and all these
- 00:22:45will fit now over and above that is a
- 00:22:48query engine now query engine is is a
- 00:22:53layer which basically understands what
- 00:22:56the query is and how is it supposed to
- 00:22:58be distributed on my computer aim book
- 00:23:01so query engines are exists in form of
- 00:23:05like these are very internal to the
- 00:23:07they're very tightly coupled with the
- 00:23:09distributed processing framework as well
- 00:23:11but like graph query engines are one
- 00:23:13example spark has its own query engine
- 00:23:15and form of things called tachyon and
- 00:23:18their others as well now over and above
- 00:23:21that the most the highest layer of this
- 00:23:24is the query language this is the manner
- 00:23:26in which you interact with your system
- 00:23:29now query language one of the famous
- 00:23:31ones is SQL that's a query language
- 00:23:33sequel is another query language graph
- 00:23:36QL is another query language so these
- 00:23:38are query languages are basically
- 00:23:40declarative languages most of them
- 00:23:42declarative they will always not be
- 00:23:44declarative it depends on the language
- 00:23:45these languages are as a standard going
- 00:23:50to be used for your consumer tools and
- 00:23:53entire tool chain that you have around
- 00:23:55it so your existing my sequel client is
- 00:24:00if it speaks SQL and your query engine
- 00:24:04and the distributed processing framework
- 00:24:06uses the exact same thing then you can
- 00:24:09technically use the same bicycle client
- 00:24:11to where your data like one such common
- 00:24:13example of a query engine is thrift
- 00:24:14thrift is a is a common name and the
- 00:24:17Hadoop system if you have used so next
- 00:24:20time you run into a problem of a new
- 00:24:23name first understand first is see where
- 00:24:27exactly is your problem and then try to
- 00:24:31map the proper the tool to okay does it
- 00:24:33fit the layer in which I am having that
- 00:24:36problem and see if it solves your
- 00:24:37problem now coming now this is basically
- 00:24:42where we set up warehouses etcetera now
- 00:24:45I want to come to the challenges of what
- 00:24:48challenge
- 00:24:49do you face in warehouse system as well
- 00:24:53this goes slightly into the concepts of
- 00:24:56data modeling and the us understand
- 00:25:00snowflake schema a star schema should I
- 00:25:02cover dot cookie raise your hand if you
- 00:25:05would like me to quickly talk about that
- 00:25:07okay that's a decent amount of hand so
- 00:25:09I'll quickly say so data right usually
- 00:25:11comes in is heavily normalized and what
- 00:25:14you get is you know this is how I save
- 00:25:16data in relational tables you know like
- 00:25:18there is okay so before we do that
- 00:25:20actually I like to cover another point
- 00:25:21in all data analytical systems one
- 00:25:25distinction that we always have to make
- 00:25:26is is something a fact or a dimension
- 00:25:29now this is the underlying ground that
- 00:25:32everything has to work around dimension
- 00:25:35is something which is going to remain
- 00:25:37static like an item when I say static as
- 00:25:41in they don't change by they they really
- 00:25:43do not come in as like event is a fact
- 00:25:45that happens around something let's take
- 00:25:49an example of a car sale and one factual
- 00:25:54process like a car is a dimension a sale
- 00:25:58record is a fact around a dimension so
- 00:26:02whatever you call as metrics or events
- 00:26:05are actually facts like earth is flood
- 00:26:09is this a fact or a dimension well Earth
- 00:26:12is a dimension and Earth is a earth is
- 00:26:14flat or round is a fact it has changed
- 00:26:16over time like the facts can change but
- 00:26:19dimensions will pretty much inherently
- 00:26:21these are the entities or actors of your
- 00:26:23above your relational system so if this
- 00:26:27is a dimension a dimension may depend on
- 00:26:29another dimension like a car may belong
- 00:26:33to a brand a car might have an engine or
- 00:26:35gearbox or tires now all of these are
- 00:26:39dimensions which are also connected to
- 00:26:41other dimensions like a tire is not
- 00:26:43going to change like it's a dimension
- 00:26:44like it's a purely tire or a blister on
- 00:26:47tire and there are facts around it now
- 00:26:50one single common fact would be at the
- 00:26:52center let's say sale happened now if
- 00:26:56you were to make this query on a system
- 00:26:59like this
- 00:27:01most of the time your query engine is
- 00:27:03actually going to be busy making these
- 00:27:05joints now remember we said that joints
- 00:27:08are expensive so this system is really
- 00:27:11not going to optimize it for you so what
- 00:27:14do you have to do instead is you convert
- 00:27:16snowflake schema why is it called
- 00:27:18snowflake is because it looks like a
- 00:27:19snowflake because each of these have new
- 00:27:22tentacles which are connected to
- 00:27:23different dimensions and you convert
- 00:27:25this into a star schema so star schema
- 00:27:28is basically the primary point
- 00:27:31misogynous of where data warehouse
- 00:27:32system start and this is what you need
- 00:27:35as a bare minimum to start doing your
- 00:27:37analytical workloads now this is where
- 00:27:40you have a common fact and dimensions
- 00:27:41have been merged so like this dimension
- 00:27:45if it was a car then it would have all
- 00:27:47the attributes of tire gearbox etc all
- 00:27:49flattened out into one single dimension
- 00:27:52so it's not the true dimensions that
- 00:27:53transform dimension and the reason why
- 00:27:56this is done this way is because now you
- 00:27:58have to do lesser number of joints when
- 00:28:00you have to make a query it leads to a
- 00:28:04very interesting problem though which
- 00:28:08I'll cover very soon but if you guys can
- 00:28:12see it then just hold on to that problem
- 00:28:14another thing is deduplication like when
- 00:28:18data comes in and mostly these are
- 00:28:21asynchronous systems which are sending
- 00:28:23data from different Lasogga that's other
- 00:28:27problem of retail market itself point of
- 00:28:28sister point-of-sale terminals
- 00:28:30now these point-of-sale terminals are
- 00:28:32remotely connected internet may or may
- 00:28:34not be there so they eventually sync
- 00:28:36with the original data warehouse now
- 00:28:39they may rethink like the application
- 00:28:43developer who has written that code on
- 00:28:44the other side and if it's seeding the
- 00:28:46data into the warehouse this periodical
- 00:28:49sync in case of a network failure can't
- 00:28:51happen over and over again now there is
- 00:28:54no way to watermark your point where you
- 00:28:57say that you know okay I have synced
- 00:28:59till point X now you can do this on
- 00:29:02every single record of your transaction
- 00:29:04so you would do it in some batches you
- 00:29:05know like I'll send 10 records at a time
- 00:29:07now what happens if the network fails
- 00:29:09while those 10 are being sent so you
- 00:29:12would invariably
- 00:29:14regardless of no matter how good code
- 00:29:16you write which is a fault-tolerant or
- 00:29:19you know has all hysterics and all kinds
- 00:29:22of things inbuilt into it you will run
- 00:29:25into a problem where there will be some
- 00:29:26duplicates which are said these
- 00:29:28duplicates have to be handled at the
- 00:29:31injection time of the warehouse system
- 00:29:34one of the standard ways to do this
- 00:29:37is something called bloom filters and
- 00:29:39cuckoo filters which is an alternate
- 00:29:42implementation of bloom filters now how
- 00:29:45basically they work is that given a
- 00:29:48stream of data which is coming in you
- 00:29:51keep an array of what has been some one
- 00:29:54interesting property of these filters is
- 00:29:56that they can never tell you whether
- 00:29:57something exists or not
- 00:29:59but they can certainly tell you that
- 00:30:01this is something not present inside it
- 00:30:04so the nature of the question that you
- 00:30:06ask you the filter sort of changes so
- 00:30:08you can never ask it have you seen this
- 00:30:10value earlier the bloom filter can only
- 00:30:13tell you with the probability that I may
- 00:30:15or may not have seen this but have you
- 00:30:18never seen this value they can tell you
- 00:30:20with a certainty that yes I have never
- 00:30:22seen this value so in a sample
- 00:30:25implementation like this like these are
- 00:30:27the objects that has already seen inside
- 00:30:28it does a multi hashing and saves it
- 00:30:31across a fixed storage space where the
- 00:30:35index locations can actually be
- 00:30:37duplicated and when you get a new object
- 00:30:39you see that have you not seen this and
- 00:30:41if the bloom filter says that yes I have
- 00:30:42not seen this only then you actually go
- 00:30:44forward and ingested into the warehouse
- 00:30:46so this very effectively handles your D
- 00:30:51duplications but one thing to understand
- 00:30:53is a duplication can most of these
- 00:30:56things can only be done across a window
- 00:30:58of time that you cannot say that D
- 00:31:00duplicate over an entire year because if
- 00:31:02your entire year had two terabytes of
- 00:31:04data a new event comes in you can search
- 00:31:06over the entire terabytes or petabytes
- 00:31:08of data so you will have to do it in a
- 00:31:10finite window that you define let's say
- 00:31:13finally duplicates in past to a
- 00:31:17fortnight or a week or ten days I mean
- 00:31:19that's a call that you have to take on
- 00:31:21how your system is implemented now if
- 00:31:25you come back to the star schema where
- 00:31:27we said that you know in a snowflake
- 00:31:30dimensions are actually merged into a
- 00:31:33common dimension while ingesting into
- 00:31:34the DB now there is a probability that a
- 00:31:38dimension has changed over time now this
- 00:31:41is one of the most common and the most
- 00:31:43challenging problem that a warehouse
- 00:31:45system can face primarily because you
- 00:31:49dump data today you do analysis day
- 00:31:53after tomorrow now when you look back at
- 00:31:56the data and you want to do an analysis
- 00:31:59and you do the cross joints across
- 00:32:03tables and if these tables are only
- 00:32:05carrying today's value you can never do
- 00:32:08a point-in-time check so if I have to
- 00:32:10find let's say certain goods let's say
- 00:32:16that the same example again I'm trying
- 00:32:18to find the most profitable goods which
- 00:32:20have the most profitable shelf life
- 00:32:22now this shelf life is dependent on a
- 00:32:24lot of things on transportation of the
- 00:32:25goods as well
- 00:32:26transportation is dependent on oil
- 00:32:28prices oil prices change almost every
- 00:32:31day well not until recently they were
- 00:32:33fixed but they change like globally they
- 00:32:35change every day so when you want to do
- 00:32:38that analysis you want historical data
- 00:32:41that for that particular fact I wanna do
- 00:32:44a join on the dimension value that
- 00:32:47existed back then so what you are trying
- 00:32:51to track is a dimensions journey through
- 00:32:54time as well that dimension value looked
- 00:32:56like this back then like a listing a
- 00:32:59there an example of the car thing itself
- 00:33:01like a tire used to cost 10 rupees back
- 00:33:04then but today if it cost thousand
- 00:33:05rupees or 10,000 rupees when you do an
- 00:33:07analysis today of ten years of data you
- 00:33:10only want to look at that 10 rupees not
- 00:33:13ten thousand rupees
- 00:33:14as it exists today so the way to
- 00:33:17implement this this problem is called
- 00:33:19slow changing dimensions or acronyms as
- 00:33:21SCD they have multiple ways to solve
- 00:33:25this one is as always you ignore it and
- 00:33:28you do not solve it so like you only get
- 00:33:31today's data no I mean it's actually a
- 00:33:33theoretically liquid inside they have
- 00:33:35called they call it type 0 which is
- 00:33:36ignoring the the entire thing that it
- 00:33:39doesn't change
- 00:33:41then there are type ones where you say
- 00:33:42that I can overwrite the values or not
- 00:33:46change the value at all I'll keep it as
- 00:33:47it is and I'll never change it or you
- 00:33:49can add certain time filters next to it
- 00:33:52where you say that start time of this
- 00:33:55value was this and end time was this and
- 00:33:57next time when you're making a joint you
- 00:33:59can see when the fact had arrived and do
- 00:34:01a join on anything which is less than n
- 00:34:04time and greater than start time of the
- 00:34:05value yeah this is fairly explained or
- 00:34:09very well across the Wikipedia article
- 00:34:11as well usually I find Wikipedia too
- 00:34:14overwhelming and not explaining at all
- 00:34:15but for one this rare exception they
- 00:34:17have done a good job so this spin maybe
- 00:34:20you can just look this up they'll really
- 00:34:22thoroughly explain you what the
- 00:34:23solutions are as well batching versus
- 00:34:27streaming now this one of the most
- 00:34:29common question that I get asked and
- 00:34:31probably because as I promised this talk
- 00:34:34is about buzz word so I had to add this
- 00:34:36in now bears versus stream these are two
- 00:34:39inherent different kinds of problem and
- 00:34:42different kind of datasets now what you
- 00:34:46can do against the streaming data is
- 00:34:48what you what's that buzzer
- 00:34:57Oh am i over tonight okay are all
- 00:35:06projectors down can you guys see over
- 00:35:08there
- 00:35:08you guys can okay so should I continue
- 00:35:10what about the guys in the front or you
- 00:35:11don't care so back to batching versus
- 00:35:19screaming like one of the most important
- 00:35:22questions that high master and probably
- 00:35:23you will face as well when you try to go
- 00:35:25about starting on your Big Data projects
- 00:35:28or trying to implement what tool and
- 00:35:30what framework is good now inherently
- 00:35:32there are two different kind of problems
- 00:35:33both screaming and batching batching is
- 00:35:37done on huge loads of data and always
- 00:35:41done usually in retrospect you can use
- 00:35:44batch analysis to emit of something like
- 00:35:50cuboids of data I'm going to cover that
- 00:35:52very soon or pre computations which are
- 00:35:56next going to assist your streaming like
- 00:35:59now to put this into perspective let's
- 00:36:01say a real-time transaction is happening
- 00:36:04in your banking system you want to check
- 00:36:06whether this is fraud or not what other
- 00:36:09easy ways to check fraud like one easy
- 00:36:11way is by looking at all historical data
- 00:36:14you can take quick patterns on that here
- 00:36:18when transactions happen after 11:00
- 00:36:21p.m. from this IP address they can be
- 00:36:24fraudulent now you can use this as
- 00:36:26retrospective analysis feed this as a
- 00:36:29quick filter to your stream which is
- 00:36:31coming in and do a quick analysis so
- 00:36:33these are two absolutely disconnected
- 00:36:36processing events and but still the
- 00:36:38question has on one versus the other and
- 00:36:40so we must be able to separate out that
- 00:36:43while we are doing streaming you cannot
- 00:36:45go check the entire earth for is this a
- 00:36:47valid transaction or not well you can if
- 00:36:51there is no one waiting of the
- 00:36:52transition on the other side but for an
- 00:36:55IOT device or a real-time system where
- 00:36:57events are happening like for let's say
- 00:37:00a flight tracker
- 00:37:03this happened in nice as well when the
- 00:37:06bus incident happened now when one of
- 00:37:10the birth unforce
- 00:37:11when bus had crammed into a bunch of
- 00:37:13pedestrians the alarm that went off was
- 00:37:16actually three hours later and people
- 00:37:18thought that was a separate attack
- 00:37:19altogether so this is where you have to
- 00:37:23understand that streaming has to happen
- 00:37:24faster all analysis have to happen
- 00:37:26faster in a way where you might take
- 00:37:29assistance from any of the other
- 00:37:31analysis that you have proceeded by
- 00:37:33doing any batch analysis earlier out of
- 00:37:37order processing now if you have a
- 00:37:41asynchronous system where you are
- 00:37:44sending events it is quite tough to sort
- 00:37:47of guarantee the order in which it
- 00:37:50arrives like one event might I arrive
- 00:37:53after the first one where it was
- 00:37:54supposed to be an order of T 1 T 2 T 3
- 00:37:56it might arrive as T 2 T 3 and T 1 now
- 00:37:59imagine if you're generating reports and
- 00:38:02you are computing these nice pretty
- 00:38:04graphs where you're doing some analysis
- 00:38:07and probably some thresholds as well
- 00:38:08actionable alerts on that and after all
- 00:38:12that is done an hour later an event
- 00:38:14which was supposed to be done for an
- 00:38:16hour back arrives you will have to go
- 00:38:19through that entire chain of pipelining
- 00:38:21again you know like transform
- 00:38:22deduplicate recompute the entire thing
- 00:38:25so out of corner processing is something
- 00:38:28that is very crucial to your warehouse
- 00:38:30and there are n number of ways to solve
- 00:38:34this but as a challenge it's worth
- 00:38:36putting this out on the table because
- 00:38:38there's no like X weight you know do
- 00:38:41this and this will solve your problem
- 00:38:42but one thing that you can apply here is
- 00:38:45by taking a window of arrival by
- 00:38:49deafening your compute slightly and say
- 00:38:52that I'm going to only process after
- 00:38:54five minutes of buffer this is sort of
- 00:38:56where buffers and this buffer etc will
- 00:38:58also start coming into the picture that
- 00:39:00I'm going to stack these up for a window
- 00:39:01of five minutes and every five minutes
- 00:39:05I'll do an analysis of what comes in in
- 00:39:07what order and whatever might arrive
- 00:39:10after this are not going to process this
- 00:39:11at all then that could be one very quick
- 00:39:13way of solving this while we are talking
- 00:39:17about data warehouses data cubes are a
- 00:39:19very important aspect of it because
- 00:39:24cubes and data marts as well I would
- 00:39:26very quickly covered that as well
- 00:39:28data Mart is actually nothing but just a
- 00:39:30siloed view of a warehouse which is
- 00:39:32given to teams as only read-only access
- 00:39:34but warehouse is mostly read-only but
- 00:39:36this is a further narrowed down version
- 00:39:38of lesser your entire warehouse consists
- 00:39:41of inventory sales salaries profit loss
- 00:39:48whatever everything is inside that
- 00:39:49systems to the finance department you
- 00:39:54would rather only expose the salary view
- 00:39:56which is one data Mart for them you
- 00:39:58might have another view which is a very
- 00:40:02siloed view of let's say inventory
- 00:40:04management to the procurement team which
- 00:40:07will have with silo will be called de
- 00:40:09tomate cubes are just a view of your
- 00:40:13data which could be multi-dimensional as
- 00:40:16well Y cubes are done in first place is
- 00:40:20for efficiency of retrieval
- 00:40:22because warehouses as they are massive
- 00:40:25they also suffer the same problem that
- 00:40:27you cannot ask a query quickly like it
- 00:40:29has a certain threshold you can optimize
- 00:40:31that but that's going to take your lot
- 00:40:33of cost to have a single-digit
- 00:40:35millisecond warehouse it's almost next
- 00:40:38to impossible but if you want something
- 00:40:39like that you want to build a real-time
- 00:40:41analytics graph out of it or a stream
- 00:40:43out of it you will have to build cubes
- 00:40:45cubes are pre computed they could be
- 00:40:48computed on-demand as well in terms of
- 00:40:50SQL databases think of them as either
- 00:40:51views or materialized views and these
- 00:40:54are snapshots that you create which get
- 00:40:57pre computed and pre populated now let's
- 00:40:59take a sample example of item location
- 00:41:02day and quantity this table if you're
- 00:41:06familiar can serve a lot of queries like
- 00:41:09on this day how many items were sold for
- 00:41:12this location or how many times is this
- 00:41:15item sold on that day so it's sort of a
- 00:41:18multi-dimensional query view for you
- 00:41:19effectively just a table of SQL but if
- 00:41:23you have to visualize it you can
- 00:41:24visualize it is like this like if this
- 00:41:26was day access this is the location
- 00:41:28access this is the item access then each
- 00:41:30point here is actually the quantity
- 00:41:33which is being sold so you can actually
- 00:41:34get faster results using this standard
- 00:41:38operation that you use on this is called
- 00:41:39slice dice and rotate if you revisit the
- 00:41:43architecture very quickly this is one
- 00:41:45sample the implementation that we had
- 00:41:48done so you have an event Q coming in
- 00:41:52there are workers these workers are
- 00:41:55possible for my sanitization of data
- 00:41:58star schema conversion deduplication out
- 00:42:02of order processing and shorting layer
- 00:42:04in orders you take that data you very
- 00:42:06quickly save that into a cheap storage
- 00:42:08which is done for effective archival
- 00:42:11which could be something like a blob
- 00:42:14storage of s3 which can be mounted as a
- 00:42:17HDFS system basically this acts as a
- 00:42:20file system now try to map this to the
- 00:42:22same Anatomy this is the file system
- 00:42:24layer next says you want to compute
- 00:42:26framework which can actually access to
- 00:42:28this access this do your queries
- 00:42:30analysis etcetera you can use something
- 00:42:33like SPARC here or Apache flink or math
- 00:42:37or well you could use a hosted EMR and
- 00:42:40you will find tons of other solutions
- 00:42:42out there as well now this is where the
- 00:42:45compute framework will do its job
- 00:42:47generate transformations for you create
- 00:42:50time tables and finally output it into a
- 00:42:52cube system one such cube system which
- 00:42:54is very popular for data analysis called
- 00:42:57druid a passive druid you can use that
- 00:43:00as well and druid has now this
- 00:43:02capability of serving very fast cube use
- 00:43:05for analysis purposes
- 00:43:07and if one aspect of the consumption of
- 00:43:09warehouse was data visualization you can
- 00:43:11use a tool like tableau to analyze your
- 00:43:14data that would be it for my talk if I
- 00:43:18am Valentine
- 00:43:20[Applause]
- 00:43:30we'll take about two questions and if
- 00:43:34anybody has any more please reach out to
- 00:43:36Pioche we you're gonna have a coffee
- 00:43:37break in some time anyway
- 00:43:39questions anyone I see a question there
- 00:43:46I be sure hello just a quick question do
- 00:43:50you have any observation there's stuff
- 00:43:53on the kind of projects we're doing once
- 00:43:56of distinction between sparks trimming
- 00:43:58and flink so in zone information yes
- 00:44:02I'll probably take one step back to
- 00:44:05address your question and I hope I'll be
- 00:44:07able to answer it and that is so how
- 00:44:10streaming effectively is underneath
- 00:44:14window operation like you can't apply
- 00:44:17something on an infinite stream you have
- 00:44:19to yeah so stream is actually a window
- 00:44:28operation so you define a small window
- 00:44:31and you say that I'll take that now this
- 00:44:33window could be bound by two things
- 00:44:36either size of the window or time of the
- 00:44:38window as well and you say that you know
- 00:44:41for these five minutes and five minutes
- 00:44:43is actually too long a time for swimming
- 00:44:45you probably will bring it down to five
- 00:44:47seconds or sometimes 500 milliseconds as
- 00:44:49well depending on the rate of ingestion
- 00:44:50and this window operation is what will
- 00:44:53define what is going to be processed as
- 00:44:56a batch now spark streaming underneath
- 00:44:59is micro batching so they use the exact
- 00:45:02same batching code that they wrote for
- 00:45:05earlier when there used to be a batch
- 00:45:07processing system they have taken that
- 00:45:09to spark streaming now whereas flink is
- 00:45:13actually swimming first they don't reuse
- 00:45:14the same code so what that means is that
- 00:45:17spark streaming has a problem of I'm
- 00:45:20actually answering this question before
- 00:45:21the newest release so the newest release
- 00:45:24doesn't have this problem they've
- 00:45:26revamped a lot of code had a problem of
- 00:45:28back pressure because if your steam is
- 00:45:30not processing fast enough you are
- 00:45:33actually going to fall behind and the
- 00:45:35again we'll have stream overflow so
- 00:45:37sparks
- 00:45:38used to have that problem whereas flink
- 00:45:40was done as streaming first so phones
- 00:45:42flink it's the other way around bats is
- 00:45:44cream first
- 00:45:45so you basically pick up a stream and
- 00:45:48then you batch it for an operation of
- 00:45:50yours so that is probably the difference
- 00:45:53that you're looking for at at the crux
- 00:45:55of it of course there's an API
- 00:45:56difference like you can do
- 00:45:57transformations on stream faster and
- 00:45:59flick any other questions
- 00:46:10okay thank you thank you
- 00:46:13[Applause]
- Big Data
- Data Processing
- OLTP
- OLAP
- Data Warehousing
- Streaming
- Batch Processing
- Data Architecture
- Descriptive Analysis
- Predictive Analysis