What defines big data?

Big data is defined by its volume, velocity, and variety. When data surpasses traditional processing capabilities, it becomes big data.

What is the difference between OLAP and OLTP?

OLTP (Online Transaction Processing) focuses on managing transaction-oriented applications, whereas OLAP (Online Analytical Processing) is geared towards data analysis and complex queries.

What are the challenges of traditional data processing systems?

Traditional systems face challenges like scalability, complex joins, and inefficiencies in handling large datasets.

What is a star schema?

A star schema is a type of database schema that allows for efficient data retrieval, where a central fact table is connected to dimension tables.

What are the different types of workloads in data processing?

The three main workloads are descriptive, predictive, and prescriptive analyses.

How does streaming differ from batch processing?

Streaming processes data in real-time, while batch processing deals with large volumes of data at once, often retrospectively.

What is a bloom filter?

A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.

What is the purpose of data cubes in data warehouses?

Data cubes are used to optimize retrieval of multi-dimensional data for analysis purposes.

How can out-of-order processing be managed?

Out-of-order processing can be handled by defining a time window for arrival and only processing data within that window.

What is the difference between Spark and Flink for streaming?

Spark uses micro-batching for streaming, while Flink is designed for true streaming applications.

ExpertTalks2017 - ABC of Distributed Data Processing by Piyush Verma

00:46:14

https://www.youtube.com/watch?v=tKuCoKDPPEQ

Summary

TLDRThis talk explores the challenges and complexity of distributed data processing, defining when data is classified as big data—a term associated with immense volume, rapid velocity, and diverse varieties. It discusses the shift from small databases to large-scale systems that integrate multiple data sources, emphasizing operational (OLTP) and analytical (OLAP) processes in data management. The speaker delves into architectural components of data warehouses and highlights the significance of data consistency and integrity while handling large datasets. Challenges such as complex joins, scalability, and the need for efficient querying are addressed, alongside the distinctions between batch and streaming processing, detailing the advantages and caveats of each. Finally, the importance of utilizing tools and technologies to tackle issues of data ingestion, storage, and analysis in modern data landscapes is underlined.

Takeaways

📊 Understanding when data is considered 'big data'
🔄 Difference between OLTP and OLAP systems
🏗️ Challenges of traditional data processing
⭐ Importance of star schema architecture
📈 Workloads: Descriptive, Predictive, and Prescriptive
🚀 Distinguishing between streaming and batch processing
🔍 Utilizing bloom filters for data deduplication
📦 Data cubes for efficient data retrieval
🕒 Handling out-of-order processing
⚙️ Comparison of Spark and Flink in streaming contexts

Timeline

00:00:00 - 00:05:00
The discussion begins with the concept of distributed data processing, emphasizing the importance of understanding when data becomes 'big data'. A small business scenario illustrates how simple data needs can evolve as businesses grow, leading to complexities in data management.
00:05:00 - 00:10:00
As businesses scale, data sources multiply and the volume of transactions increases. The speaker highlights the limitations of traditional databases in handling vast data amounts and the need for different work types, specifically operational and analytical data processing systems.
00:10:00 - 00:15:00
'Big Data' is defined using three characteristics: volume, velocity, and variety. High volume datasets, such as those generated by companies like Facebook and the airline industry, showcase the immense scale of data produced and the challenges it creates in processing and analyzing such datasets.
00:15:00 - 00:20:00
Velocity describes the pace of incoming data, particularly within the telecom and banking sectors, where real-time analysis is necessary for operations like call quality assessment and fraud detection. This characteristic emphasizes the need for more advanced data handling compared to traditional systems.
00:20:00 - 00:25:00
Variety indicates the different forms of data encountered, such as structured, semi-structured, and unstructured formats. This range necessitates careful storage and processing strategies, as businesses aim to retain all generated data for future analysis of unknown needs.
00:25:00 - 00:30:00
The significance of data engineering is highlighted, where workloads can be categorized into descriptive, predictive, and prescriptive analysis. Each type serves a distinct purpose, influencing business decisions and future strategies based on historical data patterns.
00:30:00 - 00:35:00
Challenges in traditional data architecture are discussed, including vertical scaling and complex data relationships. The speaker notes that as data grows, systems must efficiently manage data storage while minimizing costs and ensuring performance, particularly through complex join operations in databases.
00:35:00 - 00:40:00
Anatomy of data warehouse systems is examined from hardware aspects to file systems, storage formats, and processing frameworks. Tools for distributed processing, efficient querying, and data access layers are emphasized as key components in building robust data architectures.
00:40:00 - 00:46:14
Clarification of concepts such as snowflake and star schemas illustrates how data is organized for optimal querying performance in data warehouses. This leads into discussing deduplication, handling of evolving dimensions, and the challenges of maintaining data integrity across various points of sale during integration.

Mind Map

Video Q&A

What defines big data?
Big data is defined by its volume, velocity, and variety. When data surpasses traditional processing capabilities, it becomes big data.
What is the difference between OLAP and OLTP?
OLTP (Online Transaction Processing) focuses on managing transaction-oriented applications, whereas OLAP (Online Analytical Processing) is geared towards data analysis and complex queries.
What are the challenges of traditional data processing systems?
Traditional systems face challenges like scalability, complex joins, and inefficiencies in handling large datasets.
What is a star schema?
A star schema is a type of database schema that allows for efficient data retrieval, where a central fact table is connected to dimension tables.
What are the different types of workloads in data processing?
The three main workloads are descriptive, predictive, and prescriptive analyses.
How does streaming differ from batch processing?
Streaming processes data in real-time, while batch processing deals with large volumes of data at once, often retrospectively.
What is a bloom filter?
A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.
What is the purpose of data cubes in data warehouses?
Data cubes are used to optimize retrieval of multi-dimensional data for analysis purposes.
How can out-of-order processing be managed?
Out-of-order processing can be handled by defining a time window for arrival and only processing data within that window.
What is the difference between Spark and Flink for streaming?
Spark uses micro-batching for streaming, while Flink is designed for true streaming applications.

View more video summaries

Get instant access to free YouTube video summaries powered by AI!

Subtitles

Auto Scroll:

00:00:04
yeah so we are talking about distributed
00:00:08
data processing so how many of you deal
00:00:10
with data or big data like engineering
00:00:13
data in any manner mine analyze etcetera
00:00:16
alright cool so the most common
00:00:20
questions that I get asked around data
00:00:22
is like when does my data become big
00:00:25
data I think this must be a common side
00:00:28
that you've seen at the back of the
00:00:29
trucks so I hope you guys can see that
00:00:31
back or you know like we show off data
00:00:36
my data is bigger than the audit so well
00:00:40
it's a common thing so let's start with
00:00:43
how did we reach a point of it being a
00:00:46
big data or data being so massive where
00:00:48
we have to even talk about it
00:00:50
they come in sis data why do you need to
00:00:51
talk about it so let's start with a
00:00:55
small business you have a point-of-sale
00:00:56
system which is connected to it by
00:01:00
itself yeah it's connected to a small my
00:01:01
sequel database you are selling stuff
00:01:04
you so the problem that we are trying to
00:01:06
solve here is let's say we have to
00:01:07
identify the most profitable products in
00:01:11
form of the least shelf-life so the
00:01:15
product that stays for the least amount
00:01:17
of time on the shelf before it is
00:01:19
procure like afters procured and before
00:01:21
it is so fairly easy to solve in this
00:01:24
setup like you just have a small DB just
00:01:26
connect to it and figure out your
00:01:27
answers you go big in your business now
00:01:30
you have multiple point-of-sale systems
00:01:32
in the same departmental store still
00:01:34
connected to the same database we have
00:01:37
problems of failover recovery etc all
00:01:38
standard problems we go one step bigger
00:01:41
now you attach it to an ERP solution
00:01:44
where the inventory coming in you're
00:01:46
tied to that as well because you realize
00:01:48
that calculating profitability is not
00:01:51
just computing that point of procurement
00:01:53
versus point of sale you also need to
00:01:54
take into account taxations
00:01:56
or what it meant to get that product on
00:02:00
that shelf out there because the
00:02:02
standard answer that you would find from
00:02:04
this is that the dairy products or the
00:02:06
poultry products are the most profitable
00:02:07
products but apparently they are not
00:02:09
because there's at least shelf life so
00:02:11
there could be something more now
00:02:13
imagine this problem over entire
00:02:16
like you now are but scale of a huge
00:02:19
departmental store you have headquarters
00:02:21
located somewhere in the middle of
00:02:22
nowhere and you have sales point
00:02:24
everywhere now if this information was
00:02:27
still to be computed would you be able
00:02:29
to do it at the same pace by just
00:02:30
connected to my signal database now
00:02:32
millions of transactions are happening
00:02:34
every day probably every second as well
00:02:36
there are multiple points of interaction
00:02:38
of sales and purchases you could be
00:02:40
buying via Amazon you could be buying
00:02:42
over phone or email or whatever now all
00:02:45
of this has to be ingested and you need
00:02:47
to crunch stock now
00:02:49
absolutely not feasible in the standard
00:02:52
ways now this is how we learn into
00:02:54
problem that no what do we do about it
00:02:56
so one clear distinction that we'll try
00:02:59
to make here is type of workloads that
00:03:02
we usually have any dealing with data
00:03:04
problems well one our operation and the
00:03:08
other is analytical so that's the
00:03:10
standard OLTP model that we always used
00:03:12
to have and on the right as Olaf model
00:03:14
so most of the OLTP stuff is categorized
00:03:18
as online now by online I don't mean web
00:03:21
online online means that you make a
00:03:23
purchase an entry is made whatever
00:03:25
logistical checks that you have to do
00:03:27
around the data is it sanitized is it
00:03:30
forming enough locks you don't wanna
00:03:31
sell items more than what you have all
00:03:34
that is tackled by the whole TP system
00:03:36
and the stuff that we're just talking
00:03:38
about at the back now analyzing and
00:03:40
crunching is my business making money is
00:03:42
my business not making money which are
00:03:44
the product that I should invest in or
00:03:45
should not invest in is usually done
00:03:47
sitting at a pile of data and it's done
00:03:50
later so you can clearly call it offline
00:03:52
access so these are the typical
00:03:55
workloads all I really like to split it
00:03:57
out to us and as we go further we will
00:03:59
be actually talking more about the
00:04:00
offline systems and not the online
00:04:02
systems because you know I'm assuming
00:04:04
that most of you have already mastered
00:04:06
those things I assume so when do we call
00:04:12
our data to be big enough now this is
00:04:16
not my terminology it's a standard
00:04:18
terminology which is used in the data
00:04:19
engineering textbooks as well now and a
00:04:23
lot of people who pioneered the same
00:04:25
came up with this information so we
00:04:27
define as three
00:04:30
one is volume now at any one of time you
00:04:35
have a problem if you have two of these
00:04:38
three anyone in isolation is easy to
00:04:41
address and solve using traditional ways
00:04:43
but if you have any two of these you
00:04:45
usually would look beyond your
00:04:47
traditional systems of what you have
00:04:48
we'll come back we'll later have a look
00:04:51
at what traditional systems look like
00:04:52
and what challenges do they run into now
00:04:55
let's talk about volume of data for a
00:04:56
moment now facebook produces 500
00:04:58
terabytes of data a day starts from
00:05:01
facebook I'm not making them up and with
00:05:04
that person ordered data you need to
00:05:05
look into where the advertisements have
00:05:07
to go now it's not just these online
00:05:11
businesses like every time you sit on a
00:05:12
Boeing flight which goes from the west
00:05:14
coast to the east coast or vice versa
00:05:16
one flight produces 240 terabytes of
00:05:20
data like and like really it's just too
00:05:24
much huge this is just too huge a data
00:05:26
to process and that's not one flight
00:05:28
then 18,000 and 800 flights going across
00:05:31
west coast and east coast every day and
00:05:33
you can imagine the amount of data
00:05:34
they're producing so and we've reached
00:05:37
this point because we have this tendency
00:05:39
of you know having data so we will save
00:05:41
it like might as well save it we are
00:05:43
producing it so yeah on the right we
00:05:48
have velocity of data velocity of data
00:05:50
is when your data is coming in at a pace
00:05:54
where you really cannot process it the
00:05:57
standard tools that you used to one of
00:06:00
the biggest consumers of velocity of
00:06:02
data is telecom industry now every time
00:06:05
you are making that voice call or a
00:06:07
cellular call the what happens is that
00:06:10
these telecom operators are processing
00:06:12
your call drops quality of voice
00:06:15
analyzing frequency see where the drop
00:06:17
is where the quality is going bad users
00:06:20
on the fly are going to get shuffled as
00:06:22
well now this is done also taken into
00:06:24
account previous considerations of a
00:06:26
zone someone don't like my presentation
00:06:30
right so you have so these are being
00:06:35
analyzed at a rate of around 220 million
00:06:38
calls per minute now this no mean a a
00:06:42
small feed you know like this
00:06:43
you have to ingest the data which is
00:06:45
beyond your traditional computation
00:06:48
boundaries it's also used in a lot of
00:06:50
other companies as well fraud of every
00:06:52
time you make a transaction you get a
00:06:54
call from HDFC bank saying that did you
00:06:56
make this transaction well someone is
00:06:58
analyzing it someone is processing it at
00:07:00
least you feel good about the fact that
00:07:01
someone is taking care then there is a
00:07:05
break in like someone tries to access
00:07:07
your account you get a notification from
00:07:09
Google that you know someone's trying to
00:07:11
log in with an IP address which doesn't
00:07:13
really look like yours is it that really
00:07:15
you or do you want to change all your
00:07:16
passwords prediction rewards credit all
00:07:20
of the financial industry largely works
00:07:23
around these variety of data well as we
00:07:25
are going forward we do not have any
00:07:27
single source of data consumer producers
00:07:30
there are logs unstructured data semi
00:07:33
structured data structured data
00:07:35
audio/video this data coming in every
00:07:37
format as I said like you know we have
00:07:39
this tendency we're in age where we want
00:07:41
to produce it and we'll say that would
00:07:43
bother about it later we'll look and
00:07:44
this largely comes from one of the
00:07:46
previous talks I think someone named
00:07:48
mister lally that's right and he was
00:07:51
making a point about that there are
00:07:54
unknowns which you do not know about so
00:07:56
you save data for what you want to
00:07:59
analyze but what when you don't know
00:08:01
what you are know analyze some an asset
00:08:03
question from the audience well it
00:08:06
should have been in my talk but anyway
00:08:07
it has already been asked so when you
00:08:10
have that you rather save everything
00:08:12
and later bother about okay this is what
00:08:14
I need to pull out and this what I need
00:08:16
to look at for that the super set has to
00:08:17
be really huge so anyway pick two of
00:08:20
these three if you have them you really
00:08:22
have a problem and you should attend the
00:08:23
rest of this talk and otherwise as well
00:08:26
this day so why bother with data
00:08:28
engineering like why do we analyze data
00:08:30
to begin with most of it is covered in
00:08:32
the previous talk but I still quickly go
00:08:34
through it I didn't have time to change
00:08:36
the slides on the fly there three kind
00:08:39
of workloads that you usually have in
00:08:40
analytical processing one of them is
00:08:42
descriptive descriptive when you are
00:08:45
looking at traditional data you want to
00:08:47
see that how did I behave in the past
00:08:49
one year five years ten years twenty
00:08:51
years or as long as your business has
00:08:53
been around and you want to generate
00:08:55
behavior laid out
00:08:57
and you wanna see that okay you make
00:08:59
pretty graphs out of it at the end of it
00:09:01
nothing else then there's predictive
00:09:03
analysis predictive analysis is using
00:09:05
the descriptive data you use the same
00:09:07
patterns and now you want to superimpose
00:09:10
data and see that okay can I make a
00:09:12
prediction on what is it going to look
00:09:13
like they can I make a guess that maybe
00:09:15
this brand of toothpaste is going to
00:09:17
sell more in this particular store and
00:09:19
not the other one so coming back to the
00:09:20
original problem of a departmental store
00:09:22
now you are trying to analyze that maybe
00:09:24
I don't even need to ship poultry in the
00:09:26
times of let's say in september/october
00:09:29
people people don't buy it so this can
00:09:32
help you make good guesses and at the
00:09:34
end of Frey you can make more money out
00:09:36
of it because you are taking the right
00:09:37
decisions in your business and then this
00:09:39
prescriptive prescriptive is absolutely
00:09:41
guesswork I don't know how people do
00:09:43
that but well they do so as we move from
00:09:47
left to the right what you have to make
00:09:48
sure is that this is deterministic
00:09:50
behavior and you're moving towards
00:09:51
probabilistic mathematics there now that
00:09:53
is like purely into the game of machine
00:09:55
learning I mean of course there as well
00:09:57
but now you're really making guess works
00:09:59
based on a lot of in like big
00:10:01
polynomials that you really won't be
00:10:03
able to solve on a piece of paper just a
00:10:07
quick walk through descriptive is mostly
00:10:09
done on historical deterministic data
00:10:11
it's F the manner of inferential well or
00:10:14
ever managers usually make this pretty
00:10:16
graphs out of it a friend of mine makes
00:10:18
a lot of money just doing this so yeah
00:10:21
predictive predictive is future it's
00:10:24
probably stick in nature it's based on
00:10:27
descriptive and you try to take that
00:10:29
information and try to guess what's
00:10:31
going to happen later prescriptive is
00:10:33
well Google assist and all kinds of
00:10:35
things most common example that I could
00:10:37
think of so now let's try to solve that
00:10:39
architecture as round 1 as we have been
00:10:41
doing mostly so all along let's say
00:10:46
these are the sources of information
00:10:47
that are coming in there is this
00:10:49
direction happening there are events
00:10:51
happening the semi structured data
00:10:52
happening somewhere and the sources of
00:10:55
information are kind of like 1 2 & 3 and
00:10:58
we're not trying to save them in some
00:11:00
storage one of the storage choices now
00:11:03
if there's a talk there has to be
00:11:05
in a big data conversation that's why I
00:11:06
put it there so there's my sink well you
00:11:10
would say that okay my
00:11:11
put it to elasticsearch cluster and I
00:11:13
searched it the other day a friend of
00:11:15
mine was asking me like what did
00:11:17
traditional systems look like the big
00:11:18
data system like everything had ended at
00:11:20
elasticsearch was big data so now you
00:11:23
have elastics Hazzard and you try
00:11:25
to build a small query engine around it
00:11:26
where you are able to map and relate the
00:11:29
data in all different sources storage
00:11:32
choice - you say that ok you know I
00:11:34
don't really care about all these
00:11:35
different choice of data sources one
00:11:37
database is good enough so I'll take my
00:11:39
existing solution like MySQL or Postgres
00:11:42
because they scale well up to say 6
00:11:44
terabytes of data I have read that in
00:11:46
documentation it works well I've seen it
00:11:47
somewhere as well so let's dump
00:11:49
everything into that database and will
00:11:51
later bother about it now over here
00:11:53
we're putting everything into that
00:11:54
simple thing and we call it a sort of a
00:11:57
data warehouse now where does it start
00:12:01
breaking apart oh you don't have the
00:12:04
latest version of the slide but then
00:12:06
that's ok so I'll come back to the slide
00:12:08
later
00:12:08
so challenges what are the challenges
00:12:11
that we face with the regular
00:12:13
architecture that we have well one is
00:12:15
vertical scaling obviously now you are
00:12:18
pumping in more and more data and now
00:12:21
this way you have to make an interesting
00:12:22
point if you do not categorize the
00:12:24
sources of your data and put everything
00:12:28
into the same bucket what you are
00:12:30
effectively doing is you're paying for
00:12:33
something that you want to access at low
00:12:35
latency and while you're putting
00:12:37
everything aside as the system is going
00:12:39
to grow your costs are going to grow as
00:12:41
well like just a comparative study like
00:12:43
for let's say a gigabyte of data you pay
00:12:46
around $10 on a hosted system now
00:12:49
if you were to put everything into that
00:12:51
system when it becomes 2 terabyte your
00:12:53
cost is linear its multiplied by that
00:12:55
not all data is of the same nature so
00:12:59
you have to categorize selectively
00:13:01
saying that ok this is something that I
00:13:03
need now this is something that I need
00:13:05
later
00:13:06
this is something that I need more often
00:13:07
than not while the other stuff well it's
00:13:09
ok if I can get it hard like say every
00:13:12
hour or every one day as well sometimes
00:13:15
that works as well now so vertical
00:13:18
scaling off a particular single database
00:13:20
installation is going to be a new
00:13:21
challenge archival
00:13:24
because not all data is going to be
00:13:27
relevant at all points of time so you
00:13:29
want to store them in manners which is
00:13:31
retrievable later albiet a little slowly
00:13:33
with a bit of effort but that's ok
00:13:36
garbage project while we are pulling in
00:13:41
data from left right center heavily
00:13:43
corner that we can most of it is good
00:13:46
not going to be relevant after we have
00:13:48
reached a majority of an analytical
00:13:50
system where we say that ok these are
00:13:51
the things that we really do not care
00:13:52
about ever in the lifecycle of my entire
00:13:55
analytical processing now this seems to
00:13:57
be taking a lot of space or maybe it is
00:14:00
redundant because too many streams are
00:14:03
sending in the same data let's start
00:14:05
purging it and has anyone ever tried
00:14:09
doing a delete or operation on a large
00:14:12
my sequel table or an altar table you
00:14:16
guys can relate with this problem in
00:14:17
that case complex joints
00:14:21
you have left joints right joints and
00:14:25
outer left joint outer right joints
00:14:27
other joints and as you grow you know
00:14:30
they get really complex to tackle in
00:14:31
your relational sense well
00:14:34
relationship will complex as well over a
00:14:36
period of time and your queries become
00:14:39
more and more complex and long now
00:14:41
coming back to the old slide and the
00:14:43
group by operations now most of the
00:14:45
traditional systems start failing at a
00:14:48
group by operation now if you aren't
00:14:50
using your indexes well your build you
00:14:52
probably would be doing a full table
00:14:54
scan operation imagine a full table scan
00:14:56
operation across billions of rows which
00:14:58
probably spans across multiple discs as
00:15:02
well or maybe if you have done a
00:15:03
partition or sharding then across the
00:15:05
shards as well each of your cluster and
00:15:07
the shard is actually going to be bogged
00:15:09
down with one single query now you have
00:15:11
to join them on a single driver and that
00:15:13
driver goes out of memory as well so a
00:15:15
group by or a joint operation is going
00:15:17
to kill your existing system so how do
00:15:23
we solve these problems but before we
00:15:25
solve this problem I would actually like
00:15:27
to take a step back on anatomy of what a
00:15:29
data warehouse system looks like or the
00:15:31
layers at which it operates so the
00:15:36
lowest layer that I usually categorize
00:15:38
it
00:15:38
processing network and storage layer
00:15:40
this is the underlying hardware now this
00:15:43
is the hardware which you pick for and
00:15:46
you obviously each layer works
00:15:48
independently of the other layer whether
00:15:52
this is my own work so if any one of you
00:15:54
do not agree with this please feel free
00:15:55
to talk to me about it
00:15:57
I would like to improve this so if you
00:16:00
look at each layer each layer is
00:16:02
independent of what the underlying layer
00:16:03
is about or what the consumer layer is
00:16:05
going to look like now as you add more
00:16:08
and more computation you obviously need
00:16:10
more and more servers you obviously need
00:16:12
more and more dries because store is
00:16:13
going to increase as well you want to
00:16:15
make those switches and hubs and route
00:16:17
as faster as faster as well for faster
00:16:19
network because it is going to move
00:16:20
across nodes so this is the lowest
00:16:23
Hardware layer and this is the challenge
00:16:26
where infrastructure and operations
00:16:28
usually come around and this is where
00:16:30
you need cluster managers like if you
00:16:32
have heard names like yawn okay now from
00:16:35
this point onward I'll start putting and
00:16:37
throwing in some popular names of tools
00:16:39
and technologies that exists out there
00:16:41
just so that you can map them to what
00:16:43
exists and what can be associated where
00:16:45
so the next time you hear a new tool and
00:16:47
saying that ok how does this fit like
00:16:49
I've heard of spark where does flink fit
00:16:51
you can associate with ok what layer is
00:16:53
it supposed to operate in and you can
00:16:55
take a more conscious and educated guess
00:16:57
on ok do I really need this technology
00:16:59
or does it solve my problem like what
00:17:01
exactly is my problem
00:17:02
so is this my problem then I need to
00:17:06
look at a cluster manager something
00:17:07
which can help me bring up or bring down
00:17:11
nodes disks network on demand or
00:17:15
whenever I need it
00:17:17
or wherever the system needs it a layer
00:17:19
above that is file systems file systems
00:17:22
because now remember we are talking
00:17:24
about distributed systems not a one
00:17:26
single system what that means is that we
00:17:29
need file systems which are aware of
00:17:30
where is my data going to be saved
00:17:33
now is this going to be saved on node
00:17:35
one or node two or node three now this
00:17:37
distribution of events is where a file
00:17:39
system like HDFS or edifice or NFS and
00:17:43
those kind of systems will come into the
00:17:45
picture now the difference there would
00:17:46
be that s device is mostly in user space
00:17:48
whereas
00:17:52
just telling the password
00:18:01
right now computers a problem sir
00:18:09
I'll remember to move the swipe my
00:18:11
finger on that thing yeah so where was I
00:18:16
yeah a Cephas so the level at which the
00:18:20
file system operates might change but
00:18:22
effectively all of them will be the
00:18:24
exact same problem now we go one level
00:18:26
above the nut store it format now DITA
00:18:30
has to be saved in a file system now
00:18:32
this is where you will make a conscious
00:18:33
choice of what do I save it as because
00:18:35
you want data to be accessible faster
00:18:39
you wanna compress the data because at
00:18:41
rest you would wanted that okay maybe
00:18:44
I'll just inflate it back when I receive
00:18:46
it so compression decompression all
00:18:48
tackled into it or you might also want
00:18:51
to look how many of you are aware ever
00:18:52
of aware of column storages okay who
00:18:57
doesn't know column soils and would like
00:18:58
me to quickly cover that is there
00:19:00
anybody please raise your hand okay the
00:19:02
ample hands all right now so I'll just
00:19:05
quickly cover this up let's say
00:19:07
traditionally we save data in okay dot
00:19:12
one going off so you save data has rows
00:19:15
like each data is a row row with the
00:19:18
schema now you have let's say you're
00:19:20
saving an information of inventory you
00:19:22
would have an item item code price label
00:19:24
brand etc all of these would be saved in
00:19:27
per row basis now one effective store
00:19:32
that you can also uses is by inverting
00:19:34
the storage so what you can do is store
00:19:38
in columns so if one column is called
00:19:41
item code the entire storage is going to
00:19:44
be around an ID and of an ID and a van
00:19:48
so in a large system like this a column
00:19:51
storage makes more sense at times
00:19:54
again it's not a general rule of thumb
00:19:55
because it depends on your requirement
00:19:57
but why it would make sense is because
00:19:59
column so it allows you to be flexible
00:20:02
on the schema because in a row based
00:20:04
system if you had placed let's say if
00:20:06
you're saving this all in a traditional
00:20:08
table of a my sequel and a new column
00:20:11
was added later you would have to run an
00:20:13
alter table or add a new column to it in
00:20:17
the new schema and again imagine running
00:20:20
this on a my sequel table which has
00:20:22
around terabytes of data
00:20:23
your system will actually choke it's
00:20:24
going to lock it down and you won't be
00:20:26
able to perform any other operation
00:20:27
around on it now an alternate way would
00:20:30
be if we were using a column based
00:20:33
storage where the storage has been
00:20:35
inverted in form of that new column will
00:20:38
now only have the new IDs and the val
00:20:40
according to it the older storage
00:20:42
doesn't have to be altered so if that
00:20:45
quickly explains what column storage is
00:20:47
I'll be more than happy to answer your
00:20:48
questions in detail so please catch me
00:20:50
outside or any one of my team members
00:20:51
who are around here now so there are
00:20:55
multiple storage formats which exists in
00:20:57
form of H file for a system like HBase
00:21:00
or parking or Avro and all these are
00:21:04
storage formats that have been optimized
00:21:06
for such larger workloads now over and
00:21:11
above this now you need a distributed
00:21:13
processing framework as well because
00:21:16
every time I'm supposed to ask a
00:21:18
question that to the to the data system
00:21:20
that tell me how many people were buying
00:21:23
this product around Diwali or around
00:21:26
Christmas or around New Year or around
00:21:28
$15 whenever now this question would
00:21:32
probably span across multiple nodes and
00:21:36
across the data would be stored in any
00:21:38
one of these components down there you
00:21:41
need something which can massively
00:21:44
paralyze this operation you really don't
00:21:46
have to carry this operation as one
00:21:49
single operation so this sort of goes
00:21:53
back into roots of functional
00:21:54
programming and what you are basically
00:21:56
trying to do is divide an operation into
00:21:59
smaller units and spread it around and
00:22:03
an aggregation of that would give you
00:22:05
the result it's like saying if you have
00:22:07
to multiply 48 by 62 and so what you do
00:22:12
is 62 into 48 is you know what 62 into
00:22:16
one is you know what 62 into two is you
00:22:18
know what now once you have computed 62
00:22:20
times the two times 62 you can double
00:22:23
that up that makes it 6 - 2 times 4 now
00:22:26
you can add these unless it becomes 48
00:22:28
and you can distribute it around so you
00:22:30
get a product faster it if since you
00:22:33
should do mathematic that way now this
00:22:36
layer over here is where your standard
00:22:38
stuff like spark Map Reduce
00:22:41
or fling and all Sansa and all these
00:22:45
will fit now over and above that is a
00:22:48
query engine now query engine is is a
00:22:53
layer which basically understands what
00:22:56
the query is and how is it supposed to
00:22:58
be distributed on my computer aim book
00:23:01
so query engines are exists in form of
00:23:05
like these are very internal to the
00:23:07
they're very tightly coupled with the
00:23:09
distributed processing framework as well
00:23:11
but like graph query engines are one
00:23:13
example spark has its own query engine
00:23:15
and form of things called tachyon and
00:23:18
their others as well now over and above
00:23:21
that the most the highest layer of this
00:23:24
is the query language this is the manner
00:23:26
in which you interact with your system
00:23:29
now query language one of the famous
00:23:31
ones is SQL that's a query language
00:23:33
sequel is another query language graph
00:23:36
QL is another query language so these
00:23:38
are query languages are basically
00:23:40
declarative languages most of them
00:23:42
declarative they will always not be
00:23:44
declarative it depends on the language
00:23:45
these languages are as a standard going
00:23:50
to be used for your consumer tools and
00:23:53
entire tool chain that you have around
00:23:55
it so your existing my sequel client is
00:24:00
if it speaks SQL and your query engine
00:24:04
and the distributed processing framework
00:24:06
uses the exact same thing then you can
00:24:09
technically use the same bicycle client
00:24:11
to where your data like one such common
00:24:13
example of a query engine is thrift
00:24:14
thrift is a is a common name and the
00:24:17
Hadoop system if you have used so next
00:24:20
time you run into a problem of a new
00:24:23
name first understand first is see where
00:24:27
exactly is your problem and then try to
00:24:31
map the proper the tool to okay does it
00:24:33
fit the layer in which I am having that
00:24:36
problem and see if it solves your
00:24:37
problem now coming now this is basically
00:24:42
where we set up warehouses etcetera now
00:24:45
I want to come to the challenges of what
00:24:48
challenge
00:24:49
do you face in warehouse system as well
00:24:53
this goes slightly into the concepts of
00:24:56
data modeling and the us understand
00:25:00
snowflake schema a star schema should I
00:25:02
cover dot cookie raise your hand if you
00:25:05
would like me to quickly talk about that
00:25:07
okay that's a decent amount of hand so
00:25:09
I'll quickly say so data right usually
00:25:11
comes in is heavily normalized and what
00:25:14
you get is you know this is how I save
00:25:16
data in relational tables you know like
00:25:18
there is okay so before we do that
00:25:20
actually I like to cover another point
00:25:21
in all data analytical systems one
00:25:25
distinction that we always have to make
00:25:26
is is something a fact or a dimension
00:25:29
now this is the underlying ground that
00:25:32
everything has to work around dimension
00:25:35
is something which is going to remain
00:25:37
static like an item when I say static as
00:25:41
in they don't change by they they really
00:25:43
do not come in as like event is a fact
00:25:45
that happens around something let's take
00:25:49
an example of a car sale and one factual
00:25:54
process like a car is a dimension a sale
00:25:58
record is a fact around a dimension so
00:26:02
whatever you call as metrics or events
00:26:05
are actually facts like earth is flood
00:26:09
is this a fact or a dimension well Earth
00:26:12
is a dimension and Earth is a earth is
00:26:14
flat or round is a fact it has changed
00:26:16
over time like the facts can change but
00:26:19
dimensions will pretty much inherently
00:26:21
these are the entities or actors of your
00:26:23
above your relational system so if this
00:26:27
is a dimension a dimension may depend on
00:26:29
another dimension like a car may belong
00:26:33
to a brand a car might have an engine or
00:26:35
gearbox or tires now all of these are
00:26:39
dimensions which are also connected to
00:26:41
other dimensions like a tire is not
00:26:43
going to change like it's a dimension
00:26:44
like it's a purely tire or a blister on
00:26:47
tire and there are facts around it now
00:26:50
one single common fact would be at the
00:26:52
center let's say sale happened now if
00:26:56
you were to make this query on a system
00:26:59
like this
00:27:01
most of the time your query engine is
00:27:03
actually going to be busy making these
00:27:05
joints now remember we said that joints
00:27:08
are expensive so this system is really
00:27:11
not going to optimize it for you so what
00:27:14
do you have to do instead is you convert
00:27:16
snowflake schema why is it called
00:27:18
snowflake is because it looks like a
00:27:19
snowflake because each of these have new
00:27:22
tentacles which are connected to
00:27:23
different dimensions and you convert
00:27:25
this into a star schema so star schema
00:27:28
is basically the primary point
00:27:31
misogynous of where data warehouse
00:27:32
system start and this is what you need
00:27:35
as a bare minimum to start doing your
00:27:37
analytical workloads now this is where
00:27:40
you have a common fact and dimensions
00:27:41
have been merged so like this dimension
00:27:45
if it was a car then it would have all
00:27:47
the attributes of tire gearbox etc all
00:27:49
flattened out into one single dimension
00:27:52
so it's not the true dimensions that
00:27:53
transform dimension and the reason why
00:27:56
this is done this way is because now you
00:27:58
have to do lesser number of joints when
00:28:00
you have to make a query it leads to a
00:28:04
very interesting problem though which
00:28:08
I'll cover very soon but if you guys can
00:28:12
see it then just hold on to that problem
00:28:14
another thing is deduplication like when
00:28:18
data comes in and mostly these are
00:28:21
asynchronous systems which are sending
00:28:23
data from different Lasogga that's other
00:28:27
problem of retail market itself point of
00:28:28
sister point-of-sale terminals
00:28:30
now these point-of-sale terminals are
00:28:32
remotely connected internet may or may
00:28:34
not be there so they eventually sync
00:28:36
with the original data warehouse now
00:28:39
they may rethink like the application
00:28:43
developer who has written that code on
00:28:44
the other side and if it's seeding the
00:28:46
data into the warehouse this periodical
00:28:49
sync in case of a network failure can't
00:28:51
happen over and over again now there is
00:28:54
no way to watermark your point where you
00:28:57
say that you know okay I have synced
00:28:59
till point X now you can do this on
00:29:02
every single record of your transaction
00:29:04
so you would do it in some batches you
00:29:05
know like I'll send 10 records at a time
00:29:07
now what happens if the network fails
00:29:09
while those 10 are being sent so you
00:29:12
would invariably
00:29:14
regardless of no matter how good code
00:29:16
you write which is a fault-tolerant or
00:29:19
you know has all hysterics and all kinds
00:29:22
of things inbuilt into it you will run
00:29:25
into a problem where there will be some
00:29:26
duplicates which are said these
00:29:28
duplicates have to be handled at the
00:29:31
injection time of the warehouse system
00:29:34
one of the standard ways to do this
00:29:37
is something called bloom filters and
00:29:39
cuckoo filters which is an alternate
00:29:42
implementation of bloom filters now how
00:29:45
basically they work is that given a
00:29:48
stream of data which is coming in you
00:29:51
keep an array of what has been some one
00:29:54
interesting property of these filters is
00:29:56
that they can never tell you whether
00:29:57
something exists or not
00:29:59
but they can certainly tell you that
00:30:01
this is something not present inside it
00:30:04
so the nature of the question that you
00:30:06
ask you the filter sort of changes so
00:30:08
you can never ask it have you seen this
00:30:10
value earlier the bloom filter can only
00:30:13
tell you with the probability that I may
00:30:15
or may not have seen this but have you
00:30:18
never seen this value they can tell you
00:30:20
with a certainty that yes I have never
00:30:22
seen this value so in a sample
00:30:25
implementation like this like these are
00:30:27
the objects that has already seen inside
00:30:28
it does a multi hashing and saves it
00:30:31
across a fixed storage space where the
00:30:35
index locations can actually be
00:30:37
duplicated and when you get a new object
00:30:39
you see that have you not seen this and
00:30:41
if the bloom filter says that yes I have
00:30:42
not seen this only then you actually go
00:30:44
forward and ingested into the warehouse
00:30:46
so this very effectively handles your D
00:30:51
duplications but one thing to understand
00:30:53
is a duplication can most of these
00:30:56
things can only be done across a window
00:30:58
of time that you cannot say that D
00:31:00
duplicate over an entire year because if
00:31:02
your entire year had two terabytes of
00:31:04
data a new event comes in you can search
00:31:06
over the entire terabytes or petabytes
00:31:08
of data so you will have to do it in a
00:31:10
finite window that you define let's say
00:31:13
finally duplicates in past to a
00:31:17
fortnight or a week or ten days I mean
00:31:19
that's a call that you have to take on
00:31:21
how your system is implemented now if
00:31:25
you come back to the star schema where
00:31:27
we said that you know in a snowflake
00:31:30
dimensions are actually merged into a
00:31:33
common dimension while ingesting into
00:31:34
the DB now there is a probability that a
00:31:38
dimension has changed over time now this
00:31:41
is one of the most common and the most
00:31:43
challenging problem that a warehouse
00:31:45
system can face primarily because you
00:31:49
dump data today you do analysis day
00:31:53
after tomorrow now when you look back at
00:31:56
the data and you want to do an analysis
00:31:59
and you do the cross joints across
00:32:03
tables and if these tables are only
00:32:05
carrying today's value you can never do
00:32:08
a point-in-time check so if I have to
00:32:10
find let's say certain goods let's say
00:32:16
that the same example again I'm trying
00:32:18
to find the most profitable goods which
00:32:20
have the most profitable shelf life
00:32:22
now this shelf life is dependent on a
00:32:24
lot of things on transportation of the
00:32:25
goods as well
00:32:26
transportation is dependent on oil
00:32:28
prices oil prices change almost every
00:32:31
day well not until recently they were
00:32:33
fixed but they change like globally they
00:32:35
change every day so when you want to do
00:32:38
that analysis you want historical data
00:32:41
that for that particular fact I wanna do
00:32:44
a join on the dimension value that
00:32:47
existed back then so what you are trying
00:32:51
to track is a dimensions journey through
00:32:54
time as well that dimension value looked
00:32:56
like this back then like a listing a
00:32:59
there an example of the car thing itself
00:33:01
like a tire used to cost 10 rupees back
00:33:04
then but today if it cost thousand
00:33:05
rupees or 10,000 rupees when you do an
00:33:07
analysis today of ten years of data you
00:33:10
only want to look at that 10 rupees not
00:33:13
ten thousand rupees
00:33:14
as it exists today so the way to
00:33:17
implement this this problem is called
00:33:19
slow changing dimensions or acronyms as
00:33:21
SCD they have multiple ways to solve
00:33:25
this one is as always you ignore it and
00:33:28
you do not solve it so like you only get
00:33:31
today's data no I mean it's actually a
00:33:33
theoretically liquid inside they have
00:33:35
called they call it type 0 which is
00:33:36
ignoring the the entire thing that it
00:33:39
doesn't change
00:33:41
then there are type ones where you say
00:33:42
that I can overwrite the values or not
00:33:46
change the value at all I'll keep it as
00:33:47
it is and I'll never change it or you
00:33:49
can add certain time filters next to it
00:33:52
where you say that start time of this
00:33:55
value was this and end time was this and
00:33:57
next time when you're making a joint you
00:33:59
can see when the fact had arrived and do
00:34:01
a join on anything which is less than n
00:34:04
time and greater than start time of the
00:34:05
value yeah this is fairly explained or
00:34:09
very well across the Wikipedia article
00:34:11
as well usually I find Wikipedia too
00:34:14
overwhelming and not explaining at all
00:34:15
but for one this rare exception they
00:34:17
have done a good job so this spin maybe
00:34:20
you can just look this up they'll really
00:34:22
thoroughly explain you what the
00:34:23
solutions are as well batching versus
00:34:27
streaming now this one of the most
00:34:29
common question that I get asked and
00:34:31
probably because as I promised this talk
00:34:34
is about buzz word so I had to add this
00:34:36
in now bears versus stream these are two
00:34:39
inherent different kinds of problem and
00:34:42
different kind of datasets now what you
00:34:46
can do against the streaming data is
00:34:48
what you what's that buzzer
00:34:57
Oh am i over tonight okay are all
00:35:06
projectors down can you guys see over
00:35:08
there
00:35:08
you guys can okay so should I continue
00:35:10
what about the guys in the front or you
00:35:11
don't care so back to batching versus
00:35:19
screaming like one of the most important
00:35:22
questions that high master and probably
00:35:23
you will face as well when you try to go
00:35:25
about starting on your Big Data projects
00:35:28
or trying to implement what tool and
00:35:30
what framework is good now inherently
00:35:32
there are two different kind of problems
00:35:33
both screaming and batching batching is
00:35:37
done on huge loads of data and always
00:35:41
done usually in retrospect you can use
00:35:44
batch analysis to emit of something like
00:35:50
cuboids of data I'm going to cover that
00:35:52
very soon or pre computations which are
00:35:56
next going to assist your streaming like
00:35:59
now to put this into perspective let's
00:36:01
say a real-time transaction is happening
00:36:04
in your banking system you want to check
00:36:06
whether this is fraud or not what other
00:36:09
easy ways to check fraud like one easy
00:36:11
way is by looking at all historical data
00:36:14
you can take quick patterns on that here
00:36:18
when transactions happen after 11:00
00:36:21
p.m. from this IP address they can be
00:36:24
fraudulent now you can use this as
00:36:26
retrospective analysis feed this as a
00:36:29
quick filter to your stream which is
00:36:31
coming in and do a quick analysis so
00:36:33
these are two absolutely disconnected
00:36:36
processing events and but still the
00:36:38
question has on one versus the other and
00:36:40
so we must be able to separate out that
00:36:43
while we are doing streaming you cannot
00:36:45
go check the entire earth for is this a
00:36:47
valid transaction or not well you can if
00:36:51
there is no one waiting of the
00:36:52
transition on the other side but for an
00:36:55
IOT device or a real-time system where
00:36:57
events are happening like for let's say
00:37:00
a flight tracker
00:37:03
this happened in nice as well when the
00:37:06
bus incident happened now when one of
00:37:10
the birth unforce
00:37:11
when bus had crammed into a bunch of
00:37:13
pedestrians the alarm that went off was
00:37:16
actually three hours later and people
00:37:18
thought that was a separate attack
00:37:19
altogether so this is where you have to
00:37:23
understand that streaming has to happen
00:37:24
faster all analysis have to happen
00:37:26
faster in a way where you might take
00:37:29
assistance from any of the other
00:37:31
analysis that you have proceeded by
00:37:33
doing any batch analysis earlier out of
00:37:37
order processing now if you have a
00:37:41
asynchronous system where you are
00:37:44
sending events it is quite tough to sort
00:37:47
of guarantee the order in which it
00:37:50
arrives like one event might I arrive
00:37:53
after the first one where it was
00:37:54
supposed to be an order of T 1 T 2 T 3
00:37:56
it might arrive as T 2 T 3 and T 1 now
00:37:59
imagine if you're generating reports and
00:38:02
you are computing these nice pretty
00:38:04
graphs where you're doing some analysis
00:38:07
and probably some thresholds as well
00:38:08
actionable alerts on that and after all
00:38:12
that is done an hour later an event
00:38:14
which was supposed to be done for an
00:38:16
hour back arrives you will have to go
00:38:19
through that entire chain of pipelining
00:38:21
again you know like transform
00:38:22
deduplicate recompute the entire thing
00:38:25
so out of corner processing is something
00:38:28
that is very crucial to your warehouse
00:38:30
and there are n number of ways to solve
00:38:34
this but as a challenge it's worth
00:38:36
putting this out on the table because
00:38:38
there's no like X weight you know do
00:38:41
this and this will solve your problem
00:38:42
but one thing that you can apply here is
00:38:45
by taking a window of arrival by
00:38:49
deafening your compute slightly and say
00:38:52
that I'm going to only process after
00:38:54
five minutes of buffer this is sort of
00:38:56
where buffers and this buffer etc will
00:38:58
also start coming into the picture that
00:39:00
I'm going to stack these up for a window
00:39:01
of five minutes and every five minutes
00:39:05
I'll do an analysis of what comes in in
00:39:07
what order and whatever might arrive
00:39:10
after this are not going to process this
00:39:11
at all then that could be one very quick
00:39:13
way of solving this while we are talking
00:39:17
about data warehouses data cubes are a
00:39:19
very important aspect of it because
00:39:24
cubes and data marts as well I would
00:39:26
very quickly covered that as well
00:39:28
data Mart is actually nothing but just a
00:39:30
siloed view of a warehouse which is
00:39:32
given to teams as only read-only access
00:39:34
but warehouse is mostly read-only but
00:39:36
this is a further narrowed down version
00:39:38
of lesser your entire warehouse consists
00:39:41
of inventory sales salaries profit loss
00:39:48
whatever everything is inside that
00:39:49
systems to the finance department you
00:39:54
would rather only expose the salary view
00:39:56
which is one data Mart for them you
00:39:58
might have another view which is a very
00:40:02
siloed view of let's say inventory
00:40:04
management to the procurement team which
00:40:07
will have with silo will be called de
00:40:09
tomate cubes are just a view of your
00:40:13
data which could be multi-dimensional as
00:40:16
well Y cubes are done in first place is
00:40:20
for efficiency of retrieval
00:40:22
because warehouses as they are massive
00:40:25
they also suffer the same problem that
00:40:27
you cannot ask a query quickly like it
00:40:29
has a certain threshold you can optimize
00:40:31
that but that's going to take your lot
00:40:33
of cost to have a single-digit
00:40:35
millisecond warehouse it's almost next
00:40:38
to impossible but if you want something
00:40:39
like that you want to build a real-time
00:40:41
analytics graph out of it or a stream
00:40:43
out of it you will have to build cubes
00:40:45
cubes are pre computed they could be
00:40:48
computed on-demand as well in terms of
00:40:50
SQL databases think of them as either
00:40:51
views or materialized views and these
00:40:54
are snapshots that you create which get
00:40:57
pre computed and pre populated now let's
00:40:59
take a sample example of item location
00:41:02
day and quantity this table if you're
00:41:06
familiar can serve a lot of queries like
00:41:09
on this day how many items were sold for
00:41:12
this location or how many times is this
00:41:15
item sold on that day so it's sort of a
00:41:18
multi-dimensional query view for you
00:41:19
effectively just a table of SQL but if
00:41:23
you have to visualize it you can
00:41:24
visualize it is like this like if this
00:41:26
was day access this is the location
00:41:28
access this is the item access then each
00:41:30
point here is actually the quantity
00:41:33
which is being sold so you can actually
00:41:34
get faster results using this standard
00:41:38
operation that you use on this is called
00:41:39
slice dice and rotate if you revisit the
00:41:43
architecture very quickly this is one
00:41:45
sample the implementation that we had
00:41:48
done so you have an event Q coming in
00:41:52
there are workers these workers are
00:41:55
possible for my sanitization of data
00:41:58
star schema conversion deduplication out
00:42:02
of order processing and shorting layer
00:42:04
in orders you take that data you very
00:42:06
quickly save that into a cheap storage
00:42:08
which is done for effective archival
00:42:11
which could be something like a blob
00:42:14
storage of s3 which can be mounted as a
00:42:17
HDFS system basically this acts as a
00:42:20
file system now try to map this to the
00:42:22
same Anatomy this is the file system
00:42:24
layer next says you want to compute
00:42:26
framework which can actually access to
00:42:28
this access this do your queries
00:42:30
analysis etcetera you can use something
00:42:33
like SPARC here or Apache flink or math
00:42:37
or well you could use a hosted EMR and
00:42:40
you will find tons of other solutions
00:42:42
out there as well now this is where the
00:42:45
compute framework will do its job
00:42:47
generate transformations for you create
00:42:50
time tables and finally output it into a
00:42:52
cube system one such cube system which
00:42:54
is very popular for data analysis called
00:42:57
druid a passive druid you can use that
00:43:00
as well and druid has now this
00:43:02
capability of serving very fast cube use
00:43:05
for analysis purposes
00:43:07
and if one aspect of the consumption of
00:43:09
warehouse was data visualization you can
00:43:11
use a tool like tableau to analyze your
00:43:14
data that would be it for my talk if I
00:43:18
am Valentine
00:43:20
[Applause]
00:43:30
we'll take about two questions and if
00:43:34
anybody has any more please reach out to
00:43:36
Pioche we you're gonna have a coffee
00:43:37
break in some time anyway
00:43:39
questions anyone I see a question there
00:43:46
I be sure hello just a quick question do
00:43:50
you have any observation there's stuff
00:43:53
on the kind of projects we're doing once
00:43:56
of distinction between sparks trimming
00:43:58
and flink so in zone information yes
00:44:02
I'll probably take one step back to
00:44:05
address your question and I hope I'll be
00:44:07
able to answer it and that is so how
00:44:10
streaming effectively is underneath
00:44:14
window operation like you can't apply
00:44:17
something on an infinite stream you have
00:44:19
to yeah so stream is actually a window
00:44:28
operation so you define a small window
00:44:31
and you say that I'll take that now this
00:44:33
window could be bound by two things
00:44:36
either size of the window or time of the
00:44:38
window as well and you say that you know
00:44:41
for these five minutes and five minutes
00:44:43
is actually too long a time for swimming
00:44:45
you probably will bring it down to five
00:44:47
seconds or sometimes 500 milliseconds as
00:44:49
well depending on the rate of ingestion
00:44:50
and this window operation is what will
00:44:53
define what is going to be processed as
00:44:56
a batch now spark streaming underneath
00:44:59
is micro batching so they use the exact
00:45:02
same batching code that they wrote for
00:45:05
earlier when there used to be a batch
00:45:07
processing system they have taken that
00:45:09
to spark streaming now whereas flink is
00:45:13
actually swimming first they don't reuse
00:45:14
the same code so what that means is that
00:45:17
spark streaming has a problem of I'm
00:45:20
actually answering this question before
00:45:21
the newest release so the newest release
00:45:24
doesn't have this problem they've
00:45:26
revamped a lot of code had a problem of
00:45:28
back pressure because if your steam is
00:45:30
not processing fast enough you are
00:45:33
actually going to fall behind and the
00:45:35
again we'll have stream overflow so
00:45:37
sparks
00:45:38
used to have that problem whereas flink
00:45:40
was done as streaming first so phones
00:45:42
flink it's the other way around bats is
00:45:44
cream first
00:45:45
so you basically pick up a stream and
00:45:48
then you batch it for an operation of
00:45:50
yours so that is probably the difference
00:45:53
that you're looking for at at the crux
00:45:55
of it of course there's an API
00:45:56
difference like you can do
00:45:57
transformations on stream faster and
00:45:59
flick any other questions
00:46:10
okay thank you thank you
00:46:13
[Applause]