00:00:13
Preface: Content of this lecture: This lecture,
we will discuss the ‘framework of Spark’.
00:00:22
Resilient distributed data sets and also,
we will discuss the Spark execution.
00:00:28
The need of his Spark.
00:00:30
Apache Spark is a big data, analytics framework
that was originally developed, at University
00:00:37
of California, Berkeley, at AMP Lab, in 2012.
00:00:41
Since then, it has gained a lot of attraction,
both in the academia and in the industry.
00:00:50
It is, an another system for big data analytics.
00:00:54
Now, before this Spark, the Map Reduce was
already in use.
00:01:01
Now, why is not the Map Reduce good enough?
00:01:07
That we you are going to explore, to understand,
the need of the Spark system, before that
00:01:14
we have to understand that, Map Reduce simplifies,
the batch processing on a large commodity
00:01:20
clusters.
00:01:21
So, in that process, the data file or a dataset,
which was the input? which was input, through
00:01:36
the HDFS system, uses, the map function, to
be spitted into across different, splits and
00:01:51
then the map function was applied on to this
particular splits, which in turn will generate,
00:01:59
the intermediate values, which is in the form
of shuffle and is stored in HDFS and then
00:02:10
passed on, to the reduced function giving
the output, so this is, the scenario of a
00:02:17
Map Reduce, execution framework, as you have
seen the data set, is now stored in the HDFS
00:02:34
file system.
00:02:36
Hence, this particular computation is done
for the batch processing system.
00:02:49
So we have seen that, this particular batch
processing, was done using the Map Reduce
00:02:56
system and that was in use earlier, before
the development of a Spark.
00:03:02
Now let us see, why this particular model,
of Map Reduce is not good enough to be there?
00:03:09
Now, as you have seen, in the previous slide
that, the output of the Map function, will
00:03:17
be stored in HDFS and this will ensure the
fault tolerance.
00:03:42
So the output of, map function, was stored
into HDFS, file system and this ensures of
00:03:53
the fault tolerance.
00:03:55
So in between, if there is a failure, when
it is stored in the HDFS file system, still
00:04:00
the data is not lost and is completely saved,
that is why?
00:04:04
It is ensuring the fault tolerance and because
of that fault tolerance, the intermediate
00:04:10
values or intermediate results, out of the
map functions were stored into HDFS file.
00:04:26
Now, this particular intermediate results
will be now further fed on to the reduced
00:04:37
function.
00:04:49
So that was, the typical Map Reduce, execution
pipeline.
00:04:58
Now, what is the issue?
00:05:01
The problem is, mentioned over here, that
the major issue is the performance.
00:05:15
so that means, the performance is degraded
and it has become expensive, to save to the
00:05:26
disk for this particular activity of fault
tolerance, hence this particular Map Reduce,
00:05:33
framework becomes quite slow, to some of the
applications, which are interactive and required
00:05:40
to be processed in the real time, hence this
particular framework, is not very useful,
00:05:46
for certain applications.
00:05:49
Therefore, let us summarize that, Map Reduce
can be expensive, for some applications and
00:05:57
for example, interactive and iterative application.
00:06:01
They are very expensive and also, this particular
framework, that is the Map Reduce framework,
00:06:08
lacks efficient data sharing, across the map
and reduce phase of operation or iterations.
00:06:16
So lacks efficient data sharing in the sense,
the data which is, the intermediate form of
00:06:22
Map Reduce is, map is stored in the file system,
HDFS file system.
00:06:29
Hence, it is not shared, in an efficient manner.
00:06:33
Hence, this particular sharing the data across
map and reduce phase, has to be through the
00:06:40
disk and which is a slow operation, not an
efficient operation.
00:06:44
Hence, this statement which says that, Map
Reduce, lacks the efficient data sharing,
00:06:53
into the system .so due to these, drawbacks,
there were several specialized framework did
00:07:00
evolve, for different programming models,
such as, so the pregel system was, there for
00:07:09
graph processing, using bulk synchronous processing
BSP. and then, there is another framework,
00:07:25
which is called, ‘Iterative Map Reduce’
was also made and they allowed, a different
00:07:32
framework some a specialized framework, for
different applications to be supported.
00:07:38
So the drawback of Map Reduce has, given the
way to several frameworks and they were involved
00:07:46
a different programming models, so bulk synchronous,
parallel bulk synchronous parallel processing
00:07:52
framework, is about there is a synchronization
barrier, so the processes at different speed
00:08:03
the joints at the synchronization barrier
and then they will proceed further on, so
00:08:09
this is called, ‘BSP model’. so this BSP
model also, allows a fast processing for the
00:08:20
typical graph application, so graph processing
is done using BSP, to make it more faster.
00:08:28
Because, the Map Reduce does not support the
graph processing within it.
00:08:32
so grab processing framework requires, that
the data which is being, taken up by the neighboring
00:08:43
nodes, then basically these nodes ,the neighboring
node will collect the data, will gather the
00:08:49
data and then perform the computation and
then again scatter, these data to the other
00:08:57
points.
00:08:58
so this particular operation, the communication,
computation and then communication, is to
00:09:09
be completed as one step, that is the lockstep
and hence the bulk synchronous processing,
00:09:16
comes into the and effect where this all three
operations, all three actions, are to be performed
00:09:24
and then only the next step will takes place
using bulk synchronous processing.
00:09:30
Hence, this kind of paradigm, that is called,’
Bulk Synchronous Processing framework’ ,is
00:09:36
useful for the graph processing, that we will
see that, how this particular different programming
00:09:45
paradigm, such as bulk synchronous processing
and iterative Map Reduce, for the iterative
00:09:54
applications, for example, the iterations
of a Map Reduce, is basically you can see,
00:10:02
in the machine learning algorithms.
00:10:05
so these different frameworks were evolved,
out of this Map Reduce drawbacks and they,
00:10:13
they existed, over the several period, over
the period of time, to fill up this particular
00:10:19
gap.
00:10:21
Now, seeing all these scenarios and to provide
the data sharing and to support the applications
00:10:29
such as interactive and iterative applications,
there was a need for the SPARK system.
00:10:37
So the solution: which is Spark has given,
is in the form of an abstract data type, which
00:10:42
is called, ‘Resilient Distributed Data Set’.
00:10:54
So the SPARK provides, a new way, of supporting
the abstract data type, which is called resilient
00:11:01
distributed data set or in short it is RDDs.
00:11:05
Now, we will see in this part of the discussion,
how this RDDs are going to solve ,the drawbacks
00:11:13
of the batch processing Map Reduce, into an
more efficient data sharing and also going
00:11:19
to be supporting, the iterative and interactive
applications.
00:11:33
Now, this particular RDDs, are resilient distributed
data sets, they are immutable, immutable means,
00:11:44
we cannot change.
00:11:59
It cannot be changed, so that means once an
RDD is formed, so it will be an immutable.
00:12:04
Now, in this particular way, this immutable
resilient distributed data set can be partitioned
00:12:11
in a various ways, across different cluster
nodes.
00:12:16
So partition collection of Records, so partitioning
can happen, in s for example, if this is a
00:12:22
cluster, of machines which are connected,
with each other.
00:12:27
So the RDDs can be partitioned and stored,
at different places, at different segments.
00:12:33
Hence, the immutable partition collection
of records is possible and in this particular
00:12:40
scenario, that is called, ‘RDDs’.
00:12:42
Now, another thing is, once an RDD is formed,
then it will be formed using, it will be built,
00:12:51
RDDs will be built, through a course gain
transformations, such as, map, join and so
00:12:56
on.
00:12:57
Now, these RDDs can be cached for efficient
reuse, so that we are going to see that, lot
00:13:04
of new operations can be performed on it,
so again let us summarize, that the Spark
00:13:12
has, given a solution, in the form of an abstract
data type ,which is called as a, ‘RDD.’
00:13:27
and our RDD can be built, using the transformations
and also can be changed, can be changed into
00:13:46
another form, that is RDD can become another,
RDD by making various transformations, such
00:13:53
as map, join and so on.
00:13:56
That we will see in due course of action,
these RDDs are, are immutable, partition collection
00:14:04
of record.
00:14:13
Means that, once an RDD is formed, so as an
immutable, immutable means, we cannot change,
00:14:19
this entire collection of records, can be
stored and in a convenient manner onto the,
00:14:27
onto the cluster system.
00:14:29
Hence, this is an immutable partition collection
of the record, these RDDs can be cached, can
00:14:39
be cached in memory, for efficient reuse.
00:14:46
So as we have seen that, Map Reduce lacks,
this data shearing and now using the RDDs,
00:14:54
a Spark will provide, the efficient, sharing
of data in the form of, all RDDs.
00:15:04
Now let us see, through an example of a word
count.
00:15:06
Word count example, to understand the need
of Spark.
00:15:19
So in the word count application, the data
set, is installed through the HDFS file system
00:15:29
and is being read, so after reading this particular
data, set from the file system, this particular
00:15:36
reading operation, will build the RDDs. and
which is shown here, in this block number
00:15:45
1.
00:15:46
So that means once, the data is read, it will
build, the RDDs from the word-count data set.
00:16:06
Once these RDDs are built, then they can be
stored at different places, so it's an immutable
00:16:11
partitioned collection and now various operations
we can perform.
00:16:17
So we know that, these RDDs, we can perform
various transformations and first transformation
00:16:23
which we are going to perform on these RDDs,
which are, which is called a, ‘Map Function’.
00:16:28
Map function for, the word count program,
is being applied on different RDDs, which
00:16:36
are restored.
00:16:38
So after applying the map function, this particular
output, will be stored in memory and then
00:16:44
again, the reduced function will be applied,
on these RDDs, which is the output?
00:16:49
Or which is the transformations?
00:16:51
Which is the transform or RDDs, out of the
map function?
00:16:57
Again the reduce function will apply.
00:16:59
And the data and the result of this reduce,
will remain in cache.
00:17:05
so that, it can be, used up by different application
.so you can see that, this particular transformation,
00:17:14
which is changing the RDDs from one form,
to another form that means after reading,
00:17:23
from the file, it will become an RDD and after
applying the map function, it will change
00:17:30
to another RDD and map function and after
applying the reduce function.
00:17:36
it will change to, another form and the final
output, will be remained in the cache memory.
00:17:45
Output will remain in the cache, so that,
whatever application requires, this output
00:17:51
can be used up, so this particular pipeline,
which we have shown, is quite, easily, understandable
00:17:58
and is convenient to manage and part and to
store, in the partition, collection manner,
00:18:05
in this cluster computing system.
00:18:09
So, we have seen that, this RDDs has simplified
this particular task and has also made this
00:18:19
operation efficient.
00:18:21
Hence, RDDs is an immutable, partition collection
of the records.
00:18:25
And they are built through the coarse grained
transformation that we have seen in the previous
00:18:30
example.
00:18:31
Now, another question is, since the Map Reduce
was storing ,the intermediate results of a
00:18:39
map, before it is being used in the reduced
function, into an SDFS file system, for ensuring
00:18:45
the fault tolerance .now since, by produced
since the SPARK is not using, this intermediate
00:18:52
results storage, through the HDFS rather,
it will be in memory storage, so there will
00:19:00
be how this Spark ensures the fault tolerance
that we have to understand now, the concept
00:19:06
which Spark uses, for fault tolerance is called,
‘Lineage’.
00:19:33
So this park uses the lineage to achieve the
fault tolerance and that we have to understand
00:19:40
now, so what Spark does is?
00:19:43
It locks, the coarse-grained operations, which
are applied, to the partition data set.
00:19:50
meaning to say that, all the operation like,
reading of a file and that becomes an RDD
00:19:58
and then making a transformation on an RDD
using map function and then again another
00:20:04
transformation, of RDDs using reduced function,
join function and so on.
00:20:09
All these operations they form, the course
gained operations and they are to be logged
00:20:14
into a file, of before applying it.
00:20:18
So if the data is, so basically ,if the data
is, lost or if the, if the system, crashes
00:20:25
the node crashes, they simply recomputed,
these lost partition and whenever there is
00:20:30
a failure, if there is no failure, obviously
no extra cost ,is required in this process.
00:20:37
Let us see this, through an example.
00:20:40
So again, let us explain that, lineage is
in the form of the course grained, you said,
00:20:47
it is a log of a coarse grained operation.
00:21:04
Now this particular, lineage will, keep a
record of all these operations, coarse-grained
00:21:10
operations, which are applied and that will
be kept in a log file.
00:21:15
Now we will see, how using this lineage or
a log files, the fault tolerance can be achieved.
00:21:23
Let us see, through this particular example.
00:21:27
That let us see, that the word count example,
which we have seen in the last slide.
00:21:36
Now the, the same word count example, we have
to, we have, we have to see that, these RDDs,
00:21:44
will keep track of ,oddities will keep track,
the graph of transformation, that build them,
00:21:56
their lineage to rebuild lost data.
00:22:00
So the, so there will be a log of all the
coarse grained operation, which are performed
00:22:06
and which has, built these RDDs transformations.
and this is called, ‘Lineage’. let us
00:22:14
see, what happens is for example, after reading
this particular RDD will be formed after the
00:22:21
read operations, on the data set and then,
on this particular data, this RDD we have
00:22:29
performed, the map operation, RDD transformed
RDDs and this transformed RDD again, is now
00:22:35
applied with the reduced function, to make
this particular RDD and is stored in the cache.
00:22:42
Now, consider that if this particular node,
which has stored this transformed RDD if it
00:22:48
is filled, obviously it has to trace back,
to the previous RDD and consult, this lineage,
00:22:57
which will tell that, this is an output of
the map function, this is an RDD transformed
00:23:04
and RDD, this RDD when we apply, the reduced
function, it will recreate, the same RDD which
00:23:11
is lost.
00:23:13
So let us see, what is written?
00:23:15
What we have just seen?
00:23:17
So we have to simply recompute the last partition,
whenever there will be a failure, how we have
00:23:23
to trace back and apply the same transformation
again, on RDD and we can recomputed, that
00:23:32
the, the RDD which is lost in the partition
due to the failures.
00:23:38
So now, using lineage, concept we have seen
that the fault tolerance, is achieved in a
00:23:44
Spark system.
00:23:46
Now we will see that, what more we can do
here in the Spark?
00:23:51
So RDDs transfer, which RDDs provide various
operations and all these operations are divided
00:23:57
into two different categories, the first category
is called, ‘Transformations’.
00:24:01
Which we can apply, as an RDD operation.
00:24:04
second is called, ‘Actions’, which we
can perform using RDDs, operations so as far
00:24:09
as the transformations, which RDD supports
is in the form of filter, join, map, group
00:24:16
by all these are different transformations,
which RDD supports, in the Spark system .Now,
00:24:22
another set of, operation which RDD supports
is called, ‘Actions’. so actions, in the
00:24:28
sense the output of some, some operations,
is whenever there then it is called, ‘Action’.
00:24:36
For example, count, print and so on.
00:24:39
Now, then another thing which, Spark can provide
is called, ‘Control Operations’, to the
00:24:47
programmer level.
00:24:48
So there, are two interesting control, which
is being provided by the Spark, to the programmers.
00:24:54
The first is called, ‘partitioning’.
00:24:57
So, the Spark gives the control or how you
can partition your RDDs, across different
00:25:03
cluster systems.
00:25:06
and second one is called the, ‘Persistence’.
00:25:09
Persistence allows you to choose, whether
you want to persist RDDs on to the disk or
00:25:14
not.
00:25:15
So by default, it is not persisted, but if
you allow, if you choose this persistent,
00:25:20
RDDs then the, RDDs I have to be stored in
HDFS.
00:25:24
Hence, the persistent and partitioning both
controls are, given to the, to the programmer
00:25:31
the user in buys by the Spark system.
00:25:36
There are various other, Spark applications,
where Spark can be used first, these applications
00:25:42
are such as, Twitter respond classification,
algorithm for traffic prediction, k-means
00:25:48
clustering algorithms, alternating least square
matrix factorization, in memory OLAP aggregation
00:25:56
on his, pond hive data and SQL on Spark.
00:26:00
These are some of the applications and these
are the reference, material for further studies,
00:26:07
on the Spark system.
00:26:09
That we have, that is available on HTTPS,
Spark dot Apache dot org.
00:26:17
Now we will see, about the Spark execution.
00:26:20
So a Spark execution, is done, in the form
of distributed programming, that is, the first
00:26:26
operation is called, ‘Broadcast’.
00:26:28
So there is a, there are three different entities,
one is called the, ‘Driver Entity’, of
00:26:35
the Spark, the other entity is called, ‘Executors’.
00:26:39
Which are there on different nodes, data nodes
and then there is a shuffle operation.
00:26:47
So let us see the, first operation is called,
‘Broadcast’.
00:26:51
So the driver will broadcast, these different
commands, to the different executors, that
00:26:58
is called a, ‘Broadcast Operation’.
00:27:08
We will broadcast, to the executors.
00:27:17
Now, these executors will execute and give
the result back to the drivers.
00:27:27
And then again, the driver will give further
operations or the functions.
00:27:34
And these, particular functions are used by
the shuffle and again given back to the executors.
00:27:43
This all will be performed, the entire task
operations, will be performed using, the directed
00:27:50
acyclic graph.
00:27:51
So directed acyclic graph is the Schuler,
for the SPARC execution.
00:28:06
So SPARC execution, as we have seen that RDDs,
can be executed using, two operations.
00:28:19
One is called, ‘Transformations’.
00:28:22
The other one is called, ‘Actions’.
00:28:25
So, in this particular RDD, we have shown
you the actions, in the form of, the dotted
00:28:33
circle and transformation in the form of a
gray circle.
00:28:42
so you can see here, that, this shows that,
this is an RDD and from this, this particular
00:28:51
RDD, is obtained using the transformations
and further, this particular RDD is now giving
00:29:02
performing the action part and this is also
a transformation, transformation, transformation
00:29:08
and these are the actions.
00:29:10
So these actions will give the output or the
results, of the execution.
00:29:21
This complete schedule is, available at the
driver and using this particular scheduler
00:29:35
the driver in turn will supply, the different
actions, different operations on RDDs, in
00:29:44
the form of transformations and actions as,
it is defined in the directed acyclic graph
00:29:52
scheduler.
00:29:53
Therefore, this directed acyclic graph or
a DAG, will take either the actions or transformations.
00:30:02
So actions include the, Count, Take, Foreach
and the transformation involves, the map,
00:30:09
reduced by key, joined by key and group by
key.
00:30:16
So these are all, a diagram I have already
explained, that these are all RDDs and these
00:30:23
are all transformations, the arrows are transformations,
from one RDDs to another RDDs and if this
00:30:33
is an, action.
00:30:36
On these RDDs, it will be performed on the
action.
00:30:43
So let us see, a simple word count application,
which is written in, a Spark, using Flume
00:30:52
Java.
00:30:53
So here, we can see that, we have to set a
master, which will, which will take care of
00:31:02
running the entire dag scheduler.
00:31:06
And now then, we have to create a Spark context
and we have to, read the text file and then
00:31:14
we have to do a flat map, which will split
different words, which are separated by, the
00:31:22
blank.
00:31:23
after that, flatMap then we have to, do the
map operation, of a word count, which will
00:31:30
omit, the word with and the value 1 .then
we will perform the reduce by key, that means
00:31:38
for, for, for a particular world, all such
list of numbers or instances which is emitted
00:31:45
by the map function, will be now, doing the
summation of that.
00:31:50
And then, the word count it will now take
and then it will, print it for each value
00:32:01
of key.
00:32:02
The, the word count will be printed.
00:32:06
So this way, this DAG automatically once the
program is made, the dag will be constructed,
00:32:15
automatically and the DAG will be given, to
the master node.
00:32:24
And the master in, in turn will communicate,
with the executors and shuffle, for this execution
00:32:37
in this way, that is performed in this dak
way.
00:32:43
Let us, see the SPARK implementation, in more
details.
00:32:47
So, Spark ideas are an expressive computing
system, which is not limited by, the Map Reduce
00:32:54
model.
00:32:55
That means, beyond Map Reduce also, the programming
can be now done, in the SPARK.
00:33:01
Now, this Spark will facilitate the system
memory and it will avoid saving that immediate,
00:33:07
results to the disk, it will cache for repeated,
repetitive queries, that means, the output
00:33:15
of the transform actions or the transformation
will remain in the cache, so that, iterative
00:33:20
applications can, can make use of this fast
or efficient data sharing .
00:33:27
The SPARK is also compatible with, with the
Hadoop system.
00:33:31
RDDs is an abstraction as I told you, it is
a resilient distributed datasets, a partition
00:33:36
collection of record ,they are spread across
the cluster, they are here only and caching
00:33:41
data sets, in the possible, in memory and
different storage levels are possible.
00:33:47
As I told you that, the transformations and
actions, there are two operations, RDD supports
00:33:55
and transformations include map filter joint,
they are lazy operations and actions, include
00:34:03
the count collect sale and they are trigger
executions.
00:34:07
Spark components, let us go and discuss
00:34:11
The Spark components, in more details.
00:34:14
So, as you know that, is Spark, is a distributed
computing framework.
00:34:31
So now, we will see here, what are the different
components?
00:34:35
Which together will form?
00:34:37
The computational environment of the Spark
execution.
00:34:42
Now, we have seen, there is a cluster manager,
there is a driver program, there is a worker
00:34:50
nodes and within the worker nodes, you have
the executors, and within the executors what
00:34:57
are the different tasks?
00:34:59
And what is the cash?
00:35:02
All these different components together will
form the distributed computing framework,
00:35:06
which will give an efficient, execution, environment,
for the Spark program or a Spark applications.
00:35:14
So here, so within, a driver program, when
whenever a Spark, shell is being prompt, that
00:35:33
will be inside the driver program, will create
the Spark context.
00:35:38
Now executing the Spark context means that,
it will communicate with the worker nodes,
00:35:45
within the worker nodes the executors, so
a Spark context will, create will, interact
00:35:50
or communicate with the executors.
00:35:55
And within the executors, the tasks will be
executed.
00:35:59
so executors within the executable, yes tasks
will be executed and executors will be computed
00:36:05
or will be executing on the worker nodes,
so the driver program, then interacts with
00:36:12
the cluster manager and cluster manager intern
will interact with these worker nodes .so
00:36:18
this all will happen, inside the Spark and
Spark will dust the cluster manager.
00:36:24
Now, there is an option in the Spark that
instead of going through the cluster manager,
00:36:30
you can also use the yarn and other, such
resource manager and Scheduler.
00:36:38
Now, let us see, what do you mean, by the,
driver program, Spark context, server and
00:36:47
cluster manager, worker node, executor, then
tasks and through, this to understand this.
00:36:56
Let us see, a simple application.
00:37:00
Now, we have seen here, the driver program
and this driver program will create a Spark
00:37:07
context, it will create a Spark context SC.
and this in turn will now, communicate with
00:37:15
the executors, which are running inside, which
are running inside.
00:37:23
So, so this particular driver program, intern
knows the, the different worker program communication
00:37:33
and the Spark context, will now communicate
to the executors.
00:37:43
And these executors in turn, will communicate
or will, will execute the tasks.
00:37:50
So these tasks are, nothing but, the various
transformations, the RDDs, through the dag,
00:38:00
they are being transformed and they are done
through the tasks.
00:38:04
So different executors various tasks are being
created and executing.
00:38:09
So this will create the job execution and
let us go back and, and see that, these
00:38:18
So this particular way, the driver program,
will execute, the DAG and server Spark context
00:38:27
SC.
00:38:28
will be created which in turn will communicate
inside the worker node with the executors
00:38:33
and these executors in turn will execute various
transformations and actions and the result
00:38:41
will be remained in the cache.
00:38:44
So that was, about the Spark components.
00:38:49
So we have seen that, in this manner, the
operation of Spark is being performed.
00:38:57
Now, another view of partition level view,
we can see here that, the partitioning so
00:39:06
that means, RDDs are partitioned
00:39:08
And different tasks are being executed.
00:39:11
similarly job scheduling, that means, once
an RDD, is that means operations are, given
00:39:19
automatically it will build, the DAG and DAG
will be given to the DAG scheduler and that,
00:39:26
DAG scheduler will split the graph into the
stages of, task and submit each stage as it
00:39:32
is ready.
00:39:33
So task set is created and given to the task
a scheduler and as far as the cluster manager,
00:39:39
is concerned it launches, the tasks, via the
cluster manager and retry the field or straggling
00:39:45
task.
00:39:47
And this task is, given to the worker that
we have seen in the previous slide and this
00:39:53
particular workers will create the thread
and execute them, there.
00:39:58
There are different APIs, which are available
and you can write, these APIs is using Java
00:40:05
program, scala or a Python.
00:40:07
There is also an interactive interpreter:
available, access through the scalar and Python.
00:40:15
Standby applications are, there are many applications
and performance: if we see that Java and C
00:40:22
are faster and thanks to the static typing
.
00:40:26
Now let us see, the hands-on session, how
we can perform, using Spark.
00:40:36
So as Spark, we can run, as a scalar, so Spark
shall, will be created.
00:40:42
And we can download, the ,the, the file, that
data set file and then, it can be, built using
00:40:53
,the package and then this particular task
or a data file can be submitted ,to the, to
00:41:01
the master node of that Spark system.
00:41:07
So, so directly Spark, can store it into the
system and now it can perform, various operations,
00:41:16
on this particular data set, using a Scala
program.
00:41:20
Now summary, the concept of, of Map Reduce
are limited to the single path Map Reduce
00:41:30
is basically limiting various other applications
and this particular, concept is avoiding,
00:41:38
the sorting intermediate results, storing
intermediate results on the disk or on HDFS.
00:41:43
And also, speed-up computations are required,
when reusing the datasets. and all these features
00:41:51
are available, as part of ,this part that
we have seen using RDDs.
00:42:00
So using RDDs, now Spark provides, the not
only Map Reduce, operations, beyond Map Reduce,
00:42:15
it can also use .second thing is, it can be
in memory, operations, not necessarily to
00:42:26
be stored, in HDFS in to store the intermediate
results.
00:42:32
So this way of in-memory computations, will
make the speed-up and brings about the efficiency,
00:42:42
data sharing across different iterations.
00:42:47
So iterative and interactive applications
both are, easily supported and Map Reduce
00:42:51
and non Map Reduce applications are also supported,
by the Spark system.
00:43:19
So all this is possible, with the help of
RDDs and their operations.
00:43:24
so we have seen that now the RDDs are saw
a Spark is very much required and all the
00:43:32
drawbacks of Map Reduce and Hadoop, is now
not there with the SPARK and therefore the
00:43:39
Spark now has, now various new applications.
00:43:41
For example, the Spark system will, Spark.
00:43:48
So the Spark is a core, as a core, can be
used for building different applications,
00:44:00
such as, Spark MLlib, that is the machine
learning or the Spark, then Spark streaming
00:44:13
,that is the real-time applications, over
the Spark and Spark graphics, the graph computation
00:44:25
or the Spark.
00:44:27
So, now Spark can use HDFS or may not use
HDFS, Spark is independent.
00:44:38
Therefore, let us, conclude this discussion,
that RDD, is resilient distributed datasets,
00:44:45
will provide a simple and efficient programming
model, for different supporting various applications,
00:44:51
which are the batch and interactive and iterative
applications, all are supported using this
00:45:03
concept, which is called, ‘RDDs’.
00:45:06
Now, this is generalized, to a broad set of
applications.
00:45:10
And it will, leverage the coarse-grained nature
of parallel algorithm, for fault recovery.
00:45:17
So that is why, this is a hundred times, faster,
compared to the, traditional Map Reduce.
00:45:30
So, Spark is 100 times faster, that is what
is compared, with the performance, by the
00:45:41
Spark production clusters.
00:45:47
Thank you.