noc19-cs33 Lec 10-Introduction to Spark

00:46:21
https://www.youtube.com/watch?v=-5yf4vCUsbQ

概要

TLDRThe lecture provides an overview of Apache Spark as a big data analytics framework that improves upon the limitations of MapReduce. It introduces RDDs (Resilient Distributed Datasets) as a fundamental data structure allowing for efficient in-memory processing and fault tolerance through lineage tracking. Spark's ability to handle interactive and iterative applications and its differences in execution models compared to traditional batch approaches make it a powerful tool in the data processing landscape. Also highlighted are the various operations that can be performed on RDDs, such as transformations and actions, as well as Spark's integration capabilities with existing systems like Hadoop.

収穫

  • 🚀 Spark is an advanced big data analytics framework.
  • 📦 RDDs (Resilient Distributed Datasets) are key to Spark's efficiency.
  • 📊 In-memory processing enhances performance compared to MapReduce.
  • ⏳ Spark simplifies iterative and interactive applications.
  • 🏗️ Fault tolerance in Spark is achieved through lineage tracking.
  • 💼 Spark integrates well with Hadoop and other systems.
  • 🔄 RDDs support various operations: transformations and actions.
  • 💡 Spark's execution model uses a Directed Acyclic Graph (DAG).
  • ⚡ Spark can be 100 times faster than traditional MapReduce.
  • 🌍 Applications include machine learning, real-time processing, and much more.

タイムライン

  • 00:00:00 - 00:05:00

    This lecture introduces the framework of Spark, which is a big data analytics framework developed at UC Berkeley in 2012. It highlights the shortcomings of the Map Reduce system, particularly in terms of performance and efficient data sharing, making it less suitable for interactive or iterative applications.

  • 00:05:00 - 00:10:00

    The limitations of Map Reduce are discussed, emphasizing its expensive nature for certain applications and its lack of efficient data sharing. This leads to the emergence of specialized frameworks to address these drawbacks, such as Bulk Synchronous Processing (BSP) for graph processing.

  • 00:10:00 - 00:15:00

    The need for Spark arises from the limitations of Map Reduce. Spark introduces Resilient Distributed Data Sets (RDDs), a new data abstraction that solves issues related to batch processing, offering efficient data sharing and support for interactive and iterative applications.

  • 00:15:00 - 00:20:00

    RDDs are described as immutable partitioned collections of records, which can be built through coarse-grain transformations like map and join. They can be cached for efficient reuse, improving the performance of applications that require repeated access to the same data.

  • 00:20:00 - 00:25:00

    An example of a word count application illustrates how Spark operates using RDDs. The process involves reading data from HDFS to build RDDs, applying transformations such as map and reduce, and caching the final output for further use, showcasing the efficiency of RDDs in simplifying operations.

  • 00:25:00 - 00:30:00

    Spark's approach to fault tolerance is based on lineage, where the history of coarse-grained operations is logged. In the event of a failure, lost partitions can be recomputed using the lineage information, ensuring data integrity without relying on intermediate results stored on disk.

  • 00:30:00 - 00:35:00

    RDDs can undergo various transformations such as filter, join, map, and group by, categorized as transformations. Actions, including count and print, trigger the execution of these transformations, highlighting the flexibility of RDD operations within Spark.

  • 00:35:00 - 00:40:00

    Spark supports control operations for developers, providing options for partitioning RDDs and choosing whether to persist them on disk. This flexibility further enhances the efficiency of data processing in Spark applications.

  • 00:40:00 - 00:46:21

    The lecture concludes by outlining Spark's components and execution model, emphasizing the use of a distributed computing framework that can communicate with cluster managers and execute tasks on multiple worker nodes, all while managing RDD transformations and actions effectively.

もっと見る

マインドマップ

ビデオQ&A

  • What is Spark?

    Apache Spark is a big data analytics framework developed to improve performance and support iterative and interactive applications.

  • How does Spark improve on MapReduce?

    Spark provides in-memory data sharing and eliminates the need to write intermediate results to disk, thus speeding up processing.

  • What are RDDs?

    Resilient Distributed Datasets (RDDs) are immutable, partitioned collections of records used in Spark to enable efficient data processing.

  • How does Spark ensure fault tolerance?

    Spark uses lineage, a record of the series of transformations applied to RDDs, to recompute lost data if needed.

  • What types of applications can Spark support?

    Spark supports batch, interactive, and iterative applications, including machine learning and real-time processing.

  • What operations can be performed on RDDs?

    Operations on RDDs include transformations (like map and filter) and actions (like count and collect).

  • Can Spark work with other systems?

    Yes, Spark can integrate with Hadoop and other cluster managers.

  • What is the execution model in Spark?

    Spark execution involves a driver program communicating with executors to perform tasks coordinated through a Directed Acyclic Graph (DAG).

  • Why is Spark considered faster than MapReduce?

    Spark is faster due to in-memory computations and reduced disk I/O operations, making it approximately 100 times faster than traditional MapReduce.

  • What are some applications of Spark?

    Applications of Spark include Twitter sentiment analysis, traffic prediction algorithms, and machine learning with MLlib.

ビデオをもっと見る

AIを活用したYouTubeの無料動画要約に即アクセス!
字幕
en
オートスクロール:
  • 00:00:13
    Preface: Content of this lecture: This lecture, we will discuss the ‘framework of Spark’.
  • 00:00:22
    Resilient distributed data sets and also, we will discuss the Spark execution.
  • 00:00:28
    The need of his Spark.
  • 00:00:30
    Apache Spark is a big data, analytics framework that was originally developed, at University
  • 00:00:37
    of California, Berkeley, at AMP Lab, in 2012.
  • 00:00:41
    Since then, it has gained a lot of attraction, both in the academia and in the industry.
  • 00:00:50
    It is, an another system for big data analytics.
  • 00:00:54
    Now, before this Spark, the Map Reduce was already in use.
  • 00:01:01
    Now, why is not the Map Reduce good enough?
  • 00:01:07
    That we you are going to explore, to understand, the need of the Spark system, before that
  • 00:01:14
    we have to understand that, Map Reduce simplifies, the batch processing on a large commodity
  • 00:01:20
    clusters.
  • 00:01:21
    So, in that process, the data file or a dataset, which was the input? which was input, through
  • 00:01:36
    the HDFS system, uses, the map function, to be spitted into across different, splits and
  • 00:01:51
    then the map function was applied on to this particular splits, which in turn will generate,
  • 00:01:59
    the intermediate values, which is in the form of shuffle and is stored in HDFS and then
  • 00:02:10
    passed on, to the reduced function giving the output, so this is, the scenario of a
  • 00:02:17
    Map Reduce, execution framework, as you have seen the data set, is now stored in the HDFS
  • 00:02:34
    file system.
  • 00:02:36
    Hence, this particular computation is done for the batch processing system.
  • 00:02:49
    So we have seen that, this particular batch processing, was done using the Map Reduce
  • 00:02:56
    system and that was in use earlier, before the development of a Spark.
  • 00:03:02
    Now let us see, why this particular model, of Map Reduce is not good enough to be there?
  • 00:03:09
    Now, as you have seen, in the previous slide that, the output of the Map function, will
  • 00:03:17
    be stored in HDFS and this will ensure the fault tolerance.
  • 00:03:42
    So the output of, map function, was stored into HDFS, file system and this ensures of
  • 00:03:53
    the fault tolerance.
  • 00:03:55
    So in between, if there is a failure, when it is stored in the HDFS file system, still
  • 00:04:00
    the data is not lost and is completely saved, that is why?
  • 00:04:04
    It is ensuring the fault tolerance and because of that fault tolerance, the intermediate
  • 00:04:10
    values or intermediate results, out of the map functions were stored into HDFS file.
  • 00:04:26
    Now, this particular intermediate results will be now further fed on to the reduced
  • 00:04:37
    function.
  • 00:04:49
    So that was, the typical Map Reduce, execution pipeline.
  • 00:04:58
    Now, what is the issue?
  • 00:05:01
    The problem is, mentioned over here, that the major issue is the performance.
  • 00:05:15
    so that means, the performance is degraded and it has become expensive, to save to the
  • 00:05:26
    disk for this particular activity of fault tolerance, hence this particular Map Reduce,
  • 00:05:33
    framework becomes quite slow, to some of the applications, which are interactive and required
  • 00:05:40
    to be processed in the real time, hence this particular framework, is not very useful,
  • 00:05:46
    for certain applications.
  • 00:05:49
    Therefore, let us summarize that, Map Reduce can be expensive, for some applications and
  • 00:05:57
    for example, interactive and iterative application.
  • 00:06:01
    They are very expensive and also, this particular framework, that is the Map Reduce framework,
  • 00:06:08
    lacks efficient data sharing, across the map and reduce phase of operation or iterations.
  • 00:06:16
    So lacks efficient data sharing in the sense, the data which is, the intermediate form of
  • 00:06:22
    Map Reduce is, map is stored in the file system, HDFS file system.
  • 00:06:29
    Hence, it is not shared, in an efficient manner.
  • 00:06:33
    Hence, this particular sharing the data across map and reduce phase, has to be through the
  • 00:06:40
    disk and which is a slow operation, not an efficient operation.
  • 00:06:44
    Hence, this statement which says that, Map Reduce, lacks the efficient data sharing,
  • 00:06:53
    into the system .so due to these, drawbacks, there were several specialized framework did
  • 00:07:00
    evolve, for different programming models, such as, so the pregel system was, there for
  • 00:07:09
    graph processing, using bulk synchronous processing BSP. and then, there is another framework,
  • 00:07:25
    which is called, ‘Iterative Map Reduce’ was also made and they allowed, a different
  • 00:07:32
    framework some a specialized framework, for different applications to be supported.
  • 00:07:38
    So the drawback of Map Reduce has, given the way to several frameworks and they were involved
  • 00:07:46
    a different programming models, so bulk synchronous, parallel bulk synchronous parallel processing
  • 00:07:52
    framework, is about there is a synchronization barrier, so the processes at different speed
  • 00:08:03
    the joints at the synchronization barrier and then they will proceed further on, so
  • 00:08:09
    this is called, ‘BSP model’. so this BSP model also, allows a fast processing for the
  • 00:08:20
    typical graph application, so graph processing is done using BSP, to make it more faster.
  • 00:08:28
    Because, the Map Reduce does not support the graph processing within it.
  • 00:08:32
    so grab processing framework requires, that the data which is being, taken up by the neighboring
  • 00:08:43
    nodes, then basically these nodes ,the neighboring node will collect the data, will gather the
  • 00:08:49
    data and then perform the computation and then again scatter, these data to the other
  • 00:08:57
    points.
  • 00:08:58
    so this particular operation, the communication, computation and then communication, is to
  • 00:09:09
    be completed as one step, that is the lockstep and hence the bulk synchronous processing,
  • 00:09:16
    comes into the and effect where this all three operations, all three actions, are to be performed
  • 00:09:24
    and then only the next step will takes place using bulk synchronous processing.
  • 00:09:30
    Hence, this kind of paradigm, that is called,’ Bulk Synchronous Processing framework’ ,is
  • 00:09:36
    useful for the graph processing, that we will see that, how this particular different programming
  • 00:09:45
    paradigm, such as bulk synchronous processing and iterative Map Reduce, for the iterative
  • 00:09:54
    applications, for example, the iterations of a Map Reduce, is basically you can see,
  • 00:10:02
    in the machine learning algorithms.
  • 00:10:05
    so these different frameworks were evolved, out of this Map Reduce drawbacks and they,
  • 00:10:13
    they existed, over the several period, over the period of time, to fill up this particular
  • 00:10:19
    gap.
  • 00:10:21
    Now, seeing all these scenarios and to provide the data sharing and to support the applications
  • 00:10:29
    such as interactive and iterative applications, there was a need for the SPARK system.
  • 00:10:37
    So the solution: which is Spark has given, is in the form of an abstract data type, which
  • 00:10:42
    is called, ‘Resilient Distributed Data Set’.
  • 00:10:54
    So the SPARK provides, a new way, of supporting the abstract data type, which is called resilient
  • 00:11:01
    distributed data set or in short it is RDDs.
  • 00:11:05
    Now, we will see in this part of the discussion, how this RDDs are going to solve ,the drawbacks
  • 00:11:13
    of the batch processing Map Reduce, into an more efficient data sharing and also going
  • 00:11:19
    to be supporting, the iterative and interactive applications.
  • 00:11:33
    Now, this particular RDDs, are resilient distributed data sets, they are immutable, immutable means,
  • 00:11:44
    we cannot change.
  • 00:11:59
    It cannot be changed, so that means once an RDD is formed, so it will be an immutable.
  • 00:12:04
    Now, in this particular way, this immutable resilient distributed data set can be partitioned
  • 00:12:11
    in a various ways, across different cluster nodes.
  • 00:12:16
    So partition collection of Records, so partitioning can happen, in s for example, if this is a
  • 00:12:22
    cluster, of machines which are connected, with each other.
  • 00:12:27
    So the RDDs can be partitioned and stored, at different places, at different segments.
  • 00:12:33
    Hence, the immutable partition collection of records is possible and in this particular
  • 00:12:40
    scenario, that is called, ‘RDDs’.
  • 00:12:42
    Now, another thing is, once an RDD is formed, then it will be formed using, it will be built,
  • 00:12:51
    RDDs will be built, through a course gain transformations, such as, map, join and so
  • 00:12:56
    on.
  • 00:12:57
    Now, these RDDs can be cached for efficient reuse, so that we are going to see that, lot
  • 00:13:04
    of new operations can be performed on it, so again let us summarize, that the Spark
  • 00:13:12
    has, given a solution, in the form of an abstract data type ,which is called as a, ‘RDD.’
  • 00:13:27
    and our RDD can be built, using the transformations and also can be changed, can be changed into
  • 00:13:46
    another form, that is RDD can become another, RDD by making various transformations, such
  • 00:13:53
    as map, join and so on.
  • 00:13:56
    That we will see in due course of action, these RDDs are, are immutable, partition collection
  • 00:14:04
    of record.
  • 00:14:13
    Means that, once an RDD is formed, so as an immutable, immutable means, we cannot change,
  • 00:14:19
    this entire collection of records, can be stored and in a convenient manner onto the,
  • 00:14:27
    onto the cluster system.
  • 00:14:29
    Hence, this is an immutable partition collection of the record, these RDDs can be cached, can
  • 00:14:39
    be cached in memory, for efficient reuse.
  • 00:14:46
    So as we have seen that, Map Reduce lacks, this data shearing and now using the RDDs,
  • 00:14:54
    a Spark will provide, the efficient, sharing of data in the form of, all RDDs.
  • 00:15:04
    Now let us see, through an example of a word count.
  • 00:15:06
    Word count example, to understand the need of Spark.
  • 00:15:19
    So in the word count application, the data set, is installed through the HDFS file system
  • 00:15:29
    and is being read, so after reading this particular data, set from the file system, this particular
  • 00:15:36
    reading operation, will build the RDDs. and which is shown here, in this block number
  • 00:15:45
    1.
  • 00:15:46
    So that means once, the data is read, it will build, the RDDs from the word-count data set.
  • 00:16:06
    Once these RDDs are built, then they can be stored at different places, so it's an immutable
  • 00:16:11
    partitioned collection and now various operations we can perform.
  • 00:16:17
    So we know that, these RDDs, we can perform various transformations and first transformation
  • 00:16:23
    which we are going to perform on these RDDs, which are, which is called a, ‘Map Function’.
  • 00:16:28
    Map function for, the word count program, is being applied on different RDDs, which
  • 00:16:36
    are restored.
  • 00:16:38
    So after applying the map function, this particular output, will be stored in memory and then
  • 00:16:44
    again, the reduced function will be applied, on these RDDs, which is the output?
  • 00:16:49
    Or which is the transformations?
  • 00:16:51
    Which is the transform or RDDs, out of the map function?
  • 00:16:57
    Again the reduce function will apply.
  • 00:16:59
    And the data and the result of this reduce, will remain in cache.
  • 00:17:05
    so that, it can be, used up by different application .so you can see that, this particular transformation,
  • 00:17:14
    which is changing the RDDs from one form, to another form that means after reading,
  • 00:17:23
    from the file, it will become an RDD and after applying the map function, it will change
  • 00:17:30
    to another RDD and map function and after applying the reduce function.
  • 00:17:36
    it will change to, another form and the final output, will be remained in the cache memory.
  • 00:17:45
    Output will remain in the cache, so that, whatever application requires, this output
  • 00:17:51
    can be used up, so this particular pipeline, which we have shown, is quite, easily, understandable
  • 00:17:58
    and is convenient to manage and part and to store, in the partition, collection manner,
  • 00:18:05
    in this cluster computing system.
  • 00:18:09
    So, we have seen that, this RDDs has simplified this particular task and has also made this
  • 00:18:19
    operation efficient.
  • 00:18:21
    Hence, RDDs is an immutable, partition collection of the records.
  • 00:18:25
    And they are built through the coarse grained transformation that we have seen in the previous
  • 00:18:30
    example.
  • 00:18:31
    Now, another question is, since the Map Reduce was storing ,the intermediate results of a
  • 00:18:39
    map, before it is being used in the reduced function, into an SDFS file system, for ensuring
  • 00:18:45
    the fault tolerance .now since, by produced since the SPARK is not using, this intermediate
  • 00:18:52
    results storage, through the HDFS rather, it will be in memory storage, so there will
  • 00:19:00
    be how this Spark ensures the fault tolerance that we have to understand now, the concept
  • 00:19:06
    which Spark uses, for fault tolerance is called, ‘Lineage’.
  • 00:19:33
    So this park uses the lineage to achieve the fault tolerance and that we have to understand
  • 00:19:40
    now, so what Spark does is?
  • 00:19:43
    It locks, the coarse-grained operations, which are applied, to the partition data set.
  • 00:19:50
    meaning to say that, all the operation like, reading of a file and that becomes an RDD
  • 00:19:58
    and then making a transformation on an RDD using map function and then again another
  • 00:20:04
    transformation, of RDDs using reduced function, join function and so on.
  • 00:20:09
    All these operations they form, the course gained operations and they are to be logged
  • 00:20:14
    into a file, of before applying it.
  • 00:20:18
    So if the data is, so basically ,if the data is, lost or if the, if the system, crashes
  • 00:20:25
    the node crashes, they simply recomputed, these lost partition and whenever there is
  • 00:20:30
    a failure, if there is no failure, obviously no extra cost ,is required in this process.
  • 00:20:37
    Let us see this, through an example.
  • 00:20:40
    So again, let us explain that, lineage is in the form of the course grained, you said,
  • 00:20:47
    it is a log of a coarse grained operation.
  • 00:21:04
    Now this particular, lineage will, keep a record of all these operations, coarse-grained
  • 00:21:10
    operations, which are applied and that will be kept in a log file.
  • 00:21:15
    Now we will see, how using this lineage or a log files, the fault tolerance can be achieved.
  • 00:21:23
    Let us see, through this particular example.
  • 00:21:27
    That let us see, that the word count example, which we have seen in the last slide.
  • 00:21:36
    Now the, the same word count example, we have to, we have, we have to see that, these RDDs,
  • 00:21:44
    will keep track of ,oddities will keep track, the graph of transformation, that build them,
  • 00:21:56
    their lineage to rebuild lost data.
  • 00:22:00
    So the, so there will be a log of all the coarse grained operation, which are performed
  • 00:22:06
    and which has, built these RDDs transformations. and this is called, ‘Lineage’. let us
  • 00:22:14
    see, what happens is for example, after reading this particular RDD will be formed after the
  • 00:22:21
    read operations, on the data set and then, on this particular data, this RDD we have
  • 00:22:29
    performed, the map operation, RDD transformed RDDs and this transformed RDD again, is now
  • 00:22:35
    applied with the reduced function, to make this particular RDD and is stored in the cache.
  • 00:22:42
    Now, consider that if this particular node, which has stored this transformed RDD if it
  • 00:22:48
    is filled, obviously it has to trace back, to the previous RDD and consult, this lineage,
  • 00:22:57
    which will tell that, this is an output of the map function, this is an RDD transformed
  • 00:23:04
    and RDD, this RDD when we apply, the reduced function, it will recreate, the same RDD which
  • 00:23:11
    is lost.
  • 00:23:13
    So let us see, what is written?
  • 00:23:15
    What we have just seen?
  • 00:23:17
    So we have to simply recompute the last partition, whenever there will be a failure, how we have
  • 00:23:23
    to trace back and apply the same transformation again, on RDD and we can recomputed, that
  • 00:23:32
    the, the RDD which is lost in the partition due to the failures.
  • 00:23:38
    So now, using lineage, concept we have seen that the fault tolerance, is achieved in a
  • 00:23:44
    Spark system.
  • 00:23:46
    Now we will see that, what more we can do here in the Spark?
  • 00:23:51
    So RDDs transfer, which RDDs provide various operations and all these operations are divided
  • 00:23:57
    into two different categories, the first category is called, ‘Transformations’.
  • 00:24:01
    Which we can apply, as an RDD operation.
  • 00:24:04
    second is called, ‘Actions’, which we can perform using RDDs, operations so as far
  • 00:24:09
    as the transformations, which RDD supports is in the form of filter, join, map, group
  • 00:24:16
    by all these are different transformations, which RDD supports, in the Spark system .Now,
  • 00:24:22
    another set of, operation which RDD supports is called, ‘Actions’. so actions, in the
  • 00:24:28
    sense the output of some, some operations, is whenever there then it is called, ‘Action’.
  • 00:24:36
    For example, count, print and so on.
  • 00:24:39
    Now, then another thing which, Spark can provide is called, ‘Control Operations’, to the
  • 00:24:47
    programmer level.
  • 00:24:48
    So there, are two interesting control, which is being provided by the Spark, to the programmers.
  • 00:24:54
    The first is called, ‘partitioning’.
  • 00:24:57
    So, the Spark gives the control or how you can partition your RDDs, across different
  • 00:25:03
    cluster systems.
  • 00:25:06
    and second one is called the, ‘Persistence’.
  • 00:25:09
    Persistence allows you to choose, whether you want to persist RDDs on to the disk or
  • 00:25:14
    not.
  • 00:25:15
    So by default, it is not persisted, but if you allow, if you choose this persistent,
  • 00:25:20
    RDDs then the, RDDs I have to be stored in HDFS.
  • 00:25:24
    Hence, the persistent and partitioning both controls are, given to the, to the programmer
  • 00:25:31
    the user in buys by the Spark system.
  • 00:25:36
    There are various other, Spark applications, where Spark can be used first, these applications
  • 00:25:42
    are such as, Twitter respond classification, algorithm for traffic prediction, k-means
  • 00:25:48
    clustering algorithms, alternating least square matrix factorization, in memory OLAP aggregation
  • 00:25:56
    on his, pond hive data and SQL on Spark.
  • 00:26:00
    These are some of the applications and these are the reference, material for further studies,
  • 00:26:07
    on the Spark system.
  • 00:26:09
    That we have, that is available on HTTPS, Spark dot Apache dot org.
  • 00:26:17
    Now we will see, about the Spark execution.
  • 00:26:20
    So a Spark execution, is done, in the form of distributed programming, that is, the first
  • 00:26:26
    operation is called, ‘Broadcast’.
  • 00:26:28
    So there is a, there are three different entities, one is called the, ‘Driver Entity’, of
  • 00:26:35
    the Spark, the other entity is called, ‘Executors’.
  • 00:26:39
    Which are there on different nodes, data nodes and then there is a shuffle operation.
  • 00:26:47
    So let us see the, first operation is called, ‘Broadcast’.
  • 00:26:51
    So the driver will broadcast, these different commands, to the different executors, that
  • 00:26:58
    is called a, ‘Broadcast Operation’.
  • 00:27:08
    We will broadcast, to the executors.
  • 00:27:17
    Now, these executors will execute and give the result back to the drivers.
  • 00:27:27
    And then again, the driver will give further operations or the functions.
  • 00:27:34
    And these, particular functions are used by the shuffle and again given back to the executors.
  • 00:27:43
    This all will be performed, the entire task operations, will be performed using, the directed
  • 00:27:50
    acyclic graph.
  • 00:27:51
    So directed acyclic graph is the Schuler, for the SPARC execution.
  • 00:28:06
    So SPARC execution, as we have seen that RDDs, can be executed using, two operations.
  • 00:28:19
    One is called, ‘Transformations’.
  • 00:28:22
    The other one is called, ‘Actions’.
  • 00:28:25
    So, in this particular RDD, we have shown you the actions, in the form of, the dotted
  • 00:28:33
    circle and transformation in the form of a gray circle.
  • 00:28:42
    so you can see here, that, this shows that, this is an RDD and from this, this particular
  • 00:28:51
    RDD, is obtained using the transformations and further, this particular RDD is now giving
  • 00:29:02
    performing the action part and this is also a transformation, transformation, transformation
  • 00:29:08
    and these are the actions.
  • 00:29:10
    So these actions will give the output or the results, of the execution.
  • 00:29:21
    This complete schedule is, available at the driver and using this particular scheduler
  • 00:29:35
    the driver in turn will supply, the different actions, different operations on RDDs, in
  • 00:29:44
    the form of transformations and actions as, it is defined in the directed acyclic graph
  • 00:29:52
    scheduler.
  • 00:29:53
    Therefore, this directed acyclic graph or a DAG, will take either the actions or transformations.
  • 00:30:02
    So actions include the, Count, Take, Foreach and the transformation involves, the map,
  • 00:30:09
    reduced by key, joined by key and group by key.
  • 00:30:16
    So these are all, a diagram I have already explained, that these are all RDDs and these
  • 00:30:23
    are all transformations, the arrows are transformations, from one RDDs to another RDDs and if this
  • 00:30:33
    is an, action.
  • 00:30:36
    On these RDDs, it will be performed on the action.
  • 00:30:43
    So let us see, a simple word count application, which is written in, a Spark, using Flume
  • 00:30:52
    Java.
  • 00:30:53
    So here, we can see that, we have to set a master, which will, which will take care of
  • 00:31:02
    running the entire dag scheduler.
  • 00:31:06
    And now then, we have to create a Spark context and we have to, read the text file and then
  • 00:31:14
    we have to do a flat map, which will split different words, which are separated by, the
  • 00:31:22
    blank.
  • 00:31:23
    after that, flatMap then we have to, do the map operation, of a word count, which will
  • 00:31:30
    omit, the word with and the value 1 .then we will perform the reduce by key, that means
  • 00:31:38
    for, for, for a particular world, all such list of numbers or instances which is emitted
  • 00:31:45
    by the map function, will be now, doing the summation of that.
  • 00:31:50
    And then, the word count it will now take and then it will, print it for each value
  • 00:32:01
    of key.
  • 00:32:02
    The, the word count will be printed.
  • 00:32:06
    So this way, this DAG automatically once the program is made, the dag will be constructed,
  • 00:32:15
    automatically and the DAG will be given, to the master node.
  • 00:32:24
    And the master in, in turn will communicate, with the executors and shuffle, for this execution
  • 00:32:37
    in this way, that is performed in this dak way.
  • 00:32:43
    Let us, see the SPARK implementation, in more details.
  • 00:32:47
    So, Spark ideas are an expressive computing system, which is not limited by, the Map Reduce
  • 00:32:54
    model.
  • 00:32:55
    That means, beyond Map Reduce also, the programming can be now done, in the SPARK.
  • 00:33:01
    Now, this Spark will facilitate the system memory and it will avoid saving that immediate,
  • 00:33:07
    results to the disk, it will cache for repeated, repetitive queries, that means, the output
  • 00:33:15
    of the transform actions or the transformation will remain in the cache, so that, iterative
  • 00:33:20
    applications can, can make use of this fast or efficient data sharing .
  • 00:33:27
    The SPARK is also compatible with, with the Hadoop system.
  • 00:33:31
    RDDs is an abstraction as I told you, it is a resilient distributed datasets, a partition
  • 00:33:36
    collection of record ,they are spread across the cluster, they are here only and caching
  • 00:33:41
    data sets, in the possible, in memory and different storage levels are possible.
  • 00:33:47
    As I told you that, the transformations and actions, there are two operations, RDD supports
  • 00:33:55
    and transformations include map filter joint, they are lazy operations and actions, include
  • 00:34:03
    the count collect sale and they are trigger executions.
  • 00:34:07
    Spark components, let us go and discuss
  • 00:34:11
    The Spark components, in more details.
  • 00:34:14
    So, as you know that, is Spark, is a distributed computing framework.
  • 00:34:31
    So now, we will see here, what are the different components?
  • 00:34:35
    Which together will form?
  • 00:34:37
    The computational environment of the Spark execution.
  • 00:34:42
    Now, we have seen, there is a cluster manager, there is a driver program, there is a worker
  • 00:34:50
    nodes and within the worker nodes, you have the executors, and within the executors what
  • 00:34:57
    are the different tasks?
  • 00:34:59
    And what is the cash?
  • 00:35:02
    All these different components together will form the distributed computing framework,
  • 00:35:06
    which will give an efficient, execution, environment, for the Spark program or a Spark applications.
  • 00:35:14
    So here, so within, a driver program, when whenever a Spark, shell is being prompt, that
  • 00:35:33
    will be inside the driver program, will create the Spark context.
  • 00:35:38
    Now executing the Spark context means that, it will communicate with the worker nodes,
  • 00:35:45
    within the worker nodes the executors, so a Spark context will, create will, interact
  • 00:35:50
    or communicate with the executors.
  • 00:35:55
    And within the executors, the tasks will be executed.
  • 00:35:59
    so executors within the executable, yes tasks will be executed and executors will be computed
  • 00:36:05
    or will be executing on the worker nodes, so the driver program, then interacts with
  • 00:36:12
    the cluster manager and cluster manager intern will interact with these worker nodes .so
  • 00:36:18
    this all will happen, inside the Spark and Spark will dust the cluster manager.
  • 00:36:24
    Now, there is an option in the Spark that instead of going through the cluster manager,
  • 00:36:30
    you can also use the yarn and other, such resource manager and Scheduler.
  • 00:36:38
    Now, let us see, what do you mean, by the, driver program, Spark context, server and
  • 00:36:47
    cluster manager, worker node, executor, then tasks and through, this to understand this.
  • 00:36:56
    Let us see, a simple application.
  • 00:37:00
    Now, we have seen here, the driver program and this driver program will create a Spark
  • 00:37:07
    context, it will create a Spark context SC. and this in turn will now, communicate with
  • 00:37:15
    the executors, which are running inside, which are running inside.
  • 00:37:23
    So, so this particular driver program, intern knows the, the different worker program communication
  • 00:37:33
    and the Spark context, will now communicate to the executors.
  • 00:37:43
    And these executors in turn, will communicate or will, will execute the tasks.
  • 00:37:50
    So these tasks are, nothing but, the various transformations, the RDDs, through the dag,
  • 00:38:00
    they are being transformed and they are done through the tasks.
  • 00:38:04
    So different executors various tasks are being created and executing.
  • 00:38:09
    So this will create the job execution and let us go back and, and see that, these
  • 00:38:18
    So this particular way, the driver program, will execute, the DAG and server Spark context
  • 00:38:27
    SC.
  • 00:38:28
    will be created which in turn will communicate inside the worker node with the executors
  • 00:38:33
    and these executors in turn will execute various transformations and actions and the result
  • 00:38:41
    will be remained in the cache.
  • 00:38:44
    So that was, about the Spark components.
  • 00:38:49
    So we have seen that, in this manner, the operation of Spark is being performed.
  • 00:38:57
    Now, another view of partition level view, we can see here that, the partitioning so
  • 00:39:06
    that means, RDDs are partitioned
  • 00:39:08
    And different tasks are being executed.
  • 00:39:11
    similarly job scheduling, that means, once an RDD, is that means operations are, given
  • 00:39:19
    automatically it will build, the DAG and DAG will be given to the DAG scheduler and that,
  • 00:39:26
    DAG scheduler will split the graph into the stages of, task and submit each stage as it
  • 00:39:32
    is ready.
  • 00:39:33
    So task set is created and given to the task a scheduler and as far as the cluster manager,
  • 00:39:39
    is concerned it launches, the tasks, via the cluster manager and retry the field or straggling
  • 00:39:45
    task.
  • 00:39:47
    And this task is, given to the worker that we have seen in the previous slide and this
  • 00:39:53
    particular workers will create the thread and execute them, there.
  • 00:39:58
    There are different APIs, which are available and you can write, these APIs is using Java
  • 00:40:05
    program, scala or a Python.
  • 00:40:07
    There is also an interactive interpreter: available, access through the scalar and Python.
  • 00:40:15
    Standby applications are, there are many applications and performance: if we see that Java and C
  • 00:40:22
    are faster and thanks to the static typing .
  • 00:40:26
    Now let us see, the hands-on session, how we can perform, using Spark.
  • 00:40:36
    So as Spark, we can run, as a scalar, so Spark shall, will be created.
  • 00:40:42
    And we can download, the ,the, the file, that data set file and then, it can be, built using
  • 00:40:53
    ,the package and then this particular task or a data file can be submitted ,to the, to
  • 00:41:01
    the master node of that Spark system.
  • 00:41:07
    So, so directly Spark, can store it into the system and now it can perform, various operations,
  • 00:41:16
    on this particular data set, using a Scala program.
  • 00:41:20
    Now summary, the concept of, of Map Reduce are limited to the single path Map Reduce
  • 00:41:30
    is basically limiting various other applications and this particular, concept is avoiding,
  • 00:41:38
    the sorting intermediate results, storing intermediate results on the disk or on HDFS.
  • 00:41:43
    And also, speed-up computations are required, when reusing the datasets. and all these features
  • 00:41:51
    are available, as part of ,this part that we have seen using RDDs.
  • 00:42:00
    So using RDDs, now Spark provides, the not only Map Reduce, operations, beyond Map Reduce,
  • 00:42:15
    it can also use .second thing is, it can be in memory, operations, not necessarily to
  • 00:42:26
    be stored, in HDFS in to store the intermediate results.
  • 00:42:32
    So this way of in-memory computations, will make the speed-up and brings about the efficiency,
  • 00:42:42
    data sharing across different iterations.
  • 00:42:47
    So iterative and interactive applications both are, easily supported and Map Reduce
  • 00:42:51
    and non Map Reduce applications are also supported, by the Spark system.
  • 00:43:19
    So all this is possible, with the help of RDDs and their operations.
  • 00:43:24
    so we have seen that now the RDDs are saw a Spark is very much required and all the
  • 00:43:32
    drawbacks of Map Reduce and Hadoop, is now not there with the SPARK and therefore the
  • 00:43:39
    Spark now has, now various new applications.
  • 00:43:41
    For example, the Spark system will, Spark.
  • 00:43:48
    So the Spark is a core, as a core, can be used for building different applications,
  • 00:44:00
    such as, Spark MLlib, that is the machine learning or the Spark, then Spark streaming
  • 00:44:13
    ,that is the real-time applications, over the Spark and Spark graphics, the graph computation
  • 00:44:25
    or the Spark.
  • 00:44:27
    So, now Spark can use HDFS or may not use HDFS, Spark is independent.
  • 00:44:38
    Therefore, let us, conclude this discussion, that RDD, is resilient distributed datasets,
  • 00:44:45
    will provide a simple and efficient programming model, for different supporting various applications,
  • 00:44:51
    which are the batch and interactive and iterative applications, all are supported using this
  • 00:45:03
    concept, which is called, ‘RDDs’.
  • 00:45:06
    Now, this is generalized, to a broad set of applications.
  • 00:45:10
    And it will, leverage the coarse-grained nature of parallel algorithm, for fault recovery.
  • 00:45:17
    So that is why, this is a hundred times, faster, compared to the, traditional Map Reduce.
  • 00:45:30
    So, Spark is 100 times faster, that is what is compared, with the performance, by the
  • 00:45:41
    Spark production clusters.
  • 00:45:47
    Thank you.
タグ
  • Spark
  • RDD
  • MapReduce
  • big data
  • fault tolerance
  • data processing
  • transformations
  • actions
  • DAG
  • performance