Sunday, 8 February 2015


  I.     Introduction of spark Driver 

Spark Driver controls the workflow and Spark workers launches executors responsible for executing part of the job submitted to spark driver through cluster node. Spark driver has few components: 1) RDD 2) Scheduler 3) Serializer 4) Shuffle. Spark Worker has two components: 1) Task and 2) Block Manager.
1) RDD: Resilient Distributed Datasets (RDD) is a basic abstraction in Spark. RDD represents a partitioned collection of elements that can be operated on in parallel.
2) Scheduler: Spark’s scheduler uses representation of RDDs. Scheduler assigns task to machines based on data locality using delay scheduling.
3) Serializer: Spark sterilizer that uses Java’s built-in serializer. It is used for stream of reading serializer object.
4) Shuffle: In Spark, Shuffle creates a large number of shuffles (M*R). Shuffle refers to maintaining a shuffle file for each partition which is the same as the number of reduce task R per core C rather than per Map task M. Every machine needs to handle only C*R number of shuffle rather than M*R.


Fig. 1. Spark Driver Execution flow

II.     optimization and latency hiding

A.    Optimization in Spark

In Apache Spark, Optimization implements using Shuffling techniques. In this paper we use shuffling technique for optimization. The optimize shuffle performance two possible approaches are 1) To emulate Spark behavior by merging intermediate 2) To create large shuffle files 3) Use columnar compression to shift bottleneck to CPU. For shuffling shuffle data is required. Spark compressibility was low and involves significant computation and metadata maintenance for splitting data into columns on map side and reconstructing them into rows on reduce side.
One of the key optimization factors was the Shuffle. With other two factors was sorting algorithm and the external sorting service. In this paper choose optimize shuffle file performance in the Spark distributed computing platform.
Optimization Limitation:
·         Lack of iteration support
·         High latency due to persisting intermediate data onto diak

B.    Latency Hididng in Spark

Apache Spark which has taken the strength of Hadoop and made improvements in a Hadoop’s weakness and provides more efficient batch processing framework with a much lower latency. Prefetching is a latency hiding technique. It has to increase parallelism of tasks so as to keep the CPU busy by fetching the data before it is required.
Prefetching has several factors. Emerging trends such as putting data and services into the cloud and services will suffer from latency. A real-time data prefetching algorithm based on when pushing shuffle files prior to the completion of the map phase.

 III.    shuffle overview

Shuffle Phase is a component of Spark Driver. A shuffle is a communication between one input RDD and an Output RDD. Each shuffle has a fixed number of mappers and a fixed number of reduce partitions. Shuffle writer and Shuffle reader handle the I/O for a particular task, operating on iteration of RDD elements. When a shuffle has an Aggtrgator, the Shuffle Manager and its readers and writers are responsible for external spilling.
Shuffle works in two stages: 1) Shuffle writes intermediate files to disk and 2) fetch by the next stage of tasks.


Shuffle operation is implemented differently in Spark compared to Hadoop. The values of M and R in Hadoop are much lesser. The number of shuffle files in Spark scales with M*R , a smaller number of map task and reduce task may provide more justification for the way Spark handles Shuffle files on the map side .



 



Fig.2. Shuffle Work

A.    Map Side Shuffle

Each map task in Spark writes outs a shuffle file for every reducer. These files are not intermediately in the sense that Spark does not merge them into large partition. Since scheduling overhead in Spark is much lesser the no. of mappers (M) and reducers ® is higher than in Hadoop. Thus, shipping M*R files to the respective reduces could result in significant overheads.
Spark also provides a parameter to specify compression libraries to compress map outputs. It could be default. Default uses 33KB of buffer for each open file and significantly reduces risk of encountering out-of-memory error.

B.    Reduce side Shuffle

A major difference between Hadoop and Spark is on the reduce side. Spark requires to all shuffle data to fit in memory of the corresponding reduce task. Where the reducer task demands all shuffle data for a GroupByKey or a ReduceByKey operation. Spark throws an out-of-memory exception. The Shuffle is a pull operation in spark compared to push operation in Hadoop.
Each Reducer should also maintain a network buffer to map outputs. Size of the buffer is Specified through the parameter.

 IV.   shuffle techniques

Spark Shuffling uses two techniques: 1) Sort-based Shuffle 2) Hash-based Shuffle.

A.    Sort-based Shuffle

A sort-based Shuffle can be more scalable than Spark’s current hash-based one because it doesn’t require writing a separate file for each reduce task from each mapper.
Each map task will produce one or more output files sorted by a key’s partition ID, then merge-sort them to a single output file. Once the map tasks produce files, reduces will be able to request ranges of files to get their particular data. An index file for each output file saying where each partition is located and update the Block Manager to support using this index.
Map tasks will write data through a SortedFileWriter that creates one or more sorted files merge them and then creates an index file for the merged file. The SortedFileWriter must reset compression and serialization streams when writing each range.
SortedFileWriter will works as follows:
     1) Given a stream of incoming key-value pairs, first write them into buckets in memory based on their partition ID. This bucket can be ArrayBuffers for each partition ID.
     2) When the total size of the buckets gets too large, write the current in-memory output to a new file. This intermediate file will contain a header saying at which position each partition ID.


Fig. 2.Sort-Based Shuffle  
3) After all intermediate files are written, merge-sort them into a final file.
4) When writing the final file reset the serialization and compression streams after writing each partition and track the byte position of each partition to create an index file.
     Reduce Tasks will fetch and hash together data the same way use ExternalAppendOnlyMap. They merge all spilled files at once if there are a huge amount of files.

B.    Hash-based Shuffle

A hash-based shuffle is default in shuffling data but starting in spark 1.1. There is an experimental sort-based shuffle that is more memory-efficient in environments with small executors. The mapper reduce the amount of the increase based on the performance of hash-based realization of shuffle sort of performance.
Hash shuffle into a set of 64 subdirectories created on each disk. Hash shuffle large number of random writes when each shuffle is relatively small. A hash-based shuffle required a hash shuffle reader to read the intermediate file from mapper side. Hash-based shuffle are use to BlockStoreShuffle to store the shuffle file and resize into the shuffle.



                                                        Fig. 3.Hash-Based Shuffle

                                                                                                                                                          

Tuesday, 16 September 2014

Introduction of Spark

Chapter 1 : Fundamental Elements and purpose of Study
1.1 Identification and Area of study:
In cluster computing data storage cost per GB and whole world using cluster
computing 2.7 zettabytes to storage data. Visualization of this data is mostly generated by
YouTube, face book. Big data solutions are Hadoop and Apache Spark. Apache Spark is 100
times faster than hadoop and map reduces. Apache spark is used by Amazon, Yahoo, and
group on.
Apache Spark is an open-source analytics cluster computing framework developed in
AMP Lab at UC Berkeley [8]. Apache spark is general-purpose cluster computing system
with the goal of outperforming disk-based engine like Hadoop. Spark is an implementation of
Resilient Distributed Datasets (RDD)[5] .IT provides parallel in memory processing where as
traditionally Hadoop focused on Map Reduce and distributed Storage. It provides high-level
APIs in Java, Scala, and Python and soon R. Spark enables applications in Hadoop clusters to
run up to 100xs faster in memory and 10x faster running on disk. It comes with a built-in set
of over 80 high-level operators. Spark is executing Map Reduce graphs, achieving high
performance batch processing in Hadoop. There are many mechanisms which can improve
apache Hadoop performance in cluster computing system similarly we can improve their
types of mechanism in apache spark. Still there are few areas to improve the performance of
spark.
1.2 Basic Concepts:
Spark is a computational engine that is responsible for scheduling, distributing, and
monitoring applications consisting of many computational tasks across many worker
machines or a computing cluster. Because the core engine of Spark is both fast and general purpose,
it powers multiple higher-level components specialized for various workloads such
as SQL or machine learning. Spark offers an integrated framework for advanced analytics
including a machine learning library (MLLib), a graph engine (GraphX), a streaming
analytics engine (Spark Streaming) and a fast interactive query tool (Shark) [6].First of all
libraries and higher level components in the stack benefit from improvements at the lower
layers. For example, when Spark’s core engine adds an optimization, SQL and machine
learning libraries automatically speed up. Second, the costs associated with running the stack
are minimized because instead of running 5-10 independent software systems an organization
only needs to run one. This also means that each time a new component is added to the Spark
stack, every organization that uses Spark will immediately be able to try this new component.
This changes the cost of trying out a new type of data analysis from downloading, deploying,
and learning a new software project to upgrading Spark.
Scala is a modern multi-paradigm programming language designed to express
common programming patterns in a concise, elegant, and type-safe way [11]. It smoothly
integrates features of object-oriented and functional languages. Scala is object-oriented,
functional, statically typed. Scala is a high-level API which use in Spark.

Figure : Spark Stack