I. Introduction of spark Driver

Spark Driver controls the workflow and Spark workers launches executors responsible for executing part of the job submitted to spark driver through cluster node. Spark driver has few components: 1) RDD 2) Scheduler 3) Serializer 4) Shuffle. Spark Worker has two components: 1) Task and 2) Block Manager.

1) RDD: Resilient Distributed Datasets (RDD) is a basic abstraction in Spark. RDD represents a partitioned collection of elements that can be operated on in parallel.

2) Scheduler: Spark’s scheduler uses representation of RDDs. Scheduler assigns task to machines based on data locality using delay scheduling.

3) Serializer: Spark sterilizer that uses Java’s built-in serializer. It is used for stream of reading serializer object.

4) Shuffle: In Spark, Shuffle creates a large number of shuffles (M*R). Shuffle refers to maintaining a shuffle file for each partition which is the same as the number of reduce task R per core C rather than per Map task M. Every machine needs to handle only C*R number of shuffle rather than M*R.

Fig. 1. Spark Driver Execution flow

II. optimization and latency hiding

A. Optimization in Spark

In Apache Spark, Optimization implements using Shuffling techniques. In this paper we use shuffling technique for optimization. The optimize shuffle performance two possible approaches are 1) To emulate Spark behavior by merging intermediate 2) To create large shuffle files 3) Use columnar compression to shift bottleneck to CPU. For shuffling shuffle data is required. Spark compressibility was low and involves significant computation and metadata maintenance for splitting data into columns on map side and reconstructing them into rows on reduce side.

One of the key optimization factors was the Shuffle. With other two factors was sorting algorithm and the external sorting service. In this paper choose optimize shuffle file performance in the Spark distributed computing platform.

Optimization Limitation:

· Lack of iteration support

· High latency due to persisting intermediate data onto diak

B. Latency Hididng in Spark

Apache Spark which has taken the strength of Hadoop and made improvements in a Hadoop’s weakness and provides more efficient batch processing framework with a much lower latency. Prefetching is a latency hiding technique. It has to increase parallelism of tasks so as to keep the CPU busy by fetching the data before it is required.

Prefetching has several factors. Emerging trends such as putting data and services into the cloud and services will suffer from latency. A real-time data prefetching algorithm based on when pushing shuffle files prior to the completion of the map phase.

III. shuffle overview

Shuffle Phase is a component of Spark Driver. A shuffle is a communication between one input RDD and an Output RDD. Each shuffle has a fixed number of mappers and a fixed number of reduce partitions. Shuffle writer and Shuffle reader handle the I/O for a particular task, operating on iteration of RDD elements. When a shuffle has an Aggtrgator, the Shuffle Manager and its readers and writers are responsible for external spilling.

Shuffle works in two stages: 1) Shuffle writes intermediate files to disk and 2) fetch by the next stage of tasks.

Shuffle operation is implemented differently in Spark compared to Hadoop. The values of M and R in Hadoop are much lesser. The number of shuffle files in Spark scales with M*R , a smaller number of map task and reduce task may provide more justification for the way Spark handles Shuffle files on the map side .

Fig.2. Shuffle Work

A. Map Side Shuffle

Each map task in Spark writes outs a shuffle file for every reducer. These files are not intermediately in the sense that Spark does not merge them into large partition. Since scheduling overhead in Spark is much lesser the no. of mappers (M) and reducers ® is higher than in Hadoop. Thus, shipping M*R files to the respective reduces could result in significant overheads.

Spark also provides a parameter to specify compression libraries to compress map outputs. It could be default. Default uses 33KB of buffer for each open file and significantly reduces risk of encountering out-of-memory error.

B. Reduce side Shuffle

A major difference between Hadoop and Spark is on the reduce side. Spark requires to all shuffle data to fit in memory of the corresponding reduce task. Where the reducer task demands all shuffle data for a GroupByKey or a ReduceByKey operation. Spark throws an out-of-memory exception. The Shuffle is a pull operation in spark compared to push operation in Hadoop.

Each Reducer should also maintain a network buffer to map outputs. Size of the buffer is Specified through the parameter.

IV. shuffle techniques

Spark Shuffling uses two techniques: 1) Sort-based Shuffle 2) Hash-based Shuffle.

A. Sort-based Shuffle

A sort-based Shuffle can be more scalable than Spark’s current hash-based one because it doesn’t require writing a separate file for each reduce task from each mapper.

Each map task will produce one or more output files sorted by a key’s partition ID, then merge-sort them to a single output file. Once the map tasks produce files, reduces will be able to request ranges of files to get their particular data. An index file for each output file saying where each partition is located and update the Block Manager to support using this index.

Map tasks will write data through a SortedFileWriter that creates one or more sorted files merge them and then creates an index file for the merged file. The SortedFileWriter must reset compression and serialization streams when writing each range.

SortedFileWriter will works as follows:

1) Given a stream of incoming key-value pairs, first write them into buckets in memory based on their partition ID. This bucket can be ArrayBuffers for each partition ID.

2) When the total size of the buckets gets too large, write the current in-memory output to a new file. This intermediate file will contain a header saying at which position each partition ID.

Fig. 2.Sort-Based Shuffle

3) After all intermediate files are written, merge-sort them into a final file.

4) When writing the final file reset the serialization and compression streams after writing each partition and track the byte position of each partition to create an index file.

Reduce Tasks will fetch and hash together data the same way use ExternalAppendOnlyMap. They merge all spilled files at once if there are a huge amount of files.

B. Hash-based Shuffle

A hash-based shuffle is default in shuffling data but starting in spark 1.1. There is an experimental sort-based shuffle that is more memory-efficient in environments with small executors. The mapper reduce the amount of the increase based on the performance of hash-based realization of shuffle sort of performance.

Hash shuffle into a set of 64 subdirectories created on each disk. Hash shuffle large number of random writes when each shuffle is relatively small. A hash-based shuffle required a hash shuffle reader to read the intermediate file from mapper side. Hash-based shuffle are use to BlockStoreShuffle to store the shuffle file and resize into the shuffle.

Fig. 3.Hash-Based Shuffle

Shuffle-Spark

Sunday, 8 February 2015