Shuffle-Spark: Introduction of Spark

Chapter 1 : Fundamental Elements and purpose of Study

1.1 Identification and Area of study:

In cluster computing data storage cost per GB and whole world using cluster

computing 2.7 zettabytes to storage data. Visualization of this data is mostly generated by

YouTube, face book. Big data solutions are Hadoop and Apache Spark. Apache Spark is 100

times faster than hadoop and map reduces. Apache spark is used by Amazon, Yahoo, and

group on.

Apache Spark is an open-source analytics cluster computing framework developed in

AMP Lab at UC Berkeley [8]. Apache spark is general-purpose cluster computing system

with the goal of outperforming disk-based engine like Hadoop. Spark is an implementation of

Resilient Distributed Datasets (RDD)[5] .IT provides parallel in memory processing where as

traditionally Hadoop focused on Map Reduce and distributed Storage. It provides high-level

APIs in Java, Scala, and Python and soon R. Spark enables applications in Hadoop clusters to

run up to 100xs faster in memory and 10x faster running on disk. It comes with a built-in set

of over 80 high-level operators. Spark is executing Map Reduce graphs, achieving high

performance batch processing in Hadoop. There are many mechanisms which can improve

apache Hadoop performance in cluster computing system similarly we can improve their

types of mechanism in apache spark. Still there are few areas to improve the performance of

spark.

1.2 Basic Concepts:

Spark is a computational engine that is responsible for scheduling, distributing, and

monitoring applications consisting of many computational tasks across many worker

machines or a computing cluster. Because the core engine of Spark is both fast and general purpose,

it powers multiple higher-level components specialized for various workloads such

as SQL or machine learning. Spark offers an integrated framework for advanced analytics

including a machine learning library (MLLib), a graph engine (GraphX), a streaming

analytics engine (Spark Streaming) and a fast interactive query tool (Shark) [6].First of all

libraries and higher level components in the stack benefit from improvements at the lower

layers. For example, when Spark’s core engine adds an optimization, SQL and machine

learning libraries automatically speed up. Second, the costs associated with running the stack

are minimized because instead of running 5-10 independent software systems an organization

only needs to run one. This also means that each time a new component is added to the Spark

stack, every organization that uses Spark will immediately be able to try this new component.

This changes the cost of trying out a new type of data analysis from downloading, deploying,

and learning a new software project to upgrading Spark.

Scala is a modern multi-paradigm programming language designed to express

common programming patterns in a concise, elegant, and type-safe way [11]. It smoothly

integrates features of object-oriented and functional languages. Scala is object-oriented,

functional, statically typed. Scala is a high-level API which use in Spark.

Figure : Spark Stack

Shuffle-Spark