Tuesday, 16 September 2014

Introduction of Spark

Chapter 1 : Fundamental Elements and purpose of Study
1.1 Identification and Area of study:
In cluster computing data storage cost per GB and whole world using cluster
computing 2.7 zettabytes to storage data. Visualization of this data is mostly generated by
YouTube, face book. Big data solutions are Hadoop and Apache Spark. Apache Spark is 100
times faster than hadoop and map reduces. Apache spark is used by Amazon, Yahoo, and
group on.
Apache Spark is an open-source analytics cluster computing framework developed in
AMP Lab at UC Berkeley [8]. Apache spark is general-purpose cluster computing system
with the goal of outperforming disk-based engine like Hadoop. Spark is an implementation of
Resilient Distributed Datasets (RDD)[5] .IT provides parallel in memory processing where as
traditionally Hadoop focused on Map Reduce and distributed Storage. It provides high-level
APIs in Java, Scala, and Python and soon R. Spark enables applications in Hadoop clusters to
run up to 100xs faster in memory and 10x faster running on disk. It comes with a built-in set
of over 80 high-level operators. Spark is executing Map Reduce graphs, achieving high
performance batch processing in Hadoop. There are many mechanisms which can improve
apache Hadoop performance in cluster computing system similarly we can improve their
types of mechanism in apache spark. Still there are few areas to improve the performance of
spark.
1.2 Basic Concepts:
Spark is a computational engine that is responsible for scheduling, distributing, and
monitoring applications consisting of many computational tasks across many worker
machines or a computing cluster. Because the core engine of Spark is both fast and general purpose,
it powers multiple higher-level components specialized for various workloads such
as SQL or machine learning. Spark offers an integrated framework for advanced analytics
including a machine learning library (MLLib), a graph engine (GraphX), a streaming
analytics engine (Spark Streaming) and a fast interactive query tool (Shark) [6].First of all
libraries and higher level components in the stack benefit from improvements at the lower
layers. For example, when Spark’s core engine adds an optimization, SQL and machine
learning libraries automatically speed up. Second, the costs associated with running the stack
are minimized because instead of running 5-10 independent software systems an organization
only needs to run one. This also means that each time a new component is added to the Spark
stack, every organization that uses Spark will immediately be able to try this new component.
This changes the cost of trying out a new type of data analysis from downloading, deploying,
and learning a new software project to upgrading Spark.
Scala is a modern multi-paradigm programming language designed to express
common programming patterns in a concise, elegant, and type-safe way [11]. It smoothly
integrates features of object-oriented and functional languages. Scala is object-oriented,
functional, statically typed. Scala is a high-level API which use in Spark.

Figure : Spark Stack