Apache Spark is a engine for large-scale data processing. It is optimized for the execution of multiple parallel operations on the same data set as they occur in many iterative machine learning tasks.
In Spark, data is stored as resilient distributed datasets (RDDs), i.e., an immutable, persistent set of objects that is partitioned across several computers. Two types of operations are defined on RDDs:
RDD x Function -> RDD
apply a function on an RDD and create a new RDD
RDD x Function -> Object
compute a result from an RDD
These operations are executed lazily. This means that Spark stores the sequence of operations performed on an RDD. These operations are only executed, if (i) an action is performed and a result has to be computed or (ii) the computation of an RDD is explicitly requested.
The execution of a Spark application starts with the driver program. It acquires resources from an external cluster management system like YARN and instructs it to start the processes of the application.
Newer versions of Spark include high level APIs for dealing with data, including Spark SQL, Spark DataFrames and Datasets (Datasets does not currently support Python). These APIs include additional information beyond what is provided in the RDD to enable "extra optimizations".
- Apache Spark - official web site
- Spark Cluster Overview
- Spark Programming Guide
- Spark SQL, DataFrames and Datasets Guide
- Wikipedia Article - Apache Spark
- M. Zaharia and M. Chowdhury and M. J. Franklin and S. Shenker and I. Stoica "Spark: Cluster Computing with Working Sets" Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10), 2010, Pages 10-10, USENIX Association Berkeley, CA, USA