Introduction to Spark

May 31, 2015

In this post, we will briefly introduce Apache Spark.

Spark is a general engine for large-scal data processing. The main differenting factor compared to the map-reduce framework is it’s ability to cache intermediate results in-memory.

To start with Spark, simply download a pre-compiled version from the above site and type bin/spark-shell.

Spark’s main abstraction is called RDD (Resilient Distributed Datasets). Below we summarize most of the available operations/transformation of an RDD.

RDD Input

RDD Inspection Operations

The following are a few operations that are helpful to inspect the elements of an RDD

aRDD.first() : get the first element of an RDD
aRDD.take(10) : take 10 elements
aRDD.sample() : sample elements from RDD

RDD Operations

There are several operation available on RDD from which we only highlight the most commonly used ones, i.e, map, reduce, reduceByKey, filter, groupBy, join.

scala> val a = Array((1,2), (2,4), (3,5))
a: Array[(Int, Int)] = Array((1,2), (2,4), (3,5))

val aRDD = sc.makeRDD(a)
aRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:23

RDD Output

Written on May 31, 2015