Apache Spark is all the rage now in the Hadoop ecosystem. It has recently become an Apache top level project, and many people are looking at it as a successor to MapReduce.
You probably want to try this out yourself in your cluster. There are some nuances of Spark that are difficult to figure out. Here’s how you can get past them and start figuring out what Spark is all about and what it could do for you in your cluster.
The Spark home page states:
“Apache Spark is a fast and general engine for large-scale data processing”
Spark can actually be run completely outside of a Hadoop cluster by simply starting a master server and various workers. The instructions on the Spark home page are quite clear for this, so I’ll focus instead on the other way to use Spark…by executing it inside of YARN to work on your HDFS data.
These instructions assume that you have a single node cluster set up using the instructions I have posted earlier. They should also work if you have a cluster that uses Hadoop 2.2 or later (I have tried both 2.2 and 2.3).
You can download Spark from the project downloads page. You can either download the source and build Spark yourself (see the next section), or download binaries that have been built for Hadoop 2. Both will work in this situation.
To use the pre-built binaries, do the following:
Download the package by the label “Download binaries for Hadoop 2”. This is a .tgz file that contains a copy of the source distribution with binaries included. Beware…it’s big (175 MB for 0.9.0, compared to 5.5 MB for just the source). Copy this file to your Hadoop node and extract the archive. Skip to the section Configuring Spark for configuring Spark to run in your cluster.
Build Spark (optional)
In case you want to build Spark yourself, the build process is quite easy. For Hadoop 2.2, Spark can be built using Maven. Maven 3 is required for the build.
First, extract the source archive. It will create a directory named spark-0.9.0-incubating. The top level of this directory has the Maven pom.xml that you will build against.
Before running Maven, make sure you give Maven a little extra memory to work with, since the build process is quite intensive. The settings I used were:
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
Once you have MAVEN_OPTS set appropriately, then you can run the Maven build command as follows:
mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean package
Note: I tried running Spark against the Hadoop 2.3 libraries, but was unsuccessful. I received some NullPointerExceptions related to setting the Hadoop configuration when trying to run the examples. I did discover that I can build against the 2.2 libraries and run in either version 2.2 or version 2.3.
Maven has a profile that needs to be activated for running Spark in YARN (-Pyarn). Some of the tests do not work the very first time Spark is built, so for your first time building Spark, use -DskipTests to skip over the unit tests.
The build process will take a while (a few minutes), depending on the speed of your machine. When it is finished, you will see a nice report from Maven indicating the status of all of the component builds.
Once the build finishes, you can either:
- Take the complete directory with all .class and .jar files that were generated and copy it to your Hadoop node (as you would have if you downloaded the pre-built binaries).
- Copy the source archive to your Hadoop node, then extract it and copy in only the Spark assembly file (located in assembly/target/scala-2.10) and the examples assembly (located in examples/target/scala-2.10). This significantly reduces the size of the files you need to copy, and your disk space may be at a premium, especially if your single node setup is inside of a VM you are running on another box.
Whether you downloaded the pre-built binaries or built them yourself, by now you have a Spark distribution on your Hadoop node. You only need to set a few environment variables, then do a quick setup step, and you will be running Spark code.
First, there are two environment variables you need to set:
SPARK_HOME – This is not needed for Spark to function, but it is a useful shortcut.
SPARK_JAR – This is the location of your Spark assembly file.
Assuming you have built Spark 0.9.0-incubating and put the archive directory in your home directory, you will set these variables to the following values:
export SPARK_HOME=~/spark-0.9.0-incubating export SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
You can set these in your .profile so that you don’t need to keep remembering them every time you log in. You can also add the bin directory under $SPARK_HOME to your path.
Last, there’s one configuration step to perform…creating your log4j.properties file. For this you can create it from the template that is already in your conf directory:
cp $SPARK_HOME/conf/log4j.properties.template $SPARK_HOME/conf/log4j.properties
This gives you a log4j.properties file that performs INFO-level logging to the Console and dials back settings for other components.
That’s all you need to do in order to set up Spark. Notice that you didn’t even need to touch any Hadoop system files…Spark runs completely on top of YARN.
Running Your First Spark Example
Now it’s time to try out running Spark code. The Spark framework includes some helper scripts for running Spark applications.
You can try running one of the examples below:
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar $SPARK_HOME/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone
This runs a simple calculation that requires no input files and uses random numbers to estimate the value of Pi. You can see the details on the Spark Examples page.
You should see the application get submitted and start printing application reports from YARN. The final report should include:
yarnAppState: FINISHED distributedFinalState: SUCCEEDED
If you see this, then your application has finished successfully. To check the output, go to the directory with the YARN logs in it (/usr/share/yarn/logs if you used the instructions on this site) and look under userlogs to find your application ID:
yarn> cd /usr/share/yarn/logs/userlogs yarn> cd [APPLICATION_ID] yarn> cat container_*/stdout
The SparkPi example has only a single line of output in the main container:
Pi is roughly 3.1376
Since random numbers are used for the example, the actual number will vary, but if you see this line, Spark ran successfully, and you now have a working Spark installation on your single node cluster!
In future posts I will discuss Spark some more and what you can do with it.