Through promotional materials from Databricks earlier this year, I had heard about the upcoming Spark 2.3 release. I sat in a demo and was really excited to try out new features like Continuous Processing and MLlib in Structured Streaming. But my customer work is all done in Spark 2.2 at the moment, and none of my side projects really required distributed processing at scale. Then I went to Spark Summit earlier this summer, and after seeing more Spark 2.3 in action, I decided it was time to get my hands on it.

I went through the process of installing Spark 2.1 on my MacBook last year … and completely forgot how I did it. I don’t know about you, but the thought of having to find and untangle multiple versions of configuration files and environment variables is usually enough to deter me from starting a project like upgrading Spark. And Spark has a lot of dependencies, too. But it turned out to not be that bad. The rest of this post will walk through how I did it.

Preliminaries

HDFS

Although you can use Spark in Standalone mode, connecting it to Hadoop (HDFS) is important for a couple of reasons:

Spark file input & output functions are generally designed to operate on HDFS, especially for the highly efficient Parquet file format
Although Spark jobs are designed to execute entirely in memory, there might not be sufficient memory to do so, in which case Spark will temporarily cache some data to disk. HDFS is a good choice for that storage

That being said, HDFS stands for Hadoop Distributed File System, and there’s nothing distributed about storing data on a laptop, so HDFS is probably overkill for our purposes. Regardless, I installed it in order to make full use of Spark, following a great set of instructions.

Scala

Spark is written in Scala, so Scala is the first language of Spark. You will need Scala installed, which is easiest through homebrew on a Mac. Although the latest version of Scala is 2.12, Spark only supports Scala up to 2.11, so you’ll need to hard-wire your Scala installation:

brew install scala@2.11

Python/Jupyter

For data scientists, Python is typically a second language if not the first, and thankfully Spark has a Python API (PySpark) that has nearly caught up with the Scala API over time. What’s great about PySpark is that you can switch between Spark objects and native Python objects within the same script or notebook, and leverage all the usual Python libraries you love. I’m most interested in the Anaconda distribution of Python, so I’ll assume that we’re working with Python 3.6 in Anaconda. You certainly don’t have to use the Anaconda distribution, but it is strongly encouraged for integration with Jupyter notebook/lab later. Python installation is a sensitive topic that is best left to the reader ;)

Spark Installation

There’s this approach, but I think it’s easier to install with Homebrew. This approach works well, but needs to be adapted for Spark 2.3. Since I already had Spark 2.1 installed, it was enough to go to Terminal and type

brew upgrade apache-spark

Voila! Now just make sure that your bashrc or bash_profile settings reflect the brew-installed Spark library as the SPARK_HOME variable. In my case, it’s bash_profile:

vim ~/.bash_profile
export SPARK_HOME"=/usr/local/Cellar/apache-spark/2.3.0/libexec"

Apache Toree Installation

Apache Toree has one main goal: provide the foundation for interactive applications to connect and use Apache Spark. While you can do interactive PySpark work in Jupyter without Toree, Toree makes it possible to also access the Scala, R and SQL APIs for Spark just as easily, simply by toggling the choice of kernel when launching the notebook. In developing Spark applications, it’s extremely advantageous to work interactively in a notebook versus submitting jobs in batch mode only, for example to test edge cases, inspect intermediate output, ensure type conversion, etc. The steps I followed to install Toree were:

cd /usr/local/bin
git clone https://github.com/apache/incubator-toree.git
cd incubator-toree
APACHE_SPARK_VERSION=2.3.0 make clean dist
make release
pip install dist/toree-pip/toree-0.3.0.dev1.tar.gz
jupyter toree install --spark_home=/usr/local/Cellar/apache-spark/2.3.0/libexec --interpreters=Scala,PySpark,SparkR,SQL —user

Tying up Loose Ends

PySpark Jupyter Kernel

People used to Jupyter notebook or Jupyter lab will enjoy the ability to integrate Spark development into their usual workflows. And you can even do it with the Scala API by leveraging the Apache Toree kernels. Although Toree does provide a PySpark kernel as well, I prefer calling PySpark as a package within my usual Python 3.6 kernel. Here’s how I set that up.

Navigate to the kernel.json file associated with the Python 3 Jupyter kernel. In my case, that’s located at ~/anaconda/share/jupyter/kernels/python3
Add some environmental variables to the end of the kernel so it knows where to look for PySpark:

"env": {
    "SPARK_HOME": "/usr/local/Cellar/apache-spark/2.3.0/libexec",
    "SPARK_CONF_DIR": "/usr/local/Cellar/apache-spark/2.3.0/libexec/conf",
    "PYSPARK_PYTHON": "python",
    "PYSPARK_DRIVER_PYTHON": "python",
    "PYTHONSTARTUP": "/usr/local/Cellar/apache-spark/2.3.0/libexec/python/pyspark/shell.py",
    "PYTHONPATH": "/usr/local/Cellar/apache-spark/2.3.0/libexec/python/lib/py4j-0.10.4-src.zip:/usr/local/Cellar/apache-spark/2.3.0/libexec/python/:",
    "PYSPARK_SUBMIT_ARGS": "--name 'PySparkShell' 'pyspark-shell'"
  }

PySpark should now work within Jupyter notebooks. To access the PySpark API, you’ll need to import the relevant PySpark packages and build a SparkSession object in your running kernel. Here’s a simple way to do that - note that Spark has tons of configuration options that are outside the scope of this post. For those, check out the Apache Spark Configuration Guide

A typical PySpark notebook for structured data might start off like this:

import pyspark
from pyspark.sql import SparkSession 
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("myApp").master("local[*]").getOrCreate()

df = spark.read.format("csv").option("header","true").load(pathToCSVFiles)

Scala Jupyter Kernel

Getting the Scala (and R and SQL) kernels to work is a little trickier. I’ve set this up in both Linux and Mac environments and found it to be a consistent pain. Your results might differ, depending on things like where you installed Toree, what user permissions you enabled, how you installed Scala, etc. The main issue I had was that Toree installed the kernels in a location that Jupyter does not scan when launching a notebook from my default environment. You have to specify to Jupyter where to look. To rectify this issue, I had to add the following line to my bash_profile:

export JUPYTER_PATH="/usr/local/share/jupyter:/Users/tim/Library/Jupyter”

Conclusion

Apache Spark is great for making big data feel small, and accessing Spark through a Jupyter notebook can simplify your data science workflow even more. I hope you found this useful; if you have better ways of setting up any part of the environment described above, please let me know!

Setting Up Apache Spark 2.3 with Jupyter on Mac OSX