Skip to content
Briton Barker edited this page Jun 17, 2016 · 3 revisions

Welcome to the spark-tk wiki!

Getting Started

"from source code"

To get spark-tk off the ground, we have to establish its dependencies.

1. Spark - you need it.

Get it from github https://github.com/apache/spark or use a CDH install or something. spark-tk likes version 1.6

2. The .jars - build them.

From root spark-tk folder, try to build (without running the tests)

mvn clean install -DskipTests

You should see a core/target/core-1.0-SNAPSHOT.jar as well as a bunch of jars in core/target/dependencies

Now establish the location to the jars

export SPARKTK_HOME=/full/path/to/spark-tk/core/target

3. The python stuff

(If you're only interested in the Scala API, you can skip this one)

The python sparktk library is spark-tk/python/sparktk It has a few dependencies that you may not have. Look in the spark-tk/python/requirements.txt to see what it needs.

Do pyspark first. Usually pyspark is sitting in your spark installation. There are a couple options: Add the path to pyspark to $PYTHONPATH or create a symlink to pyspark and put it in your site-packages folder. Something like

sudo ln -s /opt/cloudera/parcels/CDH/lib/spark/python/pyspark /usr/lib/python2.7/site-packages/pyspark

For you other dependencies, use pip2.7 to install.

pip2.7 install decorator

or pip2.7 install -r /path/to/spark-tk/python/requirements.txt

(Note: ideally you should use the same py4j that pyspark is using)

If you start up your python interpreter from the spark-tk/python folder, you'll be fine. Otherwise, sparktk needs to be in the $PYTHONPATH or symlinked as shown above.