-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the spark-tk wiki!
To get spark-tk off the ground, we have to establish its dependencies.
Get it from github https://github.com/apache/spark or use a CDH install or something. spark-tk likes version 1.6
From root spark-tk folder, try to build (without running the tests)
mvn clean install -DskipTests
You should see a core/target/core-1.0-SNAPSHOT.jar as well as a bunch of jars in core/target/dependencies
Now establish the location to the jars
export SPARKTK_HOME=/full/path/to/spark-tk/core/target
(If you're only interested in the Scala API, you can skip this one)
The python sparktk library is spark-tk/python/sparktk It has a few dependencies that you may not have. Look in the spark-tk/python/requirements.txt to see what it needs.
Do pyspark first. Usually pyspark is sitting in your spark installation. There are a couple options: Add the path to pyspark to $PYTHONPATH or create a symlink to pyspark and put it in your site-packages folder. Something like
sudo ln -s /opt/cloudera/parcels/CDH/lib/spark/python/pyspark /usr/lib/python2.7/site-packages/pyspark
For you other dependencies, use pip2.7 to install.
pip2.7 install decorator
or pip2.7 install -r /path/to/spark-tk/python/requirements.txt
(Note: ideally you should use the same py4j that pyspark is using)
If you start up your python interpreter from the spark-tk/python folder, you'll be fine. Otherwise, sparktk needs to be in the $PYTHONPATH or symlinked as shown above.