Skip to content

sparkmllib_pipeline.py issue on Linux machine with recent Java version #91

@bit-scientist

Description

@bit-scientist

I was going through the sparkmllib_pipeline.py in Chapter 3 and had some issues getting it to run at first. There are a couple of issues that need to be addressed.

Turns out pyspark requires java to run smoothly, but there is no mention of it in Chapter 3 (mentioned in Chapter 2 only, impossible to recall)

And it doesn’t work with newer versions of Java (like Java 24 or 21),

Had to install as follows:

sudo apt update
sudo apt install openjdk-11-jdk -y

Then obviously,

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH

Also, I was not able to get spark to work using:

sc = SparkContext("local", "pipelines")
spark = SparkSession.builder.getOrCreate()

So, with some look up I found this to do the job:

spark = SparkSession.builder \
        .appName("pipelines") \
        .master("local[*]") \
        .getOrCreate()

After these changes, everything worked well!

Lastly, I did spark-submit sparkmllib_pipeline.py instead of python sparkmllib_pipeline.py, because apparently python interpreter cannot launch Spark JVM with correct settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions