Skip to content

Conversation

@ysung
Copy link
Collaborator

@ysung ysung commented Oct 29, 2025

PySpark 3.5.4 Upgrade with OpenLineage, Kafka, and Iceberg Patch Support

Overview

This PR upgrades PySpark to version 2815!3.5.4+affirm.5 (dev version: 2815!3.5.4+affirm.dev14) to include critical dependencies for data lineage tracking (OpenLineage), event streaming (Kafka), and existing patched Iceberg implementation. This upgrade enables Affirm's data platform to track data lineage across Spark jobs and integrate seamlessly with Kafka and our custom Iceberg patches.


Changes

1. pom.xml - Dependency Version Updates

  • Added two new dependency version properties: kafka.version (3.4.1), and openlineage.version (1.38.0)
  • OpenLineage 1.38.0: Enables automatic data lineage capture at the Spark level.
  • Kafka 3.4.1: Uses the existing Spark Kafka connector version. This update explicitly includes the Kafka clients JAR in the assembly, ensuring users have direct Kafka connectivity without additional setup.

2. assembly/pom.xml - Assembly Configuration

  • Added two new dependencies to the assembly configuration: OpenLineage Spark Agent, Kafka Clients
  • These dependencies are now explicitly declared in the assembly so they get packaged into the final Spark tarball

3. python/pyspark/version.py - Version String Update

File: spark/python/pyspark/version.py

  • Updated the __version__ string from the previous dev version to 2815!3.5.4+affirm.5
  • This version string is used by Python package managers for dependency resolution and installation

Validation

Comparison: Dev vs Latest Prod

We validate the new dev version against the latest production version (2815!3.5.4+affirm.4) using the following approach:

# Download prod wheel from Artifactory
wget https://affirmprod.jfrog.io/artifactory/pypi-local/pyspark/2815\!3.5.4\%2Baffirm.4/pyspark-2815\!3.5.4\%2Baffirm.4-py2.py3-none-any.whl

# Download dev wheel from build artifacts
# (after uploading to Artifactory)

# Extract and compare
unzip -q pyspark-2815\!3.5.4\%2Baffirm.4-py2.py3-none-any.whl -d prod_extracted/
unzip -q pyspark-2815\!3.5.4\%2Baffirm.dev14-py2.py3-none-any.whl -d dev_extracted/

# Key validations:
# 1. Verify new dependencies are present
find dev_extracted -name "openlineage-spark*.jar"  # ✅ Present
find dev_extracted -name "kafka-clients*.jar"      # ✅ Present
find dev_extracted -name "iceberg-spark-runtime*.jar" # ✅ Present

# 2. Verify version string
grep __version__ dev_extracted/pyspark/version.py  # ✅ Shows 2815!3.5.4+affirm.dev14

# 3. Verify pom.xml changes
# (included in distribution for reference)
cat dev_extracted/pom.xml | grep -A2 "iceberg.version\|kafka.version\|openlineage.version"

# 4. Class availability check
# Extract and inspect JAR contents to ensure classes are present
unzip -l dev_extracted/openlineage-spark_2.12-1.38.0.jar | grep OpenLineageSparkListener
# ✅ Should show: io/openlineage/spark/agent/OpenLineageSparkListener.class

Expected Results:

  • Dev version contains OpenLineage 1.38.0, Kafka 3.4.1, and Iceberg 1.8.1-PATCH.1
  • All JARs are properly packaged
  • Version string correctly shows affirm.dev14
  • No breaking changes compared to prod (only additions)

Testing

Build

cd /workspace/spark
./build/mvn -DskipTests -Pkubernetes clean package -T 4
pyenv global 3.9.18
cd python
python -m build

Outputs:

  • spark/python/dist/pyspark-2815!3.5.4+affirm.dev14-py2.py3-none-any.whl (wheel distribution)

Upload to Artifactory

# Install twine (if not already installed)
pip install twine

# Upload wheel to JFrog Artifactory PyPI repository
twine upload \
  --repository-url https://affirmprod.jfrog.io/artifactory/api/pypi/pypi-local \
  --username <jfrog_username> \
  --password <jfrog_api_token> \
  dist/pyspark-2815\!3.5.4\+affirm.dev14-py2.py3-none-any.whl \
  --verbose

Testing in Stage

  1. Build a stage batch pod with the branch:

    affirm.batch.k8s shell -n capital-data-sourcing -c stage-live-jobs --release bdx-406_py3
    
  2. Run basi Spark tests:

    import pyspark
    from pyspark.sql import SparkSession
    
    # Verify version
    assert pyspark.__version__ == "2815!3.5.4+affirm.dev14"
    
    # Verify OpenLineage is available
    spark = SparkSession.builder \
        .appName("test") \
        .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") \
        .getOrCreate()
    
    # Run a simple query to verify Spark works with OpenLineage
    df = spark.range(100)
    df.show
  3. Run spark + openlineage + kafka
    Successfully validate execution pod sping up and event emission with testing in https://github.com/Affirm/all-the-things/pull/157428

python -m affirm.platform.batch.live.airflow.tasks.run affirm.bank.originations.service.airflow_tasks.etls.late_event.OriginationLateEventETL --total-worker-num-per-shard 1 --execution-date 2025-11-04
Screenshot 2025-11-13 at 2 47 36 PM

Next Steps

After Approval + Successful Testing

Once this PR is approved build the production version:

# Rebuild PySpark wheel
cd /workspace/spark
./build/mvn -DskipTests -Pkubernetes clean package -T 4
pyenv global 3.9.18
cd python
python -m build

# Upload to Artifactory (prod release)
twine upload \
  --repository-url https://affirmprod.jfrog.io/artifactory/api/pypi/pypi-local \
  --username <jfrog_username> \
  --password <jfrog_api_token> \
  dist/pyspark-2815\!3.5.4\+affirm.5-py2.py3-none-any.whl

@ysung ysung marked this pull request as ready for review November 13, 2025 21:28
@ysung ysung requested a review from a team November 13, 2025 21:42
@ysung ysung merged commit 0cdc433 into branch-3.5 Nov 13, 2025
23 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants