Skip to content

Conversation

@dkruh1
Copy link

@dkruh1 dkruh1 commented Dec 3, 2024

Resolves #

Resolves: [dbt-spark Issue #272] ,[dbt-databricks Issue #575]

Description

Pull Request Description

Summary
This PR introduces support for defining a PySpark-based connection when using the adapter. This enhancement allows dbt to run as part of a running Databricks job cluster, expanding its usage beyond SQL warehouses or all-purpose clusters.

Background
The Spark session functionality referenced here was first discussed in [dbt-spark Issue #272]. Specifically for databricks , the issue was raised here :[dbt-databricks Issue #575]

Key Features

  1. PySpark-Based Connection:
    A new environment variable, DBT_DATABRICKS_SESSION_CONNECTION, has been introduced.

    • When this variable is set to True, a new DatabricksSessionConnectionManager is initialized.
    • This manager assumes that the dbt code is being executed in the context of an existing Spark session, making it possible to integrate with running Databricks job clusters.
  2. Testing:

    • A new pytest matrix feature called session_support was introduced in to the unit tests. When the session support is enabled , the DBT_DATABRICKS_SESSION_CONNECTION env var is set to true and the unit tests are being executed against the new DatabricksSessionConnectionManager
  3. Functional Testing:
    Functional tests were conducted using a Databricks notebook.

    • The notebook programmatically triggered dbt while ensuring the DBT_DATABRICKS_SESSION_CONNECTION variable was set to True.

    • These tests confirmed that dbt works seamlessly within a running Spark session.

    • example notebook code:
      `os.environ["DBT_DATABRICKS_SESSION_CONNECTION"] = "True"

      res = dbtRunner().invoke(["run","--profiles-dir","/Workspace/Users/doron.kruh@yotpo.com/dbt-dbx-session-test/","--project-dir","/Workspace/Users/user@databrciks.com/dbt-models","--target","prod", "--select","model_to_execute"] )`

Why This Matters

  • Enables running dbt within existing Spark sessions, providing more flexibility for advanced Databricks workflows.
  • Expands the range of cluster types supported by dbt.
  • Supports integration with Databricks job clusters, ensuring compatibility with real-world use cases.

Next Steps

  • Document this feature for users who may need it.
  • Verify compatibility with additional Databricks environments as needed.

Checklist

  • [X ] I have run this code in development and it appears to resolve the stated issue
  • [X ] This PR includes tests, or tests are not required/relevant for this PR
  • [X ] I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@benc-db
Copy link
Collaborator

benc-db commented Dec 3, 2024

@dkruh1, we need to discuss internally if we want to take this feature. I appreciate the effort, and I understand why this feature would be valuable to users, but we need to decide whether we want to take on the maintenance burden of an additional connection mechanism. Will get back to you shortly.

@benc-db
Copy link
Collaborator

benc-db commented Dec 5, 2024

@dkruh1 after discussion, we will not be taking this feature at this time. We are focused on ensuring that dbt-databricks provides the best experience for interacting with SQL Warehouses and serverless compute. As this is OSS, you are free to fork our repo and use your implementation that way.

@benc-db benc-db closed this Dec 5, 2024
@alexeyegorov
Copy link

@dkruh1 did you try building this as a fork and using this forked package to run it as databricks job? @leo-schick did you have any progress on this topic or have interest in supporting such an additional forked package?

@leo-schick
Copy link

@alexeyegorov I am currently not using Databricks in my projects, but I am in strong favor of getting this implemented!
I would prefer getting it merged there first instead of starting an own fork.

@dkruh1 Is there a way to get this implemented into dbt-spark without much extra effort?

@alexeyegorov
Copy link

@leo-schick I was reading your post like last year. Searched now for possible solutions and mentions on dbt-databricks. Pretty silly it is not supported. I stumbled upon this description:
https://gist.github.com/NodeJSmith/d2fc2e9a289360180ebaa9d7e452e285#gistcomment-5951230
I will search for that fork and maybe it is a working solution? Otherwise I will ask Claude to make a plan to implement it into dbt-spark. :D

btw, worked shortly with Mara during my time at Lampenwelt few years ago. :P

@alexeyegorov
Copy link

alexeyegorov commented Jan 25, 2026

I have chatted with Claude and "we" worked out a plan to implement session mode for the execution of sql and python models via dbt-databricks. I will check how far I can get with it and maybe give it a try as a standalone forked package on our databricks setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants