-
Notifications
You must be signed in to change notification settings - Fork 182
Introduce PySpark Session support ( enables the adapter usage for job clusters) #862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…TION Koala 1423 support session connection
Merge from fork
|
@dkruh1, we need to discuss internally if we want to take this feature. I appreciate the effort, and I understand why this feature would be valuable to users, but we need to decide whether we want to take on the maintenance burden of an additional connection mechanism. Will get back to you shortly. |
|
@dkruh1 after discussion, we will not be taking this feature at this time. We are focused on ensuring that dbt-databricks provides the best experience for interacting with SQL Warehouses and serverless compute. As this is OSS, you are free to fork our repo and use your implementation that way. |
|
@dkruh1 did you try building this as a fork and using this forked package to run it as databricks job? @leo-schick did you have any progress on this topic or have interest in supporting such an additional forked package? |
|
@alexeyegorov I am currently not using Databricks in my projects, but I am in strong favor of getting this implemented! @dkruh1 Is there a way to get this implemented into dbt-spark without much extra effort? |
|
@leo-schick I was reading your post like last year. Searched now for possible solutions and mentions on dbt-databricks. Pretty silly it is not supported. I stumbled upon this description: btw, worked shortly with Mara during my time at Lampenwelt few years ago. :P |
|
I have chatted with Claude and "we" worked out a plan to implement session mode for the execution of sql and python models via dbt-databricks. I will check how far I can get with it and maybe give it a try as a standalone forked package on our databricks setup. |
Resolves #
Resolves: [dbt-spark Issue #272] ,[dbt-databricks Issue #575]
Description
Pull Request Description
Summary
This PR introduces support for defining a PySpark-based connection when using the adapter. This enhancement allows dbt to run as part of a running Databricks job cluster, expanding its usage beyond SQL warehouses or all-purpose clusters.
Background
The Spark session functionality referenced here was first discussed in [dbt-spark Issue #272]. Specifically for databricks , the issue was raised here :[dbt-databricks Issue #575]
Key Features
PySpark-Based Connection:
A new environment variable,
DBT_DATABRICKS_SESSION_CONNECTION, has been introduced.True, a newDatabricksSessionConnectionManageris initialized.Testing:
Functional Testing:
Functional tests were conducted using a Databricks notebook.
The notebook programmatically triggered dbt while ensuring the
DBT_DATABRICKS_SESSION_CONNECTIONvariable was set toTrue.These tests confirmed that dbt works seamlessly within a running Spark session.
example notebook code:
`os.environ["DBT_DATABRICKS_SESSION_CONNECTION"] = "True"
res = dbtRunner().invoke(["run","--profiles-dir","/Workspace/Users/doron.kruh@yotpo.com/dbt-dbx-session-test/","--project-dir","/Workspace/Users/user@databrciks.com/dbt-models","--target","prod", "--select","model_to_execute"] )`
Why This Matters
Next Steps
Checklist
CHANGELOG.mdand added information about my change to the "dbt-databricks next" section.