test #1

codersky · 2025-07-09T05:56:43Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

### What changes were proposed in this pull request? This PR aims to update K8s docs to recommend K8s v1.31+ for Apache Spark 4.1.0. ### Why are the changes needed? **1. K8s v1.30 entered the maintenance since 2025-04-28 and will reach the end of support on 2025-06-28** - https://kubernetes.io/releases/patch-releases/#1-30 **2. Default K8s Versions in Public Cloud environments** The default K8s versions of public cloud providers are already moving to K8s 1.31 like the following. - EKS: v1.32 (Default), v1.33 (Available) - GKE: v1.32 (Stable), v1.32 (Regular), v1.33 (Rapid) - AKS: v1.32 (Default), v1.33 (GA) **3. End Of Support** In addition, K8s 1.30 reached or will reach the end of standard support around Apache Spark 4.1.0 release. | K8s | EKS | AKE | GKE | | ---- | ------- | ------- | ------- | | 1.30 | 2025-07 | 2025-07 | 2025-09 | - [EKS EOL Schedule](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar) - [AKS EOL Schedule](https://docs.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) - [GKE EOL Schedule](https://cloud.google.com/kubernetes-engine/docs/release-schedule) ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only change about K8s versions. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51074 from dongjoon-hyun/SPARK-52389. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… resolution ### What changes were proposed in this pull request? This PR introduces the `DataflowGraph`, a container for Declarative Pipelines datasets and flows, as described in the [Declarative Pipelines SPIP](https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0#heading=h.9g6a5f8v6xig). It also adds functionality for - Constructing a graph by registering a set of graph elements in succession (`GraphRegistrationContext`) - "Resolving" a graph, which means resolving each of the flows within a graph. Resolving a flow means: - Validating that its plan can be successfully analyzed - Determining the schema of the data it will produce - Determining what upstream datasets within the graph it depends on It also introduces various secondary changes: * Changes to `SparkBuild` to support declarative pipelines. * Updates to the `pom.xml` for the module. * New error conditions ### Why are the changes needed? In order to implement Declarative Pipelines. ### Does this PR introduce _any_ user-facing change? No changes to existing behavior. ### How was this patch tested? New test suites: - `ConnectValidPipelineSuite` – test cases where the graph can be successfully resolved - `ConnectInvalidPipelineSuite` – test cases where the graph fails to be resolved ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51003 from aakash-db/graph-resolution. Lead-authored-by: Aakash Japi <aakash.japi@databricks.com> Co-authored-by: Sandy Ryza <sandy.ryza@databricks.com> Co-authored-by: Sandy Ryza <sandyryza@gmail.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

### What changes were proposed in this pull request? This PR makes `InvalidPlanInput` a user-facing error by: 1. Refactoring `InvalidPlanInput` to implement `SparkThrowable` interface 2. Adding proper error condition (the default value is `INTERNAL_ERROR`) and message parameters support 3. Updating error creation in `InvalidInputErrors` to use the new error format 4. Adding test coverage for the new error handling ### Why are the changes needed? The PR follows the pattern of other Spark errors by implementing the `SparkThrowable` interface, which provides a standardized way to handle and display errors to users. This makes error messages more consistent and easier to understand across the Spark ecosystem. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/sbt "connect/testOnly *InvalidInputErrorsSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 0.50.7 (Universal) Closes apache#51054 from heyihong/SPARK-52337. Authored-by: Yihong He <heyihong.cn@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…tions ### What changes were proposed in this pull request? This PR aims to fix bug Connect should insensitive for JDBC options. Please refer to the comments. apache#50059 (comment) In fact, the built-in Scala API ensures these parameters are lowercase. https://github.com/apache/spark/blob/b18b956f967038db4b751a3845154f2b1d4f5f79/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/DataFrameReader.scala#L126 ### Why are the changes needed? Fix bug Connect should insensitive for JDBC options. ### Does this PR introduce _any_ user-facing change? 'Yes'. Restore the original behavior. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes apache#51068 from beliefer/SPARK-52384. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… when creating a `DataFrame` ### What changes were proposed in this pull request? When creating a `DataFrame` from Python using `spark.createDataFrame`, infer the type of any `VariantVal` objects as `VariantType`. This is implemented by adding a case mapping `VariantVal` to `VariantType` in the `pyspark.sql.types._infer_type` function. ### Why are the changes needed? Currently, when creating a `DataFrame` that includes locally-instantiated `VariantVal` objects in Python, the type is inferred as `struct<metadata:binary,value:binary>` rather than `VariantType`. This leads to unintended behavior when creating a `DataFrame` locally, or in certain situations like `df.rdd.map(...).toDF` which call `createDataFrame` under the hood. The bug only occurs when the schema of the `DataFrame` is not passed explicitly. ### Does this PR introduce _any_ user-facing change? Yes, fixes the bug described above. ### How was this patch tested? Added a test in `python/pyspark/sql/tests/test_types.py` that checks the inferred type is `VariantType`, as well as ensuring the `VariantVal` has the correct `value` and `metadata` after inference. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51065 from austinrwarner/SPARK-52355. Authored-by: Austin Warner <austin.richard.warner@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…cstring to include `VariantType` as valid input ### What changes were proposed in this pull request? Updated the `pyspark.sql.functions.to_json` docstring to include `VariantType` as a valid input. This includes updates to the summary line, the `col` parameter description, and a new example. ### Why are the changes needed? With the release of Spark 4.0, users of the new Variant Type will sometimes need to save out the JSON string representation when using PySpark. Before this change, the API docs flasely imply that `to_json` cannot be used for VariantType columns. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No tests added (docs-only change) ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51064 from austinrwarner/SPARK-52352. Authored-by: Austin Warner <austin.richard.warner@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…tion and resolution" This reverts commit 26370d6.

…rencing previous iterations in UnionLoop ### What changes were proposed in this pull request? Modify the way that we write statistics and constraints in LogicalRDDs that refer to previous iterations in UnionLoopExec. ### Why are the changes needed? LogicalRDD constraints are currently incorrectly written in the case where we have multiple columns using the same name in recursion. This leads to incorrectly pruning out filters which can lead to infinite recursion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New Golden file test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51070 from Pajaraja/pavle-martinovic_data/ConstraintsFixII. Authored-by: pavle-martinovic_data <pavle.martinovic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? New iteration of single-pass Analyzer improvements - Implement HAVING - Remove excessive stack frames by flattening `withNewScope`-shaped methods - Implement default view collation in single-pass Analyzer - Move hive table resolution to MetadataResolver extensions - Other bugfixes ### Why are the changes needed? To replace the existing Spark Analyzer with the single-pass. one. ### Does this PR introduce _any_ user-facing change? No, single-pass Analyzer is not yet enabled. ### How was this patch tested? CI with `ANALYZER_DUAL_RUN_LEGACY_AND_SINGLE_PASS_RESOLVER`. ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes apache#51078 from vladimirg-db/vladimir-golubev_data/single-pass-analyzer/improvements. Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…es name ### What changes were proposed in this pull request? In this issue I propose to remove the `TempResolvedColumn` nodes when computing the name for `InheritAnalysisRules` nodes (they are not removed during the `ResolveAggregateFunctions` rule). This is the right behavior as `TempResolvedColumn` is an internal node and shouldn't be exposed to the users. The following query: ``` SELECT sum(col1) FROM VALUES(1) GROUP BY ALL HAVING sum(ifnull(col1, 1)) = 1 ``` Would have following analyzed plans: Before the change: ``` Project [sum(col1)#2L] +- Filter (sum(ifnull(tempresolvedcolumn(col1), 1))#4L = cast(1 as bigint)) +- Aggregate [sum(col1#0) AS sum(col1)#2L, sum(ifnull(tempresolvedcolumn(col1#0, col1, false), 1)) AS sum(ifnull(tempresolvedcolumn(col1), 1))#4L] +- LocalRelation [col1#0] ``` After the change: ``` Project [sum(col1)#2L] +- Filter (sum(ifnull(col1, 1))#4L = cast(1 as bigint)) +- Aggregate [sum(col1#0) AS sum(col1)#2L, sum(ifnull(tempresolvedcolumn(col1#0, col1, false), 1)) AS sum(ifnull(col1, 1))#4L] +- LocalRelation [col1#0] ``` ### Why are the changes needed? To improve (correct) `Alias` names. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51071 from mihailoale-db/trimtempresolvedcolumnforparameters. Authored-by: mihailoale-db <mihailo.aleksic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR improves error handling in SparkConnectPlanner by: 1. Adding new error handling methods in `InvalidInputErrors.scala` for better error messages: - `invalidEnum` for handling invalid enum values - `invalidOneOfField` for handling invalid oneOf fields in protobuf messages - `cannotBeEmpty` for handling empty fields - `streamingQueryRunIdMismatch` and `streamingQueryNotFound` for streaming query errors 2. Replacing specific error messages with more generic ones. For examples: - Replacing `unknownRelationNotSupported` with `invalidOneOfField` - Replacing `catalogTypeNotSupported` with `invalidOneOfField` - Replacing `functionIdNotSupported` with `invalidOneOfField` - Replacing `expressionIdNotSupported` with `invalidOneOfField` - Replacing `dataSourceIdNotSupported` with `invalidOneOfField` 3. Improving error handling for protobuf-related issues: - Better handling of oneOf fields in protobuf messages - More descriptive error messages for invalid enum values - Better handling of empty fields ### Why are the changes needed? 1. Provide more specific and descriptive error messages 2. Better handle protobuf-related issues that are common in the Connect API 3. Make error messages more consistent across different parts of the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/sbt "connect/testOnly *SparkConnectPlannerSuite"` `build/sbt "connect/testOnly *InvalidInputErrorsSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 0.50.7 (Universal) Closes apache#51062 from heyihong/SPARK-52383. Authored-by: Yihong He <heyihong.cn@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rt regenerating the `expectation.json` files using `SPARK_GENERATE_GOLDEN_FILES=1` ### What changes were proposed in this pull request? This por refactors the `HistoryServerSuite` to support regenerating the `expectation.json` files using `SPARK_GENERATE_GOLDEN_FILES=1`, and refreshes the `expectation.json` files with the new approach: ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly org.apache.spark.deploy.history.RocksDBBackendHistoryServerSuite" ``` Meanwhile, this or also cleans up some methods which is no longer needed. ### Why are the changes needed? According to the documentation, the `HistoryServerSuite.main` method was previously used to regenerate golden files. However, there are currently 3 issues: 1. The `HistoryServerSuite.main` method generates the `expectation.json` file in the `target/scala-2.13/test-classes/HistoryServerExpectations` directory, rather than directly refreshing the content in the `core/src/test/resources/HistoryServerExpectations` directory. 2. The previous `expectation.json` files was likely based on data that had undergone re-formatting, rather than directly storing the file generated by `HistoryServerSuite.main` in the `HistoryServerExpectations` directory. 3. The `HistoryServerSuite` generates golden files in a different way compared to other tests, and it does not support the `SPARK_GENERATE_GOLDEN_FILES` parameter, which is somewhat inconvenient. Therefore, this pr enables support for the `SPARK_GENERATE_GOLDEN_FILES`. When `SPARK_GENERATE_GOLDEN_FILES=1`, it directly refreshes the content in the `core/src/test/resources/HistoryServerExpectations` directory. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51072 from LuciferYang/refactor-HistoryServerSuite. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? Currently, `structured_logging_style.py` assumes that it's being run from the `dev` directory. If it's run from the project root, it will scan all files in the parent directory, i.e. outside the project root. I encountered this and was surprised that it was taking minutes just in the globbing step. This change avoids that assumption, so `python dev/structured_logging_style.py` will still produce correct results. ### Why are the changes needed? Make it easier for developers to run the structured logging checks locally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` dev/structured_logging_style.py ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51087 from sryza/structured-logging-dir. Lead-authored-by: Sandy Ryza <sandy.ryza@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… resolution ### What changes were proposed in this pull request? This PR introduces the `DataflowGraph`, a container for Declarative Pipelines datasets and flows, as described in the [Declarative Pipelines SPIP](https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0#heading=h.9g6a5f8v6xig). It also adds functionality for - Constructing a graph by registering a set of graph elements in succession (`GraphRegistrationContext`) - "Resolving" a graph, which means resolving each of the flows within a graph. Resolving a flow means: - Validating that its plan can be successfully analyzed - Determining the schema of the data it will produce - Determining what upstream datasets within the graph it depends on It also introduces various secondary changes: * Changes to `SparkBuild` to support declarative pipelines. * Updates to the `pom.xml` for the module. * New error conditions ### Why are the changes needed? In order to implement Declarative Pipelines. ### Does this PR introduce _any_ user-facing change? No changes to existing behavior. ### How was this patch tested? New test suites: - `ConnectValidPipelineSuite` – test cases where the graph can be successfully resolved - `ConnectInvalidPipelineSuite` – test cases where the graph fails to be resolved ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51003 from aakash-db/graph-resolution. Lead-authored-by: Aakash Japi <aakash.japi@databricks.com> Co-authored-by: Sandy Ryza <sandy.ryza@databricks.com> Co-authored-by: Sandy Ryza <sandyryza@gmail.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…irectory" This reverts commit 28f0588.

… with same operationId and plan reattaches ### What changes were proposed in this pull request? In Spark Connect, queries can fail with the error INVALID_HANDLE.OPERATION_ALREADY_EXISTS, when a client retries an ExecutePlan RPC—often due to transient network issues—causing the server to receive the same request multiple times. Since each ExecutePlan request includes an operation_id, the server interprets the duplicate as an attempt to create an already existing operation, which results in the OPERATION_ALREADY_EXISTS exception. This behavior interrupts query execution and breaks the user experience under otherwise recoverable conditions. To resolve this, the PR introduces idempotent handling of ExecutePlan on the server side. When a request with a previously seen operation_id and the same plan is received, instead of returning an error, the server now reattaches the response stream to the already running execution associated with that operation. This ensures that retries due to network flakiness no longer result in failed queries, thereby improving the resilience and robustness of query executions. ### Why are the changes needed? It will improve the stability of Spark Connect in case of transient network issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Yes, new tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51084 from xi-db/fix_operation_already_exists. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…nference ### What changes were proposed in this pull request? This pull request changes `FOR` statement to infer column schemas from the query DataFrame, and no longer implicitly infer column schema in SetVariable. This is necessary due to type mismatch errors with complex nested types, e.g. `ARRAY<STRUCT<..>>`. ### Why are the changes needed? Bug fix for FOR statement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test that specifically targets problematic case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51053 from davidm-db/for_schema_inference. Lead-authored-by: David Milicevic <david.milicevic@databricks.com> Co-authored-by: David Milicevic <163021185+davidm-db@users.noreply.github.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? ### Why are the changes needed? Now the `artifacts` directory will try to be created at workdir, which may not have permission to create write. `artifacts` is a temporary directory, and although we have an exit cleanup mechanism, it's better to put it in tmpdir, driver because `artifacts` may not be cleaned up after an OOM exit. ```java org.apache.spark.sql.execution.QueryExecutionException: java.io.IOException: Failed to create a temp directory (under artifacts) after 10 attempts! at org.apache.spark.network.util.JavaUtils.createDirectory(JavaUtils.java:411) at org.apache.spark.util.SparkFileUtils.createDirectory(SparkFileUtils.scala:95) at org.apache.spark.util.SparkFileUtils.createDirectory$(SparkFileUtils.scala:94) at org.apache.spark.util.Utils$.createDirectory(Utils.scala:99) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:249) at org.apache.spark.sql.artifact.ArtifactManager$.artifactRootDirectory$lzycompute(ArtifactManager.scala:468) at org.apache.spark.sql.artifact.ArtifactManager$.artifactRootDirectory(ArtifactManager.scala:467) at org.apache.spark.sql.artifact.ArtifactManager.artifactRootPath(ArtifactManager.scala:60) at org.apache.spark.sql.artifact.ArtifactManager.<init>(ArtifactManager.scala:70) at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$artifactManager$2(BaseSessionStateBuilder.scala:395) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51083 from cxzl25/SPARK-52396. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? In this PR I propose that we cover test gaps discovered during single-pass Analyzer implementation. ### Why are the changes needed? To make Spark test coverage better. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51088 from mihailoale-db/goldenfiles1. Authored-by: mihailoale-db <mihailo.aleksic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR removes unnecessary code for converting Variants in PySpark from local to arrow representation. This allows createDataFrame and Python Datasources to work seamlessly with Variants. ### Why are the changes needed? [This PR](apache#45826) introduced code to convert Variants from internal representation to representation in Arrow (LocalDataToArrowConversion). However, the internal representation is assumed to be `dict` and the arrow representation is assumed to be `VariantVal` even though it should be the other way around. It appears this code written in the PR is not actually encountered in any tests. This caused `createDataFrame` to not work with Variants and the [attempted fix](apache#49487) added a special case (`variants_as_dicts`) for this code, even though the special case was actually the only use case. This PR removes the old unnecessary code and only keeps the "special case" code as the main code for converting Variant from local (`VariantVal`) to Arrow (`dict`). ### Does this PR introduce _any_ user-facing change? This will allow users to use Python datasources with Variants. ### How was this patch tested? Existing tests should pass, and a new unit test for Python Datasources was added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51082 from harshmotw-db/harsh-motwani_data/experimental_variant_fix. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…s to ensure the consistency of the content output to golden files ### What changes were proposed in this pull request? This pr adds sorting to the JSON content when regenerating golden files in `HistoryServerSuite` to ensure that the JSON strings output to the golden files are stable. Additionally, this pr refreshes the `HistoryServerExpectations` directory with the new approach for all golden files. ### Why are the changes needed? apache#51072 has fixed and enhanced the way `HistoryServerSuite` generates golden files. However, the current approach is susceptible to the execution environment and cannot guarantee strong consistency in the content output to golden files every time. For instance, when I switched to a different computer to run the command for refreshing golden files, I noticed that the results had changed: ``` diff --git a/core/src/test/resources/HistoryServerExpectations/application_list_json_expectation.json b/core/src/test/resources/HistoryServerExpectations/application_list_json_expectation.json index e485c0a..500a748 100644 --- a/core/src/test/resources/HistoryServerExpectations/application_list_json_expectation.json +++ b/core/src/test/resources/HistoryServerExpectations/application_list_json_expectation.json -9,9 +9,9 "sparkUser" : "lijunqing", "completed" : true, "appSparkVersion" : "3.3.0-SNAPSHOT", + "startTimeEpoch" : 1642039450519, "endTimeEpoch" : 1642039536564, - "lastUpdatedEpoch" : 0, - "startTimeEpoch" : 1642039450519 + "lastUpdatedEpoch" : 0 } ] }, { "id" : "application_1628109047826_1317105", -24,9 +24,9 "sparkUser" : "john", "completed" : true, "appSparkVersion" : "3.1.1.119", + "startTimeEpoch" : 1628637895333, "endTimeEpoch" : 1628638170208, - "lastUpdatedEpoch" : 0, - "startTimeEpoch" : 1628637895333 + "lastUpdatedEpoch" : 0 } ] }, { ... ``` As can be seen, due to differences in the execution environment, the ordering of the output fields has changed. Therefore, field sorting has been added to the output JSON to ensure that the output content is not affected by the environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Executing ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "core/testOnly org.apache.spark.deploy.history.RocksDBBackendHistoryServerSuite" ``` in multiple environments has verified the consistency of the output content. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51093 from LuciferYang/SPARK-52386-FOLLOWUP. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

… event logging ### What changes were proposed in this pull request? **See the flow chart describing the changes made in this PR: [flow chart link](https://lucid.app/lucidchart/c773b051-c634-4f0e-9a3c-a21e24ae540a/edit?viewport_loc=-4594%2C-78%2C5884%2C3280%2C0_0&invitationId=inv_3f036b9d-1a2a-4dd9-bf50-084cd90e5460)** As described in [Declarative Pipelines SPIP](https://docs.google.com/document/d/1PsSTngFuRVEOvUGzp_25CQL1yfzFHFr02XdMfQ7jOM4/edit?tab=t.0#heading=h.9g6a5f8v6xig), after we parse user's code and represent datasets and dataflows in a `DataflowGraph` (from PR apache#51003), we execute the `DataflowGraph`. This PR implements this execution. ## Main execution steps inside a pipeline run ### Step 1: Initialize the raw `DataflowGraph` In `PipelineExecution::runPipeline()`, we first initialize the dataflow graph by topologically sorting the dependencies and also figuring out the expected metadata (e.g., schema) for each dataset (`DataflowGraph::resolve()`). Also, we run some pre-flight validations to caught some early errors like circular dependencies, create a streaming table with batch data source, etc (`DataflowGraph::validate()`). ### Step 2: Materialize datasets defined in the `DataflowGraph` to the catalog After the graph is topologically sorted and validated and every dataset / flow has correct metadata populated, we publish the corresponding dataset in the catalog (which could be Hive, UC, or others) in `DatasetManager::materializeDatasets()`. For example, for each Materialized View and Table, it would register a empty table in the catalog with correct metadata (e.g., table schema, table properties, etc). If the table already exists, we alter it to have the correct metadata. ### Step 3: Populate data to the registered tables by executing the `DataflowGraph` After datasets have been registered to the catalog, inside `TriggeredGraphExecution`, we transform each dataflow defined in the `DataflowGraph` into an actual execution plan to run the actual workload and populate the data to the empty table (we transform `Flow` into `FlowExecution` through `FlowPlanner`). Each `FlowExecution` will be executed in topological order based on the sorted `DataflowGraph`, and we parallelize the execution as much as possible. Depending on the type of error, failed flows may be retried as part of execution. ## Main components of this PR: - **Flow execution** represents the execution of an individual flow in the dataflow graph. Relevant classes: - `FlowExecution` - `StreamingFlowExecution` - `BatchFlowExecution` - `FlowPlanner` – constructs `FlowExecution`s from `Flow` objects - **Graph execution** represents the execution of an entire dataflow graph, i.e. step 3 in the set of steps above. In the future, we will add a `ContinuousGraphExecution` class, which executes all the streams at once instead of in topological order. Relevant classes: - `GraphExecution` - `TriggeredGraphExecution` – executes flows in topological order, handles retries when necessary - `BackoffStrategy` – used for retries - `UncaughtExceptionHandler` - `PipelineConf` – a few configurations that control graph execution behavior - **Pipeline execution** represents a full "run" including all three execution steps above: graph resolution, catalog materialization, and graph execution. Relevant classes: - `PipelineExecution` - `RunTerminationReason` - `PipelineUpdateContext` – represents the parameters to a pipeline execution - `PipelineUpdateContextImpl` - **Catalog materialization** step 2 in the execution steps described above – represents datasets in the dataflow graph in the catalog. Uses DSv2 APIs. - `DatasetManager` - **Graph filtration / selection** allows selecting just a subset of the graph to be executed. In a followup, we will add the plumbing that allows specifying this from the CLI. Relevant classes: - `GraphFilter` - **Events** track the progress of a pipeline execution. The event messages are sent to the client for console logging, and the structured events are available for assertions inside tests. Eventually, these could power info in the Spark UI as well. Relevant classes: - `FlowProgressEventLogger` - `PipelineRunEventBuffer` - `StreamListener` - `ConstructPipelineEvent` ### Why are the changes needed? This PR implemented the core functionality to executing a Declarative Pipeline ### Does this PR introduce _any_ user-facing change? It introduces new behavior, but does not modify existing behavior. ### How was this patch tested? New unit test suite: - `TriggeredGraphExecutionSuite`: tests end-to-end executions of the pipeline under different scenarios (happy path, failure path, etc) and validate proper data has been written and proper event log is emitted. - `MaterializeTablesSuite`: tests the logic for registering datasets in the catalog. Augment existing test suites: - `ConstructPipelineEventSuite` and `PipelineEventSuite` to validate the new FlowProgress event log we're introducing. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51050 from SCHJonathan/graph-execution. Lead-authored-by: Yuheng Chang <jonathanyuheng@gmail.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…ease workflow ### What changes were proposed in this pull request? This PR proposes to redact sensitive information in log files at release workflow. ### Why are the changes needed? The output files are already protected by ZipCrypto but this PR makes it even safer by redacting sensitive information in log files. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in https://github.com/HyukjinKwon/spark/actions/runs/15481551209/job/43588217131 ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51102 from HyukjinKwon/redact-sensitive-info-in-logs. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Currently, `structured_logging_style.py` assumes that it's being run from the `dev` directory. If it's run from the project root, it will scan all files in the parent directory, i.e. outside the project root. I encountered this and was surprised that it was taking minutes just in the globbing step. This change avoids that assumption, so `python dev/structured_logging_style.py` will still produce correct results. ### Why are the changes needed? Make it easier for developers to run the structured logging checks locally. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` dev/structured_logging_style.py ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51087 from sryza/structured-logging-dir. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…unctions and variables into an abstract base class for Scala and Python ### What changes were proposed in this pull request? Refactor the TWS Exec code to extract the common functions/variables and move them to a base abstract class such that it can be shared by both scala exec and python exec. ### Why are the changes needed? code elegant - less duplicate code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? no functionalities change - existing UTs should be able to provide test coverage ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51077 from huanliwang-db/refactor-tws. Authored-by: huanliwang-db <huanli.wang@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

### What changes were proposed in this pull request? This PR is just extending the existing V2JDBCTest that is used for testing different pushdowns for JDBC connectors. JDBC Options support `numPartitions`, `lowerBound`, `upperBound`, and `partitionColumn` options. The idea is to test reading data from JDBC data sources when these options are used. Using these options will disable some of the pushdowns, for example Offset with Limit or Sort with Limit. Other pushdowns shouldn't be regressed (like Limit or Aggregation) and these are all tested. ### Why are the changes needed? Testing if there is a correctness issue when using multiple partitions for reading the data from JDBC data sources. ### Does this PR introduce _any_ user-facing change? No, since this is test only PR. ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes apache#51098 from PetarVasiljevic-DB/test_jdbc_parallel_read. Authored-by: Petar Vasiljevic <petar.vasiljevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…w tests ### What changes were proposed in this pull request? Fix non-ANSI test breakage as per apache#50959 (review) ### Why are the changes needed? Many V2 Expressions are only converted successfully in ANSI mode, so this test for V2 Expression only makes sense in that mode. ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? Run test in NON-ANSI mode ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51092 from szehon-ho/SPARK-52235-follow. Lead-authored-by: Szehon Ho <szehon.apache@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…stingVersions in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? Fix the version parsing logic in `HiveExternalCatalogVersionsSuite` to properly handle new artifact paths in https://dist.apache.org/repos/dist/release/spark/ so that "backward compatibility" test can be run. This change creates a constant `val SparkVersionPattern = """<a href="spark-(\d.\d.\d)/">""".r` for more precise version matching, and removes redundant `.filterNot(_.contains("preview"))` which is no longer needed. ### Why are the changes needed? The suite is failing to execute the "backward compatibility" test due to parsing errors with testing versions. The current implementation fails to parse versions when encountering new paths like `spark-connect-swift-0.1.0/` and `spark-kubernetes-operator-0.1.0/` in https://dist.apache.org/repos/dist/release/spark/. This leads to `PROCESS_TABLES.testingVersions` being empty, and in turn a logError: "Exception encountered when invoking run on a nested suite - Fail to get the latest Spark versions to test". As a result, the condition is not met to run the test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Executed local build and test for `HiveExternalCatalogVersionsSuite`: `build/mvn -pl sql/hive-Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite test-compile scalatest:test` Verified that the reported error no longer appears, "backward compatibility" test runs successfully, and `PROCESS_TABLES.testingVersions` now correctly contains "3.5.5" when printed out, which was previously empty. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50989 from efaracci018/fix-testingVersions. Lead-authored-by: Emilie Faracci <efaracci@amazon.com> Co-authored-by: efaracci018 <efaracci@amazon.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…ANSI mode is on ### What changes were proposed in this pull request? Avoid INVALID_ARRAY_INDEX in `split`/`rsplit` when ANSI mode is on ### Why are the changes needed? Ensure pandas on Spark works well with ANSI mode on. Part of https://issues.apache.org/jira/browse/SPARK-52169. ### Does this PR introduce _any_ user-facing change? Yes. INVALID_ARRAY_INDEX no longer fails `split`/`rsplit` when ANSI mode is on ```py >>> spark.conf.get("spark.sql.ansi.enabled") 'true' >>> import pandas as pd >>> pser = pd.Series(["hello-world", "short"]) >>> psser = ps.from_pandas(pser) ``` FROM ```py >>> psser.str.split("-", n=1, expand=True) 25/05/28 14:52:10 ERROR Executor: Exception in task 10.0 in stage 2.0 (TID 15) org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 1 is out of bounds. The array has 1 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003 == DataFrame == "__getitem__" was called from <stdin>:1 ... ``` TO ```py >>> psser.str.split("-", n=1, expand=True) 0 1 0 hello world 1 short None ``` ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51006 from xinrong-meng/arr_idx_enable. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

…I enabled ### What changes were proposed in this pull request? Enable divide-by-zero for boolean mod/rmod with ANSI enabled ### Why are the changes needed? Ensure pandas on Spark works well with ANSI mode on. Part of https://issues.apache.org/jira/browse/SPARK-52169. ### Does this PR introduce _any_ user-facing change? Yes, divide-by-zero is enabled when ANSI is on, as shown below: ``` >>> ps.set_option("compute.fail_on_ansi_mode", False) >>> pser = pd.Series([True, False]) >>> psser = ps.from_pandas(pser) >>> ps.set_option("compute.ansi_mode_support", True) >>> spark.conf.set("spark.sql.ansi.enabled", True) >>> 1 % psser 0 0.0 1 NaN dtype: float64 # Same as ANSI off >>> spark.conf.set("spark.sql.ansi.enabled", False) >>> 1 % psser 0 0.0 1 NaN dtype: float64 ``` ### How was this patch tested? Unit tests, and ```py (dev3.10) spark (divide_0_tests) % SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.10 --testnames "pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTests.test_mod" Running PySpark tests. Output is in /Users/xinrong.meng/spark/python/unit-tests.log ... Finished test(python3.10): pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTests.test_mod (5s) Tests passed in 5 seconds (dev3.10) spark (divide_0_tests) % SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.10 --testnames "pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTests.test_rmod" Running PySpark tests. Output is in /Users/xinrong.meng/spark/python/unit-tests.log ... Finished test(python3.10): pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTests.test_rmod (4s) Tests passed in 4 seconds ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51058 from xinrong-meng/bool_mod. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

### What changes were proposed in this pull request? In the PR, I propose to rename the internal method `timeToMicros()` to `makeTime()` because: 1. It makes nanoseconds but not microseconds values 2. Actually it make a TIME values from time fields, but do not converts TIME values to micros. ### Why are the changes needed? To improve code maintenance, and don't confuse other devs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the related test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51380 from MaxGekk/rename-timeToMicros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? The Declarative Pipelines SPIP included a "name" field in the pipeline spec, but we left that out in the earlier implementation. This adds it in. The name field is required. This matches behavior for similar systems, like dbt. ### Why are the changes needed? See above. ### Does this PR introduce _any_ user-facing change? Yes, but only to unreleased code. ### How was this patch tested? Updated existing tests, and added tests for proper error when the name is missing. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#51353 from sryza/pipeline-name. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…filer 4.0 ### What changes were proposed in this pull request? Bump ap-loader 4.0-10 ### Why are the changes needed? ap-loader 4.0 (v10) has already been released, which supports for async-profier 4.0. The release guide refers to [Loader for 4.0 (v10): Heatmaps and Native memory profiling](https://github.com/jvm-profiling-tools/ap-loader/releases/tag/4.0-10) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51257 from wForget/SPARK-52560. Authored-by: wforget <643348094@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…he getting started guide ### What changes were proposed in this pull request? I changed two links to python references in the [Getting Started](https://spark.apache.org/docs/latest/sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations) guide to point to working documents. One was to the Dataframe reference, and one to the DataFrame functions reference. ### Why are the changes needed? 1. Currently the links I updated were broken, and led to a 404 error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I tested the link change by building the docs locally and then checking the links worked correctly, and pointed to the right document. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51384 from carlotran4/master. Authored-by: Carlo Tran <carlotran4@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Interrupt hanging ML handlers in tests: Recently some ML connect tests hangs randomly. We need to add code to interrupt hanging handler threads and print the stack trace, for debuggability. ### Why are the changes needed? Recently some ML connect tests hangs randomly. We need to add code to interrupt hanging handler threads and print the stack trace, for debuggability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51364 from WeichenXu123/ml-connect-hang-debugger. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

### What changes were proposed in this pull request? Add a new ExternalCatalog and SessionCatalog API alterTableSchema that will supersede alterTableDataSchema. ### Why are the changes needed? Because ExternalCatalog::alterTableDataSchema takes dataSchema only (without partition columns), we lost the context of the partition column. This will make it impossible for us to support column order where partition column are not at the end. See apache#51342 for context More generally, this is a better intuitive API than alterTableDataSchema, because the caller no longer needs to strip out partition columns. Also, it is not immediately intuitive that data schema means without partition columns. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test, move test for alterTableDataSchema to the new API ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51373 from szehon-ho/alter_table_schema. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? - renames two methods in ExtractPythonUDFs, and adds docstrings explaining the parallel fusing and chaining concepts ### Why are the changes needed? - in my experience, new developers find the planning code hard to understand without sufficient explanations. The current method naming is confusing, as the `canChainUDF` is actually used select eligibility to fuse parallel udf invocations like `udf1(), udf2()`. ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50867 from benrobby/SPARK-52082. Authored-by: Ben Hurdelhey <ben.hurdelhey@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The proposed changes are largely based on commit 0674327, which added caching support for TIMESTAMP_NTZ. This PR makes the same changes, except for the TIME type. ### Why are the changes needed? To support caching the TIME type, e.g.: ``` CACHE TABLE v1 AS SELECT TIME'22:00:00'; ``` ### Does this PR introduce _any_ user-facing change? No. The TIME type is not released yet. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51385 from bersprockets/time_cache. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR aims to make Maven plugins up-to-date. ### Why are the changes needed? To prepare Apache Maven 4.0. ### Does this PR introduce _any_ user-facing change? No, this is a build-only change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51390 from dongjoon-hyun/SPARK-52697. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR adds tests for the TIME data type in Spark's Java API, covering Dataset and UDF functionality with `java.time.LocalTime`: - Adding a dataset filter operations with `java.time.LocalTime` (there is not a similar one for `TimestampType`). - UDF registration and execution with TimeType in udf8Test - the same as the `udf7Test()` for `TimestampType`. - `testLocalTimeEncoder()` already existed for `TimestampType` parity. ### Why are the changes needed? As part of the TIME data type SPIP (SPARK-51162), we need test coverage in the Java API for Datasets and UDFs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new test methods: - `JavaDatasetSuite.testLocalTimeFilter` - Tests Dataset filter with `LocalTime` - `JavaUDFSuite.udf8Test` - Tests UDF registration and execution with `LocalTime` ![Screenshot 2025-07-07 at 4 00 58 AM](https://github.com/user-attachments/assets/160f207e-e3e5-45a2-ac5d-35b55a76215e) ![Screenshot 2025-07-07 at 3 59 24 AM](https://github.com/user-attachments/assets/c18bf26d-a6ef-4445-95b0-a67566ac9fa0) I executed the tests themselves locally and also ran `./build/mvn test-compile -pl sql/core` to test compilation. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51387 from fartzy/SPARK-51557_Add_tests_for_TIME_data_type_in_Java_API. Authored-by: Mike Artz <fartzy@hotmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This PR improves the type annotations in python/pyspark/sql/datasource.py to use Python 3.10 typing syntax and built-in types instead of their typing module equivalents. ### Why are the changes needed? Follows current Python typing recommendations and best practices. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51392 from allisonwang-db/spark-52698-type-hint. Authored-by: Allison Wang <allison.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

This reverts commit a9b8e37.

### What changes were proposed in this pull request? This PR adds UDT write support for the XML file format ### Why are the changes needed? IllegalArgumentException is being thrown while writing UDT values ### Does this PR introduce _any_ user-facing change? Yes, if the udt's sqlType is compatible with XML file format, it becomes writable ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#51388 from yaooqinn/SPARK-52695. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? Some pipeline runs result in wiping out and replacing all the data for a table: - Every run of a materialized view - Runs of streaming tables that have the "full refresh" flag In the current implementation, this "wipe out and replace" is implemented by: - Truncating the table - Altering the table to drop/update/add columns that don't match the columns in the DataFrame for the current run The reason that we want originally wanted to truncate + alter instead of drop / recreate is that dropping has some undesirable effects. E.g. it interrupts readers of the table and wipes away things like ACLs. However, we discovered that not all catalogs support dropping columns (e.g. Hive does not), and there’s no way to tell whether a catalog supports dropping columns or not. So this PR changes the implementation to drop/recreate the table instead of truncate/alter. ### Why are the changes needed? See section above. ### Does this PR introduce _any_ user-facing change? Yes, see section above. No releases contained the old behavior. ### How was this patch tested? - Tests in MaterializeTablesSuite - Ran the tests in MaterializeTablesSuite with Hive instead of the default catalog ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51280 from sryza/drop-on-full-refresh. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? In the PR, I propose to support the `+` and `+` operators over TIME and DAY-TIME INTERVAL. #### Syntax ``` exprA + exprB, exprB + exprA exprA - exprB ``` where - **exprA** - an expression of the TIME data type of any valid precision [0, 6]. - **exprB** - and expression of the DAY-TIME INTERVAL with any start and end fields `SECOND`, `MINUTE`, `HOUR`, `DAY`. #### Returns The result of the TIME(n) data type or raises the error `DATETIME_OVERFLOW` if the result is out of the valid range `[00:00, 24:00)`. If the result is valid, its precision `n` is the maximum precision of the input time `m` and the day-time interval `i`: `n = max(m, i)` where `i` = 6 for the end interval field `SECOND` and `0` for other fields `MINUTE`, `HOUR`, `DAY`. ### Why are the changes needed? To conform the ANSI SQL standard: <img width="867" alt="Screenshot 2025-07-07 at 09 41 49" src="https://github.com/user-attachments/assets/808a3bad-70a6-4c28-b23d-83e8399bd0e9" /> ### Does this PR introduce _any_ user-facing change? No. The TIME data type hasn't been released yet. ### How was this patch tested? By running new tests and affected test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z time.sql" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51383 from MaxGekk/time-add-interval. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Update `InternalRow#getWriter` to return the correct writer for the TIME type. ### Why are the changes needed? Without this PR, aggregating the TIME type in interpreted mode fails. Consider the below query: ``` set spark.sql.codegen.factoryMode=NO_CODEGEN; create or replace temp view v1(col1) as values (time'22:33:01'), (time'01:33:01'), (null); select max(col1), min(col1) from v1; ``` Without this change, the query fails with the following error: ``` Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkUnsupportedOperationException: [UNSUPPORTED_CALL.WITHOUT_SUGGESTION] Cannot call the method "update" of the class "org.apache.spark.sql.catalyst.expressions.UnsafeRow". SQLSTATE: 0A000 at org.apache.spark.SparkUnsupportedOperationException$.apply(SparkException.scala:266) ~[spark-common-utils_2.13-4.1.0-SNAPSHOT.jar:4.1.0-SNAPSHOT] ``` ### Does this PR introduce _any_ user-facing change? No. The TIME type is not released yet. ### How was this patch tested? Updated unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51395 from bersprockets/time_get_writer. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…format options in TextBasedFileFormats ### What changes were proposed in this pull request? Simplify interoperations between SQLConf and file-format options in TextBasedFileFormats ### Why are the changes needed? - Reduce code duplication - Restore type annotation for IDE ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#51398 from yaooqinn/SPARK-52704. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

…versions ### What changes were proposed in this pull request? This PR proposes to remove preview postfix when looking up the JIRA versions ### Why are the changes needed? Otherwise, preview builds fail. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51399 from HyukjinKwon/SPARK-52707. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… to 3.10 ### What changes were proposed in this pull request? Upgrade minimum python version of pandas api to 3.10 ### Why are the changes needed? python 3.9 is reaching the EOL, we should upgrade the minimum python version ### Does this PR introduce _any_ user-facing change? No, infra-only ### How was this patch tested? PR builder with env ``` default: '{"PYSPARK_IMAGE_TO_TEST": "python-ps-minimum", "PYTHON_TO_TEST": "python3.10"}' ``` https://github.com/zhengruifeng/spark/actions/runs/16133332146/job/45534172036 ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#51397 from zhengruifeng/ps_py_310. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

### What changes were proposed in this pull request? Move check for non-deterministic expressions in grouping expressions from `ExprUtils` to `CheckAnalysis`. ### Why are the changes needed? This is necessary in order to be able to utilize `PullOutNonDeterminstic` rule as a post-processing rewrite rule in single-pass analyzer. Because `ExprUtils.assertValidAggregate` is called during the bottom-up traversal, we can't check for non-determinstic expressions there ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51391 from mihailotim-db/mihailotim-db/pull_out_nondeterministic. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…Data Source ### What changes were proposed in this pull request? - Allow overwriting static Python Data Sources during registration - Update documentation to clarify Python Data Source behavior and registration options ### Why are the changes needed? Static registration is a bit obscure and doesn't always work as expected (e.g. when the module providing DefaultSource is installed after `lookup_data_sources` already ran). So in practice users (or LLM agents) often want to explicitly register the data source even if it is provided as a DefaultSource. Raising an error in this case interrupts the workflow, making LLM agents spend extra tokens regenerating the same code but without registration. This change also makes the behavior consistent with user data source registration which are already allowed to overwrite previous user registrations. ### Does this PR introduce _any_ user-facing change? Yes. Previously, registering a Python Data Source with the same name as a statically registered one would throw an error. With this change, it will overwrite the static registration. ### How was this patch tested? Added a test in `PythonDataSourceSuite.scala` to verify that static sources can be overwritten correctly. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50716 from wengh/pyds-overwrite-static. Authored-by: Haoyu Weng <wenghy02@gmail.com> Signed-off-by: Allison Wang <allison.wang@databricks.com>

### What changes were proposed in this pull request? Small bug fix where the wrong variable names were used ### Why are the changes needed? The function uses lval and rval instead of the parameters val1 and val2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51253 from petern48/pandas_assert_bug. Authored-by: Peter Nguyen <petern0408@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? In Spark Declarative Pipelines (SDP), users can define append flows in Python using the [append_flow](https://github.com/apache/spark/blob/e3321aa44ea255365222c491657b709ef41dc460/python/pyspark/pipelines/api.py#L34-L41) decorator. The append_flow decorator currently accepts a `comment` arg. However, there is no way for user to see flow comments as of now. Therefore, this argument is unused and not referenced in function body. ```py def append_flow( *, target: str, name: Optional[str] = None, comment: Optional[str] = None, # <--- Removing spark_conf: Optional[Dict[str, str]] = None, once: bool = False, ) -> Callable[[QueryFunction], None]: ``` This PR removes the field. ### Why are the changes needed? The `comment` arg is not being used anywhere and having it in the API will confuse the user thinking they can see flow comments somewhere. ### Does this PR introduce _any_ user-facing change? Yes, the previously optional `comment` arg is removed from the `append_flow` API. However, SDP has not been released yet (pending release in v4.1), so no user should be impacted by this change. ### How was this patch tested? Examined all testcases to make sure none of the current append_flow usage is supplying this argument ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51404 from JiaqiWang18/SPARK-52714-remove-append_flow-comment. Authored-by: Jacky Wang <jacky.wang@databricks.com> Signed-off-by: Sandy Ryza <sandy.ryza@databricks.com>

…ANSI ### What changes were proposed in this pull request? Fix float32 type widening in `mod` with bool under ANSI. ### Why are the changes needed? Ensure pandas on Spark works well with ANSI mode on. Part of https://issues.apache.org/jira/browse/SPARK-52700. ### Does this PR introduce _any_ user-facing change? Yes. `mod` under ANSI works as pandas. ```py (dev3.11) spark (mod_dtype) % SPARK_ANSI_SQL_MODE=False ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_mod NumModTests.test_mod" ... Tests passed in 8 seconds (dev3.11) spark (mod_dtype) % SPARK_ANSI_SQL_MODE=True ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_mod NumModTests.test_mod" ... Tests passed in 7 seconds ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51394 from xinrong-meng/mod_dtype. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org>

…n parser ### What changes were proposed in this pull request? This PR proposes a change in how our parser treats datatypes. We introduce types with/without parameters and group accordingly. ### Why are the changes needed? Changes are needed for many reasons: 1. Context of primitiveDataType is constantly getting bigger. This is not a good practice, as we have many null fields which only take up memory. 2. We have inconsistencies in where we use each type. We get TIMESTAMP_NTZ in a separate rule, but we also mention it in primitive types. 3. Primitive types should stay related to primitive types, adding ARRAY, STRUCT, MAP in the rule just because it is convenient is not good practice. 4. Current structure does not give option of extending types with different features. For example, we introduced STRING collations, but what if we were to introduce CHAR/VARCHAR with collations. Current structure gives us 0 possibility of making a type CHAR(5) COLLATE UTF8_BINARY (We can only do CHAR COLLATE UTF8_BINARY (5)). ### Does this PR introduce _any_ user-facing change? No. This is internal refactoring. ### How was this patch tested? All existing tests should pass, this is just code refactoring. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51335 from mihailom-db/restructure-primitive. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… files in release build ### What changes were proposed in this pull request? This PR proposes to escape special characters when redacting the log files ### Why are the changes needed? Currently it fails to redact when there are special characters. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51405 from HyukjinKwon/escape-patterns. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Extended `DescribeDatabaseCommand` to print collation. ### Why are the changes needed? Part of new feature. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated `describe.sql`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51401 from ilicmarkodb/fix_describe_schema. Authored-by: ilicmarkodb <marko.ilic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…plicates ### What changes were proposed in this pull request? Union should be `resolved` only if there are no duplicates in any of the children and there are no conflicting attributes per branch. ### Why are the changes needed? This is necessary in order to prevent some rules like `ResolveReferences` or `WidenSetOperationTypes` resolving upper nodes while Union still has duplicate expr ids. For the following query pattern: ``` sql("""CREATE TABLE t1 (col1 STRING, col2 STRING, col3 STRING)""".stripMargin) sql("""CREATE TABLE t2 (col1 STRING, col2 DOUBLE, col3 STRING)""".stripMargin) sql("""CREATE TABLE t3 (col1 STRING, col2 DOUBLE, a STRING, col3 STRING)""".stripMargin) sql("""SELECT | * |FROM ( | SELECT col1, col2, NULL AS a, col1 FROM t1 | UNION | SELECT col1, col2, NULL AS a, col3 FROM t2 | UNION | SELECT * FROM t3 |)""".stripMargin) ``` Because at the moment `Union` can be resolved even if there are duplicates in a branch, plan is transformed in a following way: ``` Union +- Union :- Project col1#5, col2#6, null AS a#3, col1#5 +-Project col1#8, col2#9, null AS a#4, col3#10 ``` becomes ``` Union +- Project col1#5, col2#16, cast(a#3 as string) AS a#17, col1#5 +- Union :- Project col1#5, col2#6, null AS a#3, col1#5 +-Project col1#8, col2#9, null AS a#4, col3#10 ``` we end up with duplicate `col1#5` in both the outer project and the inner one. After `ResolveReferences` triggers, we will deduplicate both the inner and outer projects, resulting in an unnecessary project. Instead, by waiting to first deduplicate expr ids in the inner project before continuing resolution, the Project we insert between Unions will not contain duplicate ExprIds and we don't need to add another unnecessary one. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test case. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51376 from mihailotim-db/mihailotim-db/fix_union_resolved. Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? In the PR, I propose to changes SQL representation of TIME and DAY-TIME INTERVAL subtraction via wrapping it by `DatetimeSub`. ### Why are the changes needed? To improve user experience with Spark SQL. Before the changes, subtraction looks like: ```sql TIME '12:30:00' + (- INTERVAL '12:29:59.000001' HOUR TO SECOND ``` , and after the changes `+-` is replaced by just `-`: ```sql TIME '12:30:00' - INTERVAL '12:29:59.000001' HOUR TO SECOND ``` ### Does this PR introduce _any_ user-facing change? No. The TIME data type hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z time.sql" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51403 from MaxGekk/time-nice-subtract. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

dongjoon-hyun and others added 30 commits June 3, 2025 13:14

Revert "[SPARK-52283][SQL] Declarative Pipelines DataflowGraph crea…

1b41079

…tion and resolution" This reverts commit 26370d6.

Revert "[MINOR] Ensure structured_logging_style runs out of correct d…

970ba3d

…irectory" This reverts commit 28f0588.

MaxGekk and others added 30 commits July 5, 2025 20:52

Revert "[SPARK-52698][PYTHON] Improve type hints for datasource module"

00cf5da

This reverts commit a9b8e37.

Create test

18d6407

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test #1

test #1

Uh oh!

codersky commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

test #1

Are you sure you want to change the base?

test #1

Uh oh!

Conversation

codersky commented Jul 9, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants