[SPARK-50522][SQL] Support for indeterminate collation #49103

stefankandic · 2024-12-07T04:45:11Z

What changes were proposed in this pull request?

This pull request updates how we handle non explicit collation mismatches. Currently, Spark throws an error for any collation mismatch. This change modifies that behavior by allowing expressions to work even if they don't know the collation of their inputs.

However, if they try to fetch the collator or comparison/hash functions of the uknown (indeterminate) string type a runtime error will be raised.

Since a runtime error is vague and not nice for the end users, I also added a list of expressions which are guaranteed to not work with indeterminate collation (binary comparison, string search etc) so that they can fail immediately in analysis and let the user know exactly where the problem is.

Finally, last change is that we should never serialize any data with indeterminate collation, but we can show it back to the user, create views on top of it etc.

Why are the changes needed?

Throwing errors for all collation mismatches can break queries unnecessarily, especially for functions that don’t rely on collation (like concat). These functions combine strings without needing ordering rules, making collation enforcement unnecessary.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

stefankandic · 2024-12-25T23:23:05Z

@dejankrak-db @stevomitric please take a look, thanks!

dejankrak-db

Left a few comments, please take a look, otherwise LGTM!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala

dejankrak-db · 2024-12-28T21:18:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala

+   * Returns whether the given expression can contain indeterminate collation.
+   */
+  private def canContainIndeterminateCollation(expr: Expression): Boolean = expr match {
+    // This is not an exhaustive list, and it's fine to miss some expressions. The only difference


Perhaps a comment with a guideline for engineers adding further expressions in the future would be helpful: In case the new expression can contain indeterminate collation, it should be added to the list here, to the best of knowledge. Still, even if that is not the case, there is still runtime handling that will ensure that the expression will fail accordingly (though the first path is preferable as it saves burning some extra cycles).

The comment kind of explains that already, so I am not really sure how you propose to modify it?

Well, I wouldn't use the wording that 'it's fine to miss some expressions', as I assume that we want to encourage engineer adding a new expression that cannot contain indeterminate collation to add it to the list below. An additional explanation can be provided that otherwise, if the expression is not added to the list but cannot contain indeterminate collation, it will still fail at runtime, as already pointed out.

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

sql/core/src/test/scala/org/apache/spark/sql/collation/IndeterminateCollationTestSuite.scala

stefankandic · 2024-12-30T13:16:04Z

@cloud-fan can you also take a look?

cloud-fan · 2025-01-02T05:14:11Z

common/utils/src/main/resources/error/error-conditions.json

+  },
+  "INDETERMINATE_COLLATION_NOT_SERIALIZABLE" : {
+    "message" : [
+      "Indeterminate collation is not serializable. Use COLLATE clause to set the collation explicitly."


what does serializable mean in this context?

To have an indeterminate column in a table

Let's follow how we forbid calendar interval type in the table schema. Please check all the places we call TypeUtils.failWithIntervalType.

I removed this exception from the jsonValue. We don't have to check ADD/REPLACE COLUMN like for interval since we can't specify indeterminate collation in column definition. So we can just check at table create/write and at the creation of collation metadata.

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala

cloud-fan · 2025-01-09T05:28:10Z

common/utils/src/main/resources/error/error-conditions.json

+    ],
+    "sqlState" : "42P22"
+  },
+  "INDETERMINATE_COLLATION_NOT_SERIALIZABLE" : {


do we still need it?

cloud-fan · 2025-01-09T05:34:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala

+
+    def prepended(part: String): ColumnPath = ColumnPath(parts.prepended(part))
+
+    def appended(part: String): ColumnPath = ColumnPath(parts.appended(part))


where do we call it?

cloud-fan · 2025-01-09T05:35:43Z

sql/core/src/test/scala/org/apache/spark/sql/collation/IndeterminateCollationTestSuite.scala

+      "COLLATION_INVALID_NAME",
+      parameters = Map("proposals" -> "nl", "collationName" -> "NULL"))
+
+    intercept[SparkThrowable] {


nit: can we use checkError as well?

cloud-fan · 2025-01-09T05:37:13Z

sql/core/src/test/scala/org/apache/spark/sql/collation/IndeterminateCollationTestSuite.scala

+    }
+  }
+
+  test("insert works with indeterminate collation") {


is it expected? BTW I thought it will fail because of https://github.com/apache/spark/pull/49103/files#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR757

cloud-fan · 2025-01-09T05:37:26Z

sql/core/src/test/scala/org/apache/spark/sql/collation/IndeterminateCollationTestSuite.scala

+    }
+  }
+
+  test("can create a view with indeterminate collation") {


is it expected?

cc @srielau

initial

84e3346

github-actions bot added the SQL label Dec 7, 2024

stefankandic added 5 commits December 7, 2024 06:26

fix some failing tests

d791820

fix failing tests

6d3f828

fix scala style

57e7aa8

fix failing tests

4d6d1ed

fix failing tests

089dc06

stefankandic changed the title ~~[DRAFT][SQL] Support for indeterminate collation~~ [SPARK-SPARK-50522][SQL] Support for indeterminate collation Dec 9, 2024

stefankandic changed the title ~~[SPARK-SPARK-50522][SQL] Support for indeterminate collation~~ [SPARK-50522][SQL] Support for indeterminate collation Dec 9, 2024

stefankandic added 9 commits December 9, 2024 13:37

fix failing tests

1b746fa

add better err for serialization issue

4ad5de2

move check to the type coercion from checkanalysis

52cf4ff

merge with latest master

2eb1701

fix scala style with import

0d11866

merge with master

5a48ca0

improve test method names

85d46d8

add all named expressions

22b8022

fix failing tests

ceee6f3

stefankandic marked this pull request as ready for review December 25, 2024 23:23

stefankandic added 5 commits December 26, 2024 17:59

add runtime error logic

5c6ec6a

fix scalastyle

bfe713c

fix failing test

fad28e7

add golden files

19863aa

add in subquery to blacklisted expressions

483567a

dejankrak-db approved these changes Dec 28, 2024

View reviewed changes

address pr comments

6142538

cloud-fan reviewed Jan 2, 2025

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 2, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala Show resolved Hide resolved

stefankandic requested a review from cloud-fan January 6, 2025 10:14

stefankandic added 4 commits January 8, 2025 11:32

remove runtime error from jsonValue method

21929cc

add indeterminate check in checkAnalysis

e8593ef

formatting

6964263

Merge branch 'master' into indeterminateColl

64a7e34

cloud-fan reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50522][SQL] Support for indeterminate collation #49103

[SPARK-50522][SQL] Support for indeterminate collation #49103

stefankandic commented Dec 7, 2024 •

edited

Loading

stefankandic commented Dec 25, 2024

dejankrak-db left a comment

dejankrak-db Dec 28, 2024

stefankandic Dec 30, 2024

dejankrak-db Dec 30, 2024

stefankandic commented Dec 30, 2024

cloud-fan Jan 2, 2025

stefankandic Jan 6, 2025

cloud-fan Jan 7, 2025

stefankandic Jan 8, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025


		def prepended(part: String): ColumnPath = ColumnPath(parts.prepended(part))

		def appended(part: String): ColumnPath = ColumnPath(parts.appended(part))

[SPARK-50522][SQL] Support for indeterminate collation #49103

Are you sure you want to change the base?

[SPARK-50522][SQL] Support for indeterminate collation #49103

Conversation

stefankandic commented Dec 7, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

stefankandic commented Dec 25, 2024

dejankrak-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefankandic commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefankandic commented Dec 7, 2024 •

edited

Loading