-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50522][SQL] Support for indeterminate collation #49103
base: master
Are you sure you want to change the base?
Conversation
@dejankrak-db @stevomitric please take a look, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, please take a look, otherwise LGTM!
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Outdated
Show resolved
Hide resolved
* Returns whether the given expression can contain indeterminate collation. | ||
*/ | ||
private def canContainIndeterminateCollation(expr: Expression): Boolean = expr match { | ||
// This is not an exhaustive list, and it's fine to miss some expressions. The only difference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a comment with a guideline for engineers adding further expressions in the future would be helpful: In case the new expression can contain indeterminate collation, it should be added to the list here, to the best of knowledge. Still, even if that is not the case, there is still runtime handling that will ensure that the expression will fail accordingly (though the first path is preferable as it saves burning some extra cycles).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment kind of explains that already, so I am not really sure how you propose to modify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I wouldn't use the wording that 'it's fine to miss some expressions', as I assume that we want to encourage engineer adding a new expression that cannot contain indeterminate collation to add it to the list below. An additional explanation can be provided that otherwise, if the expression is not added to the list but cannot contain indeterminate collation, it will still fail at runtime, as already pointed out.
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/collation/IndeterminateCollationTestSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/collation/IndeterminateCollationTestSuite.scala
Show resolved
Hide resolved
@cloud-fan can you also take a look? |
}, | ||
"INDETERMINATE_COLLATION_NOT_SERIALIZABLE" : { | ||
"message" : [ | ||
"Indeterminate collation is not serializable. Use COLLATE clause to set the collation explicitly." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does serializable
mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have an indeterminate column in a table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow how we forbid calendar interval type in the table schema. Please check all the places we call TypeUtils.failWithIntervalType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this exception from the jsonValue
. We don't have to check ADD/REPLACE COLUMN like for interval since we can't specify indeterminate collation in column definition. So we can just check at table create/write and at the creation of collation metadata.
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCoercion.scala
Show resolved
Hide resolved
], | ||
"sqlState" : "42P22" | ||
}, | ||
"INDETERMINATE_COLLATION_NOT_SERIALIZABLE" : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need it?
|
||
def prepended(part: String): ColumnPath = ColumnPath(parts.prepended(part)) | ||
|
||
def appended(part: String): ColumnPath = ColumnPath(parts.appended(part)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we call it?
"COLLATION_INVALID_NAME", | ||
parameters = Map("proposals" -> "nl", "collationName" -> "NULL")) | ||
|
||
intercept[SparkThrowable] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use checkError
as well?
} | ||
} | ||
|
||
test("insert works with indeterminate collation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it expected? BTW I thought it will fail because of https://github.com/apache/spark/pull/49103/files#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR757
} | ||
} | ||
|
||
test("can create a view with indeterminate collation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @srielau
What changes were proposed in this pull request?
This pull request updates how we handle non explicit collation mismatches. Currently, Spark throws an error for any collation mismatch. This change modifies that behavior by allowing expressions to work even if they don't know the collation of their inputs.
However, if they try to fetch the collator or comparison/hash functions of the uknown (indeterminate) string type a runtime error will be raised.
Since a runtime error is vague and not nice for the end users, I also added a list of expressions which are guaranteed to not work with indeterminate collation (binary comparison, string search etc) so that they can fail immediately in analysis and let the user know exactly where the problem is.
Finally, last change is that we should never serialize any data with indeterminate collation, but we can show it back to the user, create views on top of it etc.
Why are the changes needed?
Throwing errors for all collation mismatches can break queries unnecessarily, especially for functions that don’t rely on collation (like concat). These functions combine strings without needing ordering rules, making collation enforcement unnecessary.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.