Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50522][SQL] Support for indeterminate collation #49103

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Dec 7, 2024

What changes were proposed in this pull request?

This pull request updates how we handle non explicit collation mismatches. Currently, Spark throws an error for any collation mismatch. This change modifies that behavior by allowing expressions to work even if they don't know the collation of their inputs.

However, if they try to fetch the collator or comparison/hash functions of the uknown (indeterminate) string type a runtime error will be raised.

Since a runtime error is vague and not nice for the end users, I also added a list of expressions which are guaranteed to not work with indeterminate collation (binary comparison, string search etc) so that they can fail immediately in analysis and let the user know exactly where the problem is.

Finally, last change is that we should never serialize any data with indeterminate collation, but we can show it back to the user, create views on top of it etc.

Why are the changes needed?

Throwing errors for all collation mismatches can break queries unnecessarily, especially for functions that don’t rely on collation (like concat). These functions combine strings without needing ordering rules, making collation enforcement unnecessary.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 7, 2024
@stefankandic stefankandic changed the title [DRAFT][SQL] Support for indeterminate collation [SPARK-SPARK-50522][SQL] Support for indeterminate collation Dec 9, 2024
@stefankandic stefankandic changed the title [SPARK-SPARK-50522][SQL] Support for indeterminate collation [SPARK-50522][SQL] Support for indeterminate collation Dec 9, 2024
@stefankandic
Copy link
Contributor Author

@dejankrak-db @stevomitric please take a look, thanks!

@stefankandic stefankandic marked this pull request as ready for review December 25, 2024 23:23
Copy link
Contributor

@dejankrak-db dejankrak-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, please take a look, otherwise LGTM!

* Returns whether the given expression can contain indeterminate collation.
*/
private def canContainIndeterminateCollation(expr: Expression): Boolean = expr match {
// This is not an exhaustive list, and it's fine to miss some expressions. The only difference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a comment with a guideline for engineers adding further expressions in the future would be helpful: In case the new expression can contain indeterminate collation, it should be added to the list here, to the best of knowledge. Still, even if that is not the case, there is still runtime handling that will ensure that the expression will fail accordingly (though the first path is preferable as it saves burning some extra cycles).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment kind of explains that already, so I am not really sure how you propose to modify it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I wouldn't use the wording that 'it's fine to miss some expressions', as I assume that we want to encourage engineer adding a new expression that cannot contain indeterminate collation to add it to the list below. An additional explanation can be provided that otherwise, if the expression is not added to the list but cannot contain indeterminate collation, it will still fail at runtime, as already pointed out.

@stefankandic
Copy link
Contributor Author

@cloud-fan can you also take a look?

},
"INDETERMINATE_COLLATION_NOT_SERIALIZABLE" : {
"message" : [
"Indeterminate collation is not serializable. Use COLLATE clause to set the collation explicitly."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does serializable mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have an indeterminate column in a table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow how we forbid calendar interval type in the table schema. Please check all the places we call TypeUtils.failWithIntervalType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this exception from the jsonValue. We don't have to check ADD/REPLACE COLUMN like for interval since we can't specify indeterminate collation in column definition. So we can just check at table create/write and at the creation of collation metadata.

@stefankandic stefankandic requested a review from cloud-fan January 6, 2025 10:14
],
"sqlState" : "42P22"
},
"INDETERMINATE_COLLATION_NOT_SERIALIZABLE" : {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need it?


def prepended(part: String): ColumnPath = ColumnPath(parts.prepended(part))

def appended(part: String): ColumnPath = ColumnPath(parts.appended(part))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we call it?

"COLLATION_INVALID_NAME",
parameters = Map("proposals" -> "nl", "collationName" -> "NULL"))

intercept[SparkThrowable] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use checkError as well?

}
}

test("insert works with indeterminate collation") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

test("can create a view with indeterminate collation") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants