Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Types breaks merge predicate pruning #2632

Closed
echai58 opened this issue Jun 28, 2024 · 1 comment
Closed

Large Types breaks merge predicate pruning #2632

echai58 opened this issue Jun 28, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@echai58
Copy link

echai58 commented Jun 28, 2024

Environment

Delta-rs version: 0.18.1

Binding: python


Bug

What happened:
The default behavior of large_dtypes=True on merge() causes partition pruning to not work correctly for strings. This is my understanding of why this happens:

large_dtypes=True causes the source table strings to be casted to LargeUTF8. Thus, when datafusion optimizer runs type_coercion step on the physical plan, it goes from:

TableScan: t, partial_filters=[LargeUtf8("a") = p]

to

TableScan: t, partial_filters=[LargeUtf8("a") = CAST(p AS LargeUtf8)]

Then, datafusions pruning does not support casts of non numeric types:
https://github.com/apache/datafusion/blob/35.0.0/datafusion/core/src/physical_optimizer/pruning.rs#L813-L832

This means that the pruning predicate is not actually pruning any files. This is evident via the ParquetExec list of files, which shows all files, regardless of if they match the partition or not.

If you set large_dtypes=False, you see the following after the type_coercion optimization step:

TableScan: t, partial_filters=[CAST(Utf8("a") AS Dictionary(UInt16, Utf8)) = p]

Because there is no cast on the right-hand-side, the partition pruning works, and the ParquetExec only has the files from the a partition.

This also seems to be inadvertent an side-effect of the following change in #2326 cc @Blajda
https://github.com/delta-io/delta-rs/pull/2326/files#diff-12f59fe3c4440b7ae4ee1a5ac810b42c1d7357c246aae7b5770e840e52d3ec52L1036-R1039

Before, the extra Filter means that even though the ParquetExec had all files, we could still filter out row groups based on the metadata. However, without this, we have to actually load in all the data, which is the symptom I experienced - this renders #1958 ineffective.

The correct resolution seems to be to add support in DataFusion's pruning for comparing between strings and large strings.

For my use case, setting large_dtypes=False seems to be a workaround.

@echai58 echai58 added the bug Something isn't working label Jun 28, 2024
@rtyler
Copy link
Member

rtyler commented Jul 5, 2024

@echai58 I am going to close this because I ran into the same issue recently (#2844) and I believe that the default larger_dtypes setting was incorrect/problematic.

Please let me know if you believe a different issue persists here

@rtyler rtyler closed this as completed Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants