Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet #20248

Merged
merged 16 commits into from
Dec 15, 2024

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Dec 10, 2024

This removes all instances of the DictionaryEncoder except one which is used for Polars enums and categoricals. This essentially makes it so that the dictionary arrow type is regarded as any other arrow type.

Fixes #20242.
Fixes #17945.
Fixes #20270.
Fixes #20288.
Fixes #20271.

This should massively speed up the decoding of enums and categoricals, although that is very much not the goal. This PR unifies the decoder kernels and removes a lot of the useless monomorphizations.

@coastalwhite coastalwhite force-pushed the fix/dictionary-encoding branch from 27bd711 to b5ff749 Compare December 10, 2024 11:02
@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Dec 10, 2024
@coastalwhite coastalwhite changed the title refactor: Purge Parquet Dictionary encoding refactor: Purge Parquet ArrowDataType::Dictionary decoding Dec 10, 2024
@coastalwhite
Copy link
Collaborator Author

Blocked on #20250.

@coastalwhite coastalwhite added the blocked Cannot be worked on due to external dependencies, or significant new internal features needed first label Dec 10, 2024
@coastalwhite
Copy link
Collaborator Author

After some consideration, from_arrow now also converts dictionaries to flat arrays if they cannot put into a Categorical.

@coastalwhite coastalwhite removed the blocked Cannot be worked on due to external dependencies, or significant new internal features needed first label Dec 12, 2024
@coastalwhite coastalwhite changed the title refactor: Purge Parquet ArrowDataType::Dictionary decoding feat: Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet Dec 12, 2024
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label Dec 12, 2024
@coastalwhite coastalwhite force-pushed the fix/dictionary-encoding branch from fad3670 to 5eab0bf Compare December 13, 2024 16:51
This removes all instances of the DictionaryEncoder except one which is used
for Polars enums and categoricals. This essentially makes it so that the
dictionary arrow type is regarded as any other arrow type.

Fixes pola-rs#20242.
Fixes pola-rs#17945.
@coastalwhite coastalwhite force-pushed the fix/dictionary-encoding branch from 946fc70 to e1f46a4 Compare December 13, 2024 22:17
@coastalwhite
Copy link
Collaborator Author

coastalwhite commented Dec 13, 2024

Okay, soooooo. This turned into a way larger change than I originally had in mind, but I still think this can be a relatively cohesive patch-set. In the end, I really wanted two things:

  1. Make sure that the ArrowDataType::Dictionary conversion is consistent. This is why both Parquet and Arrow code had to be hit.
  2. Allow to keep all Polars DataType information when converting to and from Parquet and Arrow.

The rules for converting from an ArrowDataType::Dictionary are now as follows.

  1. Check the Metadata for _PL_ENUM_VALUES and _PL_CATEGORICAL? If either is available, load the enum values or the categorical ordering, respectively. For Parquet, this also informs the reader the column can read using by just looking at the dictionary encoded indexes. This saves a lot of memory consumption and saves a big group-by. Categoricals properly follow the StringCache if it is enabled.
  2. If there is no metadata and the value datatype is a String, cast to a Categorical with physical ordering. This now also properly utilizes the StringCache i.f.f. it is enabled. We also now properly propagate nulls from the dictionary values onto the dictionary keys.
  3. If the value datatype is a non-string datatype, convert the dictionary to plain values.

Similarly, when we convert a Categorical or Enum to arrow, we now set _PL_ENUM_VALUES with the enum values (was already done) and _PL_CATEGORICAL with the ordering (was not yet done). To properly roundtrip through arrow this metadata had to be added to the RecordBatch. I did this with a ArrowSchemaRef. This way, to_arrow() and from_arrow can now properly roundtrip categoricals and enums. I think the same from IPC and Avro, although I did not check this yet.

@coastalwhite coastalwhite force-pushed the fix/dictionary-encoding branch from ca19a13 to 5407a7a Compare December 14, 2024 15:29
Copy link

codecov bot commented Dec 14, 2024

Codecov Report

Attention: Patch coverage is 83.96825% with 101 lines in your changes missing coverage. Please review.

Project coverage is 79.60%. Comparing base (84d317a) to head (5407a7a).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...olars-parquet/src/arrow/read/deserialize/simple.rs 66.08% 39 Missing ⚠️
crates/polars-core/src/series/from.rs 55.00% 27 Missing ⚠️
crates/polars-compute/src/propagate_dictionary.rs 80.28% 14 Missing ⚠️
crates/polars-core/src/frame/mod.rs 76.92% 6 Missing ⚠️
crates/polars-arrow/src/datatypes/field.rs 0.00% 5 Missing ⚠️
...-parquet/src/arrow/read/deserialize/categorical.rs 97.14% 2 Missing ⚠️
...ialize/dictionary_encoded/optional_masked_dense.rs 50.00% 2 Missing ⚠️
...ialize/dictionary_encoded/required_masked_dense.rs 50.00% 2 Missing ⚠️
crates/polars-arrow/src/record_batch.rs 94.11% 1 Missing ⚠️
crates/polars-core/src/datatypes/field.rs 83.33% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20248      +/-   ##
==========================================
- Coverage   79.61%   79.60%   -0.01%     
==========================================
  Files        1565     1566       +1     
  Lines      218328   218220     -108     
  Branches     2478     2465      -13     
==========================================
- Hits       173820   173714     -106     
+ Misses      43941    43937       -4     
- Partials      567      569       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


/// Propagate the nulls from the dictionary values into the keys and remove those nulls from the
/// values.
pub fn propagate_dictionary_value_nulls(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good one.

@@ -3255,6 +3271,7 @@ impl DataFrame {

pub struct RecordBatchIter<'a> {
columns: &'a Vec<Column>,
schema: ArrowSchemaRef,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want the schema on RecordBatch? Do we want O(1) access somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it to be able to properly round trip types. We need the ArrowFields, the ArrowDataType isn't enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Convinces me more that we should shift to Fields on Series/Columns.

ritchie46
ritchie46 approved these changes Dec 15, 2024
@ritchie46 ritchie46 merged commit 6d0f5df into pola-rs:main Dec 15, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
None yet
2 participants