Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Reduce the size of row encoding UTF-8 #19911

Merged
merged 8 commits into from
Nov 25, 2024

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Nov 21, 2024

Before, row encoding and decoding would use the variable row encoding. Now, we use the fact that 0xFF is always an invalid UTF-8 character. To encode, the string with bytes b1, ..., bn becomes b1 + 2, ..., bn + 2, 0x01. This way, we can just scan for the 0x01 when we want to know where to end. Nulls are encoded as 0x00. Everything is bitwise inverted for descending.

This is always a size improvement and in particular saves massively for small strings. For example, encoding "a" went from 33 bytes to 2 bytes.

This is a continuation of #19874.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Nov 21, 2024
@coastalwhite coastalwhite force-pushed the perf/re-small-utf8 branch 2 times, most recently from f22127a to da107d2 Compare November 22, 2024 16:03
Before, row encoding and decoding would use the variable row encoding. Now, we
use the fact that `0xFF` is always an invalid UTF-8 character. To encode, the
string with bytes `b1, ..., bn` becomes `0x02, b1 + 1, ..., bn + 1, 0x00`. This
way, we can just scan for the `0x00` when we want to know where to end. Empty
strings are encoded as `0x01` and nulls as `0x00`. Everything is bitwise
inverted for descending.

This is always a size improvement and in particular saves massively for small
strings. For example, encoding "a" went from 33 bytes to 3 bytes.
@ritchie46 ritchie46 merged commit fdf9751 into pola-rs:main Nov 25, 2024
25 checks passed
@coastalwhite coastalwhite deleted the perf/re-small-utf8 branch November 25, 2024 10:25
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 66.37168% with 76 lines in your changes missing coverage. Please review.

Project coverage is 79.49%. Comparing base (05f2abb) to head (9096d6f).
Report is 58 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-row/src/encode.rs 41.66% 63 Missing ⚠️
crates/polars-row/src/row.rs 0.00% 10 Missing ⚠️
crates/polars-row/src/decode.rs 75.00% 2 Missing ⚠️
crates/polars-row/src/variable.rs 98.97% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19911      +/-   ##
==========================================
+ Coverage   79.44%   79.49%   +0.04%     
==========================================
  Files        1555     1555              
  Lines      216140   216377     +237     
  Branches     2456     2456              
==========================================
+ Hits       171716   172002     +286     
+ Misses      43866    43817      -49     
  Partials      558      558              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@c-peters c-peters added the accepted Ready for implementation label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants