Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delta_rs don't seems to respect the row group size #2309

Closed
djouallah opened this issue Mar 20, 2024 · 7 comments · Fixed by #2314
Closed

delta_rs don't seems to respect the row group size #2309

djouallah opened this issue Mar 20, 2024 · 7 comments · Fixed by #2314
Labels
bug Something isn't working

Comments

@djouallah
Copy link

djouallah commented Mar 20, 2024

I would like to have delta_rs to write parquet files with bigger row group in Parquet, ideally 8 Millions rows per group, but some far, it does not seems to works, What I am missing ?

write_deltalake("/lakehouse/default/Tables/scada_duckdb",\
   df,mode="append",
    engine='rust',
    partition_by=['YEAR'],\
    max_rows_per_file=4*chunk_size,\
    max_rows_per_group= chunk_size,\
    min_rows_per_group=chunk_size,\
   storage_options={"allow_unsafe_rename":"true"})
    del df
@djouallah djouallah added the bug Something isn't working label Mar 20, 2024
@ion-elgreco
Copy link
Collaborator

Two things, I see the max_rows_per_group is incorrectly passed to the write_batch_size from python to rust, I'll make a fix to remove that.

You should use the WriterProperties class and pass that to write_deltalake, that contains max_row_group_size

@djouallah
Copy link
Author

djouallah commented Mar 21, 2024

@ion-elgreco thanks, But I am more interested in min_row_group_size, I want the minimum to be 8 Millions rows ?

@ion-elgreco
Copy link
Collaborator

@djouallah I don't see a way to set a minimum in the parquet crate

@djouallah
Copy link
Author

I am using this and it does not seems to be working ?

writer_properties=("max_row_group_size"==8000000),

@ion-elgreco
Copy link
Collaborator

@djouallah works for me:

from deltalake import WriterProperties
import polars as pl
import pyarrow.parquet as pq
import os
df = pl.DataFrame({
    "foo": list(range(10_000_000))
})

wp = WriterProperties(max_row_group_size=8_000_000)
df.write_delta("test_table",mode="append", delta_write_options={"writer_properties":wp, "engine":"rust"})
file =  list(filter(lambda x: '.parquet' in x, os.listdir("test_table")))[0]
metadata = pq.read_metadata(os.path.join("test_table", file))
for i in range(metadata.num_row_groups):
    print(metadata.row_group(i))

result:

<pyarrow._parquet.RowGroupMetaData object at 0x7f3bb4dbcea0>
  num_columns: 1
  num_rows: 8000000
  total_byte_size: 64282340
  sorting_columns: ()
<pyarrow._parquet.RowGroupMetaData object at 0x7f3bb4dbcea0>
  num_columns: 1
  num_rows: 2000000
  total_byte_size: 16297257
  sorting_columns: ()

@djouallah
Copy link
Author

Thanks, it will be nice if it is documented :)

@ion-elgreco
Copy link
Collaborator

@djouallah it it mentioned in the parameter: "Optional[WriterProperties] writer properties to the Rust parquet writer."

But we can also add this in some usage docs, do you want to perhaps open a PR for that? :)

ion-elgreco added a commit that referenced this issue Mar 26, 2024
# Description
Was passing the wrong param

- closes #2309
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants