delta_rs don't seems to respect the row group size #2309

djouallah · 2024-03-20T23:34:50Z

I would like to have delta_rs to write parquet files with bigger row group in Parquet, ideally 8 Millions rows per group, but some far, it does not seems to works, What I am missing ?

write_deltalake("/lakehouse/default/Tables/scada_duckdb",\
   df,mode="append",
    engine='rust',
    partition_by=['YEAR'],\
    max_rows_per_file=4*chunk_size,\
    max_rows_per_group= chunk_size,\
    min_rows_per_group=chunk_size,\
   storage_options={"allow_unsafe_rename":"true"})
    del df

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-03-21T10:34:04Z

Two things, I see the max_rows_per_group is incorrectly passed to the write_batch_size from python to rust, I'll make a fix to remove that.

You should use the WriterProperties class and pass that to write_deltalake, that contains max_row_group_size

djouallah · 2024-03-21T11:27:50Z

@ion-elgreco thanks, But I am more interested in min_row_group_size, I want the minimum to be 8 Millions rows ?

ion-elgreco · 2024-03-21T11:42:31Z

@djouallah I don't see a way to set a minimum in the parquet crate

djouallah · 2024-03-23T20:45:58Z

I am using this and it does not seems to be working ?

writer_properties=("max_row_group_size"==8000000),

ion-elgreco · 2024-03-24T08:24:29Z

@djouallah works for me:

from deltalake import WriterProperties
import polars as pl
import pyarrow.parquet as pq
import os
df = pl.DataFrame({
    "foo": list(range(10_000_000))
})

wp = WriterProperties(max_row_group_size=8_000_000)
df.write_delta("test_table",mode="append", delta_write_options={"writer_properties":wp, "engine":"rust"})

file =  list(filter(lambda x: '.parquet' in x, os.listdir("test_table")))[0]
metadata = pq.read_metadata(os.path.join("test_table", file))
for i in range(metadata.num_row_groups):
    print(metadata.row_group(i))

result:

<pyarrow._parquet.RowGroupMetaData object at 0x7f3bb4dbcea0>
  num_columns: 1
  num_rows: 8000000
  total_byte_size: 64282340
  sorting_columns: ()
<pyarrow._parquet.RowGroupMetaData object at 0x7f3bb4dbcea0>
  num_columns: 1
  num_rows: 2000000
  total_byte_size: 16297257
  sorting_columns: ()

djouallah · 2024-03-24T08:54:54Z

Thanks, it will be nice if it is documented :)

ion-elgreco · 2024-03-24T10:20:12Z

@djouallah it it mentioned in the parameter: "Optional[WriterProperties] writer properties to the Rust parquet writer."

But we can also add this in some usage docs, do you want to perhaps open a PR for that? :)

# Description Was passing the wrong param - closes #2309

djouallah added the bug Something isn't working label Mar 20, 2024

ion-elgreco mentioned this issue Mar 22, 2024

fix(python): wrong batch size #2314

Merged

ion-elgreco closed this as completed in #2314 Mar 26, 2024

ion-elgreco added a commit that referenced this issue Mar 26, 2024

fix(python): wrong batch size (#2314)

5b404e2

# Description Was passing the wrong param - closes #2309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delta_rs don't seems to respect the row group size #2309

delta_rs don't seems to respect the row group size #2309

djouallah commented Mar 20, 2024 •

edited

Loading

ion-elgreco commented Mar 21, 2024

djouallah commented Mar 21, 2024 •

edited

Loading

ion-elgreco commented Mar 21, 2024

djouallah commented Mar 23, 2024

ion-elgreco commented Mar 24, 2024

djouallah commented Mar 24, 2024

ion-elgreco commented Mar 24, 2024

delta_rs don't seems to respect the row group size #2309

delta_rs don't seems to respect the row group size #2309

Comments

djouallah commented Mar 20, 2024 • edited Loading

ion-elgreco commented Mar 21, 2024

djouallah commented Mar 21, 2024 • edited Loading

ion-elgreco commented Mar 21, 2024

djouallah commented Mar 23, 2024

ion-elgreco commented Mar 24, 2024

djouallah commented Mar 24, 2024

ion-elgreco commented Mar 24, 2024

djouallah commented Mar 20, 2024 •

edited

Loading

djouallah commented Mar 21, 2024 •

edited

Loading