Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy_frame join_asof and then sink_parquet not supported #20633

Open
2 tasks done
guanqun opened this issue Jan 9, 2025 · 1 comment
Open
2 tasks done

lazy_frame join_asof and then sink_parquet not supported #20633

guanqun opened this issue Jan 9, 2025 · 1 comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@guanqun
Copy link
Contributor

guanqun commented Jan 9, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

a_df = pl.LazyFrame({"a": [1, 2, 3]})
b_df = pl.LazyFrame({"a": [2, 5, 6], "b": [1, 2, 3]})

a_df.join_asof(b_df, on='a', strategy='backward').sink_parquet("out.parquet")

Log output

2440 else:
   2441     # Handle empty dict input
   2442     storage_options = None
-> 2444 return lf.sink_parquet(
   2445     path=normalize_filepath(path),
   2446     compression=compression,
   2447     compression_level=compression_level,
   2448     statistics=statistics,
   2449     row_group_size=row_group_size,
   2450     data_page_size=data_page_size,
   2451     maintain_order=maintain_order,
   2452     cloud_options=storage_options,
   2453     credential_provider=credential_provider,
   2454     retries=retries,
   2455 )

InvalidOperationError: sink_Parquet(ParquetWriteOptions { compression: Zstd(None), statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true }) not yet supported in standard engine. Use 'collect().write_Parquet(ParquetWriteOptions { compression: Zstd(None), statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true })()'


### Issue description

it seems that this operation is not supported.

### Expected behavior

we should at least allow this case with `join_asof()`, since it assume the key is sorted, and we can incrementally process it and then write the output to disk.

### Installed versions

<details>

--------Version info---------
Polars: 1.19.0
Index type: UInt32
Platform: Linux-5.15.0-82-generic-x86_64-with-glibc2.35
Python: 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:16:49) [GCC 13.3.0]
LTS CPU: False

----Optional dependencies----
adbc_driver_manager
altair 5.4.0
azure.identity
boto3
cloudpickle 3.1.0
connectorx
deltalake
fastexcel
fsspec
gevent
google.auth
great_tables
matplotlib 3.9.2
nest_asyncio 1.6.0
numpy 2.2.0
openpyxl
pandas 2.2.3
pyarrow 17.0.0
pydantic
pyiceberg
sqlalchemy
torch
xlsx2csv
xlsxwriter


</details>
@guanqun guanqun added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 9, 2025
@guanqun
Copy link
Contributor Author

guanqun commented Jan 9, 2025

feel free to move this ticket as a feature request.

if someone is willing to give some high level guidelines on how we should support this, I'm happy to dive into it and make it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant