lazy_frame join_asof and then sink_parquet not supported #20633

guanqun · 2025-01-09T02:47:49Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

a_df = pl.LazyFrame({"a": [1, 2, 3]})
b_df = pl.LazyFrame({"a": [2, 5, 6], "b": [1, 2, 3]})

a_df.join_asof(b_df, on='a', strategy='backward').sink_parquet("out.parquet")

Log output

2440 else:
   2441     # Handle empty dict input
   2442     storage_options = None
-> 2444 return lf.sink_parquet(
   2445     path=normalize_filepath(path),
   2446     compression=compression,
   2447     compression_level=compression_level,
   2448     statistics=statistics,
   2449     row_group_size=row_group_size,
   2450     data_page_size=data_page_size,
   2451     maintain_order=maintain_order,
   2452     cloud_options=storage_options,
   2453     credential_provider=credential_provider,
   2454     retries=retries,
   2455 )

InvalidOperationError: sink_Parquet(ParquetWriteOptions { compression: Zstd(None), statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true }) not yet supported in standard engine. Use 'collect().write_Parquet(ParquetWriteOptions { compression: Zstd(None), statistics: StatisticsOptions { min_value: true, max_value: true, distinct_count: false, null_count: true }, row_group_size: None, data_page_size: None, maintain_order: true })()'



### Issue description

it seems that this operation is not supported.

### Expected behavior

we should at least allow this case with `join_asof()`, since it assume the key is sorted, and we can incrementally process it and then write the output to disk.

### Installed versions

<details>

--------Version info---------
Polars: 1.19.0
Index type: UInt32
Platform: Linux-5.15.0-82-generic-x86_64-with-glibc2.35
Python: 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:16:49) [GCC 13.3.0]
LTS CPU: False

----Optional dependencies----
adbc_driver_manager
altair 5.4.0
azure.identity
boto3
cloudpickle 3.1.0
connectorx
deltalake
fastexcel
fsspec
gevent
google.auth
great_tables
matplotlib 3.9.2
nest_asyncio 1.6.0
numpy 2.2.0
openpyxl
pandas 2.2.3
pyarrow 17.0.0
pydantic
pyiceberg
sqlalchemy
torch
xlsx2csv
xlsxwriter


</details>

The text was updated successfully, but these errors were encountered:

guanqun · 2025-01-09T02:49:41Z

feel free to move this ticket as a feature request.

if someone is willing to give some high level guidelines on how we should support this, I'm happy to dive into it and make it work.

guanqun added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 9, 2025

guanqun mentioned this issue Jan 9, 2025

pl.col().diff() causes error when trying to .sink_parquet() #13670

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy_frame join_asof and then sink_parquet not supported #20633

lazy_frame join_asof and then sink_parquet not supported #20633

guanqun commented Jan 9, 2025

guanqun commented Jan 9, 2025

lazy_frame join_asof and then sink_parquet not supported #20633

lazy_frame join_asof and then sink_parquet not supported #20633

Comments

guanqun commented Jan 9, 2025

Checks

Reproducible example

Log output

guanqun commented Jan 9, 2025