Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars.scan_delta(..).filter(..).collect() fails for some datasets #20361

Open
2 tasks done
codesorcery opened this issue Dec 19, 2024 · 2 comments · May be fixed by #20362
Open
2 tasks done

polars.scan_delta(..).filter(..).collect() fails for some datasets #20361

codesorcery opened this issue Dec 19, 2024 · 2 comments · May be fixed by #20362
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@codesorcery
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Unfortunately, we were not able to produce a reproducible example.

Log output

thread '<unnamed>' panicked at crates/polars-io/src/parquet/read/predicates.rs:28:64:
called `Option::unwrap()` on a `None` value

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[47], line 6
      1 (
      2     pl.scan_delta(
      3         source="s3://some-s3-bucket/some-path", storage_options=storage_options
      4     )
      5     .filter(pl.col("FILTER_COLUMN") == 1)
----> 6     .collect()
      7 )

File /opt/conda/envs/tgp/lib/python3.11/site-packages/polars/lazyframe/frame.py:2031, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2029 # Only for testing purposes
   2030 callback = _kwargs.get("post_opt_callback", callback)
-> 2031 return wrap_df(ldf.collect(callback))

PanicException: called `Option::unwrap()` on a `None` value

Issue description

On one of our Delta Tables, calling pl.scan_delta(..).filter(..).collect() fails with the pasted PanicException.
Taking a look at crates/polars-io/src/parquet/read/predicates.rs:28:64 mentioned in the exception,
the case where md.columns_under_root_iter(&field.name) returns None is not handled correctly.
Handling it the same as when iter.len() == 0 fixed the problem for us.

Expected behavior

DataFrame is computed without exception.

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            Linux-6.6.60-flatcar-x86_64-with-glibc2.36
Python:              3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:17:24) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.5.0
boto3                1.35.36
cloudpickle          3.1.0
connectorx           <not installed>
deltalake            0.22.3
fastexcel            <not installed>
fsspec               2024.10.0
gevent               <not installed>
google.auth          2.36.0
great_tables         <not installed>
matplotlib           3.9.3
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.3
pyarrow              18.1.0
pydantic             2.10.3
pyiceberg            0.8.1
sqlalchemy           2.0.36
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@codesorcery codesorcery added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 19, 2024
@coastalwhite
Copy link
Collaborator

MRE:

import io
import polars as pl

f1 = io.BytesIO()
f2 = io.BytesIO()

pl.DataFrame({ 'a': [1], 'b': [1] }).write_parquet(f1)
pl.DataFrame({ 'a': [1] }).write_parquet(f2)

f1.seek(0)
f2.seek(0)

pl.scan_parquet([f1, f2], allow_missing_columns=True).filter(pl.col.a == pl.col.b).collect()

@mdavis-xyz
Copy link
Contributor

Note that a workaround is to manually list the files, scan each one, and then concat.

pl.concat([pl.scan_parquet(f) for f in [f1, f2]], how='diagonal').filter(pl.col.a == pl.col.b).collect()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants