Performance regression for dumping/loading DataFrames #20605
Labels
A-serde
Area: seralization and deserialization
accepted
Ready for implementation
bug
Something isn't working
performance
Performance issues or improvements
python
Related to Python Polars
regression
Issue introduced by a new release
Checks
Reproducible example
Log output
No response
Issue description
With the updated serialization code in 18.0.0 using IPC instead of Serde (#20266) it looks like the performance for dumping and loading DataFrame and LazyFrame is now on par, but significantly worse compared to 1.17.1 for the given data (4M rows, 16 cols, random integers).
Possible workaround: convert the DataFrame to PyArrow (pyarrow.lib.Table) before dump and convert back to DataFrame after load.
Expected behavior
Fast pickle load times are important, because a lot of python caching frameworks pickle the data. Would be desirable to have performance comparable with PyArrow.
Installed versions
The text was updated successfully, but these errors were encountered: