Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve cloud scan performance #19728

Merged
merged 3 commits into from
Nov 11, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Nov 11, 2024

ref #18443

We were previously slow either due to making a single very large request, or making thousands of tiny requests. This PR splits/combines the range requests to make them evenly distributed with a reasonable chunk size.

Benchmarks - the source file is 12,000 columns x 24,000 rows on S3 (see linked issue for generator). Tested on EC2.

query time (1.12.0, seconds) time (this PR) speedup
pl.read_parquet(path) 29.020907 2.771221 10.5x
pl.scan_parquet(path).select(cs.by_index(range(6_000))).collect() 14.717146 1.75916 8.4x
pl.scan_parquet(path).select(cs.by_index(range(0, 12_000, 5))).collect() 28.716227 2.414139 11.9x
pl.scan_parquet(path).select(cs.by_index(range(0, 12_000, 3))).collect() 28.815121 2.515088 11.5x
pl.scan_parquet(path).collect(new_streaming=True) 28.939848 3.001616 9.6x

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Nov 11, 2024

// Dropping is delayed for tokio async files so we need to explicitly
// flush here (https://github.com/tokio-rs/tokio/issues/2307#issuecomment-596336451).
file.sync_all().await.map_err(PolarsError::from)?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by - moved here from the callsite at file_cache

// We have a dedicated code-path for a full projection that performs a
// single range request for the entire row group. During testing this
// provided much higher throughput from cloud than making multiple range
// request with `get_ranges()`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this previously, which was fast if the file contained many nicely sized row groups, but not for the 12,000 column single row group file.

We now just use get_ranges(), which handles the download optimization for us

@ritchie46
Copy link
Member

Great speedups! 🙌

@nameexhaustion nameexhaustion marked this pull request as ready for review November 11, 2024 09:35
@ritchie46 ritchie46 merged commit c4f0cc2 into pola-rs:main Nov 11, 2024
21 of 22 checks passed
@nameexhaustion nameexhaustion deleted the get-ranges-split branch November 18, 2024 08:21
@c-peters c-peters added the accepted Ready for implementation label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants