-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Improve cloud scan performance #19728
Conversation
|
||
// Dropping is delayed for tokio async files so we need to explicitly | ||
// flush here (https://github.com/tokio-rs/tokio/issues/2307#issuecomment-596336451). | ||
file.sync_all().await.map_err(PolarsError::from)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by - moved here from the callsite at file_cache
// We have a dedicated code-path for a full projection that performs a | ||
// single range request for the entire row group. During testing this | ||
// provided much higher throughput from cloud than making multiple range | ||
// request with `get_ranges()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had this previously, which was fast if the file contained many nicely sized row groups, but not for the 12,000 column single row group file.
We now just use get_ranges()
, which handles the download optimization for us
Great speedups! 🙌 |
ref #18443
We were previously slow either due to making a single very large request, or making thousands of tiny requests. This PR splits/combines the range requests to make them evenly distributed with a reasonable chunk size.
Benchmarks - the source file is 12,000 columns x 24,000 rows on S3 (see linked issue for generator). Tested on EC2.