-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ParquetMetaDataReader to load page indexes in SerializedFileReader::new_with_options
#6506
Conversation
@@ -1198,50 +1195,62 @@ mod tests { | |||
|
|||
#[test] | |||
fn test_file_reader_filter_row_groups_and_range() -> Result<()> { | |||
let test_file = get_test_file("alltypes_plain.parquet"); | |||
let test_file = get_test_file("alltypes_tiny_pages.parquet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switch to a test file that contains page indexes.
@@ -210,24 +209,19 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { | |||
} | |||
} | |||
|
|||
let mut metadata = metadata_builder.build(); | |||
|
|||
// If page indexes are desired, build them with the filtered set of row groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key here is that we're only pulling page indexes for row groups that survived the predicate filtering above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- this looks great to me
metadata.row_group(0).column(0).data_page_offset() | ||
); | ||
|
||
// read non-contiguous row groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great -- than you for adding these tests for predicate
metadata_builder = metadata_builder | ||
.set_column_index(Some(columns_indexes)) | ||
.set_offset_index(Some(offset_indexes)); | ||
let mut reader = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to switch the metadata.
Thanks again @etseidl |
Which issue does this PR close?
Closes #6491.
Rationale for this change
Reduce the file I/O needed to read in the page indexes.
What changes are included in this PR?
If page indexes are needed, create a new
ParquetMetaDataReader
using the filtered set of row groups. This will do a single read to get the bytes necessary to instantiate the needed page index structures, rather than two reads per row group. An alternative approach would be to read all the needed structures up front and then prune them, but that would mean reading page indexes that will just be thrown away.Are there any user-facing changes?
No