Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cdf reader for delta tables #2048

Merged
merged 35 commits into from
May 1, 2024
Merged

feat: cdf reader for delta tables #2048

merged 35 commits into from
May 1, 2024

Conversation

hntd187
Copy link
Collaborator

@hntd187 hntd187 commented Jan 7, 2024

Description

This PR is the initial work for Change Data Feed (CDF) readers for delta tables. This PR looks a lot larger than it really is because a physical test table is checked in with this which will be removed once the loop is closed on CDF reading/writing.

Related Issue(s)

Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-data-files
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file

@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core labels Jan 7, 2024
Self::get_add_action_type(),
)?;

// Create the parquet scans for each associated type of file. I am not sure when we would use removes yet, but
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't remove actions then read and appended to the result with _change_type: delete?


let results = scan.scan().await?;
let data: Vec<RecordBatch> = collect_sendable_stream(results).await?;
print_batches(&data)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assert of the output of data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll flesh tests out to be more complete when we are more complete in the initial impl


// Create the parquet scans for each associated type of file. I am not sure when we would use removes yet, but
// they would be here if / when they are necessary
let cdc_scan = ParquetFormat::new()
Copy link
Collaborator

@ion-elgreco ion-elgreco Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered ctx.read_parquet for both and then just df_cdc.union(df_add)?.collect()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You lose some control when you use that facility and I'm not even sure that this would work given it looks like it's more expecting a directory and not explicit file paths. It might work, I haven't explored it extensively to be honest, but at a glance it doesn't look like it would provide the control necessary.

Copy link
Collaborator

@ion-elgreco ion-elgreco Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does accept a list of filepaths. Comparing the two, it looks like these params are the same (I think)

ParquetReadOptions.schema == FileScanConfig.file_schema
table_paths == FileScanConfig.file_groups
ParquetReadOptions.table_partition_cols == FileScanConfig.table_partition_cols

Project and limit I guess you do after read_parquet on the dataframe (I assume it's lazy).

The only gap is the statistics part, but you don't seem to use it. Could be perhaps interesting to check if there is any performance difference between the two since the https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.execute_stream will create the physical plan for you

partition_values: Vec<String>,
}

impl DeltaCdfScan {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not have time for a proper review, just as a quick comment - we should implement ExecutionPlan for this scan, so it can be used in datafusion as an operation.

hntd187 and others added 3 commits January 26, 2024 09:07
# Conflicts:
#	crates/core/src/operations/load_cdf.rs
#	crates/core/src/table/state.rs
@hntd187 hntd187 mentioned this pull request Feb 19, 2024
4 tasks
Blajda pushed a commit that referenced this pull request Mar 24, 2024
# Description
Some of my first workings on David's proposal in #2006, this is also
meant to push #2048 and general CDF forward as well by making the
logical operations of delta tables more composable than they are today.

# Related Issue(s)
#2006 
#2048 

I think and @Blajda correct me here, we can build upon this and
eventually move towards a `DeltaPlanner` esq enum for operations and
their associated logical plan building.

# Still to do

- [ ] Implement different path for partition columns that don't require
scanning the file
- [ ] Plumbing into `DeltaScan` so delta scan can make use of this
logical node
- [ ] General polish and cleanup, there are lots of unnecessary fields
and way things are built
- [ ] More tests, there is currently a large integration style end to
end test, but this can / should be broken down
rtyler added a commit that referenced this pull request Mar 25, 2024
This brings in the work already started by @hntd187

Closes #2048
@hntd187 hntd187 requested a review from fvaleye as a code owner April 21, 2024 21:52
@github-actions github-actions bot added the binding/python Issues for the Python package label Apr 21, 2024
.collect();

let final_batch = concat_batches(&batches[0].schema(), &batches).unwrap();
Ok(PyArrowType(final_batch))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also convert it as a reader again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, what do you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impl IntoPyArrow for ArrowArrayStreamReader
Convert a ArrowArrayStreamReader into a pyarrow.RecordBatchReader.

https://docs.rs/arrow/latest/arrow/ffi_stream/struct.ArrowArrayStreamReader.html

I mean the above. You can take an arrowArrayStreamReader and move it into python as pyarrow.RecordBatchReader

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a RecordBatchReader first from the batches then the stream

let stream = FFI_ArrowArrayStream::new(reader)
Ok(PyArrowType(stream))

With this as the return type: PyResult<PyArrowType>

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I send you some comments on slack on the python signature.

Regarding the starting_version it always seems to be required but in spark sql it can also be just timestamps. I think it would make sense to have the starting version optional as well, and just raise if starting version nor starting timestamp is provided

@hntd187
Copy link
Collaborator Author

hntd187 commented Apr 22, 2024

Well, the rust side of it defaults to 0 for the starting version if you don't provide one, so in theory you can give nothing and it's just the full CDF feed for the table since the beginning. I don't know why I required python here to provide a starting version, but you are right we can mark it as optional.

Copy link

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple more comments


impl FileAction for Add {
fn partition_values(&self) -> HashMap<String, Option<String>> {
self.partition_values.clone()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you only use this to get out of the map, so you could probably return &HashMap<String, Option<String>> and not have to clone (or just provide a get_partition_value which does the probe internally (which makes it easier to not have to clone for remove).

}

fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> {
vec![self.plan.clone()]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably my own ignorance regarding all these apis, but all the other things (like output_ordering etc) seem to delegate to self.plan, but this one says that plan is the child. In particular, if this returns plan I would have expected with_new_children to replace plan, but it instead makes the new children children of plan.

Anyway, just something to think of :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right here, since this is a scan, it should be an empty vec

.unwrap_or(ScalarValue::Null)
}

pub fn create_spec_partition_values<F: FileAction>(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have some docs for a pub func

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pub in their module but are actually only exported as pub(crate) outside the module. There actually is a deny lint set for publicly reachable things that don't have docs

use crate::delta_datafusion::{get_null_of_arrow_type, to_correct_scalar_value};
use crate::DeltaResult;

pub fn map_action_to_scalar<F: FileAction>(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs, as it's pub

ion-elgreco
ion-elgreco previously approved these changes Apr 28, 2024
ion-elgreco
ion-elgreco previously approved these changes Apr 28, 2024
@hntd187 hntd187 disabled auto-merge April 28, 2024 15:05
Copy link

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, lgtm!


fn with_new_children(
self: Arc<Self>,
_children: Vec<Arc<dyn ExecutionPlan>>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you do use this, so why _children and not children?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side effect of me having this stubbed to start as todo!() and then not fixing it properly.

@rtyler rtyler enabled auto-merge (rebase) May 1, 2024 00:53
auto-merge was automatically disabled May 1, 2024 01:15

Rebase failed

@rtyler rtyler merged commit 4dce000 into delta-io:main May 1, 2024
23 checks passed
@rtyler rtyler added this to the Change Data Capture Support milestone May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants