New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: cdf reader for delta tables #2048

Merged

rtyler merged 35 commits into delta-io:main from hntd187:cdf_read

May 1, 2024

Collaborator

hntd187 commented Jan 7, 2024

Description

This PR is the initial work for Change Data Feed (CDF) readers for delta tables. This PR looks a lot larger than it really is because a physical test table is checked in with this which will be removed once the loop is closed on CDF reading/writing.

Related Issue(s)

Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#change-data-files
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file

hntd187 and others added 5 commits

December 26, 2023 10:30


          feat: Change Data Feed, reading CDF

15c52c0


          feat: Change Data Feed, reading CDF

50002d2


          Merge remote-tracking branch 'origin/cdf_read' into cdf_read

63ec9e8


          feat: Change Data Feed, reading CDF, initial implementation

3a9d454


          Merge branch 'main' into cdf_read

github-actions bot added binding/rust crate/core labels

ion-elgreco reviewed

View reviewed changes

crates/deltalake-core/src/delta_datafusion/cdf/scan.rs Outdated Show resolved Hide resolved

ion-elgreco reviewed

View reviewed changes

crates/deltalake-core/src/delta_datafusion/cdf/scan.rs Outdated Show resolved Hide resolved

ion-elgreco reviewed

View reviewed changes

crates/deltalake-core/src/delta_datafusion/cdf/scan.rs Outdated

+                          Self::get_add_action_type(),
+                      )?;
+                      // Create the parquet scans for each associated type of file. I am not sure when we would use removes yet, but

Collaborator

ion-elgreco Jan 7, 2024

Aren't remove actions then read and appended to the result with _change_type: delete?

ion-elgreco reviewed

View reviewed changes

crates/deltalake-core/src/delta_datafusion/cdf/scan.rs Outdated

+                      let results = scan.scan().await?;
+                      let data: Vec<RecordBatch> = collect_sendable_stream(results).await?;
+                      print_batches(&data)?;

Collaborator

ion-elgreco Jan 7, 2024

Can you add an assert of the output of data?

Collaborator Author

hntd187 Jan 8, 2024

I'll flesh tests out to be more complete when we are more complete in the initial impl

ion-elgreco reviewed

View reviewed changes

crates/deltalake-core/src/delta_datafusion/cdf/scan.rs Outdated

+                      // Create the parquet scans for each associated type of file. I am not sure when we would use removes yet, but
+                      // they would be here if / when they are necessary
+                      let cdc_scan = ParquetFormat::new()

Collaborator

ion-elgreco Jan 7, 2024 •

edited

Loading

Have you considered ctx.read_parquet for both and then just df_cdc.union(df_add)?.collect()?

Collaborator Author

hntd187 Jan 8, 2024

You lose some control when you use that facility and I'm not even sure that this would work given it looks like it's more expecting a directory and not explicit file paths. It might work, I haven't explored it extensively to be honest, but at a glance it doesn't look like it would provide the control necessary.

Collaborator

ion-elgreco Jan 10, 2024 •

edited

Loading

It does accept a list of filepaths. Comparing the two, it looks like these params are the same (I think)

ParquetReadOptions.schema == FileScanConfig.file_schema
table_paths == FileScanConfig.file_groups
ParquetReadOptions.table_partition_cols == FileScanConfig.table_partition_cols

Project and limit I guess you do after read_parquet on the dataframe (I assume it's lazy).

The only gap is the statistics part, but you don't seem to use it. Could be perhaps interesting to check if there is any performance difference between the two since the https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.execute_stream will create the physical plan for you


          Merge branch 'main' into cdf_read

435062b

tdikland mentioned this pull request

Implement CDF related API delta-incubator/delta-sharing-rs#2

Open


          Merge branch 'main' into cdf_read

de4b501

roeap reviewed

View reviewed changes

crates/deltalake-core/src/delta_datafusion/cdf/scan.rs Outdated

+                  partition_values: Vec<String>,
+              }
+              impl DeltaCdfScan {

Collaborator

roeap Jan 11, 2024

Did not have time for a proper review, just as a quick comment - we should implement ExecutionPlan for this scan, so it can be used in datafusion as an operation.


          Merge branch 'main' into cdf_read

b5a0284

hntd187 mentioned this pull request

Change Data Feed in Delta #2095

Closed

hntd187 and others added 3 commits

January 26, 2024 09:07


          Merge branch 'main' into cdf_read

343bce8


          Merge branch 'main' into cdf_read

524b1ff

# Conflicts:
#	crates/core/src/operations/load_cdf.rs
#	crates/core/src/table/state.rs


          feat: CDF reader

72222cb

github-actions bot removed the crate/core label


          feat: CDF reader, refactoring for execution plan

a14fb4e

hntd187 mentioned this pull request

feat: logical Node for find files #2194

Merged

4 tasks

Blajda pushed a commit that referenced this pull request


          feat: logical Node for find files (#2194)

f56d8c9

# Description
Some of my first workings on David's proposal in #2006, this is also
meant to push #2048 and general CDF forward as well by making the
logical operations of delta tables more composable than they are today.

# Related Issue(s)
#2006 
#2048 

I think and @Blajda correct me here, we can build upon this and
eventually move towards a `DeltaPlanner` esq enum for operations and
their associated logical plan building.

# Still to do

- [ ] Implement different path for partition columns that don't require
scanning the file
- [ ] Plumbing into `DeltaScan` so delta scan can make use of this
logical node
- [ ] General polish and cleanup, there are lots of unnecessary fields
and way things are built
- [ ] More tests, there is currently a large integration style end to
end test, but this can / should be broken down

rtyler added a commit that referenced this pull request


          Merge remote-tracking branch 'upstream/pr/2048'

3afa452

This brings in the work already started by @hntd187

Closes #2048

hntd187 added 4 commits

March 25, 2024 09:29


          Merge branch 'main' into cdf_read

0ad835d


          Merge branch 'main' into cdf_read

930cc91


          Merge branch 'main' into cdf_read

7e1d232


          Merge branch 'main' into cdf_read

c9e82d8


          feat: add python support for CDF

a6abc83

hntd187 requested a review from fvaleye as a code owner

April 21, 2024 21:52

github-actions bot added the binding/python label


          Merge branch 'main' into cdf_read

e1cb8fd

ion-elgreco approved these changes

View reviewed changes

python/src/lib.rs Outdated

+                              .collect();
+                          let final_batch = concat_batches(&batches[0].schema(), &batches).unwrap();
+                          Ok(PyArrowType(final_batch))

Collaborator

ion-elgreco Apr 21, 2024

I think we can also convert it as a reader again

Collaborator Author

hntd187 Apr 22, 2024

I'm sorry, what do you mean?

Collaborator

ion-elgreco Apr 22, 2024

impl IntoPyArrow for ArrowArrayStreamReader
Convert a ArrowArrayStreamReader into a pyarrow.RecordBatchReader.

https://docs.rs/arrow/latest/arrow/ffi_stream/struct.ArrowArrayStreamReader.html

I mean the above. You can take an arrowArrayStreamReader and move it into python as pyarrow.RecordBatchReader

Collaborator

ion-elgreco Apr 26, 2024

Create a RecordBatchReader first from the batches then the stream

let stream = FFI_ArrowArrayStream::new(reader)
Ok(PyArrowType(stream))

With this as the return type: PyResult<PyArrowType>

ion-elgreco requested changes

View reviewed changes

Collaborator

ion-elgreco left a comment

I send you some comments on slack on the python signature.

Regarding the starting_version it always seems to be required but in spark sql it can also be just timestamps. I think it would make sense to have the starting version optional as well, and just raise if starting version nor starting timestamp is provided

Collaborator Author

hntd187 commented Apr 22, 2024

Well, the rust side of it defaults to 0 for the starting version if you don't provide one, so in theory you can give nothing and it's just the full CDF feed for the table since the beginning. I don't know why I required python here to provide a starting version, but you are right we can mark it as optional.


          feat: add python support for CDF

b7fee73

nicklan reviewed

View reviewed changes

nicklan left a comment

couple more comments

crates/core/src/delta_datafusion/cdf/mod.rs Outdated

+              impl FileAction for Add {
+                  fn partition_values(&self) -> HashMap<String, Option<String>> {
+                      self.partition_values.clone()

nicklan Apr 23, 2024

seems like you only use this to get out of the map, so you could probably return &HashMap<String, Option<String>> and not have to clone (or just provide a get_partition_value which does the probe internally (which makes it easier to not have to clone for remove).

crates/core/src/delta_datafusion/cdf/scan.rs Outdated

+                  }
+                  fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> {
+                      vec![self.plan.clone()]

nicklan Apr 23, 2024

this is probably my own ignorance regarding all these apis, but all the other things (like output_ordering etc) seem to delegate to self.plan, but this one says that plan is the child. In particular, if this returns plan I would have expected with_new_children to replace plan, but it instead makes the new children children of plan.

Anyway, just something to think of :)

Collaborator Author

hntd187 Apr 25, 2024

You are right here, since this is a scan, it should be an empty vec

crates/core/src/delta_datafusion/cdf/scan_utils.rs

+                      .unwrap_or(ScalarValue::Null)
+              }
+              pub fn create_spec_partition_values<F: FileAction>(

nicklan Apr 23, 2024

should have some docs for a pub func

Collaborator Author

hntd187 Apr 25, 2024

These are pub in their module but are actually only exported as pub(crate) outside the module. There actually is a deny lint set for publicly reachable things that don't have docs

crates/core/src/delta_datafusion/cdf/scan_utils.rs

+              use crate::delta_datafusion::{get_null_of_arrow_type, to_correct_scalar_value};
+              use crate::DeltaResult;
+              pub fn map_action_to_scalar<F: FileAction>(

nicklan Apr 23, 2024

docs, as it's pub

hntd187 and others added 5 commits

April 26, 2024 00:47


          feat: address PR feedback, added a test and some failure checks

a90578b


          Merge branch 'main' into cdf_read

528317e


          feat: address PR feedback, added a test and some failure checks

6718e1b


          Merge branch 'main' into cdf_read

38f452f


          Merge branch 'main' into cdf_read

da72793

ion-elgreco previously approved these changes

View reviewed changes


          feat: address PR feedback, fixed a few lints

c437dc4

hntd187 dismissed ion-elgreco’s stale review via

c437dc4

April 28, 2024 14:51

ion-elgreco enabled auto-merge (squash)

April 28, 2024 14:59

ion-elgreco previously approved these changes

View reviewed changes

hntd187 disabled auto-merge

April 28, 2024 15:05


          feat: Add docs for CDF

3223f39

hntd187 dismissed ion-elgreco’s stale review via

3223f39

April 28, 2024 17:33

hntd187 requested a review from MrPowers as a code owner

April 28, 2024 17:33

ion-elgreco approved these changes

View reviewed changes

nicklan approved these changes

View reviewed changes

nicklan left a comment

nice, lgtm!

crates/core/src/delta_datafusion/cdf/scan.rs

+                  fn with_new_children(
+                      self: Arc<Self>,
+                      _children: Vec<Arc<dyn ExecutionPlan>>,

nicklan Apr 30, 2024

nit: you do use this, so why _children and not children?

Collaborator Author

hntd187 Apr 30, 2024

A side effect of me having this stubbed to start as todo!() and then not fixing it properly.


          Merge branch 'main' into cdf_read

7bb6c16

rtyler enabled auto-merge (rebase)

May 1, 2024 00:53

auto-merge was automatically disabled

May 1, 2024 01:15

Rebase failed

rtyler merged commit 4dce000 into delta-io:main

23 checks passed

rtyler added this to the Change Data Capture Support milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ion-elgreco ion-elgreco approved these changes

nicklan nicklan approved these changes

wjones127 Awaiting requested review from wjones127 wjones127 is a code owner

rtyler Awaiting requested review from rtyler rtyler is a code owner

roeap Awaiting requested review from roeap roeap is a code owner

fvaleye Awaiting requested review from fvaleye fvaleye is a code owner

MrPowers Awaiting requested review from MrPowers

Labels

binding/python binding/rust