Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in deltalake_core::kernel::snapshot::log_segment::list_log_files_with_checkpoint::{{closure}} #2290

Closed
cmackenzie1 opened this issue Mar 15, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@cmackenzie1
Copy link
Contributor

cmackenzie1 commented Mar 15, 2024

Environment

Delta-rs version: 0.17.0

Binding: Rust

Environment:

  • Cloud provider:
  • OS:
  • Other:

Bug

What happened:

Panic occurred in deltalake_core::kernel::snapshot::log_segment::list_log_files_with_checkpoint::{{closure}}

assert_eq!(checkpoint_files.len(), cp.parts.unwrap_or(1) as usize);

What you expected to happen:

Errors over panics. I am not sure what this assert is trying to validate.

How to reproduce it:

Unsure what happened exactly.

More details:

@cmackenzie1 cmackenzie1 added the bug Something isn't working label Mar 15, 2024
@ion-elgreco
Copy link
Collaborator

@cmackenzie1 can you check against the latest version of main, this should already have been fixed by: #2270

@cmackenzie1
Copy link
Contributor Author

@ion-elgreco - does that address the underlying problem that the assert in this function only allows 1 checkpoint to exist in the log after the current checkpoint version? Can a table not have multiple checkpoints created after the current version in a multi-writer scenario? Maybe I don't understand what this function is really trying to do here.

    let checkpoint_files = files
        .iter()
        .filter_map(|f| {
            if f.location.is_checkpoint_file() && f.location.commit_version() == Some(cp.version) {
                Some(f.clone())
            } else {
                None
            }
        })
        .collect_vec();
    // TODO raise a proper error
    assert_eq!(checkpoint_files.len(), cp.parts.unwrap_or(1) as usize);

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Mar 15, 2024

It does because now it ignores all checkpoint files that are higher than the version that exists in _last.

Cp.parts will now be equal to the amount of files also when it's a multipart checkpoint.

If for some reason the last checkpoint version did not get updated but you did write a checkpoint parquet then that will now be ignore and not counted as a checkpoint during the validation

@cmackenzie1
Copy link
Contributor Author

Thanks for the quick response @ion-elgreco! Bear with me while I try to understand this.

The functions docstring is List all log files after a given checkpoint. and the function name is list_log_files_with_checkpoint, with the return type being DeltaResult<(Vec<ObjectMeta>, Vec<ObjectMeta>)>, a list of commit files and a list of checkpoint files.

/// List all log files after a given checkpoint.
async fn list_log_files_with_checkpoint(
cp: &CheckpointMetadata,
fs_client: &dyn ObjectStore,
log_root: &Path,
) -> DeltaResult<(Vec<ObjectMeta>, Vec<ObjectMeta>)> {

It does because now it ignores all checkpoint files that are higher than the version that exists in _last.

This is what I am struggling to understand. Why are we doing this instead of if f.location.is_checkpoint_file() && f.location.commit_version() > Some(cp.version) (after the checkpoint)? Why does this fn have the following assert assert_eq!(checkpoint_files.len(), cp.parts.unwrap_or(1) as usize);, panicing if there is ever more than one checkpoint file? This does not seem to align with what intention of the function is. Why is it only returning one checkpoint ObjectMeta instead of them all? I don't understand why the assert_eq is there.

@roeap could you help me understand the intention (and the TODO) there?

@esarili
Copy link
Contributor

esarili commented Mar 21, 2024

I can confirm that this assertion still panics after fix for the corrupted tables where _last_checkpoint points to a deleted checkpoint, which might be result of #2016:

_delta_log:
    |---- 00000000000000000302.json
    |---- 00000000000000000303.json
    |---- 00000000000000000304.json
    |---- 00000000000000000305.json
    |---- _last_checkpoint

_last_checkpoint:

{
    "num_of_add_files":null,
    "parts":null,
    "size":342,
    "size_in_bytes":142079,
    "version":300
}

Error:


2024-03-21T15:41:47.180Z | loghouse-2 | right: 1 | -
-- | -- | -- | --
thread 'tokio-runtime-worker' panicked at /root/.cargo/git/checkouts/delta-rs-fb509102fe888c8e/25962a0/crates/core/src/kernel/snapshot/log_segment.rs:461:5:
assertion `left == right` failed
left: 0
right: 1

I think it may still make sense to return a proper error rather than assertion.

@ion-elgreco
Copy link
Collaborator

@esarili do you want to open a PR perhaps to throw a meaningful error? The assertion indeed isn't clear enough if you haven't dived in the code

@esarili
Copy link
Contributor

esarili commented Mar 26, 2024

@ion-elgreco it is not clear to me what would be a proper error in this case. I can create a PR if someone guides me.

esarili added a commit to esarili/delta-rs that referenced this issue Apr 10, 2024
… sync

# Description
When a table is corrupted and `_last_checkpoint` file points to a
version that has been deleted, `list_log_files_with_checkpoint`
function panics. With this change `list_log_files_with_checkpoint`
function returns an error allowing callers react to such issues.

# Related Issue(s)
- delta-io#2290
esarili added a commit to esarili/delta-rs that referenced this issue Apr 10, 2024
When a table is corrupted and `_last_checkpoint` file points to a
version that has been deleted, `list_log_files_with_checkpoint`
function panics. With this change `list_log_files_with_checkpoint`
function returns an error allowing callers react to such issues.

- delta-io#2290
esarili added a commit to esarili/delta-rs that referenced this issue Apr 10, 2024
When a table is corrupted and `_last_checkpoint` file points to a
version that has been deleted, `list_log_files_with_checkpoint`
function panics. With this change `list_log_files_with_checkpoint`
function returns an error allowing callers react to such issues.

- delta-io#2290
ion-elgreco pushed a commit that referenced this issue Apr 11, 2024
# Description
When a table is corrupted and `_last_checkpoint` file points to a 
version that has been deleted, `list_log_files_with_checkpoint` 
function panics. With this change `list_log_files_with_checkpoint` 
function returns an error allowing callers react to such issues.

# Related Issue(s)
- #2290
@cmackenzie1
Copy link
Contributor Author

Resolved by #2406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants