-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: harmonize and simplify storage configuration #1052
Conversation
@@ -31,7 +31,8 @@ def test_read_files(s3_localstack): | |||
@pytest.mark.s3 | |||
@pytest.mark.integration | |||
@pytest.mark.timeout(timeout=15, method="thread") | |||
def test_s3_authenticated_read_write(s3_localstack_creds): | |||
def test_s3_authenticated_read_write(s3_localstack_creds, monkeypatch): | |||
monkeypatch.setenv("AWS_DEFAULT_REGION", "us-east-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we were using the rusoto logic from our S3 storage options, which look in the envornment, then a profile file, and finally just defaults to us-east-1 as a region.
While slightly convenient, i felt this is a bit too much logic for us, and hoping, that the dynamo_lock crate will eventually also move to the official aws sdk, I also did not want to use the rusoto helper.
If this goes too much against the expectations of users, we can of course add some login for this.
The setting could of course also be passed with the storage options :)
# Fails without any creds | ||
with pytest.raises(PyDeltaTableError): | ||
anon_storage_options = { | ||
key: value for key, value in azurite_creds.items() if "ACCOUNT" not in key | ||
} | ||
write_deltalake(table_path, sample_data, storage_options=anon_storage_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure what we did before, but now AZURE_STORAGE_USE_EMULATOR
will pick up the default account / key if none are specified - so its hard to create a failing state with the emulator now.
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good cleanup, but have some questions on the purpose of StorageLocation
.
} | ||
|
||
/// A parsed URL identifying a storage location | ||
/// for more information on the supported expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to be here?
} | ||
} | ||
|
||
/// A parsed URL identifying a storage location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we describe the difference between StorageLocation
, object_store::path::Path
, and url::Url
? I admit I am a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all we are getting are some extra methods, what if we defined a trait that was implemented for Url
and Path
? Would that be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StorageLocation is more or less just a intermediate solution. We had StorageUrl
and DeltaStorageConfig
. My plan was (and is :)) to get rid of both. Especially since we moved the pre-fixing logic upstream into the PrefixObjectStore
, we can get rid of that logic here as well.
The main reason I did not get rid of it in this PR yet is mainly that changes we getting too large and I had some concerns about breaking serialization. SO for now I just got rid of one (and for some reason renamed the other).
More generally speaking though I am thinking about how to most efficiently support absolute file paths in the delta log. In the most general case the log and files could reside in completely different locations (file systems) and I guess there would not even be a guarantee that all files use the same store. Not sure though if this is more fo an esoteric case, or if it is worth the effort to support scenarios like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I see. If we are planning on removing later, that sounds good to me. 👍 Thanks for the explanation.
Co-authored-by: Will Jones <[email protected]>
# Description Recently we moved some of our storage configuration via a property bag upstream to the object_store crate. This allows us to simplify our configuration handling here and make S3 configuration consistent with azure and gcp. I think as a follow up it would be great to migrate dynamodb_lock to using the official SDKs as well, and then see what we still need form the s3 storage options. # Related Issue(s) closes delta-io#999 # Documentation <!--- Share links to useful documentation ---> Co-authored-by: Will Jones <[email protected]>
Description
Recently we moved some of our storage configuration via a property bag upstream to the object_store crate. This allows us to simplify our configuration handling here and make S3 configuration consistent with azure and gcp.
I think as a follow up it would be great to migrate dynamodb_lock to using the official SDKs as well, and then see what we still need form the s3 storage options.
Related Issue(s)
closes #999
Documentation