Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Resolves #1679 issue glue catalog #2310

Open
wants to merge 2 commits into
base: branch-2.3
Choose a base branch
from

Conversation

calixtofelipe
Copy link

@calixtofelipe calixtofelipe commented Nov 18, 2023

Which Delta project/connector is this regarding?

  • Spark

Description

How was this patch tested?

I created 2 tests in DeltaTableBuilderSuite:
"Test schema external table delta glue catalog conf activated"
"Test schema delta glue catalog conf activated"
These tests just check if managed and external table will create the schema correctly when the parameter activated.
But the solution was tested in AWS glue catalog, creating. the tables and check in glue catalog if the table has the right schema and check if Athena can read the table.

Follow the 2 ways we can create tables after this solution:
Managed table:
The database location needs to be informed in the database catalog configuration.

# set the conf ("spark.databricks.delta.fixSchema.GlueCatalog", "true")
df_products.coalesce(1).write \
    .format("delta") \
    .option("mergeSchema", True) \
    .mode("overwrite") \
    .saveAsTable("database_name.table_name")

External table

# set the conf ("spark.databricks.delta.fixSchema.GlueCatalog", "true")
df_products.coalesce(1).write \
    .format("delta") \
    .option("mergeSchema", True) \
    .option("path", "s3://bucket_name/table_folder_name") \
    .mode("overwrite") \
    .save()

spark.catalog.createExternalTable(tableName="database_name.table_name", path="s3://bucket_name/table_folder_name")

Does this PR introduce any user-facing changes?

No.

Signed-off-by: Felipe Calixto Filho <felipe.calixto>
Signed-off-by: Felipe Calixto Filho <felipe.calixto>
@moomindani
Copy link

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

@calixtofelipe
Copy link
Author

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

Hey @moomindani, in the original PR you changed the provider='delta' to 'parquet' and it generate the another issue because a lot of other places will check the provider (E.g: impact time travel capability).
This new PR still change the schema but I didn't change the provider as you did in the original PR. I added the command to alter the metadata and this command will update the Hive metastore successfully without overwrite the schema to empty as apache/spark project does when we execute createTable function. So, as we are keeping the provider='delta' the will keep all the delta capabilities. thanks for replying and helping.

@moomindani
Copy link

Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced.
BTW, if I understand it correctly, your PR solves only the Issue 1 described in #1679, not the Issue 2. Is it correct?

@calixtofelipe
Copy link
Author

Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced. BTW, if I understand it correctly, your PR solves only the Issue 1 described in #1679, not the Issue 2. Is it correct?

Yes, only issue 1 has been resolved.
Regarding issue 2, I understand that it generates a poor user experience, as you described in the original PR. However, since we can create the database with the location, it is not as critical as issue 1. If I have some time I will try summit a PR to apache/spark project to fix the issue 1 as you described in the original PR too, the error is generated there.
Thanks again for replaying and helping to analyze this issues.

@moomindani
Copy link

moomindani commented Nov 22, 2023

I agree, Issue 1 is more critical than Issue 2. Thanks for clarifying it.
It may be better to explicitly mention that this PR solves Issue 1 of #1679 in the overview.

@calixtofelipe
Copy link
Author

I agree, Issue 1 is more critical than Issue 2. Thanks for clarifying it. It may be better to explicitly mention that this PR solves Issue 1 of #1679 in the overview.

I totally agree. I added the comment in the PR description and edited my commend in the issue. Thanks again for helping.

@juliangordon
Copy link

Any timeline as to when this will be merged?

@lucabem
Copy link

lucabem commented Feb 23, 2024

Hi @calixtofelipe, which conf are you using to run it on AWS Glue?

I mean not only spark.conf spark.databricks.delta.fixSchema.GlueCatalog. Additional argument such as --extra-py-files and --extra-jars

@calixtofelipe
Copy link
Author

Hi @calixtofelipe, which conf are you using to run it on AWS Glue?

I mean not only spark.conf spark.databricks.delta.fixSchema.GlueCatalog. Additional argument such as --extra-py-files and --extra-jars

Yes, @lucabem, after you build a delta package from this branch, you should set it as a delta-package via extra-jars. Make sure that in your Spark session, you are calling the right delta package.

@jeffsteinmetz
Copy link

+1
This would be helpful to add to the main release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants