[Spark] Resolves #1679 issue glue catalog #2310

calixtofelipe · 2023-11-18T19:49:51Z

Which Delta project/connector is this regarding?

Spark

Description

This PR resolves issue 1 in the issue [Design Doc] Catalog implementation for AWS Glue Data Catalog #1679
** Attention **
- only the issue 1 has been solved in this PR
- The issue 2 still unsolved in the [Design Doc] Catalog implementation for AWS Glue Data Catalog #1679
The PR will change how the catalog schema is saved in Hive Metastore.
Technical details:
- Added a new boolean parameter spark.databricks.delta.fixSchema.GlueCatalog in DeltaSqlConf.
- When this parameter is true, then in the class CreateDeltaTableCommand:
  1 - In cleanupTableDefinition function, the schema will be updated with the table schema
  2 - In updateCatalog function, after create a table in the catalog it will update the table schema using a session catalog function (alterTableDataSchema)
Describe why we need the change.
When we are using AWS Glue Catalog, the catalog can't recognize the schema, issue [Design Doc] Catalog implementation for AWS Glue Data Catalog #1679 . This PR will solve the problem.

How was this patch tested?

I created 2 tests in DeltaTableBuilderSuite:
"Test schema external table delta glue catalog conf activated"
"Test schema delta glue catalog conf activated"
These tests just check if managed and external table will create the schema correctly when the parameter activated.
But the solution was tested in AWS glue catalog, creating. the tables and check in glue catalog if the table has the right schema and check if Athena can read the table.

Follow the 2 ways we can create tables after this solution:
Managed table:
The database location needs to be informed in the database catalog configuration.

# set the conf ("spark.databricks.delta.fixSchema.GlueCatalog", "true")
df_products.coalesce(1).write \
    .format("delta") \
    .option("mergeSchema", True) \
    .mode("overwrite") \
    .saveAsTable("database_name.table_name")

External table

# set the conf ("spark.databricks.delta.fixSchema.GlueCatalog", "true")
df_products.coalesce(1).write \
    .format("delta") \
    .option("mergeSchema", True) \
    .option("path", "s3://bucket_name/table_folder_name") \
    .mode("overwrite") \
    .save()

spark.catalog.createExternalTable(tableName="database_name.table_name", path="s3://bucket_name/table_folder_name")

Does this PR introduce any user-facing changes?

No.

Signed-off-by: Felipe Calixto Filho <felipe.calixto>

moomindani · 2023-11-20T02:15:38Z

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

calixtofelipe · 2023-11-20T21:58:54Z

@calixtofelipe Thanks! It seems that your PR looks very similar to my original PR (#1579) but at that time it introduced another issue. Let me share something we observed in the past soon.

Hey @moomindani, in the original PR you changed the provider='delta' to 'parquet' and it generate the another issue because a lot of other places will check the provider (E.g: impact time travel capability).
This new PR still change the schema but I didn't change the provider as you did in the original PR. I added the command to alter the metadata and this command will update the Hive metastore successfully without overwrite the schema to empty as apache/spark project does when we execute createTable function. So, as we are keeping the provider='delta' the will keep all the delta capabilities. thanks for replying and helping.

moomindani · 2023-11-21T05:26:16Z

Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced.
BTW, if I understand it correctly, your PR solves only the Issue 1 described in #1679, not the Issue 2. Is it correct?

calixtofelipe · 2023-11-21T12:46:24Z

Thanks for clarifying it. I confirmed that your PR won't cause the same issue that I experienced. BTW, if I understand it correctly, your PR solves only the Issue 1 described in #1679, not the Issue 2. Is it correct?

Yes, only issue 1 has been resolved.
Regarding issue 2, I understand that it generates a poor user experience, as you described in the original PR. However, since we can create the database with the location, it is not as critical as issue 1. If I have some time I will try summit a PR to apache/spark project to fix the issue 1 as you described in the original PR too, the error is generated there.
Thanks again for replaying and helping to analyze this issues.

moomindani · 2023-11-22T05:57:20Z

I agree, Issue 1 is more critical than Issue 2. Thanks for clarifying it.
It may be better to explicitly mention that this PR solves Issue 1 of #1679 in the overview.

calixtofelipe · 2023-11-22T12:11:39Z

I agree, Issue 1 is more critical than Issue 2. Thanks for clarifying it. It may be better to explicitly mention that this PR solves Issue 1 of #1679 in the overview.

I totally agree. I added the comment in the PR description and edited my commend in the issue. Thanks again for helping.

juliangordon · 2024-02-22T11:38:45Z

Any timeline as to when this will be merged?

lucabem · 2024-02-23T13:05:19Z

Hi @calixtofelipe, which conf are you using to run it on AWS Glue?

I mean not only spark.conf spark.databricks.delta.fixSchema.GlueCatalog. Additional argument such as --extra-py-files and --extra-jars

calixtofelipe · 2024-06-02T08:09:00Z

Hi @calixtofelipe, which conf are you using to run it on AWS Glue?

I mean not only spark.conf spark.databricks.delta.fixSchema.GlueCatalog. Additional argument such as --extra-py-files and --extra-jars

Yes, @lucabem, after you build a delta package from this branch, you should set it as a delta-package via extra-jars. Make sure that in your Spark session, you are calling the right delta package.

jeffsteinmetz · 2024-08-14T15:47:11Z

+1
This would be helpful to add to the main release.

fix issue 1679 glue catalog

8632fa9

Signed-off-by: Felipe Calixto Filho <felipe.calixto>

calixtofelipe mentioned this pull request Nov 18, 2023

[Design Doc] Catalog implementation for AWS Glue Data Catalog #1679

Open

fix test more 100 characters

4813ac9

Signed-off-by: Felipe Calixto Filho <felipe.calixto>

sgomezvillamor mentioned this pull request Apr 16, 2024

feat(ingestion/glue): delta schemas datahub-project/datahub#10299

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Resolves #1679 issue glue catalog #2310

[Spark] Resolves #1679 issue glue catalog #2310

calixtofelipe commented Nov 18, 2023 •

edited

Loading

moomindani commented Nov 20, 2023

calixtofelipe commented Nov 20, 2023

moomindani commented Nov 21, 2023

calixtofelipe commented Nov 21, 2023

moomindani commented Nov 22, 2023 •

edited

Loading

calixtofelipe commented Nov 22, 2023

juliangordon commented Feb 22, 2024

lucabem commented Feb 23, 2024

calixtofelipe commented Jun 2, 2024

jeffsteinmetz commented Aug 14, 2024

[Spark] Resolves #1679 issue glue catalog #2310

Are you sure you want to change the base?

[Spark] Resolves #1679 issue glue catalog #2310

Conversation

calixtofelipe commented Nov 18, 2023 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

moomindani commented Nov 20, 2023

calixtofelipe commented Nov 20, 2023

moomindani commented Nov 21, 2023

calixtofelipe commented Nov 21, 2023

moomindani commented Nov 22, 2023 • edited Loading

calixtofelipe commented Nov 22, 2023

juliangordon commented Feb 22, 2024

lucabem commented Feb 23, 2024

calixtofelipe commented Jun 2, 2024

jeffsteinmetz commented Aug 14, 2024

calixtofelipe commented Nov 18, 2023 •

edited

Loading

moomindani commented Nov 22, 2023 •

edited

Loading