[SPARK-50541] Describe Table As JSON

### What changes were proposed in this pull request? Support `DESCRIBE TABLE ... [AS JSON]` to optionally display table metadata in JSON format. **SQL Ref Spec:** { DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] } **[ AS JSON ]** Output: json_metadata: String ### Why are the changes needed? The Spark SQL command `DESCRIBE TABLE` displays table metadata in a DataFrame format geared toward human consumption. This format causes parsing challenges, e.g. if fields contain special characters or the format changes as new features are added. The new `AS JSON` option would return the table metadata as a JSON string that supports parsing via machine, while being extensible with a minimized risk of breaking changes. It is not meant to be human-readable. ### Does this PR introduce _any_ user-facing change? Yes, this provides a new option to display DESCRIBE TABLE metadata in JSON format. See below (and updated golden files) for the JSON output schema: ``` { "table_name": "<table_name>", "catalog_name": "<catalog_name>", "schema_name": "<innermost_schema_name>", "namespace": ["<innermost_schema_name>"], "type": "<table_type>", "provider": "<provider>", "columns": [ { "name": "<name>", "type": <type_json>, "comment": "<comment>", "nullable": <boolean>, "default": "<default_val>" } ], "partition_values": { "<col_name>": "<val>" }, "location": "<path>", "view_text": "<view_text>", "view_original_text": "<view_original_text>", "view_schema_mode": "<view_schema_mode>", "view_catalog_and_namespace": "<view_catalog_and_namespace>", "view_query_output_columns": ["col1", "col2"], "owner": "<owner>", "comment": "<comment>", "table_properties": { "property1": "<property1>", "property2": "<property2>" }, "storage_properties": { "property1": "<property1>", "property2": "<property2>" }, "serde_library": "<serde_library>", "input_format": "<input_format>", "output_format": "<output_format>", "num_buckets": <num_buckets>, "bucket_columns": ["<col_name>"], "sort_columns": ["<col_name>"], "created_time": "<timestamp_ISO-8601>", "last_access": "<timestamp_ISO-8601>", "partition_provider": "<partition_provider>" } ``` ### How was this patch tested? - Updated golden files for `describe.sql` - Added tests in `DescribeTableParserSuite.scala`, `DescribeTableSuite.scala`, `PlanResolutionSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? Closes #49139 from asl3/asl3/describetableasjson. Authored-by: Amanda Liu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
apache · Jan 7, 2025 · 36d23ef · 36d23ef
1 parent 22cbb96
commit 36d23ef
Show file tree

Hide file tree

Showing 26 changed files with 1,313 additions and 101 deletions.
diff --git a/common/utils/src/main/resources/error/error-conditions.json b/common/utils/src/main/resources/error/error-conditions.json
@@ -1155,6 +1155,13 @@
     ],
     "sqlState" : "42623"
   },
+  "DESCRIBE_JSON_NOT_EXTENDED" : {
+    "message" : [
+      "DESCRIBE TABLE ... AS JSON only supported when [EXTENDED|FORMATTED] is specified.",
+      "For example: DESCRIBE EXTENDED <tableName> AS JSON is supported but DESCRIBE <tableName> AS JSON is not."
+    ],
+    "sqlState" : "0A000"
+  },
   "DISTINCT_WINDOW_FUNCTION_UNSUPPORTED" : {
     "message" : [
       "Distinct window functions are not supported: <windowExpr>."
@@ -5283,6 +5290,11 @@
           "Attach a comment to the namespace <namespace>."
         ]
       },
+      "DESC_TABLE_COLUMN_JSON" : {
+        "message" : [
+          "DESC TABLE COLUMN AS JSON not supported for individual columns."
+        ]
+      },
       "DESC_TABLE_COLUMN_PARTITION" : {
         "message" : [
           "DESC TABLE COLUMN for a specific partition."

diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md
@@ -568,6 +568,7 @@ Below is a list of all the keywords in Spark SQL.
 |ITEMS|non-reserved|non-reserved|non-reserved|
 |ITERATE|non-reserved|non-reserved|non-reserved|
 |JOIN|reserved|strict-non-reserved|reserved|
+|JSON|non-reserved|non-reserved|non-reserved|
 |KEYS|non-reserved|non-reserved|non-reserved|
 |LANGUAGE|non-reserved|non-reserved|reserved|
 |LAST|non-reserved|non-reserved|non-reserved|

diff --git a/docs/sql-ref-syntax-aux-describe-table.md b/docs/sql-ref-syntax-aux-describe-table.md
@@ -29,16 +29,17 @@ to return the metadata pertaining to a partition or column respectively.
 ### Syntax
 
 ```sql
-{ DESC | DESCRIBE } [ TABLE ] [ format ] table_identifier [ partition_spec ] [ col_name ]
+{ DESC | DESCRIBE } [ TABLE ] [ format ] table_identifier [ partition_spec ] [ col_name ] [ AS JSON ]
 ```
 
 ### Parameters
 
 * **format**
 
-    Specifies the optional format of describe output. If `EXTENDED` is specified
+    Specifies the optional format of describe output. If `EXTENDED` or `FORMATTED` is specified
     then additional metadata information (such as parent database, owner, and access time)
-    is returned. 
+    is returned. Also if `EXTENDED` or `FORMATTED` is specified, then the metadata can be returned 
+    in JSON format by specifying `AS JSON` at the end of the statement.
 
 * **table_identifier**
 
@@ -60,8 +61,96 @@ to return the metadata pertaining to a partition or column respectively.
     and `col_name` are  mutually exclusive and can not be specified together. Currently
     nested columns are not allowed to be specified.
 
+    JSON format is not currently supported for individual columns.
+
     **Syntax:** `[ database_name. ] [ table_name. ] column_name`
 
+* **AS JSON**
+
+  An optional parameter to return the table metadata in JSON format. Only supported when `EXTENDED`
+  or `FORMATTED` format is specified (both produce equivalent JSON).
+
+  **Syntax:** `[ AS JSON ]`
+
+  **Schema:**
+
+  Below is the full JSON schema.
+  In actual output, null fields are omitted and the JSON is not pretty-printed (see Examples).
+
+  ```sql
+    {
+      "table_name": "<table_name>",
+      "catalog_name": "<catalog_name>",
+      "schema_name": "<innermost_namespace_name>",
+      "namespace": ["<namespace_names>"],
+      "type": "<table_type>",
+      "provider": "<provider>",
+      "columns": [
+        {
+          "name": "<name>",
+          "type": <type_json>,
+          "comment": "<comment>",
+          "nullable": <boolean>,
+          "default": "<default_val>"
+        }
+      ],
+      "partition_values": {
+        "<col_name>": "<val>"
+      },
+      "location": "<path>",
+      "view_text": "<view_text>",
+      "view_original_text": "<view_original_text>",
+      "view_schema_mode": "<view_schema_mode>",
+      "view_catalog_and_namespace": "<view_catalog_and_namespace>",
+      "view_query_output_columns": ["col1", "col2"],
+      "comment": "<comment>",
+      "table_properties": {
+        "property1": "<property1>",
+        "property2": "<property2>"
+      },
+      "storage_properties": {
+        "property1": "<property1>",
+        "property2": "<property2>"
+      },
+      "serde_library": "<serde_library>",
+      "input_format": "<input_format>",
+      "output_format": "<output_format>",
+      "num_buckets": <num_buckets>,
+      "bucket_columns": ["<col_name>"],
+      "sort_columns": ["<col_name>"],
+      "created_time": "<timestamp_ISO-8601>",
+      "created_by": "<created_by>",
+      "last_access": "<timestamp_ISO-8601>",
+      "partition_provider": "<partition_provider>"
+    }
+  ```
+
+  Below are the schema definitions for `<type_json>`:
+
+| Spark SQL Data Types  | JSON Representation                                                                                                                                              |
+|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ByteType              | `{ "name" : "tinyint" }`                                                                                                                                         |
+| ShortType             | `{ "name" : "smallint" }`                                                                                                                                        |
+| IntegerType           | `{ "name" : "int" }`                                                                                                                                             |
+| LongType              | `{ "name" : "bigint" }`                                                                                                                                          |
+| FloatType             | `{ "name" : "float" }`                                                                                                                                           |
+| DoubleType            | `{ "name" : "double" }`                                                                                                                                          |
+| DecimalType           | `{ "name" : "decimal", "precision": p, "scale": s }`                                                                                                             |
+| StringType            | `{ "name" : "string" }`                                                                                                                                          |
+| VarCharType           | `{ "name" : "varchar", "length": n }`                                                                                                                            |
+| CharType              | `{ "name" : "char", "length": n }`                                                                                                                               |
+| BinaryType            | `{ "name" : "binary" }`                                                                                                                                          |
+| BooleanType           | `{ "name" : "boolean" }`                                                                                                                                         |
+| DateType              | `{ "name" : "date" }`                                                                                                                                            |
+| VariantType           | `{ "name" : "variant" }`                                                                                                                                         |
+| TimestampType         | `{ "name" : "timestamp_ltz" }`                                                                                                                                   |
+| TimestampNTZType      | `{ "name" : "timestamp_ntz" }`                                                                                                                                   |
+| YearMonthIntervalType | `{ "name" : "interval", "start_unit": "<start_unit>", "end_unit": "<end_unit>" }`                                                                                |
+| DayTimeIntervalType   | `{ "name" : "interval", "start_unit": "<start_unit>", "end_unit": "<end_unit>" }`                                                                                |
+| ArrayType             | `{ "name" : "array", "element_type": <type_json>, "element_nullable": <boolean> }`                                                                               |
+| MapType               | `{ "name" : "map", "key_type": <type_json>, "value_type": <type_json>, "value_nullable": <boolean> }`                                                            |
+| StructType            | `{ "name" : "struct", "fields": [ {"name" : "field1", "type" : <type_json>, “nullable”: <boolean>, "comment": “<comment>”, "default": “<default_val>”}, ... ] }` |
+
 ### Examples
 
 ```sql
@@ -173,6 +262,10 @@ DESCRIBE customer salesdb.customer.name;
 |data_type|    string|
 |  comment|Short name|
 +---------+----------+
+
+-- Returns the table metadata in JSON format.
+DESC FORMATTED customer AS JSON;
+{"table_name":"customer","catalog_name":"spark_catalog","schema_name":"default","namespace":["default"],"columns":[{"name":"cust_id","type":{"name":"integer"},"nullable":true},{"name":"name","type":{"name":"string"},"comment":"Short name","nullable":true},{"name":"state","type":{"name":"varchar","length":20},"nullable":true}],"location": "file:/tmp/salesdb.db/custom...","created_time":"2020-04-07T14:05:43Z","last_access":"UNKNOWN","created_by":"None","type":"MANAGED","provider":"parquet","partition_provider":"Catalog","partition_columns":["state"]}
 ```
 
 ### Related Statements

diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -283,6 +283,7 @@ IS: 'IS';
 ITEMS: 'ITEMS';
 ITERATE: 'ITERATE';
 JOIN: 'JOIN';
+JSON: 'JSON';
 KEYS: 'KEYS';
 LANGUAGE: 'LANGUAGE';
 LAST: 'LAST';

diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -287,7 +287,7 @@ statement
     | (DESC | DESCRIBE) namespace EXTENDED?
         identifierReference                                            #describeNamespace
     | (DESC | DESCRIBE) TABLE? option=(EXTENDED | FORMATTED)?
-        identifierReference partitionSpec? describeColName?            #describeRelation
+        identifierReference partitionSpec? describeColName? (AS JSON)? #describeRelation
     | (DESC | DESCRIBE) QUERY? query                                   #describeQuery
     | COMMENT ON namespace identifierReference IS
         comment                                                        #commentNamespace
@@ -1680,6 +1680,7 @@ ansiNonReserved
     | INVOKER
     | ITEMS
     | ITERATE
+    | JSON
     | KEYS
     | LANGUAGE
     | LAST
@@ -2039,6 +2040,7 @@ nonReserved
     | IS
     | ITEMS
     | ITERATE
+    | JSON
     | KEYS
     | LANGUAGE
     | LAST

diff --git a/sql/api/src/main/scala/org/apache/spark/sql/errors/CompilationErrors.scala b/sql/api/src/main/scala/org/apache/spark/sql/errors/CompilationErrors.scala
@@ -41,6 +41,18 @@ private[sql] trait CompilationErrors extends DataTypeErrorsBase {
       cause = Option(cause))
   }
 
+  def describeJsonNotExtendedError(tableName: String): AnalysisException = {
+    new AnalysisException(
+      errorClass = "DESCRIBE_JSON_NOT_EXTENDED",
+      messageParameters = Map("tableName" -> tableName))
+  }
+
+  def describeColJsonUnsupportedError(): AnalysisException = {
+    new AnalysisException(
+      errorClass = "UNSUPPORTED_FEATURE.DESC_TABLE_COLUMN_JSON",
+      messageParameters = Map.empty)
+  }
+
   def cannotFindDescriptorFileError(filePath: String, cause: Throwable): AnalysisException = {
     new AnalysisException(
       errorClass = "PROTOBUF_DESCRIPTOR_FILE_NOT_FOUND",