Skip to content

Commit

Permalink
[SPARK-50541] Describe Table As JSON
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

Support `DESCRIBE TABLE ...  [AS JSON]` to optionally display table metadata in JSON format.

**SQL Ref Spec:**

{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] } **[ AS JSON ]**

Output:
json_metadata: String

### Why are the changes needed?

The Spark SQL command `DESCRIBE TABLE` displays table metadata in a DataFrame format geared toward human consumption. This format causes parsing challenges, e.g. if fields contain special characters or the format changes as new features are added.
The new `AS JSON` option would return the table metadata as a JSON string that supports parsing via machine, while being extensible with a minimized risk of breaking changes. It is not meant to be human-readable.

### Does this PR introduce _any_ user-facing change?

Yes, this provides a new option to display DESCRIBE TABLE metadata in JSON format. See below (and updated golden files) for the JSON output schema:

```
{
      "table_name": "<table_name>",
      "catalog_name": "<catalog_name>",
      "schema_name": "<innermost_schema_name>",
      "namespace": ["<innermost_schema_name>"],
      "type": "<table_type>",
      "provider": "<provider>",
      "columns": [
        {
          "name": "<name>",
          "type": <type_json>,
          "comment": "<comment>",
          "nullable": <boolean>,
          "default": "<default_val>"
        }
      ],
      "partition_values": {
        "<col_name>": "<val>"
      },
      "location": "<path>",
      "view_text": "<view_text>",
      "view_original_text": "<view_original_text>",
      "view_schema_mode": "<view_schema_mode>",
      "view_catalog_and_namespace": "<view_catalog_and_namespace>",
      "view_query_output_columns": ["col1", "col2"],
      "owner": "<owner>",
      "comment": "<comment>",
      "table_properties": {
        "property1": "<property1>",
        "property2": "<property2>"
      },
      "storage_properties": {
        "property1": "<property1>",
        "property2": "<property2>"
      },
      "serde_library": "<serde_library>",
      "input_format": "<input_format>",
      "output_format": "<output_format>",
      "num_buckets": <num_buckets>,
      "bucket_columns": ["<col_name>"],
      "sort_columns": ["<col_name>"],
      "created_time": "<timestamp_ISO-8601>",
      "last_access": "<timestamp_ISO-8601>",
      "partition_provider": "<partition_provider>"
}
```

### How was this patch tested?

- Updated golden files for `describe.sql`
- Added tests in `DescribeTableParserSuite.scala`, `DescribeTableSuite.scala`, `PlanResolutionSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?

Closes #49139 from asl3/asl3/describetableasjson.

Authored-by: Amanda Liu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
  • Loading branch information
asl3 authored and cloud-fan committed Jan 7, 2025
1 parent 22cbb96 commit 36d23ef
Show file tree
Hide file tree
Showing 26 changed files with 1,313 additions and 101 deletions.
12 changes: 12 additions & 0 deletions common/utils/src/main/resources/error/error-conditions.json
Original file line number Diff line number Diff line change
Expand Up @@ -1155,6 +1155,13 @@
],
"sqlState" : "42623"
},
"DESCRIBE_JSON_NOT_EXTENDED" : {
"message" : [
"DESCRIBE TABLE ... AS JSON only supported when [EXTENDED|FORMATTED] is specified.",
"For example: DESCRIBE EXTENDED <tableName> AS JSON is supported but DESCRIBE <tableName> AS JSON is not."
],
"sqlState" : "0A000"
},
"DISTINCT_WINDOW_FUNCTION_UNSUPPORTED" : {
"message" : [
"Distinct window functions are not supported: <windowExpr>."
Expand Down Expand Up @@ -5283,6 +5290,11 @@
"Attach a comment to the namespace <namespace>."
]
},
"DESC_TABLE_COLUMN_JSON" : {
"message" : [
"DESC TABLE COLUMN AS JSON not supported for individual columns."
]
},
"DESC_TABLE_COLUMN_PARTITION" : {
"message" : [
"DESC TABLE COLUMN for a specific partition."
Expand Down
1 change: 1 addition & 0 deletions docs/sql-ref-ansi-compliance.md
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,7 @@ Below is a list of all the keywords in Spark SQL.
|ITEMS|non-reserved|non-reserved|non-reserved|
|ITERATE|non-reserved|non-reserved|non-reserved|
|JOIN|reserved|strict-non-reserved|reserved|
|JSON|non-reserved|non-reserved|non-reserved|
|KEYS|non-reserved|non-reserved|non-reserved|
|LANGUAGE|non-reserved|non-reserved|reserved|
|LAST|non-reserved|non-reserved|non-reserved|
Expand Down
99 changes: 96 additions & 3 deletions docs/sql-ref-syntax-aux-describe-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,17 @@ to return the metadata pertaining to a partition or column respectively.
### Syntax

```sql
{ DESC | DESCRIBE } [ TABLE ] [ format ] table_identifier [ partition_spec ] [ col_name ]
{ DESC | DESCRIBE } [ TABLE ] [ format ] table_identifier [ partition_spec ] [ col_name ] [ AS JSON ]
```

### Parameters

* **format**

Specifies the optional format of describe output. If `EXTENDED` is specified
Specifies the optional format of describe output. If `EXTENDED` or `FORMATTED` is specified
then additional metadata information (such as parent database, owner, and access time)
is returned.
is returned. Also if `EXTENDED` or `FORMATTED` is specified, then the metadata can be returned
in JSON format by specifying `AS JSON` at the end of the statement.

* **table_identifier**

Expand All @@ -60,8 +61,96 @@ to return the metadata pertaining to a partition or column respectively.
and `col_name` are mutually exclusive and can not be specified together. Currently
nested columns are not allowed to be specified.

JSON format is not currently supported for individual columns.

**Syntax:** `[ database_name. ] [ table_name. ] column_name`

* **AS JSON**

An optional parameter to return the table metadata in JSON format. Only supported when `EXTENDED`
or `FORMATTED` format is specified (both produce equivalent JSON).

**Syntax:** `[ AS JSON ]`

**Schema:**

Below is the full JSON schema.
In actual output, null fields are omitted and the JSON is not pretty-printed (see Examples).

```sql
{
"table_name": "<table_name>",
"catalog_name": "<catalog_name>",
"schema_name": "<innermost_namespace_name>",
"namespace": ["<namespace_names>"],
"type": "<table_type>",
"provider": "<provider>",
"columns": [
{
"name": "<name>",
"type": <type_json>,
"comment": "<comment>",
"nullable": <boolean>,
"default": "<default_val>"
}
],
"partition_values": {
"<col_name>": "<val>"
},
"location": "<path>",
"view_text": "<view_text>",
"view_original_text": "<view_original_text>",
"view_schema_mode": "<view_schema_mode>",
"view_catalog_and_namespace": "<view_catalog_and_namespace>",
"view_query_output_columns": ["col1", "col2"],
"comment": "<comment>",
"table_properties": {
"property1": "<property1>",
"property2": "<property2>"
},
"storage_properties": {
"property1": "<property1>",
"property2": "<property2>"
},
"serde_library": "<serde_library>",
"input_format": "<input_format>",
"output_format": "<output_format>",
"num_buckets": <num_buckets>,
"bucket_columns": ["<col_name>"],
"sort_columns": ["<col_name>"],
"created_time": "<timestamp_ISO-8601>",
"created_by": "<created_by>",
"last_access": "<timestamp_ISO-8601>",
"partition_provider": "<partition_provider>"
}
```

Below are the schema definitions for `<type_json>`:

| Spark SQL Data Types | JSON Representation |
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ByteType | `{ "name" : "tinyint" }` |
| ShortType | `{ "name" : "smallint" }` |
| IntegerType | `{ "name" : "int" }` |
| LongType | `{ "name" : "bigint" }` |
| FloatType | `{ "name" : "float" }` |
| DoubleType | `{ "name" : "double" }` |
| DecimalType | `{ "name" : "decimal", "precision": p, "scale": s }` |
| StringType | `{ "name" : "string" }` |
| VarCharType | `{ "name" : "varchar", "length": n }` |
| CharType | `{ "name" : "char", "length": n }` |
| BinaryType | `{ "name" : "binary" }` |
| BooleanType | `{ "name" : "boolean" }` |
| DateType | `{ "name" : "date" }` |
| VariantType | `{ "name" : "variant" }` |
| TimestampType | `{ "name" : "timestamp_ltz" }` |
| TimestampNTZType | `{ "name" : "timestamp_ntz" }` |
| YearMonthIntervalType | `{ "name" : "interval", "start_unit": "<start_unit>", "end_unit": "<end_unit>" }` |
| DayTimeIntervalType | `{ "name" : "interval", "start_unit": "<start_unit>", "end_unit": "<end_unit>" }` |
| ArrayType | `{ "name" : "array", "element_type": <type_json>, "element_nullable": <boolean> }` |
| MapType | `{ "name" : "map", "key_type": <type_json>, "value_type": <type_json>, "value_nullable": <boolean> }` |
| StructType | `{ "name" : "struct", "fields": [ {"name" : "field1", "type" : <type_json>, “nullable”: <boolean>, "comment": “<comment>”, "default": “<default_val>”}, ... ] }` |

### Examples

```sql
Expand Down Expand Up @@ -173,6 +262,10 @@ DESCRIBE customer salesdb.customer.name;
|data_type| string|
| comment|Short name|
+---------+----------+

-- Returns the table metadata in JSON format.
DESC FORMATTED customer AS JSON;
{"table_name":"customer","catalog_name":"spark_catalog","schema_name":"default","namespace":["default"],"columns":[{"name":"cust_id","type":{"name":"integer"},"nullable":true},{"name":"name","type":{"name":"string"},"comment":"Short name","nullable":true},{"name":"state","type":{"name":"varchar","length":20},"nullable":true}],"location": "file:/tmp/salesdb.db/custom...","created_time":"2020-04-07T14:05:43Z","last_access":"UNKNOWN","created_by":"None","type":"MANAGED","provider":"parquet","partition_provider":"Catalog","partition_columns":["state"]}
```

### Related Statements
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,7 @@ IS: 'IS';
ITEMS: 'ITEMS';
ITERATE: 'ITERATE';
JOIN: 'JOIN';
JSON: 'JSON';
KEYS: 'KEYS';
LANGUAGE: 'LANGUAGE';
LAST: 'LAST';
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,7 @@ statement
| (DESC | DESCRIBE) namespace EXTENDED?
identifierReference #describeNamespace
| (DESC | DESCRIBE) TABLE? option=(EXTENDED | FORMATTED)?
identifierReference partitionSpec? describeColName? #describeRelation
identifierReference partitionSpec? describeColName? (AS JSON)? #describeRelation
| (DESC | DESCRIBE) QUERY? query #describeQuery
| COMMENT ON namespace identifierReference IS
comment #commentNamespace
Expand Down Expand Up @@ -1680,6 +1680,7 @@ ansiNonReserved
| INVOKER
| ITEMS
| ITERATE
| JSON
| KEYS
| LANGUAGE
| LAST
Expand Down Expand Up @@ -2039,6 +2040,7 @@ nonReserved
| IS
| ITEMS
| ITERATE
| JSON
| KEYS
| LANGUAGE
| LAST
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,18 @@ private[sql] trait CompilationErrors extends DataTypeErrorsBase {
cause = Option(cause))
}

def describeJsonNotExtendedError(tableName: String): AnalysisException = {
new AnalysisException(
errorClass = "DESCRIBE_JSON_NOT_EXTENDED",
messageParameters = Map("tableName" -> tableName))
}

def describeColJsonUnsupportedError(): AnalysisException = {
new AnalysisException(
errorClass = "UNSUPPORTED_FEATURE.DESC_TABLE_COLUMN_JSON",
messageParameters = Map.empty)
}

def cannotFindDescriptorFileError(filePath: String, cause: Throwable): AnalysisException = {
new AnalysisException(
errorClass = "PROTOBUF_DESCRIPTOR_FILE_NOT_FOUND",
Expand Down
Loading

0 comments on commit 36d23ef

Please sign in to comment.