Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): virtual column allow cast to other type #16903

Merged
merged 6 commits into from
Nov 28, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Nov 21, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

allow virtual column cast to other type, avoiding jsonb value deserialization to improve performance.

  • fix meta proto-conv v111 add glue meta test
  • virtual column proto add data_types field
  • allow virtual column cast to other types, this can avoids jsonb value deserialization to improve performance

From the following example, we can see that the execution time has been reduced from 291.674 sec to 235.888 sec, and the amount of data read has been reduced from 11.01 GiB to 9.16 GiB

root@0.0.0.0:48000/default> CREATE OR REPLACE TABLE user_activity_logs AS
SELECT 
    number % 100000000 AS id,
    JSON_OBJECT(
        'id', CAST(number % 100000000 AS STRING),
        'email', CONCAT('user', CAST(number % 100000000 AS STRING), '@example.com'),
        'nickname', CONCAT('user', CAST(number % 100000000 AS STRING)),
        'phone', CONCAT('******', CAST(number % 10000 + 1000 AS STRING))
    ) AS data
FROM numbers(100000000);

-- create virtual columns
root@0.0.0.0:48000/default> CREATE or replace VIRTUAL COLUMN (
    data['id'],
    data['email'],
    data['nickname'],
    data['phone']
) FOR user_activity_logs;

-- Refresh the virtual columns to activate them
root@0.0.0.0:48000/default> REFRESH VIRTUAL COLUMN FOR user_activity_logs;

root@0.0.0.0:48000/default> select data['id'], data['email'], data['nickname'], data['phone'] from user_activity_logs;
┌────────────────────────────────────────────────────────────────────────────────────────┐
│     data['id']    │        data['email']       │  data['nickname'] │   data['phone']   │
│ Nullable(Variant) │      Nullable(Variant)     │ Nullable(Variant) │ Nullable(Variant) │
├───────────────────┼────────────────────────────┼───────────────────┼───────────────────┤
│ "91732188""[email protected]""user91732188""******3188"      │
│ "91732189""[email protected]""user91732189""******3189"      │
│ ·                 │ ·                          │ ·                 │ ·                 │
│ ·                 │ ·                          │ ·                 │ ·                 │
│ ·                 │ ·                          │ ·                 │ ·                 │
│ "37499993""[email protected]""user37499993""******10993"     │
│ 100000000 rows    │                            │                   │                   │
│ (1000 shown)      │                            │                   │                   │
└────────────────────────────────────────────────────────────────────────────────────────┘
100000000 rows read in 291.674 sec. Processed 100 million rows, 11.01 GiB (342.85 thousand rows/s, 38.67 MiB/s)


root@0.0.0.0:48000/default> CREATE OR REPLACE TABLE user_activity_logs2 AS
SELECT 
    number % 100000000 AS id,
    JSON_OBJECT(
        'id', CAST(number % 100000000 AS STRING),
        'email', CONCAT('user', CAST(number % 100000000 AS STRING), '@example.com'),
        'nickname', CONCAT('user', CAST(number % 100000000 AS STRING)),
        'phone', CONCAT('******', CAST(number % 10000 + 1000 AS STRING))
    ) AS data
FROM numbers(100000000);

-- create virtual columns
root@0.0.0.0:48000/default> CREATE or replace VIRTUAL COLUMN (
    data['id']::int,
    data['email']::string,
    data['nickname']::string,
    data['phone']::string
) FOR user_activity_logs2;

-- Refresh the virtual columns to activate them
root@0.0.0.0:48000/default> REFRESH VIRTUAL COLUMN FOR user_activity_logs2;

root@0.0.0.0:48000/default> select data['id'], data['email'], data['nickname'], data['phone'] from user_activity_logs2;
┌──────────────────────────────────────────────────────────────────────────────────┐
│    data['id']   │       data['email']      │ data['nickname'] │   data['phone']  │
│ Nullable(Int32) │     Nullable(String)     │ Nullable(String) │ Nullable(String) │
├─────────────────┼──────────────────────────┼──────────────────┼──────────────────┤
│        66666656 │ user66666656@example.com │ user66666656     │ ******7656       │
│        66666657 │ user66666657@example.com │ user66666657     │ ******7657       │
│               · │ ·                        │ ·                │ ·                │
│               · │ ·                        │ ·                │ ·                │
│               · │ ·                        │ ·                │ ·                │
│         4166665 │ user4166665@example.com  │ user4166665      │ ******7665       │
│  100000000 rows │                          │                  │                  │
│    (1000 shown) │                          │                  │                  │
└──────────────────────────────────────────────────────────────────────────────────┘
100000000 rows read in 235.888 sec. Processed 100 million rows, 9.16 GiB (423.93 thousand rows/s, 39.77 MiB/s)
  • fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Nov 21, 2024
@b41sh b41sh force-pushed the feat-virtual-column-cast branch from f485a9f to 7edbe1f Compare November 27, 2024 09:54
@b41sh b41sh requested a review from sundy-li November 27, 2024 10:13
@b41sh b41sh marked this pull request as ready for review November 27, 2024 10:15
@b41sh b41sh requested a review from drmingdrmer as a code owner November 27, 2024 10:15
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 29 files at r1, all commit messages.
Reviewable status: 10 of 29 files reviewed, 1 unresolved discussion (waiting on @b41sh and @sundy-li)


src/meta/proto-conv/src/virtual_column_from_to_protobuf_impl.rs line 54 at r1 (raw file):

            for (v, ty) in p.virtual_columns.iter().zip(p.data_types.iter()) {
                virtual_columns.push((v.clone(), TableDataType::from_pb(ty.clone())?));
            }

just a nit:
The length of virutal_columns and data_types should be identical but it's still be better to assert it here and return an Incompatible error if it does not match.

Code quote:

            for (v, ty) in p.virtual_columns.iter().zip(p.data_types.iter()) {
                virtual_columns.push((v.clone(), TableDataType::from_pb(ty.clone())?));
            }

@b41sh b41sh enabled auto-merge November 28, 2024 06:59
@b41sh b41sh added this pull request to the merge queue Nov 28, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 28, 2024
@b41sh b41sh added this pull request to the merge queue Nov 28, 2024
Merged via the queue into databendlabs:main with commit 7cff135 Nov 28, 2024
72 of 73 checks passed
@b41sh b41sh deleted the feat-virtual-column-cast branch November 28, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants