Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): Inverted index search function support options #16256

Merged
merged 7 commits into from
Aug 20, 2024

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Aug 15, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Support optional argument for match and query function to specify additional configuration options.

Currently, we support the following three options:

  1. fuzziness: If this option is specified, all terms that are less than the Levenshtein distance will be matched. fuzziness can be specified as 1 or 2. For example, if fuzziness is 1, and the query text is box, then both box and fox will be matched because they are at distance 1.
  2. operator: this option can be set to OR or AND, and default value is OR. For example: query text happy tax payer is equals to happy OR tax OR payer, but if operator is true, it will equals to happy AND tax AND payer.
  3. lenient: this option can be set to true or false, and default value is false. If set to true, no error will be reported if the query text is not valid.

Multiple options can be separated by semicolon ;.

for example:

MySQL [(none)]> CREATE TABLE t (id int, content string, INVERTED INDEX idx1 (content) tokenizer = 'english' filters = 'english_stop,english_stemmer');
Query OK, 0 rows affected (0.107 sec)

MySQL [(none)]> INSERT INTO t VALUES
    -> (1, 'The quick brown fox jumps over the lazy dog'),
    -> (2, 'A picture is worth a thousand words'),
    -> (3, 'The early bird catches the worm'),
    -> (4, 'Actions speak louder than words');
Query OK, 4 rows affected (0.226 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE match(content, 'box');
Empty set (1.030 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE match(content, 'box', 'fuzziness=1');
+------+---------+---------------------------------------------+
| id   | score() | content                                     |
+------+---------+---------------------------------------------+
|    1 |     1.0 | The quick brown fox jumps over the lazy dog |
+------+---------+---------------------------------------------+
1 row in set (0.121 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE query('content:box', 'fuzziness=1');
+------+---------+---------------------------------------------+
| id   | score() | content                                     |
+------+---------+---------------------------------------------+
|    1 |     1.0 | The quick brown fox jumps over the lazy dog |
+------+---------+---------------------------------------------+
1 row in set (4.656 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE match(content, 'action works', 'fuzziness=1');
+------+---------+-------------------------------------+
| id   | score() | content                             |
+------+---------+-------------------------------------+
|    2 |     1.0 | A picture is worth a thousand words |
|    3 |     1.0 | The early bird catches the worm     |
|    4 |     2.0 | Actions speak louder than words     |
+------+---------+-------------------------------------+
3 rows in set (0.072 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE query('content:action works', 'fuzziness=1');
+------+---------+-------------------------------------+
| id   | score() | content                             |
+------+---------+-------------------------------------+
|    2 |     1.0 | A picture is worth a thousand words |
|    3 |     1.0 | The early bird catches the worm     |
|    4 |     2.0 | Actions speak louder than words     |
+------+---------+-------------------------------------+
3 rows in set (0.089 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE match(content, 'action works', 'fuzziness=1;operator=AND');
+------+---------+---------------------------------+
| id   | score() | content                         |
+------+---------+---------------------------------+
|    4 |     2.0 | Actions speak louder than words |
+------+---------+---------------------------------+
1 row in set (0.147 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE query('content:action works', 'fuzziness=1;operator=AND');
+------+---------+---------------------------------+
| id   | score() | content                         |
+------+---------+---------------------------------+
|    4 |     2.0 | Actions speak louder than words |
+------+---------+---------------------------------+
1 row in set (0.056 sec)

MySQL [(none)]> SELECT id, score(), content FROM t WHERE match(content, '()');
ERROR 1105 (HY000): TantivyQueryParserError. Code: 1903, Text = Syntax Error: ().

MySQL [(none)]> SELECT id, score(), content FROM t WHERE match(content, '()', 'lenient=true');
Empty set (0.063 sec)
  • fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Aug 15, 2024
@b41sh b41sh marked this pull request as ready for review August 19, 2024 00:46
@b41sh b41sh requested a review from sundy-li August 19, 2024 00:47
@BohuTANG BohuTANG merged commit d14f7a5 into databendlabs:main Aug 20, 2024
72 checks passed
@BohuTANG
Copy link
Member

I think documentation need updated. cc @soyeric128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants