-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1] DatabricksRM: Use databricks-sdk to fetch token / workspace URL + several small improvements #1564
Conversation
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
dspy.settings.configure(lm=llm, rm=retriever_model) | ||
``` | ||
|
||
Below is a code snippet that shows how to query the Databricks Direct Vector Access Index using the forward() function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forward()
function is not what users call to retrieve documents with DatabricksRM
Signed-off-by: dbczumar <[email protected]>
```python | ||
self.retrieve = DatabricksRM(query=[1, 2, 3], query_type = 'vector') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preexisting implementation of DatabricksRM
used an inconsistent / incorrect definition of query_type
. In Databricks Vector Search, query_type
defines the search algorithm (ANN or hybrid), not the type of input (text or vector). The type of input can be inferred from the Python value type. This PR updates the query_type
argument to have consistent semantics with Databricks Vector Search while maintaining backwards compatible for the old argument values.
self.databricks_index_name = databricks_index_name | ||
self.columns = columns | ||
self.columns = list({docs_id_column_name, text_column_name, *(columns or [])}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatically include the user-specified docs_id_column
and text_column
in the list of columns to retrieve from the vector search index (self.columns
). Previously, users had to specify the docs ID and text columns in 2 places (the corresponding arg and the columns
arg). If the user forgot to specify the docs ID / text columns in the columns
arg, they would face a confusing error (ID / text column not found). This change fixes that problem. Users should not have to reason about this.
for k, v in item.items() | ||
if k not in [self.docs_id_column_name, self.text_column_name] | ||
} | ||
extra_columns = {k: v for k, v in item.items() if k not in [self.docs_id_column_name, self.text_column_name]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a formatting change from the precommit linter
} | ||
return extra_columns | ||
|
||
def forward( | ||
self, | ||
query: Union[str, List[float]], | ||
query_type: str = "text", | ||
query_type: str = "ANN", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._extract_doc_ids(doc) | ||
for doc in sorted_docs | ||
], | ||
doc_ids=[self._extract_doc_ids(doc) for doc in sorted_docs], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a formatting change from the linter
@@ -217,14 +228,117 @@ def forward( | |||
items += [item] | |||
|
|||
# Sorting results by score in descending order | |||
sorted_docs = sorted(items, key=lambda x: x["score"], reverse=True)[:self.k] | |||
sorted_docs = sorted(items, key=lambda x: x["score"], reverse=True)[: self.k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a formatting change from the linter
raise Exception( | ||
f"text_column_name: '{self.text_column_name}' is not in the index columns: \n {col_names}" | ||
) | ||
raise Exception(f"text_column_name: '{self.text_column_name}' is not in the index columns: \n {col_names}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a formatting change from the linter
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
|
||
# Extracting the results | ||
items = [] | ||
for idx, data_row in enumerate(results["result"]["data_array"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx
was unused
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
@@ -1,5 +1,5 @@ | |||
--- | |||
sidebar_position: 2 | |||
sidebar_position: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place DatabricksRM second in the list of retrievers documented at https://dspy-docs.vercel.app/api/category/retrieval-model-clients
@@ -18,6 +18,7 @@ ChromadbRM( | |||
``` | |||
|
|||
**Parameters:** | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These newlines were inserted by the pre-commit linter
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
@krypticmouse Any idea if the docs here are easy to fix before we merge? |
@okhat @dbczumar The issue is in docs/dspy-usecases.md file line 112, the link markdown is wrong: Change line 112 from this:
To this:
|
@arnavsinghvi11 is just checking if this will break any existing documentation on the DB site etc since the changes affect the interface. |
If the build fails vercel won’t merge it so there would be no changes reflected in the existing website. |
Thanks @dbczumar for the updates! All looks good. With the changes regarding https://github.com/stanfordnlp/dspy/pull/1564/files#r1780356012, can we update the DSPy on Databricks blog post to reflect this? Specifically, how the retriever is configured and the retreived_results is called through DatabricksRM
|
Signed-off-by: dbczumar <[email protected]>
afcb597
to
1778fdc
Compare
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Thanks @arnavsinghvi11, absolutely! I'll get that blog updated. I'll also file a follow-up PR with some test coverage, but I've QAed this PR manually |
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structurally LGTM!
DatabricksRM: Use databricks-sdk to fetch token / workspace URL + several small improvements