Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1] DatabricksRM: Use databricks-sdk to fetch token / workspace URL + several small improvements #1564

Merged
merged 25 commits into from
Sep 30, 2024

Conversation

dbczumar
Copy link
Collaborator

DatabricksRM: Use databricks-sdk to fetch token / workspace URL + several small improvements

Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
dspy.settings.configure(lm=llm, rm=retriever_model)
```

Below is a code snippet that shows how to query the Databricks Direct Vector Access Index using the forward() function.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forward() function is not what users call to retrieve documents with DatabricksRM

Signed-off-by: dbczumar <[email protected]>
```python
self.retrieve = DatabricksRM(query=[1, 2, 3], query_type = 'vector')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preexisting implementation of DatabricksRM used an inconsistent / incorrect definition of query_type. In Databricks Vector Search, query_type defines the search algorithm (ANN or hybrid), not the type of input (text or vector). The type of input can be inferred from the Python value type. This PR updates the query_type argument to have consistent semantics with Databricks Vector Search while maintaining backwards compatible for the old argument values.

self.databricks_index_name = databricks_index_name
self.columns = columns
self.columns = list({docs_id_column_name, text_column_name, *(columns or [])})
Copy link
Collaborator Author

@dbczumar dbczumar Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically include the user-specified docs_id_column and text_column in the list of columns to retrieve from the vector search index (self.columns). Previously, users had to specify the docs ID and text columns in 2 places (the corresponding arg and the columns arg). If the user forgot to specify the docs ID / text columns in the columns arg, they would face a confusing error (ID / text column not found). This change fixes that problem. Users should not have to reason about this.

for k, v in item.items()
if k not in [self.docs_id_column_name, self.text_column_name]
}
extra_columns = {k: v for k, v in item.items() if k not in [self.docs_id_column_name, self.text_column_name]}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a formatting change from the precommit linter

}
return extra_columns

def forward(
self,
query: Union[str, List[float]],
query_type: str = "text",
query_type: str = "ANN",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._extract_doc_ids(doc)
for doc in sorted_docs
],
doc_ids=[self._extract_doc_ids(doc) for doc in sorted_docs],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a formatting change from the linter

@@ -217,14 +228,117 @@ def forward(
items += [item]

# Sorting results by score in descending order
sorted_docs = sorted(items, key=lambda x: x["score"], reverse=True)[:self.k]
sorted_docs = sorted(items, key=lambda x: x["score"], reverse=True)[: self.k]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a formatting change from the linter

raise Exception(
f"text_column_name: '{self.text_column_name}' is not in the index columns: \n {col_names}"
)
raise Exception(f"text_column_name: '{self.text_column_name}' is not in the index columns: \n {col_names}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a formatting change from the linter

Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>

# Extracting the results
items = []
for idx, data_row in enumerate(results["result"]["data_array"]):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx was unused

Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
@@ -1,5 +1,5 @@
---
sidebar_position: 2
sidebar_position: 3
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place DatabricksRM second in the list of retrievers documented at https://dspy-docs.vercel.app/api/category/retrieval-model-clients

@@ -18,6 +18,7 @@ ChromadbRM(
```

**Parameters:**

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These newlines were inserted by the pre-commit linter

Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
@okhat
Copy link
Collaborator

okhat commented Sep 30, 2024

@krypticmouse Any idea if the docs here are easy to fix before we merge?

@okhat okhat requested a review from krypticmouse September 30, 2024 14:59
@krypticmouse
Copy link
Collaborator

@okhat @dbczumar The issue is in docs/dspy-usecases.md file line 112, the link markdown is wrong:

https://github.com/stanfordnlp/dspy/blob/cdbea6eb9accf1f1d2cc3f8ce333502bddf3b8fb/docs/docs/dspy-usecases.md?plain=1#L112C1-L112C143

Change line 112 from this:

| **Langfuse** | [Link]([https://docs.langtrace.ai/supported-integrations/llm-frameworks/dspy](https://langfuse.com/docs/integrations/dspy)) |

To this:

| **Langfuse** | [Link](https://langfuse.com/docs/integrations/dspy) |

@okhat
Copy link
Collaborator

okhat commented Sep 30, 2024

@arnavsinghvi11 is just checking if this will break any existing documentation on the DB site etc since the changes affect the interface.

@krypticmouse
Copy link
Collaborator

If the build fails vercel won’t merge it so there would be no changes reflected in the existing website.

@arnavsinghvi11
Copy link
Collaborator

Thanks @dbczumar for the updates! All looks good.

With the changes regarding https://github.com/stanfordnlp/dspy/pull/1564/files#r1780356012, can we update the DSPy on Databricks blog post to reflect this?

Specifically, how the retriever is configured and the retreived_results is called through DatabricksRM

retriever = DatabricksRM(
    databricks_index_name = "your_index_name",
    docs_id_column_name="id",
    text_column_name="field2",
    k=3
)
retrieved_results = DatabricksRM(query="Example query text", query_type="hybrid"))

Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
@dbczumar dbczumar changed the title DatabricksRM: Use databricks-sdk to fetch token / workspace URL + several small improvements [1] DatabricksRM: Use databricks-sdk to fetch token / workspace URL + several small improvements Sep 30, 2024
Signed-off-by: dbczumar <[email protected]>
@dbczumar
Copy link
Collaborator Author

Thanks @arnavsinghvi11, absolutely! I'll get that blog updated. I'll also file a follow-up PR with some test coverage, but I've QAed this PR manually

Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Signed-off-by: dbczumar <[email protected]>
Copy link
Collaborator

@krypticmouse krypticmouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally LGTM!

@okhat okhat merged commit 59ae987 into stanfordnlp:main Sep 30, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants