Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dspy.Embedding #1735

Merged
merged 3 commits into from
Nov 3, 2024
Merged

Add dspy.Embedding #1735

merged 3 commits into from
Nov 3, 2024

Conversation

chenmoneygithub
Copy link
Collaborator

@chenmoneygithub chenmoneygithub commented Nov 1, 2024

Very simple dspy.Embedding supports:

  • Hosted embedding model, we use litellm to send the request, and only take the embedding fields from the response.
  • Custom embedding model, users can pass a custom callable to dspy.Embedding, and the output is just the custom callable's output.

Added unit test for both scenarios.

Confirmed that the cache works:

(dspy) *[dspy-embedding][~/Documents/mlflow_team/dspy]$ python3
Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 10:07:17) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dspy;import time;embedder = dspy.Embedding("openai/text-embedding-ada-002")
>>> start=time.time();geez = embedder(["oh my god", "holy geese and elephant", "ohhhh hoover dam"]);print("TOTAL TIME: ", time.time()-start)
TOTAL TIME:  0.963151216506958
>>> start=time.time();geez = embedder(["oh my god", "holy geese and elephant", "ohhhh hoover dam"]);print("TOTAL TIME: ", time.time()-start)
TOTAL TIME:  0.009987115859985352
>>> start=time.time();geez = embedder(["oh my god", "holy geese and elephant", "ohhhh hoover dam"]);print("TOTAL TIME: ", time.time()-start)
TOTAL TIME:  0.010275125503540039

@chenmoneygithub chenmoneygithub force-pushed the dspy-embedding branch 3 times, most recently from d33a487 to 7c51351 Compare November 1, 2024 04:09
@chenmoneygithub chenmoneygithub requested a review from okhat November 1, 2024 04:09
litellm.telemetry = False

if "LITELLM_LOCAL_MODEL_COST_MAP" not in os.environ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be done before LiteLLM is imported anywhere in DSPy, for it to have an effect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched their code, and this env var is read at runtime: https://github.com/BerriAI/litellm/blob/5652c375b3e22bab6704e93058c868620c72d6ee/litellm/__init__.py#L309, so our current order should be okay.

kwargs: Additional keyword arguments to pass to the embedding model.

Returns:
A list of embeddings, one for each input, in the same order as the inputs. Or the output of the custom
Copy link
Collaborator

@okhat okhat Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure the output of this is a numpy tensor or something? Both for litellm and for callables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@okhat
Copy link
Collaborator

okhat commented Nov 1, 2024

An ideal version of this PR would involve improving the docs at this page:

https://dspy-docs.vercel.app/quick-start/getting-started-02/ (see the second cell, or see below)

import torch
import functools
from litellm import embedding as Embed

with open("test_collection.jsonl") as f:
    corpus = [ujson.loads(line) for line in f]

index = torch.load('index.pt', weights_only=True)
max_characters = 4000 # >98th percentile of document lengths

@functools.lru_cache(maxsize=None)
def search(query, k=5):
    query_embedding = torch.tensor(Embed(input=query, model="text-embedding-3-small").data[0]['embedding'])
    topk_scores, topk_indices = torch.matmul(index, query_embedding).topk(k)
    topK = [dict(score=score.item(), **corpus[idx]) for idx, score in zip(topk_indices, topk_scores)]
    return [doc['text'][:max_characters] for doc in topK]

I'd love to get the same functionality but without that complexity...

@okhat okhat merged commit 7e78199 into stanfordnlp:main Nov 3, 2024
4 checks passed
@chenmoneygithub chenmoneygithub deleted the dspy-embedding branch December 27, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants