Allow custom model names for Dense, Sparse embeddings, fix Sparse Embeddings#146
Allow custom model names for Dense, Sparse embeddings, fix Sparse Embeddings#146tomaarsen wants to merge 2 commits intoalibaba:mainfrom
Conversation
Cuiyus
left a comment
There was a problem hiding this comment.
I think the _get_model_class function is a good design. Here, it can be renamed to _loading_model to make the meaning clearer.
Furthermore, I believe the aforementioned design can address the issues you mentioned: Fix SparseEncoder model loading and Removed _manual_sparse_encode.
Thank you again for your submission! If you have any questions, we can continue to communicate. Additionally, if you have time, please continue to be responsible for the current PR and supplement the corresponding tests. (Of course, I can also take over and make adjustments and fixes for the issues raised in this PR.)
python/zvec/extension/sentence_transformer_embedding_function.py
Outdated
Show resolved
Hide resolved
|
Thank you for the review! I'd be glad to help move this PR forward together. I fixed one of your concerns, and I had a question for the other one. You're also free to take over the PR if you prefer (I know I sometimes find it easier 😄).
|
|
@greptile |
Greptile SummaryThis PR refactors model loading across Dense, Sparse, and ReRanker classes by introducing a Key changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Last reviewed commit: 9abce9f |
| self._model = self._get_model_class( | ||
|
|
||
| self._model_name, device=self._device, trust_remote_code=True | ||
| ) |
There was a problem hiding this comment.
Extra blank line after opening parenthesis
| self._model = self._get_model_class( | |
| self._model_name, device=self._device, trust_remote_code=True | |
| ) | |
| self._model = self._get_model_class( | |
| self._model_name, device=self._device, trust_remote_code=True | |
| ) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| decoded = model.decode(sparse_matrix)[0] | ||
| token_strings, scores = zip(*decoded, strict=True) | ||
| token_ids = model.tokenizer.convert_tokens_to_ids(token_strings) | ||
| sparse_dict = dict(zip(token_ids, scores, strict=True)) |
There was a problem hiding this comment.
If decoded is empty (no non-zero dimensions), unpacking will raise ValueError: not enough values to unpack
decoded = model.decode(sparse_matrix)[0]
if not decoded:
return {}
token_strings, scores = zip(*decoded, strict=True)
token_ids = model.tokenizer.convert_tokens_to_ids(token_strings)
sparse_dict = dict(zip(token_ids, scores, strict=True))
Let's work together to merge PR. ❤️ |
Hello!
Pull Request overview
DefaultLocalSparseEmbeddingandDefaultLocalDenseEmbedding_get_modelacross DenseEmbedding, SparseEmbedding, and ReRanker via a_get_model_classpropertyencode_query&encode_documentexists for both SparseEncoder and SentenceTransformer._manual_sparse_encodefor older Sentence Transformer: I would recommend requiring at least Sentence Transformers v5.0 so the SparseEncoder can be used.SentenceTransformerReRankerwithDefaultLocalReRankerin the docs, as the previous doesn't exist anymore.Details:
This PR extends the support for local dense, sparse, and reranker models a good bit. I've had some installation issues, so I'm unable to properly run the tests at the moment, but you should get much clearer results for the SparseEmbedding class. I also simplified the sparse embedding post-processing using
model.decode, which also removes some issues with SparseCUDA and SparseCPU not implementing all operations.Please do let me know if you have any questions.