Skip to content

Allow custom model names for Dense, Sparse embeddings, fix Sparse Embeddings#146

Open
tomaarsen wants to merge 2 commits intoalibaba:mainfrom
tomaarsen:feat/expand_st_model_support
Open

Allow custom model names for Dense, Sparse embeddings, fix Sparse Embeddings#146
tomaarsen wants to merge 2 commits intoalibaba:mainfrom
tomaarsen:feat/expand_st_model_support

Conversation

@tomaarsen
Copy link

Hello!

Pull Request overview

  • Allow custom model_name in DefaultLocalSparseEmbedding and DefaultLocalDenseEmbedding
  • Simplify model loading, only 1 _get_model across DenseEmbedding, SparseEmbedding, and ReRanker via a _get_model_class property
  • Fix SparseEncoder model loading: It used SentenceTransformer previously. Also, encode_query & encode_document exists for both SparseEncoder and SentenceTransformer.
  • Removed _manual_sparse_encode for older Sentence Transformer: I would recommend requiring at least Sentence Transformers v5.0 so the SparseEncoder can be used.
  • Replaced SentenceTransformerReRanker with DefaultLocalReRanker in the docs, as the previous doesn't exist anymore.

Details:

This PR extends the support for local dense, sparse, and reranker models a good bit. I've had some installation issues, so I'm unable to properly run the tests at the moment, but you should get much clearer results for the SparseEmbedding class. I also simplified the sparse embedding post-processing using model.decode, which also removes some issues with SparseCUDA and SparseCPU not implementing all operations.

Please do let me know if you have any questions.

  • Tom Aarsen

@CLAassistant
Copy link

CLAassistant commented Feb 17, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@Cuiyus Cuiyus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the _get_model_class function is a good design. Here, it can be renamed to _loading_model to make the meaning clearer.

Furthermore, I believe the aforementioned design can address the issues you mentioned: Fix SparseEncoder model loading and Removed _manual_sparse_encode.

Thank you again for your submission! If you have any questions, we can continue to communicate. Additionally, if you have time, please continue to be responsible for the current PR and supplement the corresponding tests. (Of course, I can also take over and make adjustments and fixes for the issues raised in this PR.)

@tomaarsen
Copy link
Author

Thank you for the review! I'd be glad to help move this PR forward together. I fixed one of your concerns, and I had a question for the other one. You're also free to take over the PR if you prefer (I know I sometimes find it easier 😄).

  • Tom Aarsen

@Cuiyus
Copy link
Collaborator

Cuiyus commented Feb 27, 2026

@greptile

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR refactors model loading across Dense, Sparse, and ReRanker classes by introducing a _get_model_class property pattern. It simplifies the sparse embedding implementation by removing fallback logic for older Sentence Transformers versions and leveraging model.decode() for cleaner post-processing.

Key changes:

  • Added support for custom model names in DefaultLocalDenseEmbedding and DefaultLocalSparseEmbedding
  • Unified model loading with _get_model_class property across all three classes
  • Fixed SparseEncoder model loading (was incorrectly using SentenceTransformer)
  • Removed _manual_sparse_encode fallback method
  • Updated documentation references from SentenceTransformerReRanker to DefaultLocalReRanker

Issues found:

  • Potential runtime error in sparse embedding if decoded is empty (line 754-757)
  • Minor formatting issue with extra blank line (line 134)
  • Author noted installation issues prevented running tests

Confidence Score: 3/5

  • This PR has one critical bug that needs fixing before merge, plus the author couldn't run tests
  • The refactoring is well-structured and simplifies the codebase, but the potential runtime error in sparse embedding (empty decoded results) could cause production failures. Additionally, the author's inability to run tests is concerning for changes of this magnitude.
  • Pay close attention to sentence_transformer_embedding_function.py - the sparse embedding logic needs the empty decoded case handled

Important Files Changed

Filename Overview
python/zvec/extension/sentence_transformer_function.py Simplifies model loading with _get_model_class property pattern, minor formatting issue
python/zvec/extension/sentence_transformer_embedding_function.py Adds custom model name support, simplifies sparse encoding, but has potential bug with empty decoded results
python/zvec/extension/sentence_transformer_rerank_function.py Updates documentation references from SentenceTransformerReRanker to DefaultLocalReRanker, adopts _get_model_class property pattern

Last reviewed commit: 9abce9f

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +133 to 136
self._model = self._get_model_class(

self._model_name, device=self._device, trust_remote_code=True
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line after opening parenthesis

Suggested change
self._model = self._get_model_class(
self._model_name, device=self._device, trust_remote_code=True
)
self._model = self._get_model_class(
self._model_name, device=self._device, trust_remote_code=True
)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +754 to +757
decoded = model.decode(sparse_matrix)[0]
token_strings, scores = zip(*decoded, strict=True)
token_ids = model.tokenizer.convert_tokens_to_ids(token_strings)
sparse_dict = dict(zip(token_ids, scores, strict=True))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If decoded is empty (no non-zero dimensions), unpacking will raise ValueError: not enough values to unpack

decoded = model.decode(sparse_matrix)[0]
if not decoded:
    return {}
token_strings, scores = zip(*decoded, strict=True)
token_ids = model.tokenizer.convert_tokens_to_ids(token_strings)
sparse_dict = dict(zip(token_ids, scores, strict=True))

@Cuiyus
Copy link
Collaborator

Cuiyus commented Feb 27, 2026

Thank you for the review! I'd be glad to help move this PR forward together. I fixed one of your concerns, and I had a question for the other one. You're also free to take over the PR if you prefer (I know I sometimes find it easier 😄).

  • Tom Aarsen

Let's work together to merge PR. ❤️
I noticed that the ruff check in the CI failed. Please take the time to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants