Skip to content

Comments

Add multimodal embedding & rerank support#66

Draft
roj234 wants to merge 1 commit intoJamePeng:mainfrom
roj234:vl-embedding
Draft

Add multimodal embedding & rerank support#66
roj234 wants to merge 1 commit intoJamePeng:mainfrom
roj234:vl-embedding

Conversation

@roj234
Copy link

@roj234 roj234 commented Feb 21, 2026

It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)

@JamePeng
Copy link
Owner

JamePeng commented Feb 21, 2026

It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx.
If possible, please provide necessary example and test code to illustrate its usage.

@roj234
Copy link
Author

roj234 commented Feb 21, 2026

Actually I am enhance the existing Embedding class, however I can move mctx management to llama_embedding.py
About memory, I have referred your context_stack and __del to free memory.
I also found llama_chat_format contains the logic for multimodal processing, but it is tightly coupled with the inference execution. It doesn't expose a way to get the processed tokens.

btw, Here is my usage

                    doc = [{"type": "text", "text": f"Name: {filepath.name}"}, 
                           {"type": "image", "image": image_data}]

class RAGModel:
    def __init__(self):
        self._model = LlamaEmbedding(
            # ...
            mmproj_path=...,
            image_min_tokens=...,
            image_max_tokens=...,
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        files = []

        image_id = 0
        # Should not manually concat chat template here...
        tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
        for item in contents:
            type = item['type']
            if type == 'text':
                tmpl += item['text']
            elif type == 'image':
                image_id += 1
                files.append(item['image'])
                tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd

        return tmpl +  "<|im_end|>\n<|im_start|>assistant\n", files

    def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
        text, files = self._tmpl(contents, instruction)
        return self._model.embed_multimodal(text, files, return_count=return_count)

@JamePeng
Copy link
Owner

Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants