Add multimodal embedding & rerank support by roj234 · Pull Request #66 · JamePeng/llama-cpp-python

roj234 · 2026-02-21T14:28:38Z

It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)

JamePeng · 2026-02-21T16:21:36Z

It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx.
If possible, please provide necessary example and test code to illustrate its usage.

roj234 · 2026-02-21T19:48:23Z

Actually I am enhance the existing Embedding class, however I can move mctx management to llama_embedding.py
About memory, I have referred your context_stack and __del to free memory.
I also found llama_chat_format contains the logic for multimodal processing, but it is tightly coupled with the inference execution. It doesn't expose a way to get the processed tokens.

btw, Here is my usage

                    doc = [{"type": "text", "text": f"Name: {filepath.name}"}, 
                           {"type": "image", "image": image_data}]

class RAGModel:
    def __init__(self):
        self._model = LlamaEmbedding(
            # ...
            mmproj_path=...,
            image_min_tokens=...,
            image_max_tokens=...,
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        files = []

        image_id = 0
        # Should not manually concat chat template here...
        tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
        for item in contents:
            type = item['type']
            if type == 'text':
                tmpl += item['text']
            elif type == 'image':
                image_id += 1
                files.append(item['image'])
                tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd

        return tmpl +  "<|im_end|>\n<|im_start|>assistant\n", files

    def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
        text, files = self._tmpl(contents, instruction)
        return self._model.embed_multimodal(text, files, return_count=return_count)

JamePeng · 2026-02-21T20:49:34Z

Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage.

Multimodal embedding (tested on Qwen-VL-Embedding)

07a71ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add multimodal embedding & rerank support#66

Add multimodal embedding & rerank support#66
roj234 wants to merge 1 commit intoJamePeng:mainfrom
roj234:vl-embedding

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026 •

edited

Loading

Uh oh!

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JamePeng commented Feb 21, 2026 •

edited

Loading