Add multimodal embedding & rerank support#66
Conversation
|
It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx. |
|
Actually I am btw, Here is my usage doc = [{"type": "text", "text": f"Name: {filepath.name}"},
{"type": "image", "image": image_data}]
class RAGModel:
def __init__(self):
self._model = LlamaEmbedding(
# ...
mmproj_path=...,
image_min_tokens=...,
image_max_tokens=...,
)
def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
files = []
image_id = 0
# Should not manually concat chat template here...
tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
for item in contents:
type = item['type']
if type == 'text':
tmpl += item['text']
elif type == 'image':
image_id += 1
files.append(item['image'])
tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd
return tmpl + "<|im_end|>\n<|im_start|>assistant\n", files
def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
text, files = self._tmpl(contents, instruction)
return self._model.embed_multimodal(text, files, return_count=return_count) |
|
Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage. |
It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)