Skip to content

[BUG] Qwen VL models with text prompt longer than max_seq_len 4096 length error #401

@iguy0

Description

@iguy0

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Describe the bug

When prompting Qwen VL models with a long(>4096 max_seq_len ) enough prompt the call fails with the following error:

Dec 06 23:53:16 ailab llama-swap[3174452]: models-local/qwen3-vl-32b-instruct-exl3. Skipping inline model load.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.673 INFO:     Received chat completion request
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:    Traceback (most recent call last):
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 437,
Dec 06 23:53:16 ailab llama-swap[3174452]: in generate_chat_completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        generations = await
Dec 06 23:53:16 ailab llama-swap[3174452]: asyncio.gather(*gen_tasks)
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:
Dec 06 23:53:16 ailab llama-swap[3174452]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 692, in generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        async for generation in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.stream_generate(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 779, in
Dec 06 23:53:16 ailab llama-swap[3174452]: stream_generate
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        async for generation_chunk in
Dec 06 23:53:16 ailab llama-swap[3174452]: self.generate_gen(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:      File
Dec 06 23:53:16 ailab llama-swap[3174452]: "/home/user1/projects/tabbyAPI/backends/exllamav3/model.py", line 968, in
Dec 06 23:53:16 ailab llama-swap[3174452]: generate_gen
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:        raise ValueError(
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.712 ERROR:    ValueError: Prompt length 10083 is greater
Dec 06 23:53:16 ailab llama-swap[3174452]: than max_seq_len 4096
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.715 ERROR:    Sent to request: Chat completion
Dec 06 23:53:16 ailab llama-swap[3174452]: 96b5accf70144d28907c306816d5513e aborted. Maybe the model was unloaded? Please
Dec 06 23:53:16 ailab llama-swap[3174452]: check the server console.
Dec 06 23:53:16 ailab llama-swap[3174452]: 2025-12-06 23:53:16.716 INFO:     192.168.10.45:0 - "POST /v1/chat/completions
Dec 06 23:53:16 ailab llama-swap[3174452]: HTTP/1.1" 503
Dec 06 23:53:16 ailab llama-swap[3174452]: [WARN] metrics skipped, HTTP status=503, path=/v1/chat/completions

I believe the model configuration may not be assigned correctly to max_seq_len and fails here:

if context_len > self.max_seq_len:

Please let me know if you need more information.

Reproduction steps

Download a version of turboderp/Qwen3-VL-32B-Instruct-exl3 and run a call to endpoint with an image and a text prompt with > 4096 max_seq_len

Expected behavior

The api call should respect the model configuration from config.json

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions