Skip to content

Conversation

@Ali-Elganzory
Copy link

@Ali-Elganzory Ali-Elganzory commented Jan 3, 2026

Summary

Fixes #152 by dynamically capping max_gen_toks to fit within the model's context window when using vLLM backend.

Problem Solved

  • vLLM backend crashes with ValueError: please provide at least one prompt when max_gen_toks exceeds max_model_len - prompt_length
  • This happens because vLLM truncates prompts to make room for generation tokens, resulting in empty prompts

Changes

File Modified: eval/task.py

Key Improvements

  1. Graceful handling: Caps max_gen_toks instead of crashing
  2. Safety buffer: Uses 16-token buffer to account for special tokens or tokenization edge cases
  3. Warning logs: Logs when capping occurs for debugging visibility

Code Changes

# Before
elif isinstance(model, lm_eval_models.vllm_causallms.VLLM):
    instance.args[1]["max_gen_toks"] = max_new_tokens

# After
elif isinstance(model, lm_eval_models.vllm_causallms.VLLM):
    prompt = instance.args[0]
    prompt_length = len(model.tokenizer.encode(prompt))
    max_model_len = model.model.llm_engine.model_config.max_model_len
    
    # Calculate max allowed generation tokens (16 token safety buffer)
    max_allowed = max_model_len - prompt_length - 16
    capped_max_new_tokens = min(max_new_tokens, max(1, max_allowed))
    
    if capped_max_new_tokens < max_new_tokens:
        self.logger.warning(
            f"max_new_tokens ({max_new_tokens}) capped to {capped_max_new_tokens} "
            f"(prompt: {prompt_length} tokens, model max: {max_model_len})"
        )
    
    instance.args[1]["max_gen_toks"] = capped_max_new_tokens

Testing

# Test with vLLM - should now work instead of crashing
python -m eval.eval --model vllm --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

# Verify same model works with hf backend (baseline)
python -m eval.eval --model hf --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

Impact

  • Prevents runtime crashes from context window overflow
  • Better error handling with informative warnings
  • No functional changes to working cases (only triggers when needed)
  • Works across all benchmarks without per-benchmark fixes

@Ali-Elganzory
Copy link
Author

Hi @neginraoof, could you please review this PR when you have a chance?
I would appreciate your feedback. Thanks!

eval/task.py Outdated
elif isinstance(model, lm_eval_models.vllm_causallms.VLLM):
instance.args[1]["max_gen_toks"] = max_new_tokens
# Get prompt from instance.args[0] (the templated string)
prompt = instance.args[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, can you wrap lines 57 to 64 in a try catch?
Also maybe check if prompt_length is extremely long (> max_model_len)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will wrap it in a try catch.

If prompt_length > max_model_len, we could log a warning.
capped_max_new_tokens will be set to 1, and the prompt will be truncated to fit into the context window. We are fine with that logic, right?

eval/task.py Outdated
max_model_len = model.model.llm_engine.model_config.max_model_len

# Calculate max allowed generation tokens (16 token safety buffer)
max_allowed = max_model_len - prompt_length - 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create a constant and name it instead of using 16 here

@neginraoof
Copy link
Collaborator

Thanks for creating the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vLLM backend crashes with "please provide at least one prompt" when max_gen_toks exceeds model context window

2 participants