Skip to content

Conversation

@hjenryin
Copy link

@hjenryin hjenryin commented Jan 4, 2026

Current code base is based on lm-eval<=0.4.7. lm-eval adopts the best practice of logging in 0.4.8 (EleutherAI/lm-evaluation-harness#2203), making it incompatible with the current version of evalchemy. Minimal changes needs to be made to support newer versions (and to adopt better logging practices). I have no problem running this with lm-eval==0.4.9.2 and vllm==0.13.0 with this patch; also, these changes are backward-compatible.

lm-eval 0.4.8 can be particularly helpful. It starts to support vllm 0.7+ (EleutherAI/lm-evaluation-harness#2706), with easier local data parallel setup. I've also included a section on how to run data-parallel with vllm for faster evaluation.

This is similar to #124, but I think it's better to decouple the logging so that people can know where the message is coming from.

I also encourage the authors set up version control and publish this on pypi; it would make this wonderful tool be more easier to use by everyone!

Copy link
Collaborator

@neginraoof neginraoof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot this is very helpful. Have you performed any parity experiments with older and newer lm-eval and vllm versions? If so, can you share the result with us?

I'm very interested to know iif we can try data-parallel mode for parity experiments

@hjenryin
Copy link
Author

hjenryin commented Jan 4, 2026

Thanks a lot this is very helpful. Have you performed any parity experiments with older and newer lm-eval and vllm versions? If so, can you share the result with us?

I'm very interested to know iif we can try data-parallel mode for parity experiments

I don't think I have the resource to run everything, but I think this table should demonstrate that the performance is consistent. The official numbers come from https://huggingface.co/open-thoughts/OpenThinker-7B, where the authors reported using evalchemy. The readme.md is updated 7 months ago, which I think it's safe to assume that it's from previous versions.

  AIME24 MATH500 GPQADiamond LCBv2 LCBv2 - easy LCBv2 - medium LCBv2 - hard
Openthoughts-7B - official 31.3% 83.0% 42.4% 39.9% 75.3% 28.6% 6.5%
Openthoughts-7B - dp 8 30.00% 85.00% 42.42% 40.31% 78.57% 27.67% 4.88%
Openthoughts-7B - tp 4 (the model doesn't support tp 8) 33.67% 85.80% 42.42% 41.49% 76.92% 33.50% 2.44%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants