Improved logging to support lm-eval>=0.4.8 #154

hjenryin · 2026-01-04T03:45:58Z

Current code base is based on lm-eval<=0.4.7. lm-eval adopts the best practice of logging in 0.4.8 (EleutherAI/lm-evaluation-harness#2203), making it incompatible with the current version of evalchemy. Minimal changes needs to be made to support newer versions (and to adopt better logging practices). I have no problem running this with lm-eval==0.4.9.2 and vllm==0.13.0 with this patch; also, these changes are backward-compatible.

lm-eval 0.4.8 can be particularly helpful. It starts to support vllm 0.7+ (EleutherAI/lm-evaluation-harness#2706), with easier local data parallel setup. I've also included a section on how to run data-parallel with vllm for faster evaluation.

This is similar to #124, but I think it's better to decouple the logging so that people can know where the message is coming from.

I also encourage the authors set up version control and publish this on pypi; it would make this wonderful tool be more easier to use by everyone!

neginraoof

Thanks a lot this is very helpful. Have you performed any parity experiments with older and newer lm-eval and vllm versions? If so, can you share the result with us?

I'm very interested to know iif we can try data-parallel mode for parity experiments

pyproject.toml

hjenryin · 2026-01-04T23:59:11Z

Thanks a lot this is very helpful. Have you performed any parity experiments with older and newer lm-eval and vllm versions? If so, can you share the result with us?

I'm very interested to know iif we can try data-parallel mode for parity experiments

I don't think I have the resource to run everything, but I think this table should demonstrate that the performance is consistent. The official numbers come from https://huggingface.co/open-thoughts/OpenThinker-7B, where the authors reported using evalchemy. The readme.md is updated 7 months ago, which I think it's safe to assume that it's from previous versions.

	AIME24	MATH500	GPQADiamond	LCBv2	LCBv2 - easy	LCBv2 - medium	LCBv2 - hard
Openthoughts-7B - official	31.3%	83.0%	42.4%	39.9%	75.3%	28.6%	6.5%
Openthoughts-7B - dp 8	30.00%	85.00%	42.42%	40.31%	78.57%	27.67%	4.88%
Openthoughts-7B - tp 4 (the model doesn't support tp 8)	33.67%	85.80%	42.42%	41.49%	76.92%	33.50%	2.44%

hjenryin added 2 commits January 3, 2026 19:07

Supporting lm-eval>=0.4.8

2b09c19

Improved doc with local multiGPU support; use uv to install.

76b134a

neginraoof reviewed Jan 4, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Use 0.4.9.1 or 0.4.9.2

9a83ff1

hjenryin force-pushed the main branch from 2dd00fa to 9a83ff1 Compare January 5, 2026 20:56

hjenryin requested a review from neginraoof January 5, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved logging to support lm-eval>=0.4.8 #154

Improved logging to support lm-eval>=0.4.8 #154

Uh oh!

hjenryin commented Jan 4, 2026

Uh oh!

neginraoof left a comment

Uh oh!

Uh oh!

hjenryin commented Jan 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improved logging to support lm-eval>=0.4.8 #154

Are you sure you want to change the base?

Improved logging to support lm-eval>=0.4.8 #154

Uh oh!

Conversation

hjenryin commented Jan 4, 2026

Uh oh!

neginraoof left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hjenryin commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hjenryin commented Jan 4, 2026 •

edited

Loading