π Benchmark Data | π Arxiv | π οΈ Evaluation Framework
We provide two evaluation pathways depending on whether you prefer lightweight local testing or full benchmark evaluation via our unified framework.
If you would like to test the benchmark using local code:
- Please refer to the
StartKit/directory. - We provide two end-to-end pipelines:
pipeline-hf.ipynbfor HuggingFace-based inferencepipeline-vllm.ipynbfor vLLM-based inference
- After inference, use the three task-specific evaluation scripts in the same directory to evaluate performance on FinSM, FinRE, and FinMR, respectively.
If you would like to evaluate models using our unified evaluation framework:
- Please run the three FinAuditing tasks under
FinBen/tasks/FinAuditing/ - The corresponding execution scripts are:
run_finaudit.sh
- Once you get the results, you can use the following evaluation scripts to test:
evaluate-FinSM-example.ipynbevaluate-FinRE-example.ipynbevaluate-FinMR-example.ipynb
| π Dataset | π Description |
|---|---|
| FinSM | Evaluation set for FinSM subtask within FinAuditing benchmark. This task follows the information retrieval paradigm: given a query describing a financial term that represents either currency or concentration of credit risk, an XBRL filing, and a US-GAAP taxonomy, the output is the set of mismatched US-GAAP tags after retrieval. |
| FinRE | Evaluation set for FinRE subtask within FinAuditing benchmark. This is a relation extraction task, given two specific elements |
| FinMR | Evaluation set for FinMR subtask within FinAuditing benchmark. This is a mathematical reasoning task, given two questions |
| FinSM_Sub | FinSM subset for ICAIF 2026. |
| FinRE_Sub | FinRE subset for ICAIF 2026. |
| FinMR_Sub | FinMR subset for ICAIF 2026. |
If you find our benchmark useful, please cite:
@misc{wang2025finauditingfinancialtaxonomystructuredmultidocument,
title={FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs},
author={Yan Wang and Keyi Wang and Shanshan Yang and Jaisal Patel and Jeff Zhao and Fengran Mo and Xueqing Peng and Lingfei Qian and Jimin Huang and Guojun Xiong and Xiao-Yang Liu and Jian-Yun Nie},
year={2025},
eprint={2510.08886},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.08886},
}