✨ FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs ✨

📁 Benchmark Data | 📖 Arxiv | 🛠️ Evaluation Framework

📌 How to Use This Benchmark

We provide two evaluation pathways depending on whether you prefer lightweight local testing or full benchmark evaluation via our unified framework.

1. Local Testing via Starter Kit (Recommended for Quick Experiments)

If you would like to test the benchmark using local code:

Please refer to the StartKit/ directory.
We provide two end-to-end pipelines:
- pipeline-hf.ipynb for HuggingFace-based inference
- pipeline-vllm.ipynb for vLLM-based inference
After inference, use the three task-specific evaluation scripts in the same directory to evaluate performance on FinSM, FinRE, and FinMR, respectively.

2. Full Evaluation via FinBen Framework (Recommended for Benchmarking)

If you would like to evaluate models using our unified evaluation framework:

Please run the three FinAuditing tasks under
FinBen/tasks/FinAuditing/
The corresponding execution scripts are:
- run_finaudit.sh
Once you get the results, you can use the following evaluation scripts to test:
- evaluate-FinSM-example.ipynb
- evaluate-FinRE-example.ipynb
- evaluate-FinMR-example.ipynb

📚 Datasets Released

📂 Dataset	📝 Description
FinSM	Evaluation set for FinSM subtask within FinAuditing benchmark. This task follows the information retrieval paradigm: given a query describing a financial term that represents either currency or concentration of credit risk, an XBRL filing, and a US-GAAP taxonomy, the output is the set of mismatched US-GAAP tags after retrieval.
FinRE	Evaluation set for FinRE subtask within FinAuditing benchmark. This is a relation extraction task, given two specific elements $e_1$ and $e_2$, an XBRL filing, and a US-GAAP taxonomy, the goal of this task is to classify three relation error types.
FinMR	Evaluation set for FinMR subtask within FinAuditing benchmark. This is a mathematical reasoning task, given two questions $q_1$ and $q_2$, where $q_1$ concerns the extraction of a reported value and $q_2$ pertains to the calculation of the corresponding real value, an XBRL filing, and a US-GAAP taxonomy, the task is to extract the reported value for a given instance in the XBRL filing and to compute the numeric value for that instance, which is then used to verify whether the reported value is correct..
FinSM_Sub	FinSM subset for ICAIF 2026.
FinRE_Sub	FinRE subset for ICAIF 2026.
FinMR_Sub	FinMR subset for ICAIF 2026.

If you find our benchmark useful, please cite:

    @misc{wang2025finauditingfinancialtaxonomystructuredmultidocument,
          title={FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs}, 
          author={Yan Wang and Keyi Wang and Shanshan Yang and Jaisal Patel and Jeff Zhao and Fengran Mo and Xueqing Peng and Lingfei Qian and Jimin Huang and Guojun Xiong and Xiao-Yang Liu and Jian-Yun Nie},
          year={2025},
          eprint={2510.08886},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2510.08886}, 
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs ✨

📌 How to Use This Benchmark

1. Local Testing via Starter Kit (Recommended for Quick Experiments)

2. Full Evaluation via FinBen Framework (Recommended for Benchmarking)

📚 Datasets Released

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
StartKit		StartKit
README.md		README.md
evaluate-FinMR-example.ipynb		evaluate-FinMR-example.ipynb
evaluate-FinRE-example.ipynb		evaluate-FinRE-example.ipynb
evaluate-FinSM-example.ipynb		evaluate-FinSM-example.ipynb

The-FinAI/FinAuditing

Folders and files

Latest commit

History

Repository files navigation

✨ FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs ✨

📌 How to Use This Benchmark

1. Local Testing via Starter Kit (Recommended for Quick Experiments)

2. Full Evaluation via FinBen Framework (Recommended for Benchmarking)

📚 Datasets Released

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages