Skip to content

The-FinAI/FinAuditing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

✨ FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs ✨

πŸ“ Benchmark Data | πŸ“– Arxiv | πŸ› οΈ Evaluation Framework


πŸ“Œ How to Use This Benchmark

We provide two evaluation pathways depending on whether you prefer lightweight local testing or full benchmark evaluation via our unified framework.

1. Local Testing via Starter Kit (Recommended for Quick Experiments)

If you would like to test the benchmark using local code:

  • Please refer to the StartKit/ directory.
  • We provide two end-to-end pipelines:
    • pipeline-hf.ipynb for HuggingFace-based inference
    • pipeline-vllm.ipynb for vLLM-based inference
  • After inference, use the three task-specific evaluation scripts in the same directory to evaluate performance on FinSM, FinRE, and FinMR, respectively.

2. Full Evaluation via FinBen Framework (Recommended for Benchmarking)

If you would like to evaluate models using our unified evaluation framework:

  • Please run the three FinAuditing tasks under
    FinBen/tasks/FinAuditing/
  • The corresponding execution scripts are:
    • run_finaudit.sh
  • Once you get the results, you can use the following evaluation scripts to test:
    • evaluate-FinSM-example.ipynb
    • evaluate-FinRE-example.ipynb
    • evaluate-FinMR-example.ipynb

πŸ“š Datasets Released

πŸ“‚ Dataset πŸ“ Description
FinSM Evaluation set for FinSM subtask within FinAuditing benchmark. This task follows the information retrieval paradigm: given a query describing a financial term that represents either currency or concentration of credit risk, an XBRL filing, and a US-GAAP taxonomy, the output is the set of mismatched US-GAAP tags after retrieval.
FinRE Evaluation set for FinRE subtask within FinAuditing benchmark. This is a relation extraction task, given two specific elements $e_1$ and $e_2$, an XBRL filing, and a US-GAAP taxonomy, the goal of this task is to classify three relation error types.
FinMR Evaluation set for FinMR subtask within FinAuditing benchmark. This is a mathematical reasoning task, given two questions $q_1$ and $q_2$, where $q_1$ concerns the extraction of a reported value and $q_2$ pertains to the calculation of the corresponding real value, an XBRL filing, and a US-GAAP taxonomy, the task is to extract the reported value for a given instance in the XBRL filing and to compute the numeric value for that instance, which is then used to verify whether the reported value is correct..
FinSM_Sub FinSM subset for ICAIF 2026.
FinRE_Sub FinRE subset for ICAIF 2026.
FinMR_Sub FinMR subset for ICAIF 2026.

If you find our benchmark useful, please cite:

    @misc{wang2025finauditingfinancialtaxonomystructuredmultidocument,
          title={FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs}, 
          author={Yan Wang and Keyi Wang and Shanshan Yang and Jaisal Patel and Jeff Zhao and Fengran Mo and Xueqing Peng and Lingfei Qian and Jimin Huang and Guojun Xiong and Xiao-Yang Liu and Jian-Yun Nie},
          year={2025},
          eprint={2510.08886},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2510.08886}, 
    }

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published