Skip to content

fansunqi/VideoTool

Repository files navigation

VideoTool

GitHub license Arxiv

This repository is the official implementation of Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task (NeurIPS 2025 main track).

News and Todo 🗓️

  • Release tools and test scripts

  • Release toolchain algorithm (STAR)

  • Release evaluating scripts

  • Clean requirements.txt

  • Add more tools, e.g., audio-to-text models

Introduction

In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench.

Setup and Configuration 🛠️

  1. Clone the repository 📦:

    git clone git@github.com:fansunqi/VideoTool.git
    cd ToolChainVideo
  2. Create a virtual environment 🧹 and install the dependencies 🧑‍🍳:

    conda create -n videotool python=3.9
    conda activate videotool
    pip install -r requirements.txt
  3. Set up your API key 🗝️ in config/*.yaml:

    openai:
      GPT_API_KEY: "put your openai api key here"
      PROXY: "put your openai base url here"
  4. Bulid related projects 🧩:

    mkdir projects
    cd projects
    • Download Grounded-Video-LLM for temporal grounding and temporal QA

      git clone git@github.com:WHB139426/Grounded-Video-LLM.git

      Download the checkpoint and specify the path in config/*.yaml.

    • Build LLaVA for image QA

      git clone git@github.com:fansunqi/LLaVA.git
      cd LLaVA
      pip install -e .
      cd ..

Tools

Thanks to the authors of these open-source projects for providing excellent projects.

Temporal Tools:

Spatial Tools:

Generalist Tools:

Tools Testing

See tools/test_tools.sh

Usage

Run with single video:

python run_single_video.py \
    --config config/star_single_video.yaml \
    --video_path /path/to/video.mp4 \
    --question "What is happening in the video?"

Run with single video and options

python run_single_video.py \
    --config config/star_single_video.yaml \
    --video_path /path/to/video.mp4 \
    --question "What is the person doing?" \
    --options "A. Running" "B. Swimming" "C. Cooking"

Run testcases (testcases can be found in testcases directory.):

bash run_testcases.sh

Download Datasets

  • NeXT-QA:
    git clone git@github.com:doc-doc/NExT-QA.git
    
    specify your data path in config/nextqa.yaml

Evaluation

Acknowledgments

We thank the developers of OctoTools and all developers of the open-source projects we used.

Citation

If you find our repo useful, please kindly consider citing:

@inproceedings{
    fan2025toolaugmented,
    title={Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task},
    author={Sunqi Fan and Jiashuo Cui and Meng-Hao Guo and Shuojin Yang},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=OFz4VDn0SO}
}

About

Official Repository for NeurIPS'25 Paper "Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages