VideoTool

This repository is the official implementation of Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task (NeurIPS 2025 main track).

News and Todo 🗓️

Release tools and test scripts
Release toolchain algorithm (STAR)
Release evaluating scripts
Clean requirements.txt
Add more tools, e.g., audio-to-text models

Introduction

In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench.

Setup and Configuration 🛠️

Clone the repository 📦:

git clone git@github.com:fansunqi/VideoTool.git
cd ToolChainVideo

Create a virtual environment 🧹 and install the dependencies 🧑‍🍳:

conda create -n videotool python=3.9
conda activate videotool
pip install -r requirements.txt

Set up your API key 🗝️ in config/*.yaml:

openai:
  GPT_API_KEY: "put your openai api key here"
  PROXY: "put your openai base url here"

Bulid related projects 🧩:

mkdir projects
cd projects

Download Grounded-Video-LLM for temporal grounding and temporal QA
```
git clone git@github.com:WHB139426/Grounded-Video-LLM.git
```
Download the checkpoint and specify the path in config/*.yaml.

Build LLaVA for image QA

git clone git@github.com:fansunqi/LLaVA.git
cd LLaVA
pip install -e .
cd ..

Tools

Thanks to the authors of these open-source projects for providing excellent projects.

Temporal Tools:

Frame Selector
- Select frames of interest based on current information, driven by LLM.
- Select frames of interest by image grid, driven by VLM.
Temporal Grounding
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
Temporal Refering
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM
Temporal QA
- Grounded-Video-LLM-7B: https://github.com/WHB139426/Grounded-Video-LLM

Spatial Tools:

Object Detection and Tracking
- YOLO by ultralytics: https://github.com/ultralytics/ultralytics
Image Captioning
- BLIP: https://huggingface.co/docs/transformers/model_doc/blip
Image QA
- BLIP: https://huggingface.co/docs/transformers/model_doc/blip
- LLaVA: https://github.com/haotian-liu/LLaVA
Patch Zooming
- Zoom to the key area in the images, driven by VLM.
Image Grid QA
- Image Grid QA driven by GPT-4o, adapted from https://github.com/microsoft/VLM-Video-Action-Localization

Generalist Tools:

Video QA
- Qwen-VL-2.5-7B: https://github.com/QwenLM/Qwen2.5-VL
- InternVL3-8B: https://github.com/OpenGVLab/InternVL
Summarizer
- Summarize all currect information, driven by LLM.

Tools Testing

See tools/test_tools.sh

Usage

Run with single video:

python run_single_video.py \
    --config config/star_single_video.yaml \
    --video_path /path/to/video.mp4 \
    --question "What is happening in the video?"

Run with single video and options

python run_single_video.py \
    --config config/star_single_video.yaml \
    --video_path /path/to/video.mp4 \
    --question "What is the person doing?" \
    --options "A. Running" "B. Swimming" "C. Cooking"

Run testcases (testcases can be found in testcases directory.):

bash run_testcases.sh

Download Datasets

NeXT-QA：
```
git clone git@github.com:doc-doc/NExT-QA.git
```
specify your data path in config/nextqa.yaml

Evaluation

Acknowledgments

We thank the developers of OctoTools and all developers of the open-source projects we used.

Citation

If you find our repo useful, please kindly consider citing:

@inproceedings{
    fan2025toolaugmented,
    title={Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task},
    author={Sunqi Fan and Jiashuo Cui and Meng-Hao Guo and Shuojin Yang},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=OFz4VDn0SO}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
config		config
engine		engine
testcases		testcases
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
eval.py		eval.py
main.py		main.py
prompts.py		prompts.py
reasoning.py		reasoning.py
requirements.txt		requirements.txt
run_single_video.py		run_single_video.py
run_testcases.sh		run_testcases.sh
star_prompts.py		star_prompts.py
star_reasoning.py		star_reasoning.py
util.py		util.py
visible_frames.py		visible_frames.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoTool

News and Todo 🗓️

Introduction

Setup and Configuration 🛠️

Tools

Tools Testing

Usage

Download Datasets

Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoTool

News and Todo 🗓️

Introduction

Setup and Configuration 🛠️

Tools

Tools Testing

Usage

Download Datasets

Evaluation

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages