Complex Function Calling Benchmark (ComplexFuncBench)

Introduction

Complex Function Calling Benchmark (ComplexFuncBench) is specillly designed for complex function calling evaluation. The ComplexFuncBench dataset encompass 1,000 complex function calling samples from five aspects: (1) Function calling with multi-step in single turn; (2) Function calling with user-provided constraints; (3) Function calling that requires parameter value reasoning from implicit information; (4) Function calling with long parameter values that exceed 500 tokens; and (5) Function calling with 128k long-context length.

The difference between ComplexFuncBench and other function calling benchmarks is shown in the following table.

	Real API Response	Multi-Step	Constraints	Parameter Value Reasoning	Long Parameter Reasoning	Long-Context
API-Bench	❌	❌	❌	❌	❌	❌
ToolBench	✅	✅	❌	❌	❌	❌
T-Eval	✅	✅	❌	❌	❌	❌
BFCL	❌	✅	❌	❌	✅	✅
Tool Sandbox	❌	✅	❌	❌	❌	❌
`ComplexFuncBench`	✅	✅	✅	✅	✅	✅

Leaderboard

Model	Overall Success Rate	Overall Call Acc.	Completeness	Correctness
Claude-3.5-Sonnet (20241022)	61.00	79.27	1.84	1.85
GPT-4o (2024-08-06)	60.50	80.55	1.66	1.75
GLM-4-Long	57.10	76.35	1.72	1.74
GPT-4-Turbo (2024-04-09)	49.50	71.38	1.72	1.81
Claude-3.5-Haiku (20241022)	45.80	69.50	1.79	1.71
Qwen2.5-72B	40.10	58.32	1.80	1.75
Mistral Large 2	20.10	48.78	0.94	1.0
GLM-4-9B	9.40	27.97	1.15	1.03
Qwen2.5-7B	5.0	18.19	1.5	1.47
Llama-3.1-405B	4.00	11.87	0.43	0.30
Llama-3.1-70B	2.70	8.17	0.67	0.36
Llama-3.1-8B	0.10	1.34	0.18	0.09

Method

Data Collection

The collection of the ComplexFuncBench dataset consists of three stages: coarse generation, fine-grained annotation, and generalization. The dataset contains 1,000 complex function-calling samples, which comprise 600 single-domain samples and 400 cross-domain samples.

Automated Evaluation

The automated evaluation framework \texttt{ComplexEval} evaluates models' complex function calling ability and response generation ability simultaneously.

How to evaluate on ComplexFuncBench

Preparation

First, download the repository and dataset. You can download the benchmarkd dataset through the HuggingFace datasets.

git clone https://github.com/THUDM/ComplexFuncBench.git
cd ComplexFuncBench

Then, install the dependencies.

pip install -r requirements.txt

Serve Model

For close source models, make sure the corresponding model API keys are included in your evironments .env . To enable response-based evaluation, you need to subscribe the Booking API from RapidAPI.
```
OPENAI_API_KEY=sk-XXXXXX

RAPID_API_KEY=
```

For open source models, you need to deploy your model via vLLM. Run the following command to serve the model. Take THUDM/glm-4-9b-chat for example:

vllm serve THUDM/glm-4-9b-chat --api-key token-abc123 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len 131072 --trust-remote-code

Run Model Inference

python evaluation.py --model_name {model_name} --proc_num {proc_num}

Take gpt-4o-2024-08-06 and THUDM/glm-4-9b-chat for example,

python evaluation.py --model_name gpt-4o-2024-08-06 --proc_num 50

python evaluation.py --model_name THUDM/glm-4-9b-chat --proc_num 50 --vllm_url http://xx.xx.xx.xx:8000/v1

The evaluation results is saved in result/{model_name}

Export Results

python print_results.py --result_dir {result_dir}

Citation

If you find our work helpful for your research, please consider citing our work.

@misc{zhong2025complexfuncbench,
      title={ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario}, 
      author={Lucen Zhong and Zhengxiao Du and Xiaohan Zhang and Haiyi Hu and Jie Tang},
      year={2025},
      eprint={2501.10132},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.10132}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
prompts		prompts
resources		resources
runner		runner
utils		utils
.env		.env
.gitignore		.gitignore
README.md		README.md
evaluation.py		evaluation.py
print_results.py		print_results.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Complex Function Calling Benchmark (ComplexFuncBench)

Table of Contents

Introduction

Leaderboard

Method

Data Collection

Automated Evaluation

How to evaluate on ComplexFuncBench

Preparation

Serve Model

Run Model Inference

Export Results

Citation

About

Releases

Packages

Languages

THUDM/ComplexFuncBench

Folders and files

Latest commit

History

Repository files navigation

Complex Function Calling Benchmark (ComplexFuncBench)

Table of Contents

Introduction

Leaderboard

Method

Data Collection

Automated Evaluation

How to evaluate on ComplexFuncBench

Preparation

Serve Model

Run Model Inference

Export Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages