Skip to content

THUDM/ComplexFuncBench

Repository files navigation

Complex Function Calling Benchmark (ComplexFuncBench)

📄 Arxiv Paper • 🤗 HF Paper • 📊 Dataset

Table of Contents

Introduction

Complex Function Calling Benchmark (ComplexFuncBench) is specillly designed for complex function calling evaluation. The ComplexFuncBench dataset encompass 1,000 complex function calling samples from five aspects: (1) Function calling with multi-step in single turn; (2) Function calling with user-provided constraints; (3) Function calling that requires parameter value reasoning from implicit information; (4) Function calling with long parameter values that exceed 500 tokens; and (5) Function calling with 128k long-context length.

Complex Example

The difference between ComplexFuncBench and other function calling benchmarks is shown in the following table.

Real API Response Multi-Step Constraints Parameter Value Reasoning Long Parameter Reasoning Long-Context
API-Bench
ToolBench
T-Eval
BFCL
Tool Sandbox
ComplexFuncBench

Leaderboard

Model Overall Success Rate Overall Call Acc. Completeness Correctness
Claude-3.5-Sonnet (20241022) 61.00 79.27 1.84 1.85
GPT-4o (2024-08-06) 60.50 80.55 1.66 1.75
GLM-4-Long 57.10 76.35 1.72 1.74
GPT-4-Turbo (2024-04-09) 49.50 71.38 1.72 1.81
Claude-3.5-Haiku (20241022) 45.80 69.50 1.79 1.71
Qwen2.5-72B 40.10 58.32 1.80 1.75
Mistral Large 2 20.10 48.78 0.94 1.0
GLM-4-9B 9.40 27.97 1.15 1.03
Qwen2.5-7B 5.0 18.19 1.5 1.47
Llama-3.1-405B 4.00 11.87 0.43 0.30
Llama-3.1-70B 2.70 8.17 0.67 0.36
Llama-3.1-8B 0.10 1.34 0.18 0.09

Method

Data Collection

The collection of the ComplexFuncBench dataset consists of three stages: coarse generation, fine-grained annotation, and generalization. The dataset contains 1,000 complex function-calling samples, which comprise 600 single-domain samples and 400 cross-domain samples.

Data Collection

Automated Evaluation

The automated evaluation framework \texttt{ComplexEval} evaluates models' complex function calling ability and response generation ability simultaneously.

Evaluation Pipeline

How to evaluate on ComplexFuncBench

Preparation

First, download the repository and dataset. You can download the benchmarkd dataset through the HuggingFace datasets.

git clone https://github.com/THUDM/ComplexFuncBench.git
cd ComplexFuncBench

Then, install the dependencies.

pip install -r requirements.txt

Serve Model

  • For close source models, make sure the corresponding model API keys are included in your evironments .env . To enable response-based evaluation, you need to subscribe the Booking API from RapidAPI.

    OPENAI_API_KEY=sk-XXXXXX
    
    RAPID_API_KEY=
  • For open source models, you need to deploy your model via vLLM. Run the following command to serve the model. Take THUDM/glm-4-9b-chat for example:

    vllm serve THUDM/glm-4-9b-chat --api-key token-abc123 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len 131072 --trust-remote-code

Run Model Inference

python evaluation.py --model_name {model_name} --proc_num {proc_num}

Take gpt-4o-2024-08-06 and THUDM/glm-4-9b-chat for example,

python evaluation.py --model_name gpt-4o-2024-08-06 --proc_num 50
python evaluation.py --model_name THUDM/glm-4-9b-chat --proc_num 50 --vllm_url http://xx.xx.xx.xx:8000/v1

The evaluation results is saved in result/{model_name}

Export Results

python print_results.py --result_dir {result_dir}

Citation

If you find our work helpful for your research, please consider citing our work.

@misc{zhong2025complexfuncbench,
      title={ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario}, 
      author={Lucen Zhong and Zhengxiao Du and Xiaohan Zhang and Haiyi Hu and Jie Tang},
      year={2025},
      eprint={2501.10132},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.10132}, 
}

About

Complex Function Calling Benchmark.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages