FLEX: False-Less EXecution for Reliable Text-to-SQL Evaluation (NAACL 2025 Main)

Paper: https://arxiv.org/pdf/2409.19014

Introduction

FLEX (False-Less EXecution) is a novel approach to evaluating text-to-SQL systems, designed to overcome the limitations of existing metrics such as Execution Accuracy (EX). By leveraging Large Language Models (LLMs), FLEX emulates expert-level evaluation of SQL queries, providing a more accurate and nuanced assessment of text-to-SQL model performance.

Key features of FLEX include:

Comprehensive context analysis, considering natural language questions, database schemas, and external knowledge.
Sophisticated evaluation criteria with detailed guidelines for assessing query correctness.
Robust handling of noisy ground truth, correctly evaluating queries even when the provided ground truth is inaccurate or ambiguous.

Our evaluation shows that FLEX achieves significantly higher agreement with human expert judgments (Cohen's kappa of 87.04) compared to the existing EX metric (62.00) and outperforms previous LLM-based methods.

Using FLEX, we re-evaluated 50 publicly available text-to-SQL models on the Spider and BIRD benchmarks, revealing:

Significant shifts in model rankings, particularly for the BIRD benchmark.
Generally higher scores compared to EX, suggesting that FLEX captures previously underestimated aspects of model capability.
Instances of overestimation in BIRD's challenging questions, highlighting areas for future research focus.

This repository provides the implementation of FLEX and tools for applying it to your own text-to-SQL evaluations.

Setup

1. Install Required Packages

# Install PyTorch and Transformers for your environment
pip install torch

# Install other required packages
pip install -r requirements.txt

# Optional: Install additional frameworks if needed
pip install cohere together vllm

2. Download Datasets and Models

Spider Dataset

Download spider_data.zip from Google Drive
Unzip and move contents to the spider directory

BIRD Dataset

mkdir bird-download && cd bird-download
wget https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip
unzip dev.zip # dev_20240627 directory will be created

# Move dev_20240627 directory to bird_dev/llm/data
mv dev_20240627/* ../bird_dev/llm/dev_databases/

Directory Structure

Ensure your directory structure looks like this:

.
├── bird_dev/
│   └── llm/
│       └── data/
│           ├── dev_databases/
│           │   └── california_schools/
│           │       └── california_schools.sqlite
│           ├── dev.json
│           ├── dev_gold.sql
│           └── dev_tables.json
└── spider/
    ├── database/
    │   └── academic/
    │       ├── academic.sqlite
    │       └── academic.sql
    ├── dev_gold.sql
    ├── dev.json
    └── dev_tables.json

Evaluation

Spider FLEX Evaluation

# 1. Execute prediction file to get execution result
python -m spider_eval.evaluation --pred results/spider/SuperSQL.sql

# 2. Evaluate using FLEX
python llm_judge_main.py results/spider/SuperSQL_ex.json spider "gpt-4o-2024-08-06" --batch_size 32 --num_threads 8

# 3. Categorize error cases
python llm_judge_categorize.py "text2sql-baselines/spider/SuperSQL_ex_judge.jsonl" spider gpt-4o-2024-08-06 --batch_size 32 --num_threads 8

Note: GPT-4o-2024-08-06 costs approximately $4 per 1034 instances for judgment.

BIRD FLEX Evaluation

# 1. Execute prediction file
bash bird_dev/eval.sh results/bird/SuperSQL.sql

# 2. Evaluate using FLEX
python llm_judge_main.py results/bird/SuperSQL_ex.json "bird-dev" "gpt-4o-2024-08-06" --batch_size 32 --num_threads 8

# 3. Categorize error cases
python llm_judge_categorize.py "results/bird/SuperSQL_ex_judge.jsonl" bird-dev gpt-4o-2024-08-06 --batch_size 32 --num_threads 8

Note: GPT-4o-2024-08-06 costs approximately $6 per 1534 instances for judgment.

Human Agreement Evaluation

# Evaluate false positives
python llm_human_agreement.py results/agreement/fn.jsonl gpt-4o-2024-08-06

# Evaluate false negatives
python llm_human_agreement.py results/agreement/fp.jsonl gpt-4o-2024-08-06

# Calculate Cohen's Kappa score
python cohen_kappa_score.py results/agreement/fn_gpt-4o-2024-08-06.jsonl results/agreement/fp_gpt-4o-2024-08-06.jsonl

For more options, check example scripts and utils/llm_judge.py for detailed implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FLEX: False-Less EXecution for Reliable Text-to-SQL Evaluation (NAACL 2025 Main)

Introduction

Setup

1. Install Required Packages

2. Download Datasets and Models

Spider Dataset

BIRD Dataset

Directory Structure

Evaluation

Spider FLEX Evaluation

BIRD FLEX Evaluation

Human Agreement Evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bird_dev		bird_dev
img		img
results		results
script		script
spider		spider
spider_eval		spider_eval
test		test
utils		utils
webui		webui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cohen_kappa_score.py		cohen_kappa_score.py
llm_human_agreement.py		llm_human_agreement.py
llm_judge_categorize.py		llm_judge_categorize.py
llm_judge_main.py		llm_judge_main.py
llm_judge_spider_batch.py		llm_judge_spider_batch.py
requirements.txt		requirements.txt

License

HeegyuKim/FLEX

Folders and files

Latest commit

History

Repository files navigation

FLEX: False-Less EXecution for Reliable Text-to-SQL Evaluation (NAACL 2025 Main)

Introduction

Setup

1. Install Required Packages

2. Download Datasets and Models

Spider Dataset

BIRD Dataset

Directory Structure

Evaluation

Spider FLEX Evaluation

BIRD FLEX Evaluation

Human Agreement Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages