Large Language Models for In-File Vulnerability Localization are "Lost in the End"

Welcome to the replication package for the ASE 2024 paper titled: Large Language Models for In-File Vulnerability Localization are "Lost in the End".

Paper Link: https://doi.org/10.1145/3715758

Abstract

Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (𝛽 ≥.8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (𝑝 < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the ‘lost-in-the-end’ effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models.

How to Setup this Repository

System Specifications

This repository is tested and recommended on:

OS: Windows 11 (version 23H2, build 22631.3593), Linux (Debian 5.10.179 or newer) and macOS (13.2.1 Ventura or newer)
Python version: 3.11 or newer

Installation of GitHub, OpenAI and Ollama

To use this package, you must set up three environment variables: GITHUB_TOKEN and OPENAI_API_KEY. These variables represent your personal access credentials for GitHub and OpenAI. By setting these environment variables, you ensure that your development environment can securely interact with these services without hardcoding sensitive information into your codebase. This approach enhances security and simplifies configuration management, making it easier to update credentials or share projects without exposing private keys.

To use open-source models, you can use any API compatible with the OpenAI package. However, to minimize changes to your current setup, we recommend using Ollama. To install Ollama, follow the installation steps on their official website. If you prefer to use a different provider, you'll need to adjust environment variables—such as model aliases, endpoint configurations, and the API key—to match your chosen service.

Steps to Set Up

Copy .env.example to .env:
```
cp .env.example .env
```
Open the .env file in your favorite text editor and replace the placeholder values with your actual credentials:
```
GITHUB_TOKEN=your_github_token
OPENAI_API_KEY=your_openai_api_key
```

Alternatively, you could set up the environment by directly setting environment variables. However, this only works with a UNIX-like OS.

On UNIX-like Operating Systems (Linux, MacOS):

Open your terminal.
To set the OPENAI_API_KEY variable, run:
```
export OPENAI_API_KEY='your_api_key'
```
To set the GITHUB_TOKEN variable, run:
```
export GITHUB_TOKEN ='your_github_key'
```
These commands will set the environment variables for your current session. If you want to make them permanent, you can add the above lines to your shell profile (~/.bashrc, ~/.bash_profile, ~/.zshrc, etc.)

To ensure you've set up the environment variables correctly:

In your terminal or command prompt, run:
```
echo $OPENAI_API_KEY
```
This should display your OpenAI API key.

If you don't have an account with any of these providers, create one and follow the instructions on their respective websites to obtain your API token: GitHub, OpenAI

Python Environment using Conda

Create a new conda environment using the provided environment.yml file:
```
conda env create -f environment.yml
```

Activate the environment:

 conda activate infile_vulnerability_localization

Jupyter Notebook Setup

Many scripts in this repository are Jupyter Notebooks (.ipynb files). To install and set up Jupyter Notebook:

Install Jupyter Notebook within the conda environment:
```
conda install -c conda-forge jupyterlab
```
Launch Jupyter Notebook:
```
jupyter notebook
```
A new browser window should open with the Jupyter Notebook interface, allowing you to run and edit the .ipynb files.

To ensure Jupyter is correctly installed, you can check the version:

jupyter --version

Docker Setup

For complete control over dependencies and research files, we provide a fully configured Docker image that includes all necessary dependencies and fixed cache files. This allows you to reproduce our research without needing to re-execute all LLM calls. The Docker container mirrors the structure of the cloned repository.

If you prefer to rerun all scripts with fresh LLM analysis while using Docker, you can do so by adding -v ${PWD}:/app. In this case, some caches will be ignored, and you will need to create a .env file, which will be copied into the container.

Our repository includes both Jupyter notebooks and standard Python scripts, requiring different execution approaches. Follow the instructions below:

Jupyter Notebooks

Launch the Jupyter endpoint:

docker run -p 8888:8888 baueradam/infile_vulnerability_localization

Connect to the Jupyter environment at http://localhost:8888.

Python Script Execution

Two methods for executing Python scripts:

Method 1: One-Time Execution

Launch a container and run a specific script:

docker run --rm -it -w /app baueradam/infile_vulnerability_localization bash -c "cd 1_all_files_analysis && python count_functions.py 79"

Method 2: Interactive Session

Create a container for interactive use:

docker run --rm -it -w /app baueradam/infile_vulnerability_localization bash

Then, execute scripts as needed:

cd 1_all_files_analysis
python count_functions.py 79

Important:

To ensure proper file working, it's recommended to run scripts from their directory within the container.

About this Repository

Supported LLMs

All the following script are designed to run experiments for these LLMs:

gpt-3.5-turbo
gpt-4-turbo
gpt-4o
llama3-70b-8192
mixtral-8x7b-32768
mixtral-8x22b-65536

Folder: `./0_dataset_creation`

This folder contains the code to extract the dataset for the task of bug detection in code.

Notebooks:
- 01gather_data.ipynb: Jupyter Notebook containing the code to gather data from the CVE catalog (NIST database).
- 02create_files.ipynb: Contains the code to create the files for the dataset via GitHub scraping.
Output: The product of these two scripts are three CSV files containing single commit files of CWE-79, CWE-89, and CWE-22. These files are saved in 1_all_files_analysis/cve_data.

Folder: `./1_all_files_analysis`

This folder contains the scripts, data files, and analysis results for the first experiment of RQ2 (RQ2.1) and the RQ1 experiments.

Data Folder: cve_data
- files_CWE-22.csv, files_CWE-79.csv, files_CWE-89.csv: Contain the vulnerabilities (and their patches) we extracted for each CWE from the CVE catalog.
Notebooks and Other Python Scripts:
- run_prompts.py: Sends the bug localization prompts for a given LLM and CWE number (provided as parameters), saving the results in the model_outputs subfolder. The subfolder cache also contains other intermediate results obtained by prompting the GPT models.
- analyse_results.ipynb: For each LLM it computes the accuracy, precision, recall, and other statistics (i.e., logistic regressions) to answer RQ1 and RQ2.1 and understand the impact of bug position and file size on in-file vulnerability localisation. The visualisations are saved in the subfolder results.
- count_functions.py: Computes functions' statistics (number per file and average size) on the data provided in the subfolder cve_data.

Folder: `./2_code_in_the_haystack`

This folder contains the scripts, data files, and analysis results for the second (code-in-the-haystack) experiment of RQ2 (RQ2.2).

Subfolder: source_files
- Contains 15 source files organised by programming languages.
- Each file has an assigned ID and contains multiple versions:
  - originalFile: The original, unmodified file.
  - originalBuggy: The smallest buggy snippet identified (on the function level).
  - smallestBuggy: The smallest buggy file, refactored.
  - modifiedFile: The modified file without the bug. If the buggy function was extracted, it includes a refactored main function and added comments for clear separation.
  - additional_padding: A complete gathering of files to add for padding from the same repository, separated by function with clear separation comments.
Notebooks:
- create_files_with_padding.ipynb: Using the source files in the subfolder source_files, this script creates the code-in-the-haystack files used for RQ2.2. The resulting code-in-the-haystack files are saved in data_to_process/files.csv,
- run_inference.ipynb: Runs RQ2.2 on the code-in-the-haystack files extracted by the previous script. The results of the bug localisation tasks are saved in the subfolder runs.
- analyse_data.ipynb: For each LLM it computes the accuracy, precision, recall, and other statistics (i.e., logistic regressions) to answer RQ2.2 and understand the impact of bug position and file size on in-file vulnerability localisation. The visualisations are saved in the subfolder results.

Folder: `./3_optimal_position`

This folder contains the scripts, data files, and analysis results for the RQ3 experiments.

Python Scripts:
- run_prompts.py: It chunks the files used for RQ1 using different chunk sizes and then sends the bug localization prompts for all the CWE types once provided the LLM name as a parameter. The results are stored in the subfolder results, which contains a CSV table per LLM providing the results of the RQ3 experiments in terms of accuracy, precision, recall, f1_score.

Cached Files for RQ3

The pre-generated cache files required for the experiments in the ./3_optimal_position folder are not included in this repository. To facilitate reproducibility and save time, these cache files have been made available separately.

Action Required:
Please download the cache files from Zenodo using the following DOI: 10.5281/zenodo.14840311.

Instructions:

Click on the DOI link above or visit it directly in your browser.
Download the provided archive containing the cache files.
Extract the contents of the archive.
Place the extracted files into the ./3_optimal_position/cache directory, ensuring that the folder structure remains unchanged so that the scripts can locate the cache files correctly.

By following these steps, you will have all the necessary cache data to reproduce the RQ3 experiments without needing to re-run the computationally expensive LLM calls.

Supplementary information

Below, we provide the appendix of the paper, which comprises supplementary information that could not be included in the main paper due to page constraints.

RQ2: Statistical Analysis of Vulnerability Detection Performance

We hereby follow APA guidelines to report the regression coefficients, 95% confidence intervals, effect sizes (i.e., odds ratios), and p-values for each model term (intercept and predictors). Additionally, we conclude with a paragraph interpreting the statistics in relation to the paper's claims, as recommended by APA guidelines.

Below are the tables with the regression results of Fig. 2 (see paper):

Model: mixtral-8x7b

CWE	Term	B	95% CI	p-value	Odds Ratio
CWE-22	intercept	0.49	[-0.21, 1.18]	0.173	1.632
""	file_len	-0.13	[-0.20, -0.05]	0.001	0.878
CWE-22	intercept	0.45	[-0.16, 1.06]	0.144	1.568
""	bug_pos	-0.52	[-0.82, -0.21]	0.001	0.595
CWE-89	intercept	0.48	[-0.10, 1.07]	0.106	1.616
""	file_len	-0.14	[-0.22, -0.07]	0.000	0.869
CWE-89	intercept	0.06	[-0.43, 0.55]	0.809	1.062
""	bug_pos	-0.28	[-0.45, -0.10]	0.002	0.756
CWE-79	intercept	-0.90	[-1.22, -0.58]	0.000	0.407
""	file_len	-0.06	[-0.09, -0.03]	0.000	0.942
CWE-79	intercept	-1.03	[-1.31, -0.75]	0.000	0.357
""	bug_pos	-0.14	[-0.21, -0.07]	0.000	0.869

Model: mixtral-8x22b

CWE	Term	B	95% CI	p-value	Odds Ratio
CWE-22	intercept	0.65	[0.03, 1.28]	0.041	1.915
""	file_len	-0.07	[-0.13, -0.02]	0.005	0.933
CWE-22	intercept	0.51	[-0.01, 1.03]	0.056	1.665
""	bug_pos	-0.20	[-0.34, -0.07]	0.003	0.818
CWE-89	intercept	0.48	[-0.00, 0.96]	0.052	1.617
""	file_len	-0.07	[-0.10, -0.03]	0.000	0.933
CWE-89	intercept	0.25	[-0.17, 0.67]	0.239	1.284
""	bug_pos	-0.13	[-0.20, -0.05]	0.001	0.878
CWE-79	intercept	-0.22	[-0.54, 0.10]	0.176	0.802
""	file_len	-0.10	[-0.14, -0.07]	0.000	0.905
CWE-79	intercept	-0.55	[-0.81, -0.28]	0.000	0.577
""	bug_pos	-0.20	[-0.28, -0.12]	0.000	0.818

Model: llama-3-70b

CWE	Term	B	95% CI	p-value	Odds Ratio
CWE-22	intercept	0.57	[-0.11, 1.24]	0.103	1.768
""	file_len	-0.10	[-0.17, -0.03]	0.004	0.905
CWE-22	intercept	0.09	[-0.42, 0.60]	0.736	1.094
""	bug_pos	-0.13	[-0.26, -0.01]	0.039	0.878
CWE-89	intercept	0.52	[0.00, 1.05]	0.050	1.681
""	file_len	-0.07	[-0.12, -0.02]	0.004	0.934
CWE-89	intercept	0.47	[0.01, 0.92]	0.046	1.600
""	bug_pos	-0.19	[-0.31, -0.06]	0.003	0.827
CWE-79	intercept	-0.45	[-0.78, -0.11]	0.010	0.638
""	file_len	-0.09	[-0.13, -0.05]	0.000	0.914
CWE-79	intercept	-0.79	[-1.07, -0.52]	0.000	0.454
""	bug_pos	-0.12	[-0.19, -0.06]	0.000	0.888

Model: gpt-3.5-turbo

CWE	Term	B	95% CI	p-value	Odds Ratio
CWE-22	intercept	0.13	[-0.49, 0.74]	0.685	1.139
""	file_len	-0.05	[-0.10, -0.00]	0.044	0.951
CWE-22	intercept	0.06	[-0.45, 0.57]	0.818	1.062
""	bug_pos	-0.16	[-0.29, -0.03]	0.017	0.852
CWE-89	intercept	0.84	[0.32, 1.35]	0.001	2.320
""	file_len	-0.09	[-0.14, -0.05]	0.000	0.914
CWE-89	intercept	0.32	[-0.09, 0.73]	0.131	1.378
""	bug_pos	-0.11	[-0.17, -0.04]	0.002	0.896
CWE-79	intercept	-0.73	[-1.06, -0.40]	0.000	0.482
""	file_len	-0.08	[-0.11, -0.04]	0.000	0.923
CWE-79	intercept	-1.00	[-1.28, -0.72]	0.000	0.368
""	bug_pos	-0.13	[-0.20, -0.06]	0.000	0.878

Model: gpt-4-turbo

CWE	Term	B	95% CI	p-value	Odds Ratio
CWE-22	intercept	0.95	[0.34, 1.56]	0.002	2.585
""	file_len	-0.05	[-0.09, -0.01]	0.021	0.951
CWE-22	intercept	1.01	[0.46, 1.55]	0.000	2.749
""	bug_pos	-0.18	[-0.29, -0.06]	0.002	0.835
CWE-89	intercept	0.77	[0.27, 1.27]	0.002	2.160
""	file_len	-0.08	[-0.12, -0.04]	0.000	0.923
CWE-89	intercept	0.56	[0.13, 1.00]	0.011	1.752
""	bug_pos	-0.18	[-0.28, -0.08]	0.000	0.835
CWE-79	intercept	-0.54	[-0.83, -0.25]	0.000	0.583
""	file_len	-0.06	[-0.08, -0.03]	0.000	0.941
CWE-79	intercept	-0.52	[-0.79, -0.26]	0.000	0.594
""	bug_pos	-0.19	[-0.27, -0.12]	0.000	0.827

Model: gpt-4o

CWE	Term	B	95% CI	p-value	Odds Ratio
CWE-22	intercept	0.83	[0.23, 1.43]	0.007	2.294
""	file_len	-0.05	[-0.09, -0.01]	0.022	0.951
CWE-22	intercept	0.87	[0.33, 1.40]	0.001	2.389
""	bug_pos	-0.17	[-0.29, -0.06]	0.003	0.844
CWE-89	intercept	0.92	[0.42, 1.42]	0.000	2.510
""	file_len	-0.07	[-0.11, -0.04]	0.000	0.933
CWE-89	intercept	0.50	[0.09, 0.91]	0.017	1.649
""	bug_pos	-0.09	[-0.15, -0.04]	0.001	0.914
CWE-79	intercept	-0.22	[-0.49, 0.06]	0.121	0.802
""	file_len	-0.05	[-0.07, -0.03]	0.000	0.951
CWE-79	intercept	-0.20	[-0.44, 0.05]	0.112	0.819
""	bug_pos	-0.18	[-0.24, -0.12]	0.000	0.835

Below are the tables with the regression results of Fig. 4: