Welcome to the replication package for the ASE 2024 paper titled: Large Language Models for In-File Vulnerability Localization are "Lost in the End".
Paper Link: https://doi.org/10.1145/3715758
- Abstract
- How to Setup this Repository
- About this Repository
- Cached Files for RQ3
- Supplementary Information
- Citations
- Support
Traditionally, software vulnerability detection research has focused on individual small functions due to earlier language processing technologies’ limitations in handling larger inputs. However, this function-level approach may miss bugs that span multiple functions and code blocks. Recent advancements in artificial intelligence have enabled processing of larger inputs, leading everyday software developers to increasingly rely on chat-based large language models (LLMs) like GPT-3.5 and GPT-4 to detect vulnerabilities across entire files, not just within functions. This new development practice requires researchers to urgently investigate whether commonly used LLMs can effectively analyze large file-sized inputs, in order to provide timely insights for software developers and engineers about the pros and cons of this emerging technological trend. Hence, the goal of this paper is to evaluate the effectiveness of several state-of-the-art chat-based LLMs, including the GPT models, in detecting in-file vulnerabilities. We conducted a costly investigation into how the performance of LLMs varies based on vulnerability type, input size, and vulnerability location within the file. To give enough statistical power (𝛽 ≥.8) to our study, we could only focus on the three most common (as well as dangerous) vulnerabilities: XSS, SQL injection, and path traversal. Our findings indicate that the effectiveness of LLMs in detecting these vulnerabilities is strongly influenced by both the location of the vulnerability and the overall size of the input. Specifically, regardless of the vulnerability type, LLMs tend to significantly (𝑝 < .05) underperform when detecting vulnerabilities located toward the end of larger files—a pattern we call the ‘lost-in-the-end’ effect. Finally, to further support software developers and practitioners, we also explored the optimal input size for these LLMs and presented a simple strategy for identifying it, which can be applied to other models and vulnerability types. Eventually, we show how adjusting the input size can lead to significant improvements in LLM-based vulnerability detection, with an average recall increase of over 37% across all models.
This repository is tested and recommended on:
- OS: Windows 11 (version 23H2, build 22631.3593), Linux (Debian 5.10.179 or newer) and macOS (13.2.1 Ventura or newer)
- Python version: 3.11 or newer
To use this package, you must set up three environment variables: GITHUB_TOKEN
and OPENAI_API_KEY
. These variables represent your personal access credentials for GitHub and OpenAI. By setting these environment variables, you ensure that your development environment can securely interact with these services without hardcoding sensitive information into your codebase. This approach enhances security and simplifies configuration management, making it easier to update credentials or share projects without exposing private keys.
To use open-source models, you can use any API compatible with the OpenAI package. However, to minimize changes to your current setup, we recommend using Ollama. To install Ollama, follow the installation steps on their official website. If you prefer to use a different provider, you'll need to adjust environment variables—such as model aliases, endpoint configurations, and the API key—to match your chosen service.
- Copy
.env.example
to.env
:cp .env.example .env
- Open the
.env
file in your favorite text editor and replace the placeholder values with your actual credentials:GITHUB_TOKEN=your_github_token OPENAI_API_KEY=your_openai_api_key
Alternatively, you could set up the environment by directly setting environment variables. However, this only works with a UNIX-like OS.
On UNIX-like Operating Systems (Linux, MacOS):
- Open your terminal.
- To set the
OPENAI_API_KEY
variable, run:export OPENAI_API_KEY='your_api_key'
- To set the
GITHUB_TOKEN
variable, run:export GITHUB_TOKEN ='your_github_key'
- These commands will set the environment variables for your current session. If you want to make them permanent, you can add the above lines to your shell profile (
~/.bashrc
,~/.bash_profile
,~/.zshrc
, etc.)
To ensure you've set up the environment variables correctly:
- In your terminal or command prompt, run:
This should display your OpenAI API key.
echo $OPENAI_API_KEY
If you don't have an account with any of these providers, create one and follow the instructions on their respective websites to obtain your API token: GitHub, OpenAI
- Create a new conda environment using the provided
environment.yml
file:conda env create -f environment.yml
- Activate the environment:
conda activate infile_vulnerability_localization
Many scripts in this repository are Jupyter Notebooks (.ipynb
files). To install and set up Jupyter Notebook:
- Install Jupyter Notebook within the conda environment:
conda install -c conda-forge jupyterlab
- Launch Jupyter Notebook:
jupyter notebook
- A new browser window should open with the Jupyter Notebook interface, allowing you to run and edit the
.ipynb
files.
To ensure Jupyter is correctly installed, you can check the version:
jupyter --version
For complete control over dependencies and research files, we provide a fully configured Docker image that includes all necessary dependencies and fixed cache files. This allows you to reproduce our research without needing to re-execute all LLM calls. The Docker container mirrors the structure of the cloned repository.
If you prefer to rerun all scripts with fresh LLM analysis while using Docker, you can do so by adding -v ${PWD}:/app
. In this case, some caches will be ignored, and you will need to create a .env file, which will be copied into the container.
Our repository includes both Jupyter notebooks and standard Python scripts, requiring different execution approaches. Follow the instructions below:
-
Launch the Jupyter endpoint:
docker run -p 8888:8888 baueradam/infile_vulnerability_localization
-
Connect to the Jupyter environment at http://localhost:8888.
Two methods for executing Python scripts:
Method 1: One-Time Execution
-
Launch a container and run a specific script:
docker run --rm -it -w /app baueradam/infile_vulnerability_localization bash -c "cd 1_all_files_analysis && python count_functions.py 79"
Method 2: Interactive Session
- Create a container for interactive use:
docker run --rm -it -w /app baueradam/infile_vulnerability_localization bash
- Then, execute scripts as needed:
cd 1_all_files_analysis
python count_functions.py 79
Important:
To ensure proper file working, it's recommended to run scripts from their directory within the container.
All the following script are designed to run experiments for these LLMs:
- gpt-3.5-turbo
- gpt-4-turbo
- gpt-4o
- llama3-70b-8192
- mixtral-8x7b-32768
- mixtral-8x22b-65536
This folder contains the code to extract the dataset for the task of bug detection in code.
- Notebooks:
01gather_data.ipynb
: Jupyter Notebook containing the code to gather data from the CVE catalog (NIST database).02create_files.ipynb
: Contains the code to create the files for the dataset via GitHub scraping.
- Output: The product of these two scripts are three CSV files containing single commit files of CWE-79, CWE-89, and CWE-22. These files are saved in
1_all_files_analysis/cve_data
.
This folder contains the scripts, data files, and analysis results for the first experiment of RQ2 (RQ2.1) and the RQ1 experiments.
-
Data Folder:
cve_data
files_CWE-22.csv
,files_CWE-79.csv
,files_CWE-89.csv
: Contain the vulnerabilities (and their patches) we extracted for each CWE from the CVE catalog.
-
Notebooks and Other Python Scripts:
run_prompts.py
: Sends the bug localization prompts for a given LLM and CWE number (provided as parameters), saving the results in themodel_outputs
subfolder. The subfoldercache
also contains other intermediate results obtained by prompting the GPT models.analyse_results.ipynb
: For each LLM it computes the accuracy, precision, recall, and other statistics (i.e., logistic regressions) to answer RQ1 and RQ2.1 and understand the impact of bug position and file size on in-file vulnerability localisation. The visualisations are saved in the subfolderresults
.count_functions.py
: Computes functions' statistics (number per file and average size) on the data provided in the subfoldercve_data
.
This folder contains the scripts, data files, and analysis results for the second (code-in-the-haystack) experiment of RQ2 (RQ2.2).
-
Subfolder:
source_files
- Contains 15 source files organised by programming languages.
- Each file has an assigned ID and contains multiple versions:
- originalFile: The original, unmodified file.
- originalBuggy: The smallest buggy snippet identified (on the function level).
- smallestBuggy: The smallest buggy file, refactored.
- modifiedFile: The modified file without the bug. If the buggy function was extracted, it includes a refactored main function and added comments for clear separation.
- additional_padding: A complete gathering of files to add for padding from the same repository, separated by function with clear separation comments.
-
Notebooks:
create_files_with_padding.ipynb
: Using the source files in the subfoldersource_files
, this script creates the code-in-the-haystack files used for RQ2.2. The resulting code-in-the-haystack files are saved indata_to_process/files.csv
,run_inference.ipynb
: Runs RQ2.2 on the code-in-the-haystack files extracted by the previous script. The results of the bug localisation tasks are saved in the subfolderruns
.analyse_data.ipynb
: For each LLM it computes the accuracy, precision, recall, and other statistics (i.e., logistic regressions) to answer RQ2.2 and understand the impact of bug position and file size on in-file vulnerability localisation. The visualisations are saved in the subfolderresults
.
This folder contains the scripts, data files, and analysis results for the RQ3 experiments.
- Python Scripts:
run_prompts.py
: It chunks the files used for RQ1 using different chunk sizes and then sends the bug localization prompts for all the CWE types once provided the LLM name as a parameter. The results are stored in the subfolderresults
, which contains a CSV table per LLM providing the results of the RQ3 experiments in terms of accuracy, precision, recall, f1_score.
The pre-generated cache files required for the experiments in the ./3_optimal_position
folder are not included in this repository. To facilitate reproducibility and save time, these cache files have been made available separately.
Action Required:
Please download the cache files from Zenodo using the following DOI: 10.5281/zenodo.14840311.
Instructions:
- Click on the DOI link above or visit it directly in your browser.
- Download the provided archive containing the cache files.
- Extract the contents of the archive.
- Place the extracted files into the
./3_optimal_position/cache
directory, ensuring that the folder structure remains unchanged so that the scripts can locate the cache files correctly.
By following these steps, you will have all the necessary cache data to reproduce the RQ3 experiments without needing to re-run the computationally expensive LLM calls.
Below, we provide the appendix of the paper, which comprises supplementary information that could not be included in the main paper due to page constraints.
We hereby follow APA guidelines to report the regression coefficients, 95% confidence intervals, effect sizes (i.e., odds ratios), and p-values for each model term (intercept and predictors). Additionally, we conclude with a paragraph interpreting the statistics in relation to the paper's claims, as recommended by APA guidelines.
Below are the tables with the regression results of Fig. 2 (see paper):
Model: mixtral-8x7b
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.49 | [-0.21, 1.18] | 0.173 | 1.632 |
"" | file_len | -0.13 | [-0.20, -0.05] | 0.001 | 0.878 |
CWE-22 | intercept | 0.45 | [-0.16, 1.06] | 0.144 | 1.568 |
"" | bug_pos | -0.52 | [-0.82, -0.21] | 0.001 | 0.595 |
CWE-89 | intercept | 0.48 | [-0.10, 1.07] | 0.106 | 1.616 |
"" | file_len | -0.14 | [-0.22, -0.07] | 0.000 | 0.869 |
CWE-89 | intercept | 0.06 | [-0.43, 0.55] | 0.809 | 1.062 |
"" | bug_pos | -0.28 | [-0.45, -0.10] | 0.002 | 0.756 |
CWE-79 | intercept | -0.90 | [-1.22, -0.58] | 0.000 | 0.407 |
"" | file_len | -0.06 | [-0.09, -0.03] | 0.000 | 0.942 |
CWE-79 | intercept | -1.03 | [-1.31, -0.75] | 0.000 | 0.357 |
"" | bug_pos | -0.14 | [-0.21, -0.07] | 0.000 | 0.869 |
Model: mixtral-8x22b
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.65 | [0.03, 1.28] | 0.041 | 1.915 |
"" | file_len | -0.07 | [-0.13, -0.02] | 0.005 | 0.933 |
CWE-22 | intercept | 0.51 | [-0.01, 1.03] | 0.056 | 1.665 |
"" | bug_pos | -0.20 | [-0.34, -0.07] | 0.003 | 0.818 |
CWE-89 | intercept | 0.48 | [-0.00, 0.96] | 0.052 | 1.617 |
"" | file_len | -0.07 | [-0.10, -0.03] | 0.000 | 0.933 |
CWE-89 | intercept | 0.25 | [-0.17, 0.67] | 0.239 | 1.284 |
"" | bug_pos | -0.13 | [-0.20, -0.05] | 0.001 | 0.878 |
CWE-79 | intercept | -0.22 | [-0.54, 0.10] | 0.176 | 0.802 |
"" | file_len | -0.10 | [-0.14, -0.07] | 0.000 | 0.905 |
CWE-79 | intercept | -0.55 | [-0.81, -0.28] | 0.000 | 0.577 |
"" | bug_pos | -0.20 | [-0.28, -0.12] | 0.000 | 0.818 |
Model: llama-3-70b
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.57 | [-0.11, 1.24] | 0.103 | 1.768 |
"" | file_len | -0.10 | [-0.17, -0.03] | 0.004 | 0.905 |
CWE-22 | intercept | 0.09 | [-0.42, 0.60] | 0.736 | 1.094 |
"" | bug_pos | -0.13 | [-0.26, -0.01] | 0.039 | 0.878 |
CWE-89 | intercept | 0.52 | [0.00, 1.05] | 0.050 | 1.681 |
"" | file_len | -0.07 | [-0.12, -0.02] | 0.004 | 0.934 |
CWE-89 | intercept | 0.47 | [0.01, 0.92] | 0.046 | 1.600 |
"" | bug_pos | -0.19 | [-0.31, -0.06] | 0.003 | 0.827 |
CWE-79 | intercept | -0.45 | [-0.78, -0.11] | 0.010 | 0.638 |
"" | file_len | -0.09 | [-0.13, -0.05] | 0.000 | 0.914 |
CWE-79 | intercept | -0.79 | [-1.07, -0.52] | 0.000 | 0.454 |
"" | bug_pos | -0.12 | [-0.19, -0.06] | 0.000 | 0.888 |
Model: gpt-3.5-turbo
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.13 | [-0.49, 0.74] | 0.685 | 1.139 |
"" | file_len | -0.05 | [-0.10, -0.00] | 0.044 | 0.951 |
CWE-22 | intercept | 0.06 | [-0.45, 0.57] | 0.818 | 1.062 |
"" | bug_pos | -0.16 | [-0.29, -0.03] | 0.017 | 0.852 |
CWE-89 | intercept | 0.84 | [0.32, 1.35] | 0.001 | 2.320 |
"" | file_len | -0.09 | [-0.14, -0.05] | 0.000 | 0.914 |
CWE-89 | intercept | 0.32 | [-0.09, 0.73] | 0.131 | 1.378 |
"" | bug_pos | -0.11 | [-0.17, -0.04] | 0.002 | 0.896 |
CWE-79 | intercept | -0.73 | [-1.06, -0.40] | 0.000 | 0.482 |
"" | file_len | -0.08 | [-0.11, -0.04] | 0.000 | 0.923 |
CWE-79 | intercept | -1.00 | [-1.28, -0.72] | 0.000 | 0.368 |
"" | bug_pos | -0.13 | [-0.20, -0.06] | 0.000 | 0.878 |
Model: gpt-4-turbo
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.95 | [0.34, 1.56] | 0.002 | 2.585 |
"" | file_len | -0.05 | [-0.09, -0.01] | 0.021 | 0.951 |
CWE-22 | intercept | 1.01 | [0.46, 1.55] | 0.000 | 2.749 |
"" | bug_pos | -0.18 | [-0.29, -0.06] | 0.002 | 0.835 |
CWE-89 | intercept | 0.77 | [0.27, 1.27] | 0.002 | 2.160 |
"" | file_len | -0.08 | [-0.12, -0.04] | 0.000 | 0.923 |
CWE-89 | intercept | 0.56 | [0.13, 1.00] | 0.011 | 1.752 |
"" | bug_pos | -0.18 | [-0.28, -0.08] | 0.000 | 0.835 |
CWE-79 | intercept | -0.54 | [-0.83, -0.25] | 0.000 | 0.583 |
"" | file_len | -0.06 | [-0.08, -0.03] | 0.000 | 0.941 |
CWE-79 | intercept | -0.52 | [-0.79, -0.26] | 0.000 | 0.594 |
"" | bug_pos | -0.19 | [-0.27, -0.12] | 0.000 | 0.827 |
Model: gpt-4o
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.83 | [0.23, 1.43] | 0.007 | 2.294 |
"" | file_len | -0.05 | [-0.09, -0.01] | 0.022 | 0.951 |
CWE-22 | intercept | 0.87 | [0.33, 1.40] | 0.001 | 2.389 |
"" | bug_pos | -0.17 | [-0.29, -0.06] | 0.003 | 0.844 |
CWE-89 | intercept | 0.92 | [0.42, 1.42] | 0.000 | 2.510 |
"" | file_len | -0.07 | [-0.11, -0.04] | 0.000 | 0.933 |
CWE-89 | intercept | 0.50 | [0.09, 0.91] | 0.017 | 1.649 |
"" | bug_pos | -0.09 | [-0.15, -0.04] | 0.001 | 0.914 |
CWE-79 | intercept | -0.22 | [-0.49, 0.06] | 0.121 | 0.802 |
"" | file_len | -0.05 | [-0.07, -0.03] | 0.000 | 0.951 |
CWE-79 | intercept | -0.20 | [-0.44, 0.05] | 0.112 | 0.819 |
"" | bug_pos | -0.18 | [-0.24, -0.12] | 0.000 | 0.835 |
Below are the tables with the regression results of Fig. 4:
Model: mixtral-8x7b
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.04 | [-0.17, 0.25] | 0.706 | 1.041 |
"" | file_len | -0.05 | [-0.06, -0.03] | 0.000 | 0.951 |
CWE-22 | intercept | -0.19 | [-0.32, -0.05] | 0.006 | 0.827 |
"" | bug_pos | -0.07 | [-0.08, -0.06] | 0.000 | 0.933 |
CWE-89 | intercept | -0.66 | [-0.89, -0.43] | 0.000 | 0.517 |
"" | file_len | -0.04 | [-0.05, -0.03] | 0.000 | 0.961 |
CWE-89 | intercept | -0.80 | [-0.95, -0.65] | 0.000 | 0.449 |
"" | bug_pos | -0.07 | [-0.08, -0.05] | 0.000 | 0.933 |
CWE-79 | intercept | -0.34 | [-0.56, -0.13] | 0.002 | 0.712 |
"" | file_len | -0.04 | [-0.05, -0.03] | 0.000 | 0.961 |
CWE-79 | intercept | -0.87 | [-1.01, -0.73] | 0.000 | 0.420 |
"" | bug_pos | -0.02 | [-0.03, -0.00] | 0.015 | 0.980 |
Model: mixtral-8x22b
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.31 | [0.11, 0.52] | 0.003 | 1.363 |
"" | file_len | -0.06 | [-0.07, -0.05] | 0.000 | 0.942 |
CWE-89 | intercept | 0.01 | [-0.21, 0.23] | 0.941 | 1.010 |
"" | file_len | -0.08 | [-0.09, -0.07] | 0.000 | 0.923 |
CWE-79 | intercept | 0.14 | [-0.06, 0.35] | 0.164 | 1.150 |
"" | file_len | -0.02 | [-0.03, -0.01] | 0.001 | 0.980 |
CWE-22 | intercept | -0.28 | [-0.41, -0.15] | 0.000 | 0.756 |
"" | bug_pos | -0.05 | [-0.07, -0.04] | 0.000 | 0.951 |
CWE-89 | intercept | -0.12 | [-0.27, 0.03] | 0.113 | 0.887 |
"" | bug_pos | -0.18 | [-0.20, -0.16] | 0.000 | 0.835 |
CWE-79 | intercept | 0.13 | [0.01, 0.26] | 0.042 | 1.139 |
"" | bug_pos | -0.04 | [-0.05, -0.02] | 0.000 | 0.961 |
Model: llama-3-70b
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.78 | [0.57, 0.99] | 0.000 | 2.181 |
"" | file_len | -0.04 | [-0.05, -0.03] | 0.000 | 0.961 |
CWE-89 | intercept | 0.20 | [-0.02, 0.43] | 0.072 | 1.222 |
"" | file_len | -0.09 | [-0.11, -0.08] | 0.000 | 0.914 |
CWE-79 | intercept | -0.16 | [-0.37, 0.05] | 0.135 | 0.852 |
"" | file_len | -0.03 | [-0.04, -0.02] | 0.000 | 0.971 |
CWE-22 | intercept | 0.29 | [0.16, 0.42] | 0.000 | 1.336 |
"" | bug_pos | -0.03 | [-0.04, -0.02] | 0.000 | 0.971 |
CWE-89 | intercept | -0.06 | [-0.21, 0.09] | 0.440 | 0.941 |
"" | bug_pos | -0.20 | [-0.22, -0.17] | 0.000 | 0.819 |
CWE-79 | intercept | -0.49 | [-0.62, -0.36] | 0.000 | 0.612 |
"" | bug_pos | -0.02 | [-0.04, -0.01] | 0.000 | 0.980 |
Model: gpt-3.5-turbo
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.51 | [0.31, 0.72] | 0.000 | 1.665 |
"" | file_len | -0.03 | [-0.04, -0.02] | 0.000 | 0.970 |
CWE-89 | intercept | -0.63 | [-0.88, -0.38] | 0.000 | 0.533 |
"" | file_len | -0.07 | [-0.08, -0.05] | 0.000 | 0.932 |
CWE-79 | intercept | -1.14 | [-1.39, -0.89] | 0.000 | 0.320 |
"" | file_len | -0.02 | [-0.03, -0.01] | 0.002 | 0.980 |
CWE-22 | intercept | -0.00 | [-0.13, 0.12] | 0.963 | 1.000 |
"" | bug_pos | -0.01 | [-0.02, -0.00] | 0.033 | 0.990 |
CWE-89 | intercept | -0.71 | [-0.88, -0.54] | 0.000 | 0.492 |
"" | bug_pos | -0.16 | [-0.19, -0.14] | 0.000 | 0.852 |
CWE-79 | intercept | -1.50 | [-1.66, -1.33] | 0.000 | 0.223 |
"" | bug_pos | -0.00 | [-0.02, 0.01] | 0.897 | 1.000 |
Model: gpt-4-turbo
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 1.34 | [1.11, 1.56] | 0.000 | 3.822 |
"" | file_len | -0.04 | [-0.06, -0.03] | 0.000 | 0.961 |
CWE-89 | intercept | 0.42 | [0.21, 0.62] | 0.000 | 1.522 |
"" | file_len | -0.05 | [-0.06, -0.04] | 0.000 | 0.951 |
CWE-79 | intercept | 0.21 | [0.01, 0.42] | 0.040 | 1.233 |
"" | file_len | -0.01 | [-0.02, 0.00] | 0.086 | 0.990 |
CWE-22 | intercept | 1.14 | [1.01, 1.28] | 0.000 | 3.128 |
"" | bug_pos | -0.07 | [-0.08, -0.06] | 0.000 | 0.933 |
CWE-89 | intercept | 0.04 | [-0.09, 0.16] | 0.595 | 1.041 |
"" | bug_pos | -0.07 | [-0.08, -0.05] | 0.000 | 0.933 |
CWE-79 | intercept | 0.13 | [0.00, 0.25] | 0.048 | 1.139 |
"" | bug_pos | -0.01 | [-0.02, 0.00] | 0.123 | 0.990 |
Model: gpt-4o
CWE | Term | B | 95% CI | p-value | Odds Ratio |
---|---|---|---|---|---|
CWE-22 | intercept | 0.53 | [0.32, 0.75] | 0.000 | 1.699 |
"" | file_len | -0.09 | [-0.11, -0.08] | 0.000 | 0.914 |
CWE-89 | intercept | 0.94 | [0.73, 1.16] | 0.000 | 2.563 |
"" | file_len | -0.11 | [-0.12, -0.10] | 0.000 | 0.896 |
CWE-79 | intercept | 0.75 | [0.54, 0.96] | 0.000 | 2.117 |
"" | file_len | -0.02 | [-0.03, -0.01] | 0.000 | 0.980 |
CWE-22 | intercept | 0.42 | [0.27, 0.57] | 0.000 | 1.521 |
"" | bug_pos | -0.23 | [-0.25, -0.20] | 0.000 | 0.793 |
CWE-89 | intercept | 0.76 | [0.60, 0.91] | 0.000 | 2.137 |
"" | bug_pos | -0.26 | [-0.28, -0.24] | 0.000 | 0.771 |
CWE-79 | intercept | 0.61 | [0.48, 0.74] | 0.000 | 1.840 |
"" | bug_pos | -0.03 | [-0.04, -0.01] | 0.000 | 0.970 |
The results indicate a negative association between both bug position and file size with the probability of bug detection. For instance:
- The significant negative coefficients for bug_pos (e.g., -0.52 for CWE-22 in mixtral-8x7b) suggest that as the bug's position moves further within a file, the likelihood of detection decreases.
- Similarly, negative coefficients for file_len (e.g., -0.13 for CWE-22 in mixtral-8x7b) indicate that larger files are less likely to have their bugs detected.
- Bug position generally shows larger coefficients (in absolute terms) than file length. This suggests that bug position has a stronger effect on bug detection probability than file length.
Additionally, we also include the results of a multiple logistic regression (i.e., combining both predictors), which remain consistent with those obtained from simple logistic regression above.
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.05, 95% CI [0.85, 1.29], p = 0.678
- Regression term: target_length, Odds Ratio = 0.98, 95% CI [0.97, 0.99], p = 0.005
- Regression term: target_bug_position, Odds Ratio = 0.94, 95% CI [0.93, 0.96], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 0.52, 95% CI [0.41, 0.66], p = 0.000
- Regression term: target_length, Odds Ratio = 0.99, 95% CI [0.97, 1.00], p = 0.098
- Regression term: target_bug_position, Odds Ratio = 0.94, 95% CI [0.92, 0.96], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 0.71, 95% CI [0.57, 0.88], p = 0.002
- Regression term: target_length, Odds Ratio = 0.96, 95% CI [0.94, 0.97], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 1.01, 95% CI [1.00, 1.03], p = 0.174
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.36, 95% CI [1.11, 1.68], p = 0.003
- Regression term: target_length, Odds Ratio = 0.95, 95% CI [0.94, 0.97], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.98, 95% CI [0.96, 0.99], p = 0.003
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.16, 95% CI [0.92, 1.45], p = 0.219
- Regression term: target_length, Odds Ratio = 0.98, 95% CI [0.96, 0.99], p = 0.003
- Regression term: target_bug_position, Odds Ratio = 0.85, 95% CI [0.83, 0.87], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.15, 95% CI [0.94, 1.41], p = 0.184
- Regression term: target_length, Odds Ratio = 1.00, 95% CI [0.99, 1.01], p = 0.937
- Regression term: target_bug_position, Odds Ratio = 0.97, 95% CI [0.95, 0.98], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 2.17, 95% CI [1.76, 2.67], p = 0.000
- Regression term: target_length, Odds Ratio = 0.96, 95% CI [0.95, 0.98], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.99, 95% CI [0.98, 1.00], p = 0.137
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.44, 95% CI [1.14, 1.81], p = 0.002
- Regression term: target_length, Odds Ratio = 0.97, 95% CI [0.95, 0.98], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.84, 95% CI [0.82, 0.86], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 0.85, 95% CI [0.69, 1.05], p = 0.131
- Regression term: target_length, Odds Ratio = 0.97, 95% CI [0.96, 0.99], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.99, 95% CI [0.98, 1.01], p = 0.373
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.67, 95% CI [1.36, 2.05], p = 0.000
- Regression term: target_length, Odds Ratio = 0.96, 95% CI [0.95, 0.97], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 1.01, 95% CI [1.00, 1.03], p = 0.092
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 0.59, 95% CI [0.46, 0.77], p = 0.000
- Regression term: target_length, Odds Ratio = 0.98, 95% CI [0.97, 1.00], p = 0.058
- Regression term: target_bug_position, Odds Ratio = 0.86, 95% CI [0.83, 0.88], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 0.32, 95% CI [0.25, 0.41], p = 0.000
- Regression term: target_length, Odds Ratio = 0.97, 95% CI [0.96, 0.99], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 1.02, 95% CI [1.00, 1.04], p = 0.059
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 3.76, 95% CI [2.99, 4.71], p = 0.000
- Regression term: target_length, Odds Ratio = 0.99, 95% CI [0.97, 1.00], p = 0.047
- Regression term: target_bug_position, Odds Ratio = 0.94, 95% CI [0.93, 0.95], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.52, 95% CI [1.23, 1.86], p = 0.000
- Regression term: target_length, Odds Ratio = 0.97, 95% CI [0.96, 0.98], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.95, 95% CI [0.94, 0.97], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 1.24, 95% CI [1.01, 1.51], p = 0.041
- Regression term: target_length, Odds Ratio = 0.99, 95% CI [0.98, 1.01], p = 0.297
- Regression term: target_bug_position, Odds Ratio = 1.00, 95% CI [0.98, 1.01], p = 0.469
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 2.14, 95% CI [1.70, 2.69], p = 0.000
- Regression term: target_length, Odds Ratio = 0.97, 95% CI [0.96, 0.99], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.81, 95% CI [0.79, 0.83], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 3.53, 95% CI [2.79, 4.47], p = 0.000
- Regression term: target_length, Odds Ratio = 0.96, 95% CI [0.95, 0.97], p = 0.000
- Regression term: target_bug_position, Odds Ratio = 0.79, 95% CI [0.77, 0.81], p = 0.000
Predictors: target_length
, target_bug_position
- Regression term: const, Odds Ratio = 2.10, 95% CI [1.70, 2.59], p = 0.000
- Regression term: target_length, Odds Ratio = 0.99, 95% CI [0.98, 1.00], p = 0.127
- Regression term: target_bug_position, Odds Ratio = 0.98, 95% CI [0.97, 0.99], p = 0.004
This code is free. So, if you use this code anywhere, please cite us:
@inproceedings{sovrano2025llms,
title={Large Language Models for In-File Vulnerability Localization Can Be “Lost in the End”},
author={Sovrano, Francesco and Bauer, Adam and Bacchelli, Alberto},
booktitle={Proceedings of ACM International Conference on the Foundations of Software Engineering 2025 (FSE’25)},
year={2025},
doi={10.1145/3715758},
organization={ACM}
}
Thank you!
For any problem or question, please contact me at [email protected]