Automated Detection of Typed Links in Issue Trackers

This repository contains the data, experiments, and analysis for the RE2022 submission "Automated Detection of Typed Links in Issue Trackers". With this repository you should be able to replicate the experiments or use the model on your own datasets.

Author Information

Clara Marie Lüders, University of Hamburg
Tim Pietz, University of Hamburg
Walid Maalej, University of Hamburg

Description of Artifact

There are three subfolders;

data
tld
pics

data

Contains the data we used for our analysis and machine learning models. This folder contains the following subfolders; raw, processed, splits, results. The folder also contains the script data_extract.py and a few intermittent results from some analysis (.csv files), f.e overviews for user numbers per repository and other properties are saved in repo_overview.csv and user_numbers.csv.

We used the JIRA data from here: .

The script data_extract.py accesses the MongoDB and goes through all the entries inside one collection and saves all issues and all links into raw. It also contains a function to calculate the number of contributors (result already saved as .csv in the data folder). Please run the data_extract.py script in the data folder as thet path is set accordingly. The extract functions for the issues and links use JiraRepos as a database name, so if you have another name, change it accordingly.

tld

Contains the python scrips to run all models, they will save their data into the results folder

pics

Contains the figures contained in the packages and further figures that were not included in the paper.

REJ Additions:

files marked with REJ
you have to adapt prepare data for the individual test cases (filter out Relates, map to link vs non-link, see linktypes.py)

Jupyter Notebooks

BERT_results_correlations

Extracts the results on the test data from the results folder and calculates the precision, recall, and F1-score per repository. It also calculates the correlations of the macro F1-scores to properties of the repositories and link types.

Create_Word_Models

Creates word2vec and fasttext models and embedding vectors for SCCNN and DCCNN experiments, these are saved under data.

DetailedTestdata_Top3Prediction

Connects the results on the test data to their input texts, contains an analyze to the text length and the results of the optimization strategy "Top3 Prediction" which predicts the top 3 possible labels based on the logits, can be adapted to top k prediciton.

Preprocessing

Preprocesses the raw data extracted with data_extract.py. Cleans issues and links. Issues are removed when they have no title and links are checked for duplicates etc., the script also adds 'non-links'.

Linktype_Properties

Calculates the cosine similarity of the issue texts of linked issues, as well as their lengths and the absolute difference. Saves the result as a .csv in data for further analysis.

Random_Majority_results

Calculates the F1-score and accuracy of the random and majority baseline.

Repository_Properties

Calculates the numbers of Table 1 and 2, then saves Table 1 as .csv for further analysis.

SCCNN_DCCNN_results

Calculates the F1-score of the SCCNN and DCCNN architectures.

System Requirements

Due to the size of the Jira dataset, we recommend a system with at least 24 GB memory available. If you want to use the replication package

with Docker, you will need a x86 architecture environment with Docker Compose installed.
without Docker, you will need to use a Linux system running on a x86 architecture with a working conda distribution like Miniconda and a MongoDB server.

Installation Instructions

With Docker

Clone the repository
Start the docker services with
```
docker compose up -d
```
The compose first build a docker image for the replication package, installing a Jupyter Notebook server alongside all the neccessary Python dependencies. Additionally, it also spins up a MongoDB instance that automatically initializes with the Jira dataset from Montgomery et al. ().
Observe the MongoDB container with
```
docker compose logs mongo -f
```
and wait for the import to finish. The MongoDB instance will countinuously print out log messages relating to the import at least every 5 seconds. After the import has finished, it will print a "Waiting for connections" message and the frequency of log messages decreases significantly. Depending on your network connection, the import process might take 20-30 minutes.

View the Python image outputs with

docker compose logs lynx -f

and look for a message like

Jupyter Notebook 6.4.12 is running at:
http://79f633f1551b:8888/?token=[…]
 or http://127.0.0.1:8888/?token=[…]

Open the Jupyter Notebook instance using the link displayed in the logs.
Follow the steps in the "Steps to Reproduce" section below. In the steps where you need to run a Python script, you can use the terminal built into the Jupyter UI. The Jupyter notebook Docker container volume mounts you locally cloned repository. All of the outputs are thus saved in the repository directory of your machine.

Without Docker

Download the Jira Dataset from Montgomery et al. (). Follow the instructions detailed in the README.md of step 3 on the site to import the data into your MongoDB server.
Setup the python environment specified in the conda.yml and activate it with
```
conda env create -f conda.yml
conda activate tld
```
Follow the steps in the "Steps to Reproduce" section below.

Steps to Reproduce

Run the data_extract.py script from the data folder to extract issues and links into the data/raw directory. You can specify the MongoDB access details using CLI arguments
```
python data_extract.py --host [host] --port [port] --username [username] --password [password]
```
With the Docker Compose setup, use --host mongo and leave out the other arguments.
Preprocess the data with the jupyter notebook Preproccesing.ipynb, this adds the processed data into data/processed
Run the experiments as detailed in the next README section "Running the experiments"
Run the jupyter notebook BERT_results_correlations.ipynb to see the results

Running the experiments

To train a BERT-based typed link detection model, run the tld.models.bert module. The module takes the training configuration as CLI parameters. For example, the following command replicates the paper results for the redhat repository.

python -m tld.models.bert \
  --model bert-base-uncased \
  --tracker redhat \
  --train-batch-size 48 \
  --eval-batch-size 128 \
  --n-epochs 30

To train one of the CNN-based models, run the tld.models.cnn module. Select a model architecture with the --model CLI argument, using either sccnn or dccnn. For example, the following command replicates our SCCNN results for the redhat repository.

python -m tld.models.cnn \
  --model sccnn \
  --tracker redhat

Using your own data

Create an issue.csv containing all issues and a link.csv containing all links in your dataset. The issue.csv should contain at least a column for the id, title, description, and resolution (needed to create random non-links). The link.csv should contain a column for issue_id_1, issue_id_2, linktype, and name (we used issue_id_1+issue_id_2+linktype, but in general a unique identifier for a link). These are the neseccary columns to run the deep learning models. Some analysis, f.e. Repository_Properties.ipynb contains analysis regarding the (sub-)projects and issue belongs to, so these will not run correctly if there is no projectid column. If your data contains link types that are not in the JIRA dataset, you might need to provide a new entry into the dictionary found in tld/linktypes.py

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data		data
pics		pics
tld		tld
.DS_Store		.DS_Store
.gitignore		.gitignore
BERT_results_correlations.ipynb		BERT_results_correlations.ipynb
Create_Word_Models.ipynb		Create_Word_Models.ipynb
DetailedTestdata_Top3Prediction.ipynb		DetailedTestdata_Top3Prediction.ipynb
Dockerfile		Dockerfile
LICENSE		LICENSE
Linktype_Properties.ipynb		Linktype_Properties.ipynb
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
REJ_BERT_results_correlations.ipynb		REJ_BERT_results_correlations.ipynb
REJ_Issue_Quality.ipynb		REJ_Issue_Quality.ipynb
REJ_Projects_Maintainers.ipynb		REJ_Projects_Maintainers.ipynb
Random_Majority_results.ipynb		Random_Majority_results.ipynb
Repository_Properties.ipynb		Repository_Properties.ipynb
SCCNN_DCCNN_comparison.ipynb		SCCNN_DCCNN_comparison.ipynb
Traditional_MachineLearning_Baseline.ipynb		Traditional_MachineLearning_Baseline.ipynb
conda.yml		conda.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Detection of Typed Links in Issue Trackers

Author Information

Description of Artifact

data

tld

pics

REJ Additions:

Jupyter Notebooks

BERT_results_correlations

Create_Word_Models

DetailedTestdata_Top3Prediction

Preprocessing

Linktype_Properties

Random_Majority_results

Repository_Properties

SCCNN_DCCNN_results

System Requirements

Installation Instructions

With Docker

Without Docker

Steps to Reproduce

Running the experiments

Using your own data

About

Releases 1

Packages

Contributors 2

Languages

License

RegenKordel/LYNX-TypedLinkDetection

Folders and files

Latest commit

History

Repository files navigation

Automated Detection of Typed Links in Issue Trackers

Author Information

Description of Artifact

data

tld

pics

REJ Additions:

Jupyter Notebooks

BERT_results_correlations

Create_Word_Models

DetailedTestdata_Top3Prediction

Preprocessing

Linktype_Properties

Random_Majority_results

Repository_Properties

SCCNN_DCCNN_results

System Requirements

Installation Instructions

With Docker

Without Docker

Steps to Reproduce

Running the experiments

Using your own data

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages