-
Notifications
You must be signed in to change notification settings - Fork 0
Methods: Analysis
Kara Moraw edited this page Jul 10, 2024
·
3 revisions
We produced several datasets from the data we collected from ePrints repositories and GitHub, as well as visualisations.
We manually examined all publications with links in the first two pages to find whether these repositories were created for the publication or simply used/cited.
The resulting dataset, eprints_w_intent.csv
, contains a row for each repository-publication pair examined in this manner:
column name | type | description | comment |
---|---|---|---|
github_repo_id |
str |
GitHub repository ID | <username>/<reponame> |
mention_created |
bool |
whether the repository was mentioned in the publication as having been created for it | |
pub_title |
str |
title of the publication | |
pub_author_for_reference |
str |
one of the publication authors as listed on ePrints | not reflective of main authorship |
pdf_url |
str |
link to downloadable file for the publication | |
page_no |
int |
number of the page the GitHub link was found on | first page denoted by 0 |
detected_github_url |
str |
GitHub link as extracted from the publication during PDF parsing | |
pattern_matched_github_url |
cleaned GitHub link from regular expression matching | ||
eprints_date |
str |
date field in ePrints | YYYY-MM-DD |
eprints_pub_year |
int |
year from the date field | |
eprints_repo |
str |
identifier of the ePrints repository the publication was found in |
We merged all repository metadata into one file with one row per repository:
aggregated_overall.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID | |
archived |
bool |
whether the repository was archived | |
created_at |
datetime |
time of repository creation | |
has_wiki |
bool |
whether the repository has a wiki | |
has_pages |
bool |
whether the repository has GitHub pages | |
license |
str |
license identifier | as provided by GitHub |
readme_size |
int |
README size in Bytes | |
readme_path |
str |
path to README file | |
readme_emojis |
int |
number of emojis detected in the README file | |
contributing_size |
int |
size of CONTRIBUTING.md in bytes | 0 if non-existent |
citation_added |
datetime |
time of commit adding CITATION.cff | |
contributing_added |
datetime |
time of commit adding CONTRIBUTING.md | |
repo_created_at |
datetime |
same as created_at
|
leftover duplication from dataframe merge |
week_since_repo_creation_citation_added |
int |
number of weeks between repository creation and addition of CITATION.cff | |
week_since_repo_creation_contributing_added |
int |
number of weeks between repository creation and addition of CONTRIBUTING.md | |
license_type |
str |
one of permissive, unknown, none or non-permissive | based on https://en.wikipedia.org/wiki/Permissive_software_license |
readme_size_class |
str |
one of "none", "ultra-short", "short", "informative", "detailed" | empirically determined: between 1 and 300 bytes is ultra-short, 300-1500 is short, anything between 1500 and 10000 is informative, everything beyond is detailed |
forks_count |
int |
number of forks of the repository | |
stars_count |
int |
number of stars of the repository | |
max_active_contributors |
int |
maximum observed number of active contributors to the repository in any week | an active contributor has made at least one commit in the last 12 weeks |
Moreover, we reframed the data collected from GitHub as timelines - one row for each week of repository life. The resulting data files are:
aggregated_timeline.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID | |
week_since_repo_creation |
int |
week since repository creation | increasing with repo creation being week 0 |
closed_count |
int |
number of closed issues for this repository in that week | total number of closed issues up to and in that week, not the number of issues closed in that week |
open_count |
int |
number of open issues in that week | as for closed_count
|
active_contributors |
int |
number of active contributors in that week | a contributor is considered active if they have made at least one commit to the main branch in the last 12 weeks |
contributors |
int |
number of users who have ever made a commit to the main branch between repository creation and that week | |
forks_count |
int |
number of forks up until and in that week | |
stars_count |
int |
number of stars up until and in that week | |
ownership_added |
bool |
whether an ownership keyword was added to the README file in that week | addition means that the line showed up in commit additions |
usage_added |
bool |
whether a usage keyword was added to the README file in that week | as for ownership_added
|
citation_added |
bool |
whether a citation keyword was added to the README file in that week | as for ownership_added
|
citation_file_added |
bool |
whether a CITATION.cff file was added in that week |
|
contributing_file_added |
bool |
whether a CONTRIBUTING.md file was added |
|
paper_published |
bool |
whether a publication citing this repository was published that week | based on ePrints date field with potential limitations as outlined elsewhere |
aggregated_commit_author_timeline.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID | |
author |
str |
username of a contributor to the repository | |
week_since_repo_creation |
int |
week since repository creation | increasing with repo creation being week 0 |
commits |
int |
number of commits made by author in week_since_repo_creation
|
|
active_contributors |
bool |
whether author is considered an active contributor in this week |
a contributor is considered active if they have made at least one commit to the main branch in the last 12 weeks |
aggregated_issue_user_timeline.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID | |
week_since_repo_creation |
int |
week since repository creation | increasing with repo creation being week 0 |
user |
str |
username of user who interacts with the repository's issues | |
created_count |
int |
number of issues created by user in week_since_repo_creation in the repository |
|
closed_count |
int |
number of issues closed by user in week_since_repo_creation in the repository |
|
user_status |
str |
one of [inactive', 'opening', 'both', 'closing']
|
opening if at least one issue opened in last 12 weeks, closing if at least one issue closed in last 12 weeks, both if both apply, inactive otherwise |