Skip to content

Methods: Analysis

Kara Moraw edited this page Jul 10, 2024 · 3 revisions

We produced several datasets from the data we collected from ePrints repositories and GitHub, as well as visualisations.

Repository mention intent

We manually examined all publications with links in the first two pages to find whether these repositories were created for the publication or simply used/cited. The resulting dataset, eprints_w_intent.csv, contains a row for each repository-publication pair examined in this manner:

column name type description comment
github_repo_id str GitHub repository ID <username>/<reponame>
mention_created bool whether the repository was mentioned in the publication as having been created for it
pub_title str title of the publication
pub_author_for_reference str one of the publication authors as listed on ePrints not reflective of main authorship
pdf_url str link to downloadable file for the publication
page_no int number of the page the GitHub link was found on first page denoted by 0
detected_github_url str GitHub link as extracted from the publication during PDF parsing
pattern_matched_github_url cleaned GitHub link from regular expression matching
eprints_date str date field in ePrints YYYY-MM-DD
eprints_pub_year int year from the date field
eprints_repo str identifier of the ePrints repository the publication was found in

Aggregated timelines

We merged all repository metadata into one file with one row per repository:


column name type description comment
github_user_cleaned_url str GitHub repository ID
archived bool whether the repository was archived
created_at datetime time of repository creation
has_wiki bool whether the repository has a wiki
has_pages bool whether the repository has GitHub pages
license str license identifier as provided by GitHub
readme_size int README size in Bytes
readme_path str path to README file
readme_emojis int number of emojis detected in the README file
contributing_size int size of in bytes 0 if non-existent
citation_added datetime time of commit adding CITATION.cff
contributing_added datetime time of commit adding
repo_created_at datetime same as created_at leftover duplication from dataframe merge
week_since_repo_creation_citation_added int number of weeks between repository creation and addition of CITATION.cff
week_since_repo_creation_contributing_added int number of weeks between repository creation and addition of
license_type str one of permissive, unknown, none or non-permissive based on
readme_size_class str one of "none", "ultra-short", "short", "informative", "detailed" empirically determined: between 1 and 300 bytes is ultra-short, 300-1500 is short, anything between 1500 and 10000 is informative, everything beyond is detailed
forks_count int number of forks of the repository
stars_count int number of stars of the repository
max_active_contributors int maximum observed number of active contributors to the repository in any week an active contributor has made at least one commit in the last 12 weeks

Moreover, we reframed the data collected from GitHub as timelines - one row for each week of repository life. The resulting data files are:


column name type description comment
github_user_cleaned_url str GitHub repository ID
week_since_repo_creation int week since repository creation increasing with repo creation being week 0
closed_count int number of closed issues for this repository in that week total number of closed issues up to and in that week, not the number of issues closed in that week
open_count int number of open issues in that week as for closed_count
active_contributors int number of active contributors in that week a contributor is considered active if they have made at least one commit to the main branch in the last 12 weeks
contributors int number of users who have ever made a commit to the main branch between repository creation and that week
forks_count int number of forks up until and in that week
stars_count int number of stars up until and in that week
ownership_added bool whether an ownership keyword was added to the README file in that week addition means that the line showed up in commit additions
usage_added bool whether a usage keyword was added to the README file in that week as for ownership_added
citation_added bool whether a citation keyword was added to the README file in that week as for ownership_added
citation_file_added bool whether a CITATION.cff file was added in that week
contributing_file_added bool whether a file was added
paper_published bool whether a publication citing this repository was published that week based on ePrints date field with potential limitations as outlined elsewhere


column name type description comment
github_user_cleaned_url str GitHub repository ID
author str username of a contributor to the repository
week_since_repo_creation int week since repository creation increasing with repo creation being week 0
commits int number of commits made by author in week_since_repo_creation
active_contributors bool whether author is considered an active contributor in this week a contributor is considered active if they have made at least one commit to the main branch in the last 12 weeks


column name type description comment
github_user_cleaned_url str GitHub repository ID
week_since_repo_creation int week since repository creation increasing with repo creation being week 0
user str username of user who interacts with the repository's issues
created_count int number of issues created by user in week_since_repo_creation in the repository
closed_count int number of issues closed by user in week_since_repo_creation in the repository
user_status str one of [inactive', 'opening', 'both', 'closing'] opening if at least one issue opened in last 12 weeks, closing if at least one issue closed in last 12 weeks, both if both apply, inactive otherwise
Clone this wiki locally