Datasets

Jump to bottom

Kara Moraw edited this page Jul 10, 2024 · 2 revisions

CZI Software Mentions

Source:

biomedical papers
- 2.4 million papers from NIH PubMed Central commercial subset
- 1.4 million papers from NIH PubMed Central non-commercial subset
- 4 million papers from CZI Publishers' collection
  - biomedical, but also some other areas
  - includes 1.7 million PubMed Central papers

Method:

SciBERT extracts plain-text software mentions
disambiguation of mentions using DBSCAN
linking by exact-match query in Pip, SciPy, GitHub
- GitHub links have high error / unclear rate in evaluation

Content:

software mentions
context (2-3 lines)
disambiguated software name
link (if any)

PLOS Open Scence Indicators

Source:

61,000 research articles published in PLOS (2019-2022)
6,500 articles in non-PLOS journals (for comparison)

Method:

analyse XML of published research articles
detect 3 OpenScience practices
- sharing of research data (NLP, DataSeer)
- sharing of code (NLP, DataSeer)
- posting of preprints (CrossRef, DataCite)

Content:

data and code generation and sharing rate
location of shared data and code

Softcite dataset

Source:

5,000 open access research publications in life sciences and social sciences

Method:

manually annotated (I think)

Content:

XML publications with encoded annotations (used GROBID)
later used to train ML models
- implemented in GROBID module for software mention recognition

SoftwareKG

Source:

SoftwareKG_Social: 51,000 articals from social sciences
SoftwareKG_PubMed: 3M PubMed Central articles

Contents:

knowledge graph
name of software mention
accessibility, URL, license, disambiguation

French Open Science Monitor

Link, Methodology Paper

select relevant papers using affiliation detector for CrossRef
- identifies papers with at least one French author
- uses controlled list of institutions for France
run Grobid, then software mention detection + dataset mention detection on the collection

SoMeSci

Source:

1367 PubMed Central papers

Contents:

Knowledge graph
version, developer, URL, citations