-
Notifications
You must be signed in to change notification settings - Fork 0
Datasets
Kara Moraw edited this page Jul 10, 2024
·
2 revisions
Source:
- biomedical papers
- 2.4 million papers from NIH PubMed Central commercial subset
- 1.4 million papers from NIH PubMed Central non-commercial subset
- 4 million papers from CZI Publishers' collection
- biomedical, but also some other areas
- includes 1.7 million PubMed Central papers
Method:
- SciBERT extracts plain-text software mentions
- disambiguation of mentions using DBSCAN
- linking by exact-match query in Pip, SciPy, GitHub
- GitHub links have high error / unclear rate in evaluation
Content:
- software mentions
- context (2-3 lines)
- disambiguated software name
- link (if any)
Source:
- 61,000 research articles published in PLOS (2019-2022)
- 6,500 articles in non-PLOS journals (for comparison)
Method:
- analyse XML of published research articles
- detect 3 OpenScience practices
- sharing of research data (NLP, DataSeer)
- sharing of code (NLP, DataSeer)
- posting of preprints (CrossRef, DataCite)
Content:
- data and code generation and sharing rate
- location of shared data and code
Source:
- 5,000 open access research publications in life sciences and social sciences
Method:
- manually annotated (I think)
Content:
- XML publications with encoded annotations (used GROBID)
- later used to train ML models
- implemented in GROBID module for software mention recognition
Source:
- SoftwareKG_Social: 51,000 articals from social sciences
- SoftwareKG_PubMed: 3M PubMed Central articles
Contents:
- knowledge graph
- name of software mention
- accessibility, URL, license, disambiguation
- select relevant papers using affiliation detector for CrossRef
- identifies papers with at least one French author
- uses controlled list of institutions for France
- run Grobid, then software mention detection + dataset mention detection on the collection
Source:
- 1367 PubMed Central papers
Contents:
- Knowledge graph
- version, developer, URL, citations