Skip to content

Datasets

Kara Moraw edited this page Jul 10, 2024 · 2 revisions

CZI Software Mentions

Link

Source:

  • biomedical papers
    • 2.4 million papers from NIH PubMed Central commercial subset
    • 1.4 million papers from NIH PubMed Central non-commercial subset
    • 4 million papers from CZI Publishers' collection
      • biomedical, but also some other areas
      • includes 1.7 million PubMed Central papers

Method:

  • SciBERT extracts plain-text software mentions
  • disambiguation of mentions using DBSCAN
  • linking by exact-match query in Pip, SciPy, GitHub
    • GitHub links have high error / unclear rate in evaluation

Content:

  • software mentions
  • context (2-3 lines)
  • disambiguated software name
  • link (if any)

PLOS Open Scence Indicators

Link

Source:

  • 61,000 research articles published in PLOS (2019-2022)
  • 6,500 articles in non-PLOS journals (for comparison)

Method:

  • analyse XML of published research articles
  • detect 3 OpenScience practices
    • sharing of research data (NLP, DataSeer)
    • sharing of code (NLP, DataSeer)
    • posting of preprints (CrossRef, DataCite)

Content:

  • data and code generation and sharing rate
  • location of shared data and code

Softcite dataset

Link

Source:

  • 5,000 open access research publications in life sciences and social sciences

Method:

  • manually annotated (I think)

Content:

  • XML publications with encoded annotations (used GROBID)
  • later used to train ML models
    • implemented in GROBID module for software mention recognition

SoftwareKG

Link

Source:

  • SoftwareKG_Social: 51,000 articals from social sciences
  • SoftwareKG_PubMed: 3M PubMed Central articles

Contents:

  • knowledge graph
  • name of software mention
  • accessibility, URL, license, disambiguation

French Open Science Monitor

Link, Methodology Paper

  • select relevant papers using affiliation detector for CrossRef
    • identifies papers with at least one French author
    • uses controlled list of institutions for France
  • run Grobid, then software mention detection + dataset mention detection on the collection

SoMeSci

Paper, data

Source:

  • 1367 PubMed Central papers

Contents:

  • Knowledge graph
  • version, developer, URL, citations