Skip to content

complexity-science-hub/cwts_covid

 
 

Repository files navigation

cwts_covid

This repository contains the code CWTS uses to create internal databases to study scientific literature on COVID-19. This code is provided as is for anyone who would like to replicate or expand upon it.

The code in this repository allows you to do the following steps:

  • Take published lists of scientific publications on COVID-19 and create a relational database with them.
  • Query the Dimensions and Altmetrics APIs to get more data on these publications (you will need to use your own API keys for this).
  • Do some basic plotting of this data.

This workflow can be illustrated as follows:

Workflow

Data sources

For the moment, we consider publications from the following sources:

  • CORD19 (last updated March 28, 2020):
  • Dimensions (last updated March 28, 2020):
  • WHO (last updated March 28, 2020)

You will need to download these datasets and add them to a local folder in order to process them. We assume that you will have a local copy of the whole CORD19 dataset, and a csv file with publication metadata for Dimensions and WHO. Previous releases of the Dimensions and WHO lists can be found in the datasets_input folder. Please also see the notebooks below for more details.

In the future, we might expand to more sources.

Steps

Create database

The relational schema we use to consolidate the data sources mentioned above is available as a SQL script (working at least on MySQL).

SQL schema

You can use the Notebook_1_SQL_database notebook to populate this database. This notebook allows you to insert data into a MySQL instance of your choice, where an empty database is assumed to exist with the above-mentioned schema. Alternatively, it allows you to export the relational data to Pandas tables.

An explanation on tables and identifiers

  • The pub table contains publications from all data sources. If you would like to work with publications coming exclusively from one data source, join it with the datasource table via the pub_datasource table.
  • The primary keys of all tables (pub_id, covid19_mtadata_id, who_metadata_id, dimensions_metadata_id, datasource_id) are not stable and are only internally consistent: if you create different versions of the database, they will likely differ.
  • In order to work with Dimensions and Altmetrics data, publication identifiers should be used. Please give preference to DOIs, then to PMIDs, then to PMCIDs (listed in order of coverage).
  • We removed a few (<1000) publications which had no known identifier among these three options. These are usually pre-prints, which are likely to be equipped with an identifier in future releases.
  • The metadata tables contain fields which are specific to a datasource, and we considered potentially useful. They are only available for publications coming from that datasource.

Query Dimensions and Altmetrics

You can then query Dimensions and Altmetrics APIs using your own keys, using the Notebook_2_API_queries notebook. You can request access as a researcher here: https://www.dimensions.ai/scientometric-research.

Data overview

Finally, using the Notebook_3_metadata_overview and Notebook_4_API_data_overview notebooks, you can get an overview of some of the resulting metadata and data.

How to give feedback

Please open an issue, or propose changes using a Pull Request.

How to cite

TBD

Acknowledgements

We would like to thank Digital Science (Dimensions, Altmetrics) for their support and for making all their data available to us.

About

CWTS database to study COVID19

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%