Analysis Code for "Determining Research Priorities Using Machine Learning" Paper

About

This is a project related to the paper "Determining Research Priorities Using Machine Learning" (Thomas etal. (2024), DOI:TBD)

Installation
Getting the data
Running
Configuration

Installation

You will need to create the virtual environment for running the software. Once this is installed you can use the provided utility or download the data manually from Zenodo.

Software

Requirements

Python 3.10

Build the conda virtual environment and activate it:

> conda env create -f env-strat-paper.yml
> conda activate strat-paper

Alternatively there is a requirements.txt file which may be used to build the virtual environment.

Data

Data were produced using the topic emergence package and are available from Zenodo at the following URL: https://zenodo.org/records/13621625. Download and unpack all of the data into the data subdirectory.

Use

Each of the notebooks illustrates part of the work from processing the outputs from the LDA modeling (data obtained above) to analysis and plotting of the results.

Descriptions of notebooks:

Name	Description
Process_Data	Do (most of) the basic processing of LDA model output data into form used by other notebooks. Some light analysis of processing is also included. Run this notebook first to generate the files needed by other notebooks. There are various switches within the notebook to test various data generation/filtering choices.
Bootstrap_Estimation_1998-2010_RI	Estimate errors using bootstrap for Research Interest (RI) metric.
Bootstrap_Estimation_1998-2010_TCS	Estimate errors using bootstrap for Topic Contribution Score (TCS) metric.
Bootstrap_Estimation_1998-2010_TCS_CAGR	Estimate errors using bootstrap for TCS Compount Annual Growth Rate (TCS_CAGR) metric.
Bootstrap_Estimation_DS2010_TCS	Estimate errors using bootstrap for the Decadal Survey 2010 (DS2010) corpus TCS (TCS_{DS2010}) metric.
Bootstrap_Estimation_Decadal2010-Whitepapers_TCS	Estimate errors using bootstrap for the Decadal Survey 2010 submitted whitepapers TCS (TCS_{whitepaper}) metric.
Explore_Topics	Examine topic keywords for selected topics in selected runs.
Journal_Citation_Modeling	Modeling of citation rates for astronomy literature (1998 - 2019) with various functions.
MLCR_Analysis	Analysis of TCS-based metrics compared to the Mean Lifetime Citation Rate (MLCR).
Paper_plots	Generate most of the plots for the paper.
Sample timeseries Plots	Generate selected topic timeseries plots for investigation.
Stable_Topics	Generate stable topic files from LDA model output data. You dont need to use this notebook unless you want to investigate other Lpt thresholds than the one used in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
notebooks		notebooks
plots		plots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env-strat-paper.yml		env-strat-paper.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis Code for "Determining Research Priorities Using Machine Learning" Paper

About

Installation

Software

Data

Use

About

Releases 3

Packages

Languages

License

brianthomas/ml_strat_prioritization

Folders and files

Latest commit

History

Repository files navigation

Analysis Code for "Determining Research Priorities Using Machine Learning" Paper

About

Installation

Software

Data

Use

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages