This is a project related to the paper "Determining Research Priorities Using Machine Learning" (Thomas etal. (2024), DOI:TBD)
You will need to create the virtual environment for running the software. Once this is installed you can use the provided utility or download the data manually from Zenodo.
Requirements
- Python 3.10
Build the conda virtual environment and activate it:
> conda env create -f env-strat-paper.yml
> conda activate strat-paper
Alternatively there is a requirements.txt file which may be used to build the virtual environment.
Data were produced using the topic emergence package and are available from Zenodo at the following URL: https://zenodo.org/records/13621625. Download and unpack all of the data into the data subdirectory.
Each of the notebooks illustrates part of the work from processing the outputs from the LDA modeling (data obtained above) to analysis and plotting of the results.
Descriptions of notebooks:
Name | Description |
---|---|
Process_Data | Do (most of) the basic processing of LDA model output data into form used by other notebooks. Some light analysis of processing is also included. Run this notebook first to generate the files needed by other notebooks. There are various switches within the notebook to test various data generation/filtering choices. |
Bootstrap_Estimation_1998-2010_RI | Estimate errors using bootstrap for Research Interest (RI) metric. |
Bootstrap_Estimation_1998-2010_TCS | Estimate errors using bootstrap for Topic Contribution Score (TCS) metric. |
Bootstrap_Estimation_1998-2010_TCS_CAGR | Estimate errors using bootstrap for TCS Compount Annual Growth Rate (TCS_CAGR) metric. |
Bootstrap_Estimation_DS2010_TCS | Estimate errors using bootstrap for the Decadal Survey 2010 (DS2010) corpus TCS (TCS_{DS2010}) metric. |
Bootstrap_Estimation_Decadal2010-Whitepapers_TCS | Estimate errors using bootstrap for the Decadal Survey 2010 submitted whitepapers TCS (TCS_{whitepaper}) metric. |
Explore_Topics | Examine topic keywords for selected topics in selected runs. |
Journal_Citation_Modeling | Modeling of citation rates for astronomy literature (1998 - 2019) with various functions. |
MLCR_Analysis | Analysis of TCS-based metrics compared to the Mean Lifetime Citation Rate (MLCR). |
Paper_plots | Generate most of the plots for the paper. |
Sample timeseries Plots | Generate selected topic timeseries plots for investigation. |
Stable_Topics | Generate stable topic files from LDA model output data. You dont need to use this notebook unless you want to investigate other Lpt thresholds than the one used in the paper. |