Skip to content

Article Relevance Prediction

kellywujy edited this page Jun 29, 2023 · 6 revisions

This page outlines the primary workflow and points to the required references to develop the Article Relevance Prediction model.

Data

The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. Download all files and extract the contents into MetaExtractor/data/article-relevance/raw/.

The prediction pipeline requires the trained model object. The model is available HERE. Download the model file and put the .joblib file in MetaExtractor/models/article-relevance/.

Model Training

In order to train the model to reproduce the results, see Model Training. In order to set this to train the original model, set the environment variable USE_REVIEWED_DATA: By default is true and use newly reviewed articles to train the model. If set to False, the pipeline will reproduce the original model.

Model Retraining

The following steps can be followed to retrain the Article Relevance Prediction Model:

In order to retrain the model to reproduce the results, see Model Training. In order to set this to retrain the model:

  1. Set the environment variable USE_REVIEWED_DATA: By default is true and use newly reviewed articles to train the model. If set to False, the pipeline will reproduce the original model.
  2. Set the environment variable REVIEWED_FOLDER_PATH : This allows the pipeline to use the results of the Data Review Tool to retrain a new model. Set this to where the result parquet file is stored.

Prediction Pipeline

In order to run the prediction pipeline, see Article Relevance Prediction

Clone this wiki locally