Skip to content

Entity Extraction

Ty Andrews edited this page Jun 27, 2023 · 6 revisions

This page contains relevant info for entity extraction model training and development.

Data Labelling

Instructions on how to setup LabelStudio using Huggingface spaces can be found here: LabelStudio README

To begin labelling data the recommended process is:

  1. Run the labelling preprocessing script to split articles into bite sized chunks
  2. Generate pre-labelled entities for the new text chunks to speed up data labelling
  3. Upload the pre-labelled JSON files to the cloud storage bucket attached to the LabelStudio instance

For detailed instructions on how to perform the data labelling see the latter part of the LabelStudio README.

Model Training

To train the named entity recognition (NER) model see the instructions for both the spaCy training process and the HuggingFace training setup.

Model Retraining

After articles with extracted entities have been reviewed follow the following steps to retrain the NER models and follow on steps:

  1. With the reviewed article parquet file run the script XXXXXXXXX.py to extract corrected text
  2. Add the corrected text examples to the existing training data
  3. Follow the training process above to train the model
  4. Upload the retrained model to the HuggingFace hub by uploading the new model files following the HuggingFace upload instructions
  5. Rebuild the metaextractor-entity-extraction-pipeline docker image to pull in the latest models following the instructions here: Entity Extraction Pipeline Docker Instructions
  6. Notify the xDD team to pull the latest Docker image from Docker Hub for future entity extraction runs on their servers
Clone this wiki locally