-
Notifications
You must be signed in to change notification settings - Fork 3
Entity Extraction
Ty Andrews edited this page Jun 28, 2023
·
6 revisions
This page contains relevant info for entity extraction model training and development.
Instructions on how to setup LabelStudio using Huggingface spaces can be found here: LabelStudio README
To begin labelling data the recommended process is:
- Run the labelling preprocessing script to split articles into bite sized chunks
- Generate pre-labelled entities for the new text chunks to speed up data labelling
- Upload the pre-labelled JSON files to the cloud storage bucket attached to the LabelStudio instance
For detailed instructions on how to perform the data labelling see the latter part of the LabelStudio README.
To train the named entity recognition (NER) model follow these steps:
- Download the latest labelled files from LabelStudio by following the "Downloading Labelled Files Steps" in the LabelStudio README
- To train the HuggingFace model follow the steps outlined in the Hugging Face Training README
- To train the spaCy model follow the steps outlined in the spaCy Training README
After articles with extracted entities have been reviewed follow the following steps to retrain the NER models and follow on steps:
- Add all the JSON files (from LabelStudio) and the reviewed article parquet files to a single folder under
data/entity-extraction/raw/
, e.g.data/entity-extraction/raw/2023-06-28_ner-model-retrain/
- Follow the training process outlined above to trigger training of the model which will pull in the new data from the parquet files
- Upload the retrained model from the model output folders (or Google Drive if training on Colab) to the HuggingFace hub by uploading the new model files following the HuggingFace upload instructions or the Spacy upload instructions, depending on the model trained.
- Rebuild the
metaextractor-entity-extraction-pipeline
docker image to pull in the latest models following the instructions here: Entity Extraction Pipeline Docker Instructions - Notify the xDD team to pull the latest Docker image from Docker Hub for future entity extraction runs on their servers
See a problem or have an idea to improve the project? Please submit an issue here: Submit New Issue to MetaExtractor