Skip to content

Latest commit

 

History

History
209 lines (149 loc) · 12.9 KB

README.md

File metadata and controls

209 lines (149 loc) · 12.9 KB

SKET

This repository contains the source code for the Semantic Knowledge Extractor Tool (SKET).
SKET is an unsupervised hybrid knowledge extraction system that combines a rule-based expert system with pre-trained machine learning models to extract cancer-related information from pathology reports.

Installation

CAVEAT: the package has been tested using Python 3.7 and 3.8 on unix-based systems and win64 systems. There are no guarantees that it works with different configurations.

Clone this repository

git clone https://github.com/ExaNLP/sket.git

Install all the requirements:

pip install -r requirements.txt

Then install any core model from scispacy v0.3.0 (default is en_core_sci_sm):

pip install </path/to/download>

The required scispacy models are available at: https://github.com/allenai/scispacy/tree/v0.3.0

Datasets

Users can go into the datasets folder and place their datasets within the corresponding use case folders. Use cases are: Colon Cancer (colon), Cervix Uterine Cancer (cervix), and Lung Cancer (lung).

Datasets can be provided in two formats:

XLS Format

Users can provide .xls or .xlsx files with the first row consisting of column headers (i.e., fields) and the rest of data inputs.

JSON Format

Users can provide .json files structured in two ways:

As a dict containing a reports field consisting of multiple key-value reports;

{'reports': [{k: v, ...}, ...]}

As a dict containing a single key-value report.

{k: v, ...}

SKET concatenates data from all the fields before translation. Users can alterate this behavior by filling ./sket/rep_proc/rules/report_fields.txt with target fields, one per line. Users can also provide a custom file to SKET, as long as it contains one field per line (more on this below).

Users can provide special headers that are treated differently from regular text by SKET. These fields are:
id: when specified, the id field is used to identify the corresponding report. Otherwise, uuid is used. gender: when specified, the gender field is used to provide patient's information within RDF graphs. Otherwise, gender is set to None. age: when specified, the age field is used to provide patient's information within RDF graphs. Otherwise, age is set to None.

Dataset Statistics

Users can compute dataset statistics to uderstand the distribution of concepts extracted by SKET for each use case. For instance, if a user wants to compute statistics for Colon Cancer, they can run

python compute_stats.py --outputs ./outputs/concepts/refined/colon/*.json --use_case colon

Pretrain

SKET can be deployed with different pretrained models, i.e., fastText and BERT. In our experiments, we employed the BioWordVec fastText model and the Bio + Clinical BERT model.
BioWordVec can be downloaded from https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.bin
Bio + Clinical BERT model can be automatically downloaded at run time by setting the biobert SKET parameter equal to 'emilyalsentzer/Bio_ClinicalBERT'

Users can pass different pretrained models depending on their preferences.

Usage

Users can deploy SKET using run_med_sket.py. We release within ./examples three sample datasets that can be used as toy examples to play with SKET. SKET can be deployed with different configurations and using different combinations of matching models.

Furthermore, SKET exhibits a tunable threshold parameter that can be tuned to decide the harshness of the entity linking component. The higher the threshold, the more precise the model -- at the expense of recall -- and vice versa. Users can fine-tune this parameter to obtain the desired trade-off between precision and recall. Note that threshold must always be lower than or equal to the number of considered matching models. Otherwise, the entity linking component does not return any concept.

The available matching models, in form of SKET parameters, are:
biow2v: the ScispaCy pretrained word embeddings. Set this parameter to True to use them.
biofast: the fastText model. Set this parameter to /path/to/fastText/file to use fastText.
biobert: the BERT model. Set this parameter to bert-name to use BERT (see https://huggingface.co/transformers/pretrained_models.html for model IDs).
str_match: the Gestalt Pattern Matching (GPM) model. Set this parameter to True to use GPM.

When using BERT, users can also set gpu parameter to the corresponding GPU number to fasten SKET execution.

For instance, a user can run the following script to obtain concepts, labels, and RDF graphs on the test.xlsx sample dataset:

python run_med_sket.py 
       --src_lang it 
       --use_case colon 
       --spacy_model en_core_sci_sm 
       --w2v_model 
       --string_model 
       --thr 2.0 
       --store 
       --dataset ./examples/test.xlsx

or, if a user also wants to use BERT with GPU support, they can run the following script:

python run_med_sket.py  
       --src_lang it 
       --use_case colon 
       --spacy_model en_core_sci_sm 
       --w2v_model 
       --string_model 
       --bert_model emilyalsentzer/Bio_ClinicalBERT
       --gpu 0 
       --thr 2.5 
       --store 
       --dataset ./examples/test.xlsx

In both cases, we set the src_lang to it as the source language of reports is Italian. Therefore, SKET needs to translate reports from Italian to English before performing information extraction.

Docker

SKET can also be deployed as a Docker container -- thus avoiding the need to install its dependencies directly on the host machine. Two Docker images can be built: sket_cpu and sket_gpu.
For sket_gpu, NVIDIA drivers have to be already installed within the host machine. Users can refer to NVIDIA user-guide for more information.

Instructions on how to build and run sket images are reported below, if you already have docker installed on your machine, you can skip the first step.

  1. Install Docker. In this regard, check out the correct installation procedure for your platform.

  2. Install docker-compose. In this regard, check the correct installation procedure for your platform.

  3. Check the Docker daemon (i.e., dockerd) is up and running.

  4. Download or clone the sket repository.

  5. In sket_server/sket_rest_config the config.json file allows you to configure the sket instance, edit this file in order to set the following parameters: w2v_model, fasttext_model, bert_model, string_model, gpu, and thr, where thr stands for similarity threshold and its default value is set to 0.9.

  6. Depending on the Docker image of interest, follow one of the two procedures below:
    6a) SKET CPU-only: from the sket, type: docker-compose run --service-ports sket_cpu
    6b) SKET GPU-enabled: from the sket, type: docker-compose run --service-ports sket_gpu

  7. When the image is ready, the sket server is running at: http://0.0.0.0:8000 if you run sket_cpu . If you run sket_gpu the server will run at: http://0.0.0.0:8001.

  8. The annotation of medical reports can be performed with two types of POST request:
    8a) If you want to store the annotations in the outputs directory, the URL to make the request to is: http://0.0.0.0:8000/annotate/<use_case>/<language> where use_case and language are the use case and the language (identified using ISO 639-1 Code) of your reports, respectively.

    Request example:

    curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it

    where path/to/examples is the path to the examples folder. With this type of request, labels and concepts are stored in .json files, while graphs are stored in .json,.n3,.ttl,.trig files.
    If you want to store exclusively one file format among .n3,.ttl, and .trig, put after the desired language /trig if you want to store graphs in .trig format, /turtle if you want to store graphs in ttl format and /n3 if you want to store graphs in .n3 format.

    Request example:

    curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/turtle

    where path/to/examples is the path to the examples folder.

    8b) If you want to use the labels, the concepts, or the graphs returned by sket without saving them, the URL to make the request to is: http://0.0.0.0:8000/annotate/<use_case>/<language>/<output> where use_case and language are the use case and the language (identified using ISO 639-1 Code) of your reports, respectively, and output is labels, concepts, or graphs.

    Request example:

    curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/labels

    where path/to/examples is the path to the examples folder.
    If you want your request to return a graph, your request must include also the graph format. Hence, your request will be: http://0.0.0.0:8000/annotate/<use_case>/<language>/graphs/<rdf_format> where <rdf_format> can be on format among: turtle, n3 and trig.

    curl -H "Content-Type: multipart/form-data" -F "data=@path/to/examples/test.xlsx" http://0.0.0.0:8000/annotate/colon/it/graphs/turtle

    where path/to/examples is the path to the examples folder.

  9. If you want to embed your medical reports in the request, change the application type and set: -H "Content-Type: application/json" then, instead of - F "data=@..." put -d '{"reports":[{},...,{}]}' if you have multiple reports, or -d '{"k":"v",...}' if you have a single report.

  10. If you want to build the images again, from the project folder type docker-compose down --rmi local, pay attention that this command will remove all the images created (both CPU and GPU). If you want to remove only one image between CPU and GPU see the docker image documentation. Finally repeat steps 5-8.

Regarding SKET GPU-enabled, the corresponding Dockerfile (you can find the Dockerfile at the following path: sket_server/docker-sket_server-config/sket_gpu) contains the nvidia/cuda:11.0-devel. Users are encouraged to change the NVIDIA/CUDA image within the Dockerfile depending on the NVIDIA drivers installed in their host machine. NVIDIA images can be found here.

Cite

If you use or extend our work, please cite the following:

@article{jpi_sket-2022,
  title = "Empowering Digital Pathology Applications through Explainable Knowledge Extraction Tools",
  author = "S. Marchesin and F. Giachelle and N. Marini and M. Atzori and S. Boytcheva and G. Buttafuoco and F. Ciompi and G. M. Di Nunzio and F. Fraggetta and O. Irrera and H. Müller and T. Primov and S. Vatrano and G. Silvello",
  journal = "Journal of Pathology Informatics",
  year = "2022",
  url = "https://www.sciencedirect.com/science/article/pii/S2153353922007337",
  doi = "https://doi.org/10.1016/j.jpi.2022.100139",
  pages = "100139"
}
@article{npj_dig_med-2022,
  title = "Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations",
  author = "N. Marini and S. Marchesin and S. Otálora and M. Wodzinski and A. Caputo and M. van Rijthoven and W. Aswolinskiy and J. M. Bokhorst and D. Podareanu and E. Petters and S. Boytcheva and G. Buttafuoco and S. Vatrano and F. Fraggetta and J. der Laak and M. Agosti and F. Ciompi and G. Silvello and H. Müller and M. Atzori",
  journal = "npj Digital Medicine",
  year = "2022",
  url = "http://dx.doi.org/10.1038/s41746-022-00635-4",
  doi = "10.1038/s41746-022-00635-4",
  volume = "5",
  number = "1",
  pages = "1--18"
}