This is the implementation as described in [1]. WARNING: doesn't work for small number of phrases. You need a document corpus with at least 100 (?) different phrases
docker compose build
(the resulting image is appr. 5GB)docker compose up -d
If you start the container as described, the address for the curl
command would be http://localhost:9007
.
All results (processed documents, the embeddings, etc.) are stored in the Docker volume results
(mounted to /rest_api/tmp
in the container).
However, they are serialized as Python Objects and need to be loaded with:
- processed documents:
src/data_functions/DataProcessingFactory.load(PATH/TO/DOCUMENT_OBJECT)
- phrase embeddings:
src/embedding_functions/SentenceEmbeddingsFactory.load(PATH/TO/DOCUMENT_OBJECT, PATH/TO/EMBEDDING_OBJECT)
- phrase cluster:
src/cluster_functions/PhraseClusterFactory.load(PATH/TO/CLUSTER_OBJECT)
- concept graphs: these are a list of serialized networkx [2] graphs
upload text data to be preprocessed (i.e. extraction of phrases)
curl -X POST -F data=@"PATH/TO/DATA.zip" -F config=@"PATH/TO/CONFIG.yaml" -F labels=@"PATH/TO/LABELS.yaml" http://SOME_IP:SOME_PORT/preprocessing
POST http://SOME_IP:SOME_PORT/preprocessing
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="data"; filename="DATA.zip"
Content-Type: application/zip
< PATH/TO/DATA.zip
--boundary
Content-Disposition: form-data; name="config"; filename="CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/CONFIG.yaml
--boundary
Content-Disposition: form-data; name="labels"; filename="LABELS.yaml"
Content-Type: application/x-yaml
< PATH/TO/LABELS.yaml
--boundary--
data
(mandatory) : the text files provided as one zip fileconfig
(optional) : configurations for the preprocessing step provided as yaml file (if not provided, default values will be used)labels
(optional) : the label (if any) of each data text file provided as yaml file (line-wise) (e.g. an entry looks like this:file1: LABEL1
)
/preprocessing/statistics
: gets some basic statistics for the corpus/preprocessing/noun_chunks
: gets all noun chunks that were found (with the documents where)
process
: name of the process (e.g.corpus_name
in config); if not provided, uses 'default'
embed the extracted phrases into a vector space
curl -X GET http://SOME_IP:SOME_PORT/embedding
or
curl -X POST -F config=@"PATH/TO/CONFIG.yaml" http://SOME_IP:SOME_PORT/embedding
GET http://SOME_IP:SOME_PORT/embedding
or
POST http://SOME_IP:SOME_PORT/embedding
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="config"; filename="CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/CONFIG.yaml
--boundary--
config
(optional) : configurations for the embedding step provided as yaml file (if not provided, default values will be used)
The first time this endpoint is called, the respective model (as given in config or the default one) will be downloaded
/embedding/statistics
: gets some basic statistics for the embedding object
process
: name of the process (e.g.corpus_name
in config); if not provided, uses 'default'
create the concept clusters from the embeddings which serve as the base for the concept graphs in the next step
curl -X GET http://SOME_IP:SOME_PORT/clustering
or
curl -X POST -F config=@"PATH/TO/CONFIG.yaml" http://SOME_IP:SOME_PORT/clustering
GET http://SOME_IP:SOME_PORT/clustering
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="config"; filename="CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/CONFIG.yaml
--boundary--
config
(optional): configuration for the clustering step provided as yaml file (if not provided, default values will be used)
/clustering/concepts
: shows the concepts that were found
process
: name of the process (e.g.corpus_name
in config); if not provided, uses 'default'top_k
(only for../concepts
): how many top k representatives of each clusterdistance
(only for../concepts
): how far away (cosine) a representative can be
Attention: the base endpoint has no funtion as such. To create the graphs you need to call the ../creation
path argument
or call ../statistics
to get a simple overview of the graphs created for a specific process
.
Creates graph representations for each phrase cluster that was found during the 'clustering' step.
You get a response with all graphs found and their respective edge/node count. To get a specific graph's adjacency matrix use its id as path argument (the id is just the number given in the formerly mentioned response).
curl -X GET http://SOME_IP:SOME_PORT/graph/creation
or
curl -X POST -F config=@"PATH/TO/CONFIG.yaml" http://SOME_IP:SOME_PORT/graph/creation
### POST with config
POST http://SOME_IP:SOME_PORT/graph/creation?exclusion_ids=[COMMA-SEPERATED LIST OF INTEGERS]
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="config"; filename="CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/CONFIG.yaml
--boundary--
### GET statistics
GET http://SOME_IP:SOME_PORT/graph/statistics
### GET specific graph
GET http://SOME_IP:SOME_PORT/graph/GRAPH_ID
config
(optional): configuration for the clustering step provided as yaml file (if not provided, default values will be used)
/graph/creation
: creates the graphs according to config/graph/statistics
: shows some basic statistics/graph/<GRAPH-ID>
: returns nodes and adjacency information for a specific graph
process
: name of the process (e.g.corpus_name
in config); if not provided, uses 'default'exlusion_ids
(only for creation): list of concept ids (as returned by e.g./clustering/concepts
), that shall be excluded from the final graphs in the form of[ID1, ID2, etc.]
draw
(only for querying a specific graph):true
orfalse
- if the response is a rendered graph or json
starts a complete pipeline with all 4 sub steps.
- besides the configs for each subprocess (see further below) another
document_server_config
config might be provided to use documents from a server (e.g. elasticsearch instance) instead of uploading them.
If such a config is provided it overrides any uploaded documents.
url: BASE_URL to server (e.g.: 'http:\\localhost')
port: PORT (e.g.: '9008')
index: index in ES (e.g.: 'documents')
size: size of the document batch that is requested from the server
replace_keys: dictionary -
'key' is the servers response key for the actual text
'value' is the concept-graphs-api internal name for the actual text content (as of now 'content')
(e.g.: '{text: content}')
/pipeline/configuration
(query params here:process
(if a configuration from a specific pipeline is requested) &default
(defaults totrue
so that the endpoint returns a default configuration; if set tofalse
it can be combined withprocess
to request a specific configuration))
ToDo: proper format for this entry
process
: overrides thecorpus_name
given in the configlang
(de
oren
): if declared here and no model provided inconfig
, will use default language specific models forpreprocessing
andclustering
(default: en)skip_present
: (true
orfalse
) - whether to skip already saved steps for this particular process name (default: true)skip_steps
: (comma separated list of either one or multiple of[data, embedding, clustering, graph]
) - pipeline steps that should be omitted if prerequisites are fulfilled; overrulesskip_present
return_statistics
: (true
orfalse
) - whether to wait for pipeline to finish and return statistics for the created graphs in the end or just silently start the pipeline (default: false)
data
, labels
and configs
need to be provided like in the respective base endpoints except for the configs need to be specified accordingly (if not provided, default values will be used):
preprocessing
:config
->data_config
embedding
:config
->embedding_config
clustering
:config
->clustering_config
graph
:config
->graph_config
(see example below)
ToDo: describe json config possibilities when header Content-Type:application/json
POST http://SOME_IP:SOME_PORT/pipeline?process=default&lang=en&skip_present=true
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="data"; filename="DATA.zip"
Content-Type: application/zip
< PATH/TO/DATA.zip
--boundary
Content-Disposition: form-data; name="data_config"; filename="DATA_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/DATA/CONFIG.yaml
--boundary
Content-Disposition: form-data; name="embedding_config"; filename="EMBEDDING_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/EMBEDDING/CONFIG.yaml
--boundary
Content-Disposition: form-data; name="clustering_config"; filename="CLUSTERING_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/CLUSTERING/CONFIG.yaml
--boundary
Content-Disposition: form-data; name="graph_config"; filename="GRAPH_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/GRAPH/CONFIG.yaml
--boundary--
or without data upload, but rather document server:
POST http://SOME_IP:SOME_PORT/pipeline?process=default&lang=en&skip_present=true
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="document_server_config"; filename="DOCUMENT_SERVER_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/DOCUMENT_SERVER/CONFIG.yaml
[...]
--boundary
Content-Disposition: form-data; name="graph_config"; filename="GRAPH_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/GRAPH/CONFIG.yaml
--boundary--
Gets the name of all stored processes.
GET http://SOME_IP:SOME_PORT/processes
Gets status of a specific process
process
: name of the process (e.g.corpus_name
in config); if not provided, uses 'default'
GET http://SOME_IP:SOME_PORT/status?process=PROCESS_NAME
Checks if a data server (specified with a document_server_config
file) is reachable
POST http://SOME_IP:SOME_PORT/document-server
Content-Type: multipart/form-data; boundary="boundary"
--boundary
Content-Disposition: form-data; name="document_server_config"; filename="DOCUMENT_SERVER_CONFIG.yaml"
Content-Type: application/x-yaml
< PATH/TO/DOCUMENT_SERVER/CONFIG.yaml
--boundary--
# Name of the corpus; can be chosen freely (but acts as reference point between the different endpoint actions)
corpus_name: default
# Name of the spaCy model; pre-installed: de_dep_news_trf, de_core_news_sm (German) & en_core_web_trf (English) [default]
spacy_model: en_core_web_trf
# Number of processes that will be spawned (how many cores will be utilized)
n_process: 1
# Only files in the data zip will be processed that have this file extension
file_extension: txt
# Encoding of the text files
file_encoding: utf-8
# Lemmatize the input data
use_lemma: False
# Prepend the head of a phrase to its beginning
prepend_head: False
# Use only the head of a phrase
head_only: False
# Discriminate between upper and lowercase tokens even if it's the same word
case_sensitive: False
# If a tf-idf pre-filter for phrases should be used; 1, 1.0, None takes phrases as is; see e.g. scikit-learn's tf-idf vectorizer for a description
# (int or float)
filter_min_df: 1
# (int or float)
filter_max_df: 1.0
# (None or name of POSTed stopwords file)
filter_stop: None
# Name of spaCy pipeline components to disable (None or array)
disable: None
# Configuration of NegSpacy (heuristic negation detection)
# - chunk_prefix: str (path to file) or array
# - neg_termset_file: str (path to file)
# - scope: int or bool --> if >=1 it restricts the search for negated candidates to this in terms of distance in dependency parse
# - feat_of_interest: string (NE | NC | BOTH) --> negation checked on which feature of interest: named entities (NE), noun chunks (NC), or both
# - language: str
negspacy:
- enabled: true
- configuration:
- chunk_prefix:
- kein
- keine
- keinen
neg_termset_file: ./conf/negex_files/negex_trigger_german_biotxtm_2016_extended.txt
scope: 1
language: de
feat_of_interest: NC
# Name of the corpus; can be chosen freely (but acts as reference point between the different endpoint actions)
corpus_name: default
# (for German: Sahajtomar/German-semantic; English: sentence-transformers/paraphrase-albert-small-v2 [default])
model: sentence-transformers/paraphrase-albert-small-v2
# Number of processes that will be spawned (how many cores will be utilized)
n_process: 1
# Name of the corpus; can be chosen freely (but acts as reference point between the different endpoint actions)
corpus_name: default
# The clustering algorithm to be used. One of {kmeans (default), kmeans-mb, affinity-prop}
algorithm: kmeans
# The downscale algorithm to be used (to effectively cluster the high-dimensional embeddings); only "umap" supported.
downscale: umap
# Whether the recommended settings for 'clustering', 'downscaling' & 'cluster number deduction' shall be used if nothing else is stated
missing_as_recommended: True
# UMAP arguments prefixed with 'scaling_' (see umap-learn.readthedocs.io)
scaling_*
# Clustering arguments prefixed with 'clustering_' (dependent on algorithm used: see scikit-learn)
clustering_*
# KElbow method arguments for deducing the number of clusters prefixed with 'kelbow_' (see scikit-yb.org)
kelbow_*
# Name of the corpus; can be chosen freely (but acts as reference point between the different endpoint actions)
corpus_name: default
# Phrases that are farther away (in cosinus) from the cluster center than this won't be used for graph creation
cluster_distance: 0.6
# A cluster needs to have at least this many phrases or else the whole cluster will be skipped
cluster_min_size: 1
# How important should the cosinus distance be when connecting phrases
graph_cosine_weight: .5
# At which weight threshold between two phrase they will be merged into one
graph_merge_threshold: .95
# Edges where the weight is smaller than this value are cut
graph_weight_cut_off: .5
# Whether the graph will be transformed to treelike (each node has max one incoming and one outgoing edge)
graph_unroll: True
# What proportion of edges shall be trimmed
graph_simplify: .5
# Shall the importance of an edge (when trimming) be measured by 'weight' or by 'significance' [3]
graph_simplify_alg: significance
# If true, sub clusters in a cluster will be formed that might be used downstream
graph_sub_clustering: False
# Whether phrases adhere strictly to their assigned cluster even if they might be nearer to another cluster center
restrict_to_cluster: True
# (Deprecated?)
filter_min_df: 1
filter_max_df: 1.
filter_stop: None
[1] Matthies, F et al. Concept Graphs: A Novel Approach for Textual Analysis of Medical Documents. In: Röhrig, R et al., editors. Studies in Health Technology and Informatics. IOS Press; 2023. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI230710
[2] https://networkx.org/
[3] Dianati N. Unwinding the hairball graph: Pruning algorithms for weighted complex networks. Phys Rev E. 2016;93(1). Available from: https://link.aps.org/doi/10.1103/PhysRevE.93.012304