Widgets

Input and output

Load Corpus from CSV

The Load Corpus from CSV widget loads a file and extracts the document texts and labels into two lists.

You can specify the text and label columns (counting columns starts with 1). You can also specify to skip the first header row and specify the delimiter. In the case of a tab-separated file, use \t as a delimiter.

Create Dataset

The Create Dataset widget combines embeddings (output of embeddings models) and labels to a dataset. The output is the Orange Data Table.

Import Dataset

The Import Dataset widget takes two files as arguments: x.npy is a numpy array with the embeddings and y.npy is a numpy array with labels. The widget produces the dataset on the output, which is the same as the output of Create Dataset widget.

Export Dataset

The Export Dataset widget takes the dataset on the input and it opens up a popup dialog with a link to the file archive. The file archive contains two files: x.npy and y.npy, which are features and labels, respectively. You can transform features (eg. the code snippet below) and reupload the dataset with Import Dataset widget.

import numpy as np

x = np.load('x.npy')
x += 2  #  feature transformation
np.save('x2.npy', x)

Word based text embeddings

Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space[1].

Models

English model - Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.
German model - Word2Vec Continuous Skipgram model trained on German CoNLL17 corpus. The model contains 100-dimensional vectors for 4946997 words and phrases.
Spanish model - Word2Vec model trained on Spanish Billion Word Corpus. The model contains 300-dimensional vectors for 1000653 words and phrases.
Russian model - Word2Vec Continuous Skipgram model trained on Russian CoNLL17 corpus. The model contains 100-dimensional vectors for 3338424 words and phrases.
Latvian model - Word2Vec Continuous Skipgram model trained on Latvian CoNLL17 corpus. The model contains 100-dimensional vectors for 560445 words and phrases.
Estonian model - Word2Vec Continuous Skipgram model trained on Estonian CoNLL17 corpus. The model contains 100-dimensional vectors for 926795 words and phrases.
Croatian model - Word2Vec Continuous Skipgram model trained on Croatian CoNLL17 corpus. The model contains 100-dimensional vectors for 928316 words and phrases.
Slovenian model - Word2Vec Continuous Skipgram model trained on Slovenian CoNLL17 corpus. The model contains 100-dimensional vectors for 706835 words and phrases.

fastText

fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab[2]. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words.

Models

English model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.
German model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.
Spanish model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.
Russian model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.
Lithuanian model - trained on Common Crawl and Wikipedia using fastText. The model contains 300-dimensional vectors.
Latvian model - trained on Common Crawl and Wikipedia using fastText. The model contains 300-dimensional vectors.
Estonian model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.
Croatian model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.
Slovenian model - fastText Wikipedia supervised word embeddings, aligned in a single vector space. The model contains 300-dimensional vectors.

fastText Croatian

Croatian CLARIN.SI-embed.hr contains word embeddings induced from a large collection of Croatian texts composed of the Croatian web corpus hrWaC and a 400-million-token-heavy collection of newspaper texts.

fastText Embeddia

Embeddia fastText embeddings trained on Slovenian Gigafida 2.0 corpus. A skipgram model was trained with default hyperparameters on 8 threads, except for the following two changes: dim parameter was set to 300 and minCount parameter was set to 20. That is, we calculated 300-dimensional word vectors of every word that appears at least 20 times in the corpus. Each line in the .vec file consists of the word, followed by the 300 dimensional vector, all fields are space separated. The first line (642655 300) tells, there are 642655 word vectors of 300 dimensions..

fastText Slovenian

Slovenian CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC etc.

GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words[3]. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Models

English model - Wikipedia 2014 + Gigaword 5. The model contains 300-dimensional vectors.
German model - German Wikipedia. The model contains 300-dimensional vectors.
Spanish model - GloVe model trained on Spanish Billion Word Corpus. The model contains 300-dimensional vectors for 855380 words and phrases.

ELMo

ELMo is a deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts[4]. These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

Models

English model - ELMo model trained on English CoNLL17 corpus. The model contains 1024-dimensional vectors.
German model - ELMo model trained on German CoNLL17 corpus. The model contains 1024-dimensional vectors.
Spanish model - ELMo model trained on Spanish CoNLL17 corpus. The model contains 1024-dimensional vectors.
Russian model - ELMo model trained on Russian CoNLL17 corpus. The model contains 1024-dimensional vectors.
Latvian model - ELMo model trained on Latvian CoNLL17 corpus. The model contains 1024-dimensional vectors.
Estonian model - ELMo model trained on Estonian CoNLL17 corpus. The model contains 1024-dimensional vectors.
Croatian model - ELMo model trained on Croatian CoNLL17 corpus. The model contains 1024-dimensional vectors.
Slovenian model - ELMo model trained on Slovenian CoNLL17 corpus. The model contains 1024-dimensional vectors.

ELMo Embeddia

ELMo language model used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level.

BERT

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks[5]. BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

Models

bert-base-multilingual-uncased - 12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased text in the top 102 languages with the largest Wikipedias.
bert-base-uncased - bert-base-uncased - 12-layer, 768-hidden, 12-heads, 110M parameters. Trained on lower-cased English text.
distilbert-base-multilingual-cased - Distil* is a class of compressed models that started with DistilBERT. DistilBERT stands for Distillated-BERT. DistilBERT is a small, fast, cheap and light Transformer model based on Bert architecture.

Bert Embeddia

CroSloEngualBert - Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (ie. to be used with pytorch library).

LSI

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts[9].

Parameters

Number of topics - Number of requested factors (latent dimensions), 200 by default.
Decay - Weight of existing observations relatively to new ones.

Sentence based text embeddings

Universal Sentence Encoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks[6]. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks.

Models

English model - The input is variable length English text and the output is a 512 dimensional vector.
German model - The input to the module is variable length English or German text and the output is a 512 dimensional vector.
Spanish model - The input to the module is variable length English or Spanish text and the output is a 512 dimensional vector.

Document based text embeddings

Doc2Vec

Doc2vec is an unsupervised algorithm to generate vectors for sentence/paragraphs/documents[7]. The algorithm is an adaptation of Word2Vec which can generate vectors for words.

Models

English model - Doc2Vec is trained on English Wikipedia Distributed Bag Of Words. The model outputs 300-dimensional vectors for each document.

Tokenizers

Tokenizers tokenize document text to words or sentences.

Regex Word Tokenizer

The regex word tokenizer is a simple tokenizer that tokenizes a document with '\w+' regex to words. It can also transform document text to lowercase.

Tok Tok Word Tokenizer

The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.

Tok-tok has been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. The input should be in UTF-8 encoding.

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

Other widgets

Token Filtering

Token Filtering widget filters uninformative characters. The default filter filters: !, ", #, $, %, &, "", (, ), *, +, ,', -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, @, [, , ], ^, _, {, |, }. You can specify custom tokens to filter, which need to be separated by a new line. Custom tokens override the default filter.

Language

Language widget selects the language of the embeddings model. It overrides the language parameter of the text embeddings widget.

References

[1] @ARTICLE{2013arXiv1301.3781M, author = {{Mikolov}, Tomas and {Chen}, Kai and {Corrado}, Greg and {Dean}, Jeffrey}, title = "{Efficient Estimation of Word Representations in Vector Space}", journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language}, year = "2013", month = "Jan", eid = {arXiv:1301.3781}, pages = {arXiv:1301.3781}, archivePrefix = {arXiv}, eprint = {1301.3781}, primaryClass = {cs.CL} }

[2] @article{joulin2016bag, title={Bag of Tricks for Efficient Text Classification}, author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, journal={arXiv preprint arXiv:1607.01759}, year={2016} }

[3] @inproceedings{pennington2014glove, author = {Jeffrey Pennington and Richard Socher and Christopher D. Manning}, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, title = {GloVe: Global Vectors for Word Representation}, year = {2014}, pages = {1532--1543}, url = {http://www.aclweb.org/anthology/D14-1162}, }

[4] @inproceedings{Peters:2018, author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke}, title={Deep contextualized word representations}, booktitle={Proc. of NAACL}, year={2018} }

[5] @ARTICLE{2018arXiv181004805D, author = {{Devlin}, Jacob and {Chang}, Ming-Wei and {Lee}, Kenton and {Toutanova}, Kristina}, title = "{BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}", journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language}, year = "2018", month = "Oct", eid = {arXiv:1810.04805}, pages = {arXiv:1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, primaryClass = {cs.CL} }

[6] @ARTICLE{2018arXiv180311175C, author = {{Cer}, Daniel and {Yang}, Yinfei and {Kong}, Sheng-yi and {Hua}, Nan and {Limtiaco}, Nicole and {St. John}, Rhomni and {Constant}, Noah and {Guajardo-Cespedes}, Mario and {Yuan}, Steve and {Tar}, Chris and {Sung}, Yun-Hsuan and {Strope}, Brian and {Kurzweil}, Ray}, title = "{Universal Sentence Encoder}", journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language}, year = "2018", month = "Mar", eid = {arXiv:1803.11175}, pages = {arXiv:1803.11175}, archivePrefix = {arXiv}, eprint = {1803.11175}, primaryClass = {cs.CL} }

[7] @ARTICLE{2014arXiv1405.4053L, author = {{Le}, Quoc V. and {Mikolov}, Tomas}, title = "{Distributed Representations of Sentences and Documents}", journal = {arXiv e-prints}, keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning}, year = "2014", month = "May", eid = {arXiv:1405.4053}, pages = {arXiv:1405.4053}, archivePrefix = {arXiv}, eprint = {1405.4053}, primaryClass = {cs.CL}, }

[8] @inproceedings{inproceedings, author = {Zhao, Jiang and Lan, Man and Feng Tian, Jun}, year = {2015}, month = {01}, pages = {117-122}, title = {ECNU: Using Traditional Similarity Measurements and Word Embedding for Semantic Textual Similarity Estimation}, doi = {10.18653/v1/S15-2021}

[9] @ARTICLE{2011arXiv1102.5597R, author = {{{\v{R}}eh{{\r{u}}}{\v{r}}ek}, Radim}, title = "{Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms}", journal = {arXiv e-prints}, keywords = {Computer Science - Numerical Analysis, Computer Science - Machine Learning}, year = "2011", month = "Feb", eid = {arXiv:1102.5597}, pages = {arXiv:1102.5597}, archivePrefix = {arXiv}, eprint = {1102.5597}, primaryClass = {cs.NA}, adsurl = {https://ui.adsabs.harvard.edu/abs/2011arXiv1102.5597R}, adsnote = {Provided by the SAO/NASA Astrophysics Data System} } }

Widgets

Widgets

Input and output

Load Corpus from CSV

Create Dataset

Import Dataset

Export Dataset

Word based text embeddings

Word2Vec

Models

fastText

Models

fastText Croatian

fastText Embeddia

fastText Slovenian

GloVe

Models

ELMo

Models

ELMo Embeddia

BERT

Models

Bert Embeddia

LSI

Parameters

Sentence based text embeddings

Universal Sentence Encoder

Models

Document based text embeddings

Doc2Vec

Models

Tokenizers

Regex Word Tokenizer

Tok Tok Word Tokenizer

Punkt Sentence Tokenizer

Other widgets

Token Filtering

Language

References

Clone this wiki locally