Skip to content

Latest commit

 

History

History
97 lines (66 loc) · 5.25 KB

README.md

File metadata and controls

97 lines (66 loc) · 5.25 KB

dutchembeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

All embeddings are released under the CC-BY-SA-4.0 license.

The software is released under the GNU GPL 2.0.

These embeddings have been created with the support of Textgain®.

Embeddings

To download the embeddings, please click any of the links in the following table. In almost all cases, the 320-dimensional embeddings outperform the 160-dimensional embeddings.

Corpus 160 320
Roularta link (mirror) link (mirror)
Wikipedia link (mirror) link (mirror)
Sonar500 link (mirror) link (mirror)
Combined link (mirror) link (mirror)
COW - small (mirror), big (mirror)

See below for a usage explanation.

Citing

If you use any of the resources from this paper, please cite our paper, as follows:

@InProceedings{tulkens2016evaluating,
  author = {Stephan Tulkens and Chris Emmery and Walter Daelemans},
  title = {Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }

Please also consider citing the corpora of the embeddings you use. Without the people who made the corpora, the embeddings could never have been created.

Usage

The embeddings are currently provided in .txt files which contain vectors in word2vec format, which is structured as follows:

The first line contains the size of the vectors and the vocabulary size, separated by a space.

Ex: 320 50000

Each line thereafter contains the vector data for a single word, and is presented as a string delimited by spaces. The first item on each line is the word itself, the n following items are numbers, representing the vector of length n. Because the items are represented as strings, these should be converted to floating point numbers.

Ex: hond 0.2 -0.542 0.253 etc.

If you use python, these files can be loaded with gensim or reach, as follows.

# Gensim
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/embedding-file')
katvec = model['kat']
model.most_similar('kat')

# Reach
from reach import Reach

r = Reach.load('path/to/embedding-file')
katvec = r['kat']
r.most_similar('kat')

Relationship dataset

If you want to test the quality of your embeddings, you can use the relation.py script. This script takes a .txt file of predicates, and creates dataset which is used for evaluation.

This currently only works with the gensim word2vec models or the SPPMI model, as defined above.

Example:

from relation import Relation

# Load the predicates.
rel = Relation('data/question-words.txt')

# load a word2vec model
model = KeyedVectors.load_word2vec_format('path/to/embedding-file')

# Test the model
rel.test_model(model)