Correlated Occurrence Analogue to Lexical Semantics
This implements the Correlated Occurrence Analogue to Lexical Semantics (COALS; Rohde, Gonnerman, & Plaut, 2005). COALS represents word meaning by means of high-dimensional vectors derived from word co-occurrence patterns in large corporora. The construction of these vectors starts by compiling a co-occurrence table using a ramped (or alternatively, flat) n-word window. Rohde et al. use a 4-word, ramped window:
1 2 3 4 [0] 4 3 2 1
Next, all but the m (14.000 in case of Rohde et al.) columns, reflecting the most frequent words, are discarded. Co-occurrence counts are then converted to word-pair correlations; negative correlations set to zero, and positive correlations are replaced by their square root in order to reduce difference between them. The rows of the co-occurrence table then represent COALS vectors for their respective words. Optionally, Singular Value Decomposition (SVD) can be used to reduce the dimensionality of these vectors, and these reduced vectors can, in turn, be converted to binary vectors, by setting negative components to zero, and positive components to one.
The implementation also supports a recent extension to the COALS model as proposed by Chang et al. (2012).
-
Chang, Y., Furber, S., & Welbourne, S. (2012). Generating Realistic Semantic Codes for Use in Neural Network Models. In Miyake, N., Peebles, D, and Cooper, R. P. (Eds.). Proceedings of the 34th Annual Meeting of the Cognitive Science Society (CogSci 2012).
-
Rohde, D. L. T., Gonnerman, L. M., & Plaut, D. C. (2005). An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence.
I released the code "as is"; that is, in the state in which I used it for the modeling work in my PhD thesis. As such, it does what it has to do, but there is still a lot of room for improvement.
$ ./coals
COALS version 0.99-beta
Copyright (c) 2012-2014 Harm Brouwer <[email protected]>
Center for Language and Cognition, University of Groningen
Netherlands Organisation for Scientific Research (NWO)
usage ./coals [options]
constructing COALS vectors:
--wsize <num> set co-occurrence window size to <num>
--wtype <type> use <type> window for co-occurrences (dflt: ramped)
--rows <num> set number of rows of the co-occurrence matrix to <num>
--cols <num> set number of cols of the co-occurrence matrix to <num>
--dims <num> reduce COALS vectors to <num> dimensions (using SVD)
--vtype <type> construct <type> (real/binary[_pn]) vectors (dflt: real)
--unigrams <file> read unigram counts (word frequencies) from <file>
--ngrams <file> read n-gram counts (co-occurrence freqs.) from <file>
--output <file> write COALS vectors to <file>
--enforce <file> enforce inclusion of words in <file>
--pos_fts <num> number of positive features (for binary_pn vectors)
--neg_fts <num> number of negative features (for binary_pn vectors)
extracting similar words:
--vectors <file> compute similarities on basis of vectors in <file>
--output <file> write top-k similar word sets to <file>
--topk <num extract top-<num> similar words for each word
basic information for users:
--help shows this help message
--version shows version
Here is an example of how to create 100-bits binary COALS vectors for the 15000 most frequent words in a corpus, using a 4-word ramped window, and the 14000 most frequent features:
coals --wsize 4 --wtype ramped --rows 15000 --cols 14000
--unigrams data/1-grams --ngrams data/9-grams
--vtype binary --dims 100 --output coals-svdb-100.model
The assumed input format for the unigram counts is:
1|unigram_1|f
1|unigram_2|f
.|.........|.
1|unigram_n|f
where f is an integer representing the frequency of the unigram. The assumed input format for the n-gram counts, in turn, is:
n|ngram_1|f
n|ngram_2|f
.|.......|...
n|ngram_n|f
where n denotes the size of the n-gram, and f its frequency.
Once you have constructed COALS vectors, you can use them to construct lists
of top
coals --vectors coals-svdb-100.model --output top-similar.txt --topk 25
COALS requires uthash and svdlibc.
COALS is available under the Apache License, Version 2.0.