This project implements a multihash table n-gram language model class in C++. An n-gram is a set of n words which occur together in natural language (e.g., "ice cream"). Applications of n-gram language models include predictive text, machine translation, syntax-dependent spell check, speech/handwriting recognition, spam filters, and plagiarism detection applications (Jurafsky & Martin, 2008). This implementation parses a plain-text file of natural written language, stores the parsed data, and accesses it efficiently with O(1) constant runtime complexity. The multihash table data structure in this project is implemented as a main hash table with two subtables and additional metadata. One subtable is a hash table, and the other is a dynamic array sorted by n-gram frequency. See below for more detail on n-gram language models, multihash tables, and the data structures and functions for this implementation.
This project has the following directory tree:
.
├── CMakeLists.txt
├── CMakeLists.txt.in
├── README.md
├── app
│ └── main.cpp
├── build
│ ├── run_app
│ └── run_tests
├── code
│ ├── LM.cpp
│ └── LM.h
├── corpus
│ ├── GettysburgAddress.txt
│ ├── Macbeth.txt
│ ├── MyAntoniaChaps1-9.txt
│ ├── OneBigram.txt
│ └── TaleOfTwoCitiesChapter1.txt
├── results
│ ├── GettysburgAddress.csv
│ └── MyAntoniaChaps1-9.csv
└── tests
└── test_LM.cpp
File | Description |
---|---|
CMakeLists.txt | Compilation configuration file |
CMakeLists.txt.in | Compilation configuration file |
README.md | This ReadMe markdown file |
app\ | Application directory |
main.cpp | Demo application code |
build\ | Compilation file directory |
run_app | Executable: Demonstrates class |
run_tests | Executable: Performs unit tests |
code\ | Class code directory |
LM.cpp | Language model class source file |
LM.h | Language model class header file |
corpus\ | Sample corpus texts directory * |
GettysburgAddress.txt | Lincoln's 'Gettysburg Address' UTF-8 text file |
Macbeth.txt | Shakespeare's 'Macbeth' UTF-8 text file |
MyAntoniaChaps1-9.txt | Cather's 'My Antonia,' Chap. 1-9, UTF-8 text file |
OneBigram.txt | UTF-8 text test file with only two words: 'one bigram' |
TaleOfTwoCitiesChapter1.txt | Dickens' 'A Tale of Two Cities,' Chap. 1, UTF-8 text file |
results\ | Output directory for CSV files of n-grams & counts |
GettysburgAddress.csv | 'Gettysburg Address' CSV output file |
MyAntoniaChaps1-9.csv | 'My Antonia,' Chap. 1-9, CSV output file |
tests\ | Unit tests directory |
test_LM.cpp | Unit tests code |
* Note: All of the corpus texts are public domain works from Project Gutenberg.
To compile the code and create executables in a Linux operating system environment:
- Change to the build directory:
cd build
- Run cmake to create the compilation configuration files:
.. cmake
- Run make to compile the executables:
make
This will create two executables in the build directory:
run_app
: Demonstrates the language model class with sample textsrun_tests
: Runs unit tests on the class
To run these these executables:
- Change to the build directory:
cd build
- Enter
./run_app
or./run_tests
accordingly
Predictive text, machine translation, syntax-dependent spell check, speech/handwriting recognition, spam filters, and plagiarism detection applications require a reliable language model (Jurafsky & Martin, 2008). Although complex syntax parse trees/forests are utilized in some specific contexts, probabilistic n-gram language models which use the frequency of certain combinations of n words (i.e., collocations) occurring in natural language can be surprisingly effective in many such applications. As an example, given the word “ice,” a word likely to follow it might be “cream,” so “ice cream” would be a frequent 2-gram (or bigram) for the headword “ice.”
To create an n-gram language model (LM), it is necessary to process large corpora of texts, calculating the frequency of each n-gram. Then, to apply the model, it is necessary to quickly obtain those frequencies either to predict the next word, or in the case of speech/handwriting recognition, to decide which interpretation of the input is most likely. Not only must each unique word in the corpus be mapped to the number of times it occurs in the corpus, it is also necessary to track the frequency of the potentially large sets of the
What’s more, every time a particular n-gram is encountered in the corpus, the count must be incremented, so it is necessary to search through a headword’s sublist for the corresponding n-gram each time. If there are
A third challenge is that after the model is built, the most frequent n-grams for any given headword must be accessed quickly when applying the model. Thus, even if the sublist were sorted alphabetically to optimize search and update during the model training phase, the sublist must then be sorted by frequency to facilitate quick access to the frequency data. Furthermore, if the user wishes to improve the model by adding additional n-grams from another corpus, the sublist must again be sorted alphabetically for training and then resorted by frequency once again for accessing the frequency data. As there are m sublists, sorting and resorting can be an expensive operation for a large corpus.
A fourth challenge is that the number of unique words and unique n-grams contained in the corpus is unknown before processing, so the top list and sublists must be able to expand dynamically as the model is trained. Although this expansion can be achieved with linked lists and binary search trees, as explained above, linked lists have linear search time, and binary search trees cannot be simultaneously sorted by two different criteria.
To address these challenges, I implemented a multihash table data structure with embedded hash subtables. A hash table is an abstract data structure that places and locates data based on a hash function which transforms a value from the data (e.g., a name or ID number) into a key or index. Hash tables are time-efficient with average constant time complexity for insert, search, and remove. Whereas a hash table allows only one value per key, a multihash allows multiple values per key (Gregoire, 2018, p. 132; “Multimap,” 2022), and for this implementation, those multiple values are contained in embedded hash subtables. Thus, while based on the principles of a hash table, a multihash structure with embedded hash subtables requires different structures and member functions than a hash table. In other words, it is not simply a matter of using a hash table class.
More specifically, for this implementation, the upper-level multihash table contains each unique headword in the corpus with the headword hashed as the key. Each headword maps to an embedded subtable of each of the unique sets of collocates following the headword, with the collocate hashed as the key. For example, for the bigram “ice cream,” “ice” would be hashed as a key and indexed in the headword table, and that node would contain a subtable containing the collocate “cream” hashed as the key mapped to a value of the number of occurrences of that particular bigram. Because hash table search, insert, and remove have average constant time (Leiserson, 2005; Sedgewick, 1998), using a multihash with an embedded hash table will optimize runtime in the model training phase, as compared to the linear time of inserting into linked lists and the logarithmic time of balanced trees.
To address the third challenge of accessing the frequency data quickly while maintaining the ability to add more n-grams to the model, when a collocate is added to a subtable, a pointer to that collocate is also added to a dynamic array in each headword node. When model training is completed, the array of pointers is sorted by frequency. This allows quick access to the most frequent n-grams, permits the later expansion of the model, and avoids taking up additional storage with duplicated data.
To address the fourth challenge, the initial hash table sizes are estimated from the raw corpus file size and are automatically expanded when necessary. In addition, following Sedgewick's (1998) recommendations, prime number hash table sizes are used to reduce collisions (i.e., instances where the hash function returns the same index for two different elements), linear probing (i.e., trying the next available index in the table) is used to handle any collisions, and the tables are resized when the load exceeds or falls below specified thresholds. For the hash function, the Fowler-Noll-Vo 1a non-cryptographic hash function (FNV-1a, Fowler et al., 2019) was selected because, although not perfect, it results in relatively fewer collisions for alphanumeric keys, is relatively fast, and is relatively simple to implement compared to other common hash functions (Estébanez et al., 2014; Henke et al., 2008; Karasek et al., 2011, van den Bosch, 2015). Collectively, these measures help to ensure average constant runtime complexity.
The multihash table n-gram language model (LM) class has the following data structures and member functions:
struct headword_node
: Multihash node containing headword as key; collocate subtable, collocate-node pointer subtable, and count as values; and deleted node flagstruct collocate_node
: Subtable node containing collocate as key, count as value, and deleted node flagstruct head_table
: Multihash table containing array of headword nodes, hash function setting, occupied count, and capacitystruct collocate_table
: Subtable containing array of collocate nodes, hash function setting, occupied count, and capacity
LM()
: Constructs bigram language model objectLM(int n)
: Constructs language model object with specified n-gram size (e.g., 2-gram, 3-gram, etc.)bool Train(string filename)
: Trains LM with an input plain-text file, parsing the text and inserting n-grams in multihash table and subtables; replaces any previous existing modelmap<string, int> NGrams()
: Returns a map of all n-grams and their respective countsvector<string> Collocates(string headword, unsigned int x)
: Returns a vector of the x most frequent collocates for a given headwordfloat Frequency(string ngram)
: Returns the frequency of a specified n-gram; if a unigram is entered for a bigram or higher model, headword frequency is returnedmap<string, int> CollocateCounts(string headword)
: Returns a map of collocates and their counts for a given headwordint UniqueUnigramCount()
: Returns the total number of unique words (i.e., unigrams)int UniqueNGramsCount()
: Returns the total number of unique n-gramsint NGramsTotal()
: Returns the total number of n-grams processed (including duplicates)int TotalTokens()
: Returns total number of tokens (i.e., unigrams/headwords) processedbool CSV(string filename)
: Generates a CSV file of all n-grams and counts; returns false if failedbool Remove(string ngram)
: Removes a specified n-grambool Grow(string filename)
: Expands model with additional input text file
int EstimateTokenCount(string filename)
: For the purpose of setting initial table size, estimates number of words (i.e., tokens) based on input file sizeunsigned int PowerOfTwoExponent(unsigned int x)
: Return exponent of smallest power of 2 greater than or equal to xunsigned int PrimeSize(unsigned int x)
: Return smallest prime greater than the smallest power of 2 greater or equal to x for x > minimum table size - 1, or return minimum table size otherwise (see Sedgewick, 1998)shared_ptr<head_table> InitHeadTable(int capacity)
: Initializes headword multihash tableshared_ptr<collocate_table> InitCollocateTable(int capacity)
: Initializes collocate hash subtableshared_ptr<headword_node> InitHeadword(string headword)
: Initializes headword nodeshared_ptr<collocate_node> InitCollocate(string collocate)
: Initializes collocate nodeint InsertNGram(string headword, string collocate)
: Inserts headword or increments count if already exists; returns bucket indexint InsertCollocate(shared_ptr<headword_node> headword, string collocate)
: Inserts collocate into headword’s subtable or increments count if already exists; returns bucket indexshared_ptr<headword_node> FindHeadword(string headword)
: Find headword nodeshared_ptr<collocate_node> FindCollocate(shared_ptr<headword_node> headword, string collocate)
: Find collocate nodevoid Reset()
: Deletes existing model and resets countsbool DeleteHeadword(shared_ptr<headword_node> headword)
: Deletes headword by setting deleted node flag to true and corrects countsbool DeleteCollocate(shared_ptr<headword_node> headword, shared_ptr<collocate_node> collocate)
: Deletes collocate (and headword if appropriate) by setting deleted node flag(s) to true and corrects countsfloat GetHeadTableLoad(shared_ptr<head_table> table)
: Returns head table loadfloat GetCollocateTableLoad(shared_ptr<collocate_table> table)
: Returns collocate table loadvoid ResizeHeadTable(shared_ptr<head_table> table)
: Resizes head tablevoid ResizeCollocateTable(shared_ptr<headword_node> headword, shared_ptr<collocate_table> table)
: Resizes collocate tableint InsertHeadNode(shared_ptr<head_table> table, shared_ptr<headword_node> node)
: Insert existing headword node into new headword tableint InsertCollocateNode(shared_ptr<collocate_table> table, shared_ptr<collocate_node> node)
: Insert existing collocate node into new subtablevoid SortCounts()
: Sorts all frequency subtables after building or expanding LMvector<string> SplitNGram(string ngram)
: Separates n-gram string into headword and collocate
Unit tests using Google's C++ testing framework were performed to ensure that the implemented class performed as expected. The following tests were implemented and passed:
InitLM
: Both the default and parameterized constructors initialize data members as expectedInitCollocate
: Initializes collocate node's data members as expectedPrimeSize
: Returns expected prime numbers for several inputsInitCollocateTable
: Initializes collocate hash subtable as expectedHashFunction
: Collocate subtable hash function assigns index as expectedInitHeadword
: Initializes headword node's data members as expectedInitHeadTable
: Initializes headword hash table node's data members as expectedInsertNGram
: Inserts several bigrams in head table and collocate and frequency subtables as expectedEstimateTokenCount
: Within 5% error margin, correctly estimates number of tokens based on text file's sizeTrain
: Training bigram model with empty, small, and large text files results in expected bigram and token counts (implicitly tests that table sizes are automatically increased before number of entries exceeds capacity)UniqueUnigramCount
: Returns expected number of unique unigrams for a given textUniqueNGramsCount
: Returns expected number of unique trigrams for a given textUniqueNGramsCount
: Returns expected number of unique trigrams for a given textNGramsTotal
: Returns expected number of trigrams for a given textTotalTokens
: Returns expected number of tokens for a given text in a 4-gram language modelFindHeadword
: Locates expected nodeFindCollocate
: Locates expected nodeNGrams
: Several bigrams and their respective counts match expected values for a given textCollocates
: Specified number of most frequent collocates for a particular headword are returned for a given textFrequency
: Several bigrams and their respective frequencies match expected values for a given textCollocateCounts
: Several collocates and their respective counts match expected values for a particular headword in a given textCSV
: A CSV file is output with expected number of elementsRemove
: Unigram, bigram, and token counts are as expected after several bigrams are removed; improperly input strings do not affect counts; when last element in a table is removed, the entire table is reset; when the load for a table drops below specified threshold, the table size is reducedGrow
: Expanding bigram language model with both the same text file and a different text file results in expected token and bigram counts and changes in frequency; bigrams unique to each text are present in model
Future improvements of the n-gram language model class implemented here could include refinements of parsing capabilities, including double-width characters and parsing punctuation as tokens. Including start and end of sentence tokens would also make it possible to implement automatic generation of text similar to the corpus a model is trained with.
Estébanez, C., Saez, Y., Recio, G., & Isasi, P. (2014). Performance of the most common non‐cryptographic hash functions. Software: Practice and Experience, 44(6), 681-698. https://doi.org/10.1002/spe.2179
Fowler, G., Noll, L. C., Vo, K.-P., Eastlake, D., & Hansen, T. (2019, May 29). The FNV non-cryptographic hash algorithm. Internet Engineering Task Force. https://datatracker.ietf.org/doc/html/draft-eastlake-fnv-17.html
Google. (n.d.). GoogleTest. https://google.github.io/googletest
Gregoire, M. (2018). Professional C++ (4th ed.). Wiley.
Henke, C., Schmoll, C., & Zseby, T. (2008). Empirical evaluation of hash functions for multipoint measurements. ACM SIGCOMM Computer Communication Review, 38(3), 39-50. https://doi.org/10.1145/1384609.1384614
Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to speech recognition, computational linguistics and natural language processing (2nd ed.). Prentice Hall.
Karasek, J., Burget, R., & Morský, O. (2011, August). Towards an automatic design of non-cryptographic hash function. In 2011 34th International Conference on Telecommunications and Signal Processing (TSP) (pp. 19-23). IEEE. https://doi.org/10.1109/TSP.2011.6043785
Leiserson, C. E. (2005). Introduction to algorithms: Lecture 7: Hashing, hash functions [Video lecture]. MIT Open Courseware. https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-7-hashing-hash-functions
Multimap. (2022, March 11). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Multimap&oldid=1076479596
Project Gutenberg. (n.d.) https://www.gutenberg.org/
Sedgewick, R. (1998). Algorithms in C++: Parts 1-4: Fundamentals, data structures, sorting, searching (3rd ed.). Addison-Wesley.
van den Bosch, N. (2015). A comparison of of hashing algorithms [Bachelor's thesis]. Leiden University. https://theses.liacs.nl/pdf/2014-2015NickvandenBosch.pdf