-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathbuild_instructions.txt
69 lines (54 loc) · 4.24 KB
/
build_instructions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-Software and libraries used:
Python 2.6 x64
up to date sklearn / numpy / gensim
word2phrase (https://github.com/travisbrady/word2phrase/blob/master/word2phrase.py)
elasticsearch 2.0.0 (or similar with support for LMJM / BM25 similarity algorithms)
Counter for Python 2.6 (http://code.activestate.com/recipes/576611-counter-class/) renamed as mycounter.py
-Data used:
CK12 books (http://www.ck12.org/user%3AanBkcmVjb3VydEBnbWFpbC5jb20./book/Concepts/?utm_medium=email&utm_source=share-content-share-this-flexbook%C2%AE-textbook)
Flashcards retrieved using the Quizlet API (several queries during Dec 15/ Jan 16), the search keywords mostly correspond to 8th exam topics (see train/topics.txt).
These files can be foud in /data and have the following format (search keyword, flashcard question and flashcard answer separated by tabs).
The reason of having several files corresponds to several re-tries after API failing or change in the set of keywords.
No information from training or test set was used during the data retrieval process. Also there is no need of further data or API calls in order to generate new predictions.
The original data files were zipped and added for reproducibility, they need to be extracted in the same directory.
The training/validation sets provided by the organization should reside in /data (They cannot be shared outside the competition so you should download them from Kaggle).
I augmented the training set by including the aristo.csv file, which is a compilation
of 345 questions/answers manually merged from http://allenai.org/content/data/Aristo_Multi-state.zip and http://allenai.org/content/data/Regents.zip
-Scripts:
genlemmas.py
Dedupes and generates lemmatized files from the flashcards and books data.
Listing of how train folder should look like after genlemmas.py execution:
/train/
07/01/2016 06:17 392,643,191 bigquiz.txt (quizlet flashcards)
07/01/2016 22:43 81,069,226 bigquiz2.txt (quizlet flashcards)
08/01/2016 09:34 494,383,330 bigquiz3.txt (quizlet flashcards)
10/01/2016 17:11 120,850,240 bigquizlemma.txt (deduped bigquiz.txt after lemmatization)
10/01/2016 19:48 18,170,621 bigquizlemma2.txt (deduped bigquiz2.txt after lemmatization)
10/01/2016 20:02 180,042,639 bigquizlemma3.txt (deduped bigquiz3.txt after lemmatization)
08/01/2016 10:52 3,513,169 CK12clean.txt (cleaned CK12 ebook)
10/01/2016 17:11 2,411,097 CK12lemma.txt (cleaned CK12 ebook lemmatized)
16/12/2015 08:54 320,695,054 requiz.txt (quizlet flashcards)
16/12/2015 17:37 209,900,104 requiz2.txt (quizlet flashcards)
17/12/2015 08:09 71,508,317 requiz3.txt (quizlet flashcards)
23/12/2015 23:07 185,594,189 requizlemma.txt (joined and deduped requiz.txt, requiz2.txt, requiz3.txt after lemmatization)
05/01/2016 14:35 3,827 topics.txt (8th grade keywords used to query Quizlet API to download the flashcards)
genvocab.py
Extracts most common word bigrams and trigrams from the lemmatized training data and generates vocabulary file "quizletvocab3.pick"
trainw2v.py
Trains a word2vec model in model/word2vec_myck12_quizlet3_stem_23gram.model' using the lemmatized data. It seems that fuzzy matching for vocabulary expansion
improves the results. This script uses the vocab file extracted in the previous step
indexES.py
Creates a knowledge base in Elasticsearch based on the CK12 and Quizlet data (both lemmatized and original).
The indexes are generated as following:
quizlets -> quizlet flashcards
quizlets_lemma -> lemmatized quizlet flashcards
qa -> CK12 books
qa_lemma -> CK12 lemmatized books
getfeatsESW2V.py
Generates IR and W2V features for the training and validation data
getfeatsESBM25.py
Generates IR features for the training and validation data (it needs a config change in elasticsearch.yml-> index.similarity.default.type: BM25)
getfeatsESLMJM.py
Generates IR features for the training and validation data (it needs a config change in elasticsearch.yml-> index.similarity.default.type: LMJelinekMercer)
combine_all.py
Trains a Logistic Regression model, performs both cross-validation (5-fold) and generates final submission.