Code to accompany the paper: Keith and O'Connor. "Uncertainty-aware generative models for inferring document class prevalence." EMNLP, 2018.
For more information, see the research project's website. If you would like to use our model in your own work, we suggest our freq-e software package.
If you use this code please cite the paper:
@inproceedings{keith18uncertainty,
author = {Keith, Katherine A. and O'Connor, Brendan},
title = {Uncertainty-aware generative models for inferring document class prevalence},
booktitle = {{EMNLP}},
year = 2018}
cd yelp_data
- First download Yelp academic dataset challenge round 9 https://www.yelp.com/dataset_challenge
- Open the .tar file
- Then run
./go_preproc.sh
This does the following pre-processing:
- tokenizes using NLK
- unigrams
- lowercase
- separates into classes (class=0 is stars=<3, class=1 is stars >3)
- prunes vocab so any vocab in the training data that is in <5 docs we don't use; also we use this same vocab size for the LSTM and map the pruned vocab to OOV symbols
Output is train/train_all.json
with dictionary keys
-
"reivew_id" : review id from the original doc
-
"class" : class=0 is stars=<3, class=1 is stars >3
-
"date" : original date
-
"toks" : dicionary of token counts
- First ignore all businesses that have less than 200 reviews
- Choose 500 test and 500 dev businesses by weighted random sampling
- weighted random sampling is by the number of docs a business has
#(1) ASSEMBLE DATA
yelp_data/go_maketrain_trials.sh #natural setting
yelp_data/go_maketrain_trials_prop.sh #synthetic setting proportion of 0.1
#(2) TRAIN MODELS
code/train_all/go_train_master.sh
#(3) EVALUATE MODELS ON THE TEST SET
code/eval/baselines.py
code/eval/go_eval_master.sh
#(4) ANALYZE RESULTS
code/eval/analysis.py
graphs/final_mae_plots.ipynb
#(1) ASSEMBLE DATA
yelp_data/go_train_prop_all.sh
#(2) TRAIN MODELS
code/train_all/go_training_prop_all.sh
#(3) EVAL MODELS ON THE TEST SET
code/eval/go_eval_train_prop_all.sh
#(4) ANALYZE RESULTS
code/eval/analysis_train_prop.py
graphs/train_prop10trains.ipynb
#(1) ASSEMBLE DATA
yelp_data/go_trainsize_trials.sh
#(2) TRAIN
code/train_all/go_trainsize_trials.sh
#(3) EVAL
code/eval/go_trainsize_all_trials.sh
#(4) ANALYSIS
code/eval/analysis_train_size.py
graphs/trainsize.ipynb
- First, you will need to download the ReadMe R package
- Make sure you change your home path manually in
/readme_our_experiments/coderun_readme.R
- Then run:
code/readme_our_experiments/go_readme.sh