doc_prevalence

Code to accompany the paper: Keith and O'Connor. "Uncertainty-aware generative models for inferring document class prevalence." EMNLP, 2018.

For more information, see the research project's website. If you would like to use our model in your own work, we suggest our freq-e software package.

If you use this code please cite the paper:

@inproceedings{keith18uncertainty,
author = {Keith, Katherine A. and O'Connor, Brendan}, 
title = {Uncertainty-aware generative models for inferring document class prevalence},
booktitle = {{EMNLP}},
year = 2018}

Code to replicate experiments in the paper

Setting up Yelp dataset

cd yelp_data
First download Yelp academic dataset challenge round 9 https://www.yelp.com/dataset_challenge
Open the .tar file
Then run ./go_preproc.sh

This does the following pre-processing:

tokenizes using NLK
unigrams
lowercase
separates into classes (class=0 is stars=<3, class=1 is stars >3)
prunes vocab so any vocab in the training data that is in <5 docs we don't use; also we use this same vocab size for the LSTM and map the pruned vocab to OOV symbols

Output is train/train_all.json with dictionary keys

"reivew_id" : review id from the original doc
"class" : class=0 is stars=<3, class=1 is stars >3
"date" : original date
"toks" : dicionary of token counts

train/test split

First ignore all businesses that have less than 200 reviews
Choose 500 test and 500 dev businesses by weighted random sampling
- weighted random sampling is by the number of docs a business has

10-fold cross val; 10 re-samplings (Table 1, Fig 3)

#(1) ASSEMBLE DATA 
yelp_data/go_maketrain_trials.sh #natural setting 
yelp_data/go_maketrain_trials_prop.sh #synthetic setting proportion of 0.1

#(2) TRAIN MODELS 
code/train_all/go_train_master.sh

#(3) EVALUATE MODELS ON THE TEST SET
code/eval/baselines.py 
code/eval/go_eval_master.sh 

#(4) ANALYZE RESULTS 
code/eval/analysis.py  
graphs/final_mae_plots.ipynb

Training proportion experiments (10 resamplings of training data) (Fig 5a)

#(1) ASSEMBLE DATA 
yelp_data/go_train_prop_all.sh

#(2) TRAIN MODELS 
code/train_all/go_training_prop_all.sh

#(3) EVAL MODELS ON THE TEST SET 
code/eval/go_eval_train_prop_all.sh

#(4) ANALYZE RESULTS
code/eval/analysis_train_prop.py
graphs/train_prop10trains.ipynb

Training size experiments (10 resamplings of training data) (Fig 5b)

#(1) ASSEMBLE DATA 
yelp_data/go_trainsize_trials.sh

#(2) TRAIN
code/train_all/go_trainsize_trials.sh

#(3) EVAL
code/eval/go_trainsize_all_trials.sh

#(4) ANALYSIS
code/eval/analysis_train_size.py
graphs/trainsize.ipynb

Readme experiments

First, you will need to download the ReadMe R package
Make sure you change your home path manually in /readme_our_experiments/coderun_readme.R
Then run:

code/readme_our_experiments/go_readme.sh

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
code		code
graphs		graphs
yelp_data		yelp_data
README.md		README.md
simple_derivations.pdf		simple_derivations.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc_prevalence

Code to replicate experiments in the paper

Setting up Yelp dataset

train/test split

10-fold cross val; 10 re-samplings (Table 1, Fig 3)

Training proportion experiments (10 resamplings of training data) (Fig 5a)

Training size experiments (10 resamplings of training data) (Fig 5b)

Readme experiments

About

Releases

Packages

Contributors 2

Languages

slanglab/doc_prevalence

Folders and files

Latest commit

History

Repository files navigation

doc_prevalence

Code to replicate experiments in the paper

Setting up Yelp dataset

train/test split

10-fold cross val; 10 re-samplings (Table 1, Fig 3)

Training proportion experiments (10 resamplings of training data) (Fig 5a)

Training size experiments (10 resamplings of training data) (Fig 5b)

Readme experiments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages