Replication software, data, and supplementary materials for the paper: Keith et al., EMNLP-2017, "Identifying civilians killed by police with distantly supervised entity-event extraction."
Contact: Katherine Keith ([email protected]), Brendan O'Connor ([email protected])
-
data/
- sentments/
- train.json, test.json: sentence-level train and test files after preproc
- docid = docid_fragid_sentid, i.e. the first number before the hyphen is the document id matching docs/ the second number is the fragment number within the document, and the third number is the sentence number within that fragment (see preproc/ for how these fragments and sentences were segmented)
- train.json, test.json: sentence-level train and test files after preproc
- gold/
- fatalencs/
- fe-raw.csv: raw Fatal Encounters (FE) file downloaded Feb. 27, 2017
- fe-all.json: our FE post-processing (with hapnis normalized names)
- guardian/
- guard-raw.csv: raw guardian 2016 data downloaded Jan. 1, 2017
- guard-all.json: our guardian post-processing (with hapnis normalized names)
- fatalencs/
- sentments/
-
code/
- eval/
- evaluation.py: prints out AUPRC and best F1 for a given model
- models/
- preds/
- .json files with predictions for the six models in the paper
- logreg/
- Logistic Regression model code. See
README.md
in this directory for further instructions and notes
- Logistic Regression model code. See
- cnn/
- CNN model code. See
README.md
in this directory for further instructions and notes
- CNN model code. See
- preds/
- preproc/
- scrape/
- Code which downloads articles found via Google News and adds them to a Postgres database
- dedupe/
- Code which removes duplicate sentences from the dataset.
- sentment/
- hap/ : HAPNIS name normalization code
- normnames.py : name normalization
- getsentment.py : matches extracted sentences against gold data
- scrape/
- eval/
-
requirements.txt : pip installed packages in requirements format
-
run.sh : runs the entire model pipleine (with data-pre-processed)
-
run_pretrained.sh : runs the model pipeline with the pre-trained model for the given test data
Model train/test split of documents:
- Training: Jan. - Aug. 2016
- Testing: Sept. - Dec. 2016
The evalutation script prints out the AUPRC and best F1 for the predictions of a given model.
Example usage:
cat code/models/preds/m1.json | python code/eval/evaluation.py
The evaluation code requries predictions in the following json format (see code/models/preds/m1.json
for example) with one dictionary per line and dicitionary keys:
- "id" : document-sentence id
- "weight" : prediction on that mention given by the model
- "name" : name of the potential victim associated with that mention
To run the current model (logistic regression with EM training) with train/test data after pre-processing:
./run.sh
To run the pre-trained model on test data
./run_pretrained.sh
Here's an outline for the pipeline for soft (EM-based logistic regression):
- Extract features
python code/models/logreg/extrfeats.py data/sentments/train.json --ngrams --deps
python code/models/logreg/extrfeats.py data/sentments/test.json --ngrams --deps
- Run through logistic regression
python code/models/logreg/runlr.py data/sentments/train.json data/sentments/test.json train_ng_dep.mtx test_ng_dep.mtx
- Evaluate
cat code/models/preds/em50.json | python code/eval/evaluation.py
The preprocessed data used for this paper is data/sentments/train.json
and data/sentments/test.json
.
For both files, each line corresponds to a single mention with dictionary keys:
- "docid": document id
- "name": HAPNIS normalized firstname, lastname pair of that is mapped to the 'TARGET' symbol
- "names_org": un-normalized names corresponding to 'name' that originally appeared in the text
- "sentnames": other names in the mention that will be mapped to 'PERSON' symbol
- "downloadtime": time the document was downloaded
- "sent_org": original mention text
- "sent_alter": mention text with names replaced by 'TARGET' and 'PERSON' symbols
- "plabel": 1 if 'name' matches a gold standard victim name in Fatal Encounters, 0 otherwise
Feature extraction requires
- Standford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
- Standford CoreNLP pywrapper (https://github.com/brendano/stanford_corenlp_pywrapper)