- Developed confidence score tuning process (see
disambiguation_1880
folder) - Developed interpolation process (see
interpolation
folder for code andinterpolation_notebooks
folder for jupyter notebooks that document process) - For overview of process and next steps see google drive
HNYC_Project/Projects/spatial_linkage/Spatial Linkage & Interpolation: Summer 2020.ppt
andHNYC_Project/Projects/spatial_linkage/Spatial Linkage and Interpolation Workflow.doc
Accomplished:
- Updated confidence score to include census conflicts (see
disambiguation_2.ipynb
) - Merged lat lng data to matched data -> produced
matched.csv
- Experimented with methods to add spatial weights, including graph-based and cluster-based approaches on a subset of the data (see
spatial-disambiguation.ipynb
) - Outlined overall workflow for disambiguation using bipartite graph matching algorithm (
linkage-disambiguation.ipynb
). - Ran algorithms on full dataset & obtained initial performance metrics
- Functionalized process using disambiguation module
- Tuned algorithms and compared metrics + recommendations
- Updated ES matching process to reflect metaphone matching See Google Drive Documentation > disambiguation > Spatial Linkage: Spring 2020 for slides
Two sources:
- City Directory: name (first, last, initial), address, ID, occupation, ED, ward, block number (constructed)
- Census: address (hidden during match process), ID (different from CD), occupation, ED, ward, age, gender, dwelling, household characteristics
Steps:
0. Preprocessing: changing names to their phonetic index
- Initial entity link (to generate possible matches): criteria = same ED + JW dist < 2 on indexed name
- Disambiguation (to choose between non-unique matches):
a. Generate a confidence score based on occupation (having occupation), age > 12 (not implemented), JW dist of name, relative probability (number of non-unique matches)
Similar, but no ED and address data, only ward data in the census.
/ES_matching
contains all work related to elastic searchreadme.md
gives guidelines on how to implement the processfall_2019_analysis.md
describes the work done up to fall 2019/doc
: relevant documentation of the process/src
: source code for running elastic search on our dataexplore_output.ipynb
quick validation of the ES output (targeted at understanding whether metaphone matching was implemented in ES process)
/disambiguation_1850
contains all work related to disambiguation of 1880 datadisambiguation_1850_v1.ipynb
runs disambiguation process on 1850 ES output
/disambiguation_1880
contains all work related to disambiguation of 1880 dataConfidence_Score_Tuning.ipynb
: Documents confidence score tuning processConfidence_Score_Tuning_v02.ipynb
: Confidence score tuning results used for 1850 v02 disambiguation run (10/2020), uses old version of 1880 data because of data issues with new 1880Confidence_Score_Tuning_new_1880_data_draft.ipynb
: Attempt to tune confidence score with new 1880 data, revealed that there was an issue with the data/_archived
: archived scripts/confidence_score
preprocess.ipynb
: preprocessing of data including generation of metaphonesget_confidence_score.ipynb
: outlines process for generating confidence score, including calculation of jaro-wrinklerdisambiguation_analysis.ipynb
: EDA on confidence scores -- generatesmatch_results_confidence_score.csv
andfall_2019_disambiguation_report.md
/linkage_eda
spatial-disambiguation.ipynb
: documentation of different spatial weight algorithmslinkage-disambiguation.ipynb
: outline record linkage approach (conceptually valid but code is outdated)linkage_full_run_v1.ipynb
: applies basic algorithm to the whole dataset + initial performance analysislinkage_eda.ipynb
: applies various iterations of algorithm to the whole dataset + conclusionslinkage_eda_v2.ipynb
: applies updated geocodes on best 2 algorithms + improved benchmarking
run_link_records.ipynb
: implemented record matching using pysparkconfidence_score_latlng.ipynb
: adding of census conflicts to confidence score, merging of lat lng data (contains most updated confidence score formula)linkage_full_run_SPRING_LATEST.ipynb
: informed by linkage EDA (see archive), generates latest disambiguated output from ES matching (with metaphone issue fixed)
/interpolation_notebooks
Process_Documentation
Disambiguation_Analysis_v01.ipynb
: Resolves dwelling conflicts, calculates statistics,explores distance based sequences, and interpolation between known dwellingsInterpolation_v01.ipynb
: Runs through current version of predicting unknown recordsDisambiguation_Analysis_v02.ipynb
: Same information as v01, for new data (10/2020)Interpolation_v02.ipynb
: Same information as v01, for new data (10/2020)
Concepts_and_Development
:Block and Centroid Prediction with Analysis.ipynb
: Walks through approaches to predicting block numbers directly, and then clusters (tests different clustering algorithms)Block Centroids and What They Represent, 1850.ipynb
: Creates block centroids and illustrates them with visualizationsDwelling Addresses Fill In and Conflict Resolution Development.ipynb
: Development of conflict resolution within dwelling processDeveloping Distance Based Sequences.ipynb
: Process of developing distance based sequencesModel Comparison.ipynb
: Tests a few different model options (no in depth tuning)Sequences Exploration.ipynb
: Tests different iterations of sequence identificationModel Exploration.ipynb
: Brief experimentation with using neural networks, incomplete because of preprocessing necessary
Archived
1880_1850_for_Interpolation.ipynb
: Explores 1880 and 1850 census datasetsFeature_Exploration.ipynb
: Explores some of the columns in 1880 and 1850 datasets in order to determine what they represents and if they can be used for modellingInterpolation Pilots.ipynb
: Working notebook for starting explorations of options for interpolation (often moved into a separate notebook when they seem worth looking at in more depth)Linear_Model.ipynb
: Creates and tests linear models for house number interpolationModeling Comparison.ipynb
: Tests different modeling approaches for house numbers (currently linear model and gradient boosting) -- includes haversine sequences and block numbers as featuresBlock_Numbers Early Exploration.ipynb
: Explore block numbers distributions/data analysis and try using them as feature to predict house numberStreet_Dictionaries.ipynb
: Tried out looking at street dictionaries for dwellings in betweenBlock Number Prediction.ipynb
: Initial experiment with predicting block numbersInterpolation between known address development.ipynb
: Process of looking at values between known dwellings
/interpolation
See read me within this folder for details/disambiguation
is a python module containing wrapper functions needed in the disambiguation processinit.py
contains a Disambiguator object, when instantiated can be used to run entire disambiguation process, calling functions from below (seelinkage_eda.ipynb
for example on usage)preprocess.py
contains functions needed before applying disambiguation algorithms, including confidence score generationdisambiguation.py
contains functions needed for disambiguationanalysis.py
wrapper functions to produce performance metricsconfidence_score_tuning.py
contains functions needed for the confidence tuning processbenchmarking.py
contains Benchmark objects, to run benchmarking process in confidence tuning for 1880
/matching_viz
visualization web app to understand disambiguation output, see readme in folder for guidance on how to run
Data is available in the HNYC Spatial Linkage Google Drive HNYC_Project/Projects/spatial_linkage/Data
- 1850 disambiguated output:
1850_disambiguated.csv
- 1850 disambiguated output 10/2020 (current):
1850_mn_match_v02.csv
- 1850 ES matches:
es-1850-22-9-2020.csv
- Matches with confidence score (raw input for 1880 disambiguation processes):
matches.csv
- this is based off Fall 2019's Spark matching output
- Latest ES matches:
es-1880-21-5-2020.csv
- Latest Disambiguated Output:
disambiguated-21-5-2020.csv