Skip to content

tanel3203/blacklistNameMatcher

Repository files navigation

Blacklist Name Matcher

Description:

  • If using a file: Imports a file with a list of blacklisted names, cleans it against a file with irrelevant names and contents
  • If using EUROPA sanctions list: Imports XML from EUROPA sanctions website, processes it to list type
  • Then implements a strict find against a query name,
  • If no results are found, a modified (partial) search is implemented

Contents

Components:

  • scraper.py (1 method) - scrape EUROPA XML data and process to list type
  • importer.py (1 method) - import file
  • processor.py (1 method) - process file to list
  • cleaner.py (1 method) - clean file against a noisefile with irrelevant words
  • terrorist_finder.py (3 methods) - find matches in file against query
  • main.py - command line program

Data:

  • blacklist.tsv, blacklist.txt, ...
  • noisefile.tsv, noisefile.txt, ...

Example command line process

tanel@tanel:~/Documents/pyScript$ python main.py
Please enter name to search in the terrorist list 
> Robert Mugabe
Do you want to import a file or use the EUROPA database? (Input 'file', otherwise EUROPA is used) 
> europa
We will use the default sanctions list on ec.europa.eu and 8110 records (as at 27.12.2016)

Give it a few seconds...

------------------------------------------------------------------------------
IMPORTANT!
If this text is followed by an error
it is most likely you requested # names that's more than there are in the list
The list has  8110  existing names
------------------------------------------------------------------------------
No strict match, looking for partial matches...
TERRORIST MATCHED!
Certainty:  100.0 %
Name:  Robert Gabriel Mugabe

TERRORIST MATCHED!
Certainty:  100.0 %
Name:  Robert Gabriel Mugabe

TERRORIST MATCHED!
Certainty:  50.0 %
Name:  Grace Mugabe

TERRORIST MATCHED!
Certainty:  50.0 %
Name:  Grace Mugabe

TERRORIST MATCHED!
Certainty:  50.0 %
Name:  Robert Konars

Program logic

ProgramLogic

Program setup:

ProgramSetUp

Details

  • file has one name in every row (no XML, JSON,.., formatting)
  • common filetypes: txt, csv, tsv

Tech used:

  • Python 2.7
  • Lubuntu 16.04

Current issues:

  • every time a name is queried, data is reimported and processed
    • In reality, a cronjob would do it in every X amount of time
  • Partial matches are not ordered
  • User raw input is not cleaned
  • Does not handle foreign keyboards (e.g. kirillitsa)

Content works, not yet implemented:

  • (levenshteinDistance.py): fuzzy search using Levenshtein distance, if strict and partial matches don't return anything