Searches for ip addresses inside unformatted text
-
The goal of the project is to take some unstructured text that contains Ipv4 addresses, lex that into a datastructure for later use, then filter down this list and perform geoip queries against the Ip addresses obtained.
-
This project is meant to be a fun and challenging example. It currently doesn't support the end goal yet. It's close, the filtering isn't completed, everything else seems to be in a good working state. See all the open issues for the missing features.
-
The CSV geolite2 data download and data population are externalized from the libary to decouple data bootstraping. This was done by design so that the library is nice and small and flexible.
-
The lexing routing for finding IP addresses uses a real lexer very similar to
flex
orlex
. This is 100% tested and works great. -
The data population routine is handled by python invoke which is a
Makefile
like library for python tasks and execution. To bootstrap the project, you'll useinvoke <cmd>
and to populate the Sqlite3 db. -
An attempt was made to use an
ORM
using the SQLAlchemy to transform the CSV geolite data into a small DB that has no external deps. The dataset size to populate the DB is a little more than 4 million records so it takes a significant amoutn of time to populate the DB. WARNING: You may experienceOOM
exceptions if your dataset is too large. The DB and models have some test coverage to ensure everything works. -
There is automatted and generated API documenation using Sphinx.
-
This runs on
python 3.6
andpython 3.7
- Ensure that you have cloned the repository
- Checkout the code repository and initialize your python dev environment
cd ipcrawl
./init.sh
- Once the project has been initialized we just need to source it
. ./init.sh
-
NOTE: You only have to do this once and all further commands assume that you have executed this step.
-
We first want to download the Geolite2 free CSV files.
- NOTE: You can customize what files are present by modifying config.json Do this before you invoke the following command.
- NOTE: This step isn't necessary unless you want to download a fresh version, init.sh will do this for you if the CSV files don't already exist.
inv download-geolite-dbs
- After the CSV files have been downloaded, we will attempt to populate the database.
- NOTE: This may take a long time depending on your machine. This a
synchronous operation and not threaded due to time constraints. There
is a little over 4 million records to populate into your database.
Benchmarks have shown on
4.2 GHz Intel Core i7
with64g RAM
andSSD
this took on average8-12
minutes. We only need to do this when there is new data to populate and this depends on the geolite2 release cycle for CSVs. Consult their documentaiton for further detials.- NOTE: This step isn't necessay when the sqlite3 db does not exist.
inv populate-sqlite3
$ inv -l
Available tasks:
bandit Runs bandit security linter
benchmark-raw-csv Perform timeit calculations on reading CSVs as raw file or into a dict
build-sdist Builds the package
clean Cleans all compiled artifacts recursively
coverage Run code coverage
docs-html Builds the sphinx documentation
download-geolite-asn-db Downloads the geolite2 asn db
download-geolite-city-db Downloads the geolite2 city db
download-geolite-country-db Downloads the geolite2 country db
download-geolite-dbs Metajob to run all other download_geolite_*_db tasks
populate-sqlite3 Populate SQLite3 db with geolite2 CSV data
prep-commit Preps the commit, runs [bandit, docs-html, coverage]
prep-packaging Preps the current state of this project for use with packaging as a tarball
tests Runs all or specific tests
inv docs-html
- Then open your browser using the
file:///path/to/this/repo/docs/build/html/index.html
- The following will download the geolite data CSV files
inv download-geolite-asn-db
inv download-geolite-city-db
inv download-geolite-country-db
inv download-geolite-dbs
inv populate-sqlite3
- This project is currently tested only on the newest versions of python
which at this time of writing is only
python 3.6
andpython 3.7
- This runs all the tests
tox
- tox environments
tox -l
py36
py37
bandit
coverage
docs-html
flake8
- Example of running only flake8
tox -e flake8
- Cody Lane 2019