Merge pull request #7 from peterskim12/master

Add campaign finance demo
elastic · Feb 20, 2015 · 3129d4d · 3129d4d
2 parents 0e10d85 + 78d5a29
commit 3129d4d
Show file tree

Hide file tree

Showing 481 changed files with 115,151 additions and 0 deletions.
diff --git a/usfec/README.md b/usfec/README.md
@@ -0,0 +1,90 @@
+US FEC Campaign Contributions Demo: 2013-2014 US Election cycle
+=====
+
+For some background information for this demo, please see the blog post here:
+[Kibana 4 for investigating PACs, Super PACs, and who your neighbor might be voting for](http://www.elasticsearch.org/blog/kibana-4-for-investigating-pacs-super-pacs-and-your-neighbors/)
+
+#Installation
+
+This demo consists of the following:
+
+* Instructions for restoring index snapshot with pre-indexed campaign contributions data
+* Python script for joining normalized files and outputting JSON
+* Elasticsearch index template
+* Logstash config
+
+
+## Restoring index snapshot
+
+After downloading and installing the ELK stack, you’ll need to download the index snapshot file for the campaign contributions data which can be obtained here (FYI it’s a 1.4GB file; we take no responsibility for this download eating up your monthly mobile tethering quota):
+
+http://download.elasticsearch.org/demos/usfec/snapshot_demo_usfec.tar.gz 
+
+Create a folder somewhere on your local drive called “snapshots” and uncompress the .tar.gz file into that directory. For example:
+```
+# Create snapshots directory
+mkdir -p ~/elk/snapshots
+# Copy snapshot download to your new snapshots directory
+cp ~/Downloads/snapshot_demo_usfec.tar.gz ~/elk/snapshots
+# Go to snapshots directory
+cd ~/elk/snapshots
+# Uncompress snapshot file
+tar xf snapshot_demo_usfec.tar.gz
+```
+Once you have Elasticsearch running, restoring the index is a two-step process:
+
+1) Register a file system repository for the snapshot (change the value of the “location” parameter below to the location of your usfec snapshot directory):
+```
+curl -XPUT 'http://localhost:9200/_snapshot/usfec' -d '{
+    "type": "fs",
+    "settings": {
+        "location": "/tmp/snapshots/usfec",
+        "compress": true,
+        "max_snapshot_bytes_per_sec": "1000mb",
+        "max_restore_bytes_per_sec": "1000mb"
+    }
+}'
+```
+2) Call the Restore API endpoint to start restoring the index data into your Elasticsearch instance:
+```
+curl -XPOST "localhost:9200/_snapshot/usfec/1/_restore"
+```
+At this point, go make yourself a [coffee](https://bluebottlecoffee.com/preparation-guides). When your delicious cup of single-origin, direct trade coffee has finished brewing, you can check to see if the restore operation is complete by calling the cat recovery API:
+```
+curl -XGET 'localhost:9200/_cat/recovery?v'
+```
+Or get a count of the documents in the expected indexes:
+```
+curl -XGET localhost:9200/usfec*/_count -d '{
+	"query": {
+		"match_all": {}
+	}
+}'
+```
+which should return a count of approximately 4250251.
+
+## Python script
+
+The raw FEC data is provided in a number of 7 files. In order to do some useful querying of the data in a search engine / NoSQL store like Elasticsearch, you typically have to go through a data modeling process of identifying how to join data from various tables. 
+
+The Python script (in scripts/process_camfin.py) takes care of some of the obvious ways to join the various data files and produces four .json files which can then be loaded into Elasticsearch using Logstash. The script requires Python 3.
+
+You don't need to run the Python script but it's here in case you want to modify how the data is joined, perform additional data cleansing/enrichment, re-process the latest raw data set from the FEC, etc.
+
+##Elasticsearch index template config
+
+The Elasticsearch mapping configuration is defined in the index template file: index\_template.json. Documentation:
+
+* [Mapping documentation](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html)
+* [Index templates documentation](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html)
+
+##Logstash config
+
+The Logstash configuration is defined in the file: logstash.conf. Documentation for Logstash plugins: [http://www.elasticsearch.org/guide/en/logstash/current/index.html](http://www.elasticsearch.org/guide/en/logstash/current/index.html).
+
+##Miscellaneous
+
+There are a few other files in this directory which probably deserves explanation:
+
+* data/US.txt, data/zip_codes.csv: These are two zip code to lat/long mapping files which the Python script uses to enrich zip codes in the raw data with a lat/long that Elasticsearch can use for geo queries. If you run the Python script, make sure these two files are in the same directory as the current working dir at the time of execution.
+* Vagrant/Puppet files: The first demo released in this demo repo, the NYC traffic accidents demo, included these Vagrant/Puppet files to programmatically instantiate a virtual machine that installs the ELK stack and restore the index snapshot with a simple 'vagrant up' command. While you are still free to use these files, we chose not to recommend this for this demo since the index snapshot is so large which can cause problems if people's internet connections are slow, laptops don't have sufficient resources for running a larger VM, etc.
diff --git a/usfec/Vagrantfile b/usfec/Vagrantfile
@@ -0,0 +1,139 @@
+# -*- mode: ruby -*-
+# vi: set ft=ruby :
+
+# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
+VAGRANTFILE_API_VERSION = "2"
+
+Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
+  # All Vagrant configuration is done here. The most common configuration
+  # options are documented and commented below. For a complete reference,
+  # please see the online documentation at vagrantup.com.
+
+  # Every Vagrant virtual environment requires a box to build off of.
+  config.vm.box = "puppetlabs/ubuntu-14.04-64-puppet"
+
+  config.vm.provision :shell, path: "bootstrap.sh"
+  config.vm.provision :puppet do |puppet|
+    puppet.manifests_path = "puppet/manifests"
+    puppet.module_path = "puppet/modules"
+    # puppet.options = "--verbose --debug"
+  end
+
+  config.vm.network "forwarded_port", guest: 9200, host: 9200
+  config.vm.network "forwarded_port", guest: 5601, host: 5601
+
+  config.vm.provider "virtualbox" do |vb|
+    # Use VBoxManage to customize the VM. For example to change memory:
+    vb.customize ["modifyvm", :id, "--memory", "2048"]
+    vb.customize ["modifyvm", :id, "--ioapic", "on"]
+    vb.cpus = 2
+  end
+
+  # Disable automatic box update checking. If you disable this, then
+  # boxes will only be checked for updates when the user runs
+  # `vagrant box outdated`. This is not recommended.
+  # config.vm.box_check_update = false
+
+  # Create a forwarded port mapping which allows access to a specific port
+  # within the machine from a port on the host machine. In the example below,
+  # accessing "localhost:8080" will access port 80 on the guest machine.
+  # config.vm.network "forwarded_port", guest: 80, host: 8080
+
+  # Create a private network, which allows host-only access to the machine
+  # using a specific IP.
+  # config.vm.network "private_network", ip: "192.168.33.10"
+
+  # Create a public network, which generally matched to bridged network.
+  # Bridged networks make the machine appear as another physical device on
+  # your network.
+  # config.vm.network "public_network"
+
+  # If true, then any SSH connections made will enable agent forwarding.
+  # Default value: false
+  # config.ssh.forward_agent = true
+
+  # Share an additional folder to the guest VM. The first argument is
+  # the path on the host to the actual folder. The second argument is
+  # the path on the guest to mount the folder. And the optional third
+  # argument is a set of non-required options.
+  # config.vm.synced_folder "../data", "/vagrant_data"
+
+  # Provider-specific configuration so you can fine-tune various
+  # backing providers for Vagrant. These expose provider-specific options.
+  # Example for VirtualBox:
+  #
+  # config.vm.provider "virtualbox" do |vb|
+  #   # Don't boot with headless mode
+  #   vb.gui = true
+  #
+  #   # Use VBoxManage to customize the VM. For example to change memory:
+  #   vb.customize ["modifyvm", :id, "--memory", "1024"]
+  # end
+  #
+  # View the documentation for the provider you're using for more
+  # information on available options.
+
+  # Enable provisioning with CFEngine. CFEngine Community packages are
+  # automatically installed. For example, configure the host as a
+  # policy server and optionally a policy file to run:
+  #
+  # config.vm.provision "cfengine" do |cf|
+  #   cf.am_policy_hub = true
+  #   # cf.run_file = "motd.cf"
+  # end
+  #
+  # You can also configure and bootstrap a client to an existing
+  # policy server:
+  #
+  # config.vm.provision "cfengine" do |cf|
+  #   cf.policy_server_address = "10.0.2.15"
+  # end
+
+  # Enable provisioning with Puppet stand alone.  Puppet manifests
+  # are contained in a directory path relative to this Vagrantfile.
+  # You will need to create the manifests directory and a manifest in
+  # the file default.pp in the manifests_path directory.
+  #
+  # config.vm.provision "puppet" do |puppet|
+  #   puppet.manifests_path = "manifests"
+  #   puppet.manifest_file  = "site.pp"
+  # end
+
+  # Enable provisioning with chef solo, specifying a cookbooks path, roles
+  # path, and data_bags path (all relative to this Vagrantfile), and adding
+  # some recipes and/or roles.
+  #
+  # config.vm.provision "chef_solo" do |chef|
+  #   chef.cookbooks_path = "../my-recipes/cookbooks"
+  #   chef.roles_path = "../my-recipes/roles"
+  #   chef.data_bags_path = "../my-recipes/data_bags"
+  #   chef.add_recipe "mysql"
+  #   chef.add_role "web"
+  #
+  #   # You may also specify custom JSON attributes:
+  #   chef.json = { :mysql_password => "foo" }
+  # end
+
+  # Enable provisioning with chef server, specifying the chef server URL,
+  # and the path to the validation key (relative to this Vagrantfile).
+  #
+  # The Opscode Platform uses HTTPS. Substitute your organization for
+  # ORGNAME in the URL and validation key.
+  #
+  # If you have your own Chef Server, use the appropriate URL, which may be
+  # HTTP instead of HTTPS depending on your configuration. Also change the
+  # validation key to validation.pem.
+  #
+  # config.vm.provision "chef_client" do |chef|
+  #   chef.chef_server_url = "https://api.opscode.com/organizations/ORGNAME"
+  #   chef.validation_key_path = "ORGNAME-validator.pem"
+  # end
+  #
+  # If you're using the Opscode platform, your validator client is
+  # ORGNAME-validator, replacing ORGNAME with your organization name.
+  #
+  # If you have your own Chef Server, the default validation client name is
+  # chef-validator, unless you changed the configuration.
+  #
+  #   chef.validation_client_name = "ORGNAME-validator"
+end
diff --git a/usfec/bootstrap.sh b/usfec/bootstrap.sh
@@ -0,0 +1,3 @@
+#!/usr/bin/env bash
+
+apt-get update