DataWave Quickstart

Purpose

This quickstart is intended to speed up your development workflow by automating the steps that you'd otherwise need to perform manually to configure, build, deploy, and test a standalone DataWave environment
It includes setup and tear-down automation for DataWave, Hadoop, Accumulo, ZooKeeper, etc, and provides convenience methods for testing DataWave's ingest and query components
For simplicity, all cluster services are downloaded/installed automatically, they are extracted and installed under the quickstart home directory, and all are owned/executed by the current user
Docker alternative: rather than installing and running services locally, you may prefer to run the entire quickstart environment as a Docker container. This will provide the exact same capabilities described below, but hosted on a CentOS 7 base image.

Prerequisites

None, other than Linux, Bash, and an Internet connection to wget tarballs
JDK 1.8 and Maven 3.x are required. However, if it is determined that they are not already installed/configured in your environment, they'll be downloaded and installed automatically under datawave-quickstart/bin/services/java and datawave-quickstart/bin/services/maven respectively
To date, this quickstart has been tested on CentOS 6x, 7x and Ubuntu 16x. Any Linux with at least Bash 4x will probably suffice

Recommended (minimum) system configuration

RAM: 8GB, CPU: 1 x quad-core
Storage: To install everything, you'll need at least a few GB free
ulimit -u (max user processes): 32768
ulimit -n (open files): 32768
vm.swappiness: 0

Get started

The quickstart is comprised primarily of ../{serviceName}/boostrap.sh scripts. These define supporting bash variables and bash functions, encapsulating both configuration and key functionality for each service in a consistent manner.

All boostrap scripts are sourced into your environment via bin/env.sh, which itself should be sourced in your ~/.bashrc as described below.

Note that the four bash commands shown in Step 1 below will complete the entire setup process, but first you may want to read through subsequent sections to better understand how the setup works and how to customize it for your own preferences

Update your ~/.bashrc as indicated below and then run the commands that follow to bootstrap and install all services...
```
$ echo "source /path/to/datawave-quickstart/bin/env.sh" >> ~/.bashrc  # Step 1
$ source ~/.bashrc                                                    # Step 2
$ allInstall                                                          # Step 3
$ datawaveWebStart && datawaveWebTest                                 # Step 4
# Setup is now complete
```
If you happen to update ~/.bashrc via some other method, such as with vi or other, beware that the source /path/to/datawave-quickstart/bin/env.sh line should be placed after any existing Hadoop-, Accumulo-, Wildfly-, or ZooKeeper-related variables you may have defined, in order to prevent those existing configs from interfering with the quickstart install. Making it the last line in the file will mitigate most conflicts
After executing source ~/.bashrc as shown above, tarballs for registered services (Hadoop, Accumulo, ZooKeeper, Wildfly, etc) will begin downloading automatically, and the DataWave build will be started automatically in turn. This can take a while to complete, so this is a good time for a coffee break
- It's also easy to affect which version of a binary gets downloaded/installed for a given service. Just override the default DW_**_DIST_URI value defined in the service's bootstrap.sh before performing this step
  
  For example...
```
$ vi ~/.bashrc

    ...

    export DW_ACCUMULO_DIST_URI=http://my.favorite.apache.mirror/accumulo/1.8.X/accumulo-1.8.X-bin.tar.gz
    export DW_JAVA_DIST_URI=file:///my/local/binaries/jdk-8-linux-x64.tar.gz
    export DW_HADOOP_DIST_URI=file:///my/local/binaries/hadoop-2.6.0-cdh5.9.1.tar.gz
    export DW_ZOOKEEPER_DIST_URI=file:///my/local/binaries/zookeeper-3.4.5-cdh5.9.1.tar.gz

    source /path/to/datawave-quickstart/bin/env.sh

$ source ~/.bashrc
```
- As for DataWave's ingest and web service binaries, a Maven build is kicked off automatically and the respective tarballs will be moved into place upon conclusion of the build
- The DW_DATAWAVE_BUILD_COMMAND variable in datawave/bootstrap.sh defines the Maven command used for the build. It may be overridden in ~/.bashrc or from the command line as needed
When tarball downloads are finished and the DataWave build has completed, the allInstall command will initialize and configure all services...
- Alternatively, individual services may be installed one at a time, if desired, using {serviceName}Install
- All binaries are extracted/installed under datawave-quickstart/bin/services
- As part of the installation of DataWave Ingest, example data will be ingested automatically via M/R for demonstration purposes
- Example datasets in JSON, CSV, and XML formats are among those ingested automatically
Lastly, datawaveWebStart && datawaveWebTest will start up DataWave Web and run preconfigured tests against DataWave's REST API. Note any test failures, if present, and check logs for error messages.
- Troubleshooting help:
  - Get help on web test options: datawaveWebTest --help
  - Rerun web tests with more output: datawaveWebTest --verbose --pretty-print
  - Wildfly/DataWave Web logs: datawave-quickstart/wildfly/standalone/log
  - Yarn/MR logs: datawave-quickstart/data/hadoop/yarn/log
  - DataWave Ingest logs: datawave-quickstart/datawave-ingest/logs
- Note that an individual service may be started at any time with its associated {serviceName}Start command
- Moreover, using a DataWave-specific command, such as datawaveWebStart for starting DataWave web services in Wildfly, will automatically start up service dependencies, if necessary, such as Hadoop, Accumulo, ZooKeeper, etc
- Alternatively, you may execute the allStart command to bring up all services in the proper order

Check the status of your services...

Execute the allStatus command or use {serviceName}Status for a given service to display its associated PID(s)
Try out the following queries from the command line and inspect the results...

# Find some Wikipedia data based on page title, with --verbose flag to see the CURL command used...

$ datawaveQuery --query "PAGE_TITLE:AccessibleComputing OR PAGE_TITLE:Anarchism" --verbose

# TV show data from API.TVMAZE.COM (graph edge queries)

$ datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'kevin bacon' && TYPE == 'TV_COSTARS'" --pagesize 30
$ datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'william shatner' && TYPE == 'TV_CHARACTERS'"
$ datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'westworld' && TYPE == 'TV_SHOW_CAST'"  --pagesize 20

# Check out all 'datawaveQuery' function options, inspect default parameter values, etc...

$ datawaveQuery --help

Visit any of the following in your browser
- Hadoop HDFS: http://localhost:50070/dfshealth.html#tab-overview
- Hadoop M/R: http://localhost:8088/cluster
- Accumulo: http://localhost:9995
- DataWave Web: https://localhost:8443/DataWave/doc

Stop services, uninstall services, reset the environment, etc

To stop all services gracefully, execute the allStop command. Likewise, you may stop individual services with the associated {serviceName}Stop command
allStop --hard will bring down all services via kill -9 with no prompting
To tear down all services and remove all associated data directories, execute the allUninstall [ --remove-binaries, -rb ] command. Individual services may be uninstalled at any time with {serviceName}Uninstall [ --remove-binaries, -rb ]
To re-source all bootstrap.sh and related scripts without starting a new bash session: resetQuickstartEnvironment
Nuclear options:
- resetQuickstartEnvironment --hard : kill -9 all services, remove everything including tarballs, re-download and re-build all
- resetQuickstartEnvironment --hard && allInstall : same as above but install everything after reset

Convenience functions implemented for DataWave Web and DataWave Ingest

DataWave-specific functions are implemented in a variety scripts. For example...

bin/query.sh: Query-related functions for interacting with DataWave Web's REST API
datawave/bootstrap.sh: Common functions. Parent wrapper for the DataWave Web and DataWave Ingest bootstraps
datawave/bootstrap-web.sh: Bootstrap for DataWave Web and associated functions
datawave/bootstrap-ingest.sh: Bootstrap for DataWave Ingest and associated functions
datawave/bootstrap-user.sh: User-specific configs/functions for your DataWave Web test user

The functions below are a small subset of those implemented for DataWave, but will likely be the most commonly used

datawaveWebStart [ --debug ]
- Start up DataWave's web services in Wildfly. Pass the --debug flag to start Wildfly in debug mode
datawaveQuery --query <query-expression>
- Submit queries on demand and inspect results
- Supports several options, use the --help flag for more info
- After installation, try datawaveQuery --query "PAGE_TITLE:AccessibleComputing OR PAGE_TITLE:Anarchism" to view some Wikipedia search results
- Also see bin/query.sh for more details
- Query syntax guidance: http://localhost:8443/DataWave/doc/query_help.html
datawaveWebTest
- Wrapper function for the datawave/test-web/run.sh script, for executing pre-configured, curl-based tests against DataWave's REST API
- Supports several options, use the --help flag for more information
datawaveIngestJson /path/to/local/file.json
- Kick off M/R job to ingest raw JSON file
- Ingest config file: myjson-ingest-config.xml
- Raw file ingested automatically by the quickstart install: tvmaze-api.json
- Execute the ingest-tv-shows.sh script to download and ingest a JSON dataset dynamically via http://tvmaze.com/api
  - Upon completion, you should be able to execute queries related to cast and characters for several TV shows...
  - Shard query: datawaveQuery --query "EMBEDDED_CAST_PERSON_NAME == 'bryan cranston'" --syntax JEXL
  - Edge query: datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'don knotts' && TYPE == 'TV_COSTARS'" --pagesize 30
  - Edge-to-Shard query: datawaveQuery --logic EdgeEventQuery --syntax JEXL --query "SOURCE == 'don knotts' && SINK == 'andy griffith' && RELATION == 'PERSON-PERSON' && TYPE == 'TV_COSTARS'"
datawaveIngestWikipedia /path/to/local/enwiki*.xml
- Kick off M/R job to ingest a raw Wikipedia file. Any standard enwiki-flavored (raw XML) should suffice
- Ingest config file: wikipedia-ingest-config.xml
- Raw file ingested automatically by the quickstart install: enwiki-20130305-pages-articles-brief.xml
- Example shard query: datawaveQuery --query "PAGE_TITLE:AccessibleComputing OR PAGE_TITLE:Anarchism"
- Example edge query: datawaveQuery --logic EdgeQuery --syntax JEXL --query "SOURCE == 'computer accessibility' && TYPE == 'REDIRECT'"
datawaveIngestCsv /path/to/local/file.csv
- Kick off M/R job to ingest a raw CSV file
- Ingest config file: mycsv-ingest-config.xml
- Raw file ingested automatically by the quickstart install: my.csv
- Example query: datawaveQuery --query "FOO_FIELD:myfoo"
datawaveBuild and datawaveBuildDeploy
- Rebuild and/or redeploy DataWave as needed (i.e., after the initial install/deploy)

Functions implemented by all service bootstraps

See bin/services/{servicename}/bootstrap.sh


`{servicename}Start`	Start the service
`{servicename}Stop`	Stop the service
`{servicename}Status`	Display current status of the service, including PIDs if running
`{servicename}Install`	Install the service
`{servicename}Uninstall`	Uninstall the service, leaving tarballs in place. Optional `--remove-binaries, -rb` flag
`{servicename}IsRunning`	Returns 0 if running, non-zero otherwise. Mostly for internal use
`{servicename}IsInstalled`	Returns 0 if installed, non-zero otherwise. Mostly for internal use
`{servicename}Printenv`	Display current state of the service configuration, bash variables, etc
`{servicename}PidList`	Display all service PIDs on a single line, space-delimited

Where {servicename} is one of hadoop, accumulo, zookeeper, datawave

Common functions implemented for convenience

See bin/common.sh


`allStart`	Start all services
`allStop`	Stop all services gracefully. Use `--hard` flag to `kill -9`
`allStatus`	Display current status of all services, including PIDs if running
`allInstall`	Install all services
`allUninstall`	Uninstall all services, leaving tarballs in place by default. Optional '--remove-binaries' flag
`allPrintenv`	Display current state of all service configurations

PKI Notes

Note that DataWave Web is PKI enabled by default and requires two-way authentication. The following self-signed materials are used for demonstration purposes
- Server Truststore [JKS]: ca.jks
- Server Keystore [PKCS12]: testServer.p12
- Client Cert [PKCS12]: testUser.p12
Passwords for all of the above: secret
This purpose of this PKI setup is to demonstrate DataWave's ability to be integrated easily into an organization's existing private key infrastructure and user authorization services. See datawave/bootstrap-user.sh for more information on configuration of the test user's roles and associated Accumulo authorizations
To access DataWave Web endpoints in a browser, you'll need to import the client cert into the browser's certificate store
If you'd like to test with your own certificate materials, see datawave/bootstrap.sh and override the trustore/keystore variables there prior to performing Step 2 of the install process above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DataWave Quickstart

Purpose

Prerequisites

Recommended (minimum) system configuration

Get started

Stop services, uninstall services, reset the environment, etc

Convenience functions implemented for DataWave Web and DataWave Ingest

Functions implemented by all service bootstraps

Common functions implemented for convenience

PKI Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

DataWave Quickstart

Purpose

Prerequisites

Recommended (minimum) system configuration

Get started

Stop services, uninstall services, reset the environment, etc

Convenience functions implemented for DataWave Web and DataWave Ingest

Functions implemented by all service bootstraps

Common functions implemented for convenience

PKI Notes