SMV provides a shell script to easily create an example application. The example app can be used for exploring SVM and it can also be used as an initialization script for a new project.
The smv-init
script can be used to create the initial SMV app. smv-init
only requires two parameters.
- The name of the directory to create for the application
- The FQN of the package to use for Maven and source files. For example:
$ _SMV_HOME_/tools/smv-init MyApp com.mycompany.myapp
The above command will create the MyApp
directory and
install the source, configuration, and build files required for a minimal example SMV app.
The rest of this document assumes the above command was run to show what is generated and how to use it.
Note: User can skip to the "Run Example App" section if they are not interested in exploring the output from smv-init
The generated example app contains two configuration files.
smv-app-conf.props
: The application level configuration parameters. This file should define the application name and the configured stages.smv-user-conf.props
: The user level configuration parameters. This file is normally NOT checked in
See Application Configuration for more details about available configuration parameters.
The data directory contains a sample file extracted from US employment data.
$ wget http://www2.census.gov/econ2012/CB/sector00/CB1200CZ11.zip
$ unzip CB1200CZ11.zip
$ mv CB1200CZ11.dat CB1200CZ11.csv
More info can be found on US Census site
A valid pom.xml file is generated using the package name provided on the command line.
The last part of the provided package name is used as the Maven "artifactID" and the first section is used as the "groupID"
In the example above, the artifactId will be set to myapp
and the groupID will be com.mycompany
The example app generates an app with a two stages stage1
and stage2
. Having two stages for such a tiny example is overkill, but they are there for demonstration purposes.
The generated source files are:
stage1/input/InputSetS1.scala
: contains definitions of all input files into stage1 and their DQM rules/policies.stage1/EmploymentByState.scala
: contains sample ETL module for processing the provided employment data.stage2/input/InputFilesS2.scala
: defines the input links to the output modules instage1
stage2/StageEmpCategory.scala
: sample "modeling" module for creating categorical variables.
Note: In practice, a single stage will have multiple module files and possibly additional input files (depending on the number and complexity of inputs)
The generated application must be built before it is run. This is simply done by running the following maven command:
$ mvn clean install
The above command should generate a target directory that contains the application "fat" jar myapp-1.0-SNAPSHOT-jar-with-dependencies.jar
.
This jar file will contain the compiled application class files, all the SMV class files and everything else that SMV depends on (except for the Spark libraries)
The built app can be run by two methods.
smv-run
: used to run specific modules, stages, or entire app from the command line.smv-shell
: uses the Spark Shell to interactively run and explore the output of individual modules and files.
# run entire app (run all output modules in all stages)
$ _SMV_HOME_/tools/smv-run --run-app
# run stage1 (all output modules in stage1)
$ _SMV_HOME_/tools/smv-run -s stage1
# or
$ _SMV_HOME_/tools/smv-run -s com.mycompany.myapp.stage1
# run specific module (any module can be run this way, does not have to be an output module)
$ _SMV_HOME_/tools/smv-run -m com.mycompany.myapp.stage1.EmploymentByState
See SMV Output Modules for more details on how to mark a module as an output module.
The output csv file and schema can be found in the data/output
directory (as configured in the conf/smv-user-conf.props
files).
$ cat data/output/com.mycompany.myapp.stage1.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"32",981295
"33",508120
"34",3324188
"35",579916
"36",7279345
$ cat data/output/com.mycompany.myapp.stage1.EmploymentByState_XXXXXXXX.schema/part-*
FIRST('ST): String
EMP: Long
Note: the output above may be different as it depends on order of execution of partitions.
$ _SMV_HOME_/tools/smv-run -m com.mycompany.myapp.stage1.EmploymentByState -g
With the -g
flag, instead of produce and persist the module, the module dependency
graph will be created as a dot
file. It can be converted to png
using the
dot
command.
$ dot -Tpng com.mycompany.MyApp.stage1.EmploymentByState.dot -o graph.png
You main need to install graphviz
on your system to use the dot
command.
See Run SMV Application for further details.
Spark shell can be used to allow the user to run individual modules interactively.
The smv-shell
script is provided by SMV to make it easy to launch the Spark shell with the "fat" jar attached.
$ _SMV_HOME_/tools/smv-shell
See Run Spark Shell for details.
Once we are inside the spark shell, we can "source" (using the s()
smv shell helper function) any SmvFile
or SmvModule
instance
and inspect the contents (because the s
function returns a standard Spark DataFrame
object)
scala> val d1=s(stage1.input.employment)
scala> d1.count
res1: Long = 38818
scala> d1.printSchema
root
|-- ST: string (nullable = true)
|-- ZIPCODE: string (nullable = true)
...
scala> d1.select("ZIPCODE", "YEAR", "ESTAB", "EMP").show(10)
ZIPCODE YEAR ESTAB EMP
35004 2012 167 2574
35005 2012 88 665
...
You can also access SmvModules defined in the code. This is not limited to output modules.
scala> val d2 = s(stage1.EmploymentByState)
d2: org.apache.spark.sql.DataFrame = [ST: string, EMP: bigint]
scala> d2.printSchema
root
|-- ST: string (nullable = true)
|-- EMP: long (nullable = true)
scala> d2.count
res2: Long = 52
EmploymentByState
is defined in stage1 EmploymentByState.scala
file.
As you can see above, when you try to refer to a SmvModule, it will do the calculation and
then persist it for future use. Now you can use d2
, a DataFrame, to refer to the
SmvModule output, although in this example, there are nothing intersecting in that
data other than the EMP
field.
To quickly get an overall idea of the input data, we can use the SMV EDD (Extended Data Dictionary) tool.
scala> d1.select("ZIPCODE", "YEAR", "ESTAB", "EMP").edd.summary().eddShow
ZIPCODE Non-Null Count 38818
ZIPCODE Min Length 5
ZIPCODE Max Length 5
ZIPCODE Approx Distinct Count 38989
YEAR Non-Null Count 38818
YEAR Min Length 4
YEAR Max Length 4
YEAR Approx Distinct Count 1
ESTAB Non-Null Count 38818
ESTAB Average 191.45262507084342
ESTAB Standard Deviation 371.37743343837866
ESTAB Min 1.0
ESTAB Max 16465.0
EMP Non-Null Count 38818
EMP Average 2907.469241073729
EMP Standard Deviation 15393.485966796263
EMP Min 0.0
EMP Max 2733406.0
scala> d1.edd.histogram("ESTAB", "EMP").eddShow
Histogram of ESTAB: with BIN size 100.0
key count Pct cumCount cumPct
0.0 26060 67.13% 26060 67.13%
100.0 3129 8.06% 29189 75.19%
200.0 1960 5.05% 31149 80.24%
...
-------------------------------------------------
Histogram of EMP: with BIN size 100.0
key count Pct cumCount cumPct
0.0 15792 40.68% 15792 40.68%
100.0 3132 8.07% 18924 48.75%
200.0 1844 4.75% 20768 53.50%
300.0 1235 3.18% 22003 56.68%
400.0 988 2.55% 22991 59.23%
500.0 738 1.90% 23729 61.13%
...
Please see EDD doc for more details.