Skip to content

Files

Latest commit

c214ef5 · Jan 24, 2019

History

History
This branch is 475 commits behind GoogleCloudDataproc/initialization-actions:master.

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
May 17, 2018
Jan 24, 2019

Apache Gobblin Initialization Action

This initialization action installs version 0.12.0 RC2 of Apache Gobblin on all nodes within Google Cloud Dataproc cluster.

The distribution is hosted in Dataproc-team owned Google Cloud Storage bucket gobblin-dist.

Using this initialization action

You can use this initialization action to create a new Dataproc cluster with Gobblin installed by:

  1. Use the gcloud command to create a new cluster with this initialization action. The following command will create a new cluster named <CLUSTER_NAME>.

    gcloud dataproc clusters create <CLUSTER_NAME> \
        --initialization-actions gs://dataproc-initialization-actions/gobblin/gobblin.sh
  2. Submit jobs

    gcloud dataproc jobs submit hadoop --cluster=<CLUSTER_NAME> \
        --class org.apache.gobblin.runtime.mapreduce.CliMRJobLauncher \
        --properties mapreduce.job.user.classpath.first=true -- \
        -sysconfig /usr/local/lib/gobblin/conf/gobblin-mapreduce.properties \
        -jobconfig gs://<PATH_TO_JOB_CONFIG>

    Alternatively, you can submit jobs through Gobblin launcher scripts located in /usr/local/lib/gobblin/bin. By default, Gobblin is only configured for mapreduce mode.

  3. To learn about how to use Gobblin read the documentation for the Getting Started guide.

Important notes

  1. For Gobblin to work with Dataproc Job API, any additional client libraries (for example: Kafka, MySql) would have to be symlinked into /usr/lib/hadoop/lib directory on each node.