Skip to content

Commit

Permalink
Add a notice for using regional init action buckets. (#708)
Browse files Browse the repository at this point in the history
  • Loading branch information
medb authored Jan 14, 2020
1 parent d8e3645 commit fa16d0f
Show file tree
Hide file tree
Showing 43 changed files with 645 additions and 305 deletions.
41 changes: 29 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,36 @@ When creating a [Google Cloud Dataproc](https://cloud.google.com/dataproc/) clus

## How initialization actions are used

Initialization actions are stored in a [Google Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run:
Initialization actions must be stored in a [Google Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run:

gcloud dataproc clusters create <CLUSTER_NAME> \
[--initialization-actions [GCS_URI,...]] \
[--initialization-action-timeout TIMEOUT]
```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
[--initialization-actions [GCS_URI,...]] \
[--initialization-action-timeout TIMEOUT]
```
During development, you can create Dataproc cluster using Dataproc-provided
[regional](https://cloud.google.com/dataproc/docs/concepts/regional-endpoints) initialization
actions buckets (for example `goog-dataproc-initialization-actions-us-east1`):
```bash
REGION=<region>
CLUSTER=<cluster_name>
gcloud dataproc clusters create ${CLUSTER} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh
```
Before creating clusters, you need to copy initialization actions to your own GCS bucket. For example:
**:warning: NOTICE:** For production usage, before creating clusters it's strongly recommended
copying initialization actions to your own Cloud Storage bucket to guarantee consistent use of the
same initialization action code across all Dataproc cluster nodes and prevent unintended upgrades
from upstream in the cluster:
```bash
MY_BUCKET=<gcs-bucket>
gsutil cp presto/presto.sh gs://$MY_BUCKET/
gcloud dataproc clusters create my-presto-cluster \
--initialization-actions gs://$MY_BUCKET/presto.sh
BUCKET=<your_init_actions_bucket>
CLUSTER=<cluster_name>
gsutil cp presto/presto.sh gs://${BUCKET}/
gcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh
```
You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. This is also useful if you want to modify initialization actions to fit your needs.
Expand Down Expand Up @@ -92,9 +109,9 @@ custom metadata:
```bash
gcloud dataproc clusters create cluster-name \
--initialization-actions ... \
--metadata name1=value1,name2=value2... \
... other flags ...
--initialization-actions ... \
--metadata name1=value1,name2=value2,... \
... other flags ...
```
## For more information
Expand Down
33 changes: 20 additions & 13 deletions alluxio/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,20 @@ will be Alluxio workers.

## Using this initialization action

**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.

You can use this initialization action to create a new Dataproc cluster with
Alluxio installed:

1. Using the `gcloud` command to create a new cluster with this initialization
action. The following command will create a new cluster named
`<CLUSTER_NAME>`.
action.

```bash
gcloud dataproc clusters create <cluster_name> \
--initialization-actions gs://$my_bucket/alluxio/alluxio.sh \
--metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
REGION=<region>
CLUSTER=<cluster_name>
gcloud dataproc clusters create ${CLUSTER} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \
--metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
```

You can find more information about using initialization actions with Dataproc
Expand Down Expand Up @@ -48,19 +51,23 @@ must precede the Alluxio action.
`alluxio_site_properties` delimited using `;`.

```bash
gcloud dataproc clusters create <cluster_name> \
--initialization-actions gs://$my_bucket/alluxio/alluxio.sh \
--metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
--metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY>"
REGION=<region>
CLUSTER=<cluster_name>
gcloud dataproc clusters create ${CLUSTER} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \
--metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
--metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY>"
```

* Additional files can be downloaded into `/opt/alluxio/conf` using the
metadata key `alluxio_download_files_list` by specifying `http(s)` or `gs`
uris delimited using `;`.

```bash
gcloud dataproc clusters create <cluster_name> \
--initialization-actions gs://$my_bucket/alluxio/alluxio.sh \
--metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS> \
--metadata alluxio_download_files_list="gs://$my_bucket/$my_file;https://$server/$file"
REGION=<region>
CLUSTER=<cluster_name>
gcloud dataproc clusters create ${CLUSTER} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \
--metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS> \
--metadata alluxio_download_files_list="gs://goog-dataproc-initialization-actions-${REGION}/$my_file;https://$server/$file"
```
29 changes: 17 additions & 12 deletions beam/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Due to the current development
portability framework, you are responsible for building and maintaining their
own Beam artifacts manually. Instructions are included below.

## Using this initialization action

**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.

## Building Beam Artifacts

You will generate two categories of artifacts for this initialization action:
Expand Down Expand Up @@ -110,10 +114,11 @@ You should explicitly set the Beam and Flink metadata variables (use a script as
shown later).

```bash
CLUSTER_NAME="$1
INIT_ACTIONS="gs://$MY_BUCKET/docker/docker.sh"
INIT_ACTIONS+=",gs://$MY_BUCKET/flink/flink.sh"
INIT_ACTIONS+=",gs://$MY_BUCKET/beam/beam.sh"
REGION=<region>
CLUSTER_NAME="$1"
INIT_ACTIONS="gs://goog-dataproc-initialization-actions-${REGION}/docker/docker.sh"
INIT_ACTIONS+=",gs://goog-dataproc-initialization-actions-${REGION}/flink/flink.sh"
INIT_ACTIONS+=",gs://goog-dataproc-initialization-actions-${REGION}/beam/beam.sh"
FLINK_SNAPSHOT="https://archive.apache.org/dist/flink/flink-1.5.3/flink-1.5.3-bin-hadoop28-scala_2.11.tgz"
METADATA="beam-job-service-snapshot=<...>"
METADATA+=",beam-image-enable-pull=true"
Expand All @@ -123,9 +128,9 @@ METADATA+=",flink-start-yarn-session=true"
METADATA+=",flink-snapshot-url=${FLINK_SNAPSHOT}"

gcloud dataproc clusters create "${CLUSTER_NAME}" \
--initialization-actions="${INIT_ACTIONS}" \
--image-version="1.2" \
--metadata="${METADATA}"
--initialization-actions "${INIT_ACTIONS}" \
--image-version "1.2" \
--metadata "${METADATA}"
```

The Beam Job Service runs on port `8099` of the master node. You can submit
Expand All @@ -135,11 +140,11 @@ on the master node, upload the wordcount job binary, and then run:

```bash
./wordcount \
--runner flink \
--endpoint localhost:8099 \
--experiments beam_fn_api \
--output=<out> \
--container_image <BEAM_CONTAINER_DESTINATION>/go:<BEAM_SOURCE_VERSION>
--runner flink \
--endpoint localhost:8099 \
--experiments beam_fn_api \
--output=<out> \
--container_image <BEAM_CONTAINER_DESTINATION>/go:<BEAM_SOURCE_VERSION>
```

The Beam Job Service port must be opened to submit beam jobs from machines
Expand Down
26 changes: 20 additions & 6 deletions bigdl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,19 @@ More information [project's website](https://analytics-zoo.github.io)

## Using this initialization action

**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.

You can use this initialization to create a new Dataproc cluster with BigDL's Spark and PySpark libraries installed.

Because of a time needed to install BigDL on the cluster nodes we need to set
`--initialization-action-timeout 10m` property to prevent cluster creation timeout.

```
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
--initialization-action-timeout 10m
```

Expand All @@ -28,19 +36,25 @@ The URL should end in `-dist.zip`.
For example, for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2:

```
gcloud dataproc clusters create <CLUSTER_NAME> \
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version 1.0 \
--initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
--initialization-action-timeout 10m \
--metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/bigdl/dist-spark-1.6.2-scala-2.10.5-all/0.7.2/dist-spark-1.6.2-scala-2.10.5-all-0.7.2-dist.zip'
```

Or, for example, to download Analytics Zoo 0.4.0 with BigDL v0.7.2 for Dataproc 1.3 (Spark 2.3) use this:

```
gcloud dataproc clusters create <CLUSTER_NAME> \
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--image-version 1.3 \
--initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
--initialization-action-timeout 10m \
--metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/zoo/analytics-zoo-bigdl_0.7.2-spark_2.3.1/0.4.0/analytics-zoo-bigdl_0.7.2-spark_2.3.1-0.4.0-dist-all.zip'
```
Expand Down
23 changes: 17 additions & 6 deletions bigtable/README.MD
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
# Google Cloud Bigtable via Apache HBase
This initialization action installs Apache HBase libraries and the [Google Cloud Bigtable](https://cloud.google.com/bigtable/) [HBase Client](https://github.com/GoogleCloudPlatform/cloud-bigtable-client).


## Using this initialization action

**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.

You can use this initialization action to create a Dataproc cluster configured to connect to Cloud Bigtable:

1. Create a Bigtable instance by following [these directions](https://cloud.google.com/bigtable/docs/creating-instance).
1. Using the `gcloud` command to create a new cluster with this initialization action.

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://$MY_BUCKET/bigtable/bigtable.sh \
--metadata bigtable-instance=<BIGTABLE INSTANCE>
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigtable/bigtable.sh \
--metadata bigtable-instance=<BIGTABLE INSTANCE>
```
1. The cluster will have HBase libraries, the Bigtable client, and the [Apache Spark - Apache HBase Connector](https://github.com/hortonworks-spark/shc) installed.
1. In addition to running Hadoop and Spark jobs, you can SSH to the master (`gcloud compute ssh <CLUSTER_NAME>-m`) and use `hbase shell` to [connect](https://cloud.google.com/bigtable/docs/installing-hbase-shell#connect) to your Bigtable instance.
Expand All @@ -28,8 +33,14 @@ You can use this initialization action to create a Dataproc cluster configured t
```
1. Submit the jar with dependecies as a Dataproc job. Note that `OUTPUT_TABLE` should not already exist. This job will create the table with the correct column family.

```bash
gcloud dataproc jobs submit hadoop --cluster <CLUSTER_NAME> --class com.example.bigtable.sample.WordCountDriver --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar -- wordcount-hbase gs://$MY_BUCKET/README.md <OUTPUT_TABLE>
```bass
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc jobs submit hadoop --cluster ${CLUSTER_NAME} \
--class com.example.bigtable.sample.WordCountDriver
--jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar \
-- \
wordcount-hbase gs://goog-dataproc-initialization-actions-${REGION}/README.md <OUTPUT_TABLE>
```

## Running an example Spark job on cluster using SHC
Expand Down
Loading

0 comments on commit fa16d0f

Please sign in to comment.