Add a notice for using regional init action buckets. (#708)

GoogleCloudDataproc · Jan 14, 2020 · fa16d0f · fa16d0f
1 parent d8e3645
commit fa16d0f
Show file tree

Hide file tree

Showing 43 changed files with 645 additions and 305 deletions.
diff --git a/README.md b/README.md
@@ -4,19 +4,36 @@ When creating a [Google Cloud Dataproc](https://cloud.google.com/dataproc/) clus
 
 ## How initialization actions are used
 
-Initialization actions are stored in a [Google Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run:
+Initialization actions must be stored in a [Google Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run:
 
-    gcloud dataproc clusters create <CLUSTER_NAME> \
-      [--initialization-actions [GCS_URI,...]] \
-      [--initialization-action-timeout TIMEOUT]
+```bash
+gcloud dataproc clusters create <CLUSTER_NAME> \
+    [--initialization-actions [GCS_URI,...]] \
+    [--initialization-action-timeout TIMEOUT]
+```
+
+During development, you can create Dataproc cluster using Dataproc-provided
+[regional](https://cloud.google.com/dataproc/docs/concepts/regional-endpoints) initialization
+actions buckets (for example `goog-dataproc-initialization-actions-us-east1`):
+
+```bash
+REGION=<region>
+CLUSTER=<cluster_name>
+gcloud dataproc clusters create ${CLUSTER} \
+    --region ${REGION} \
+    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh
+```
 
-Before creating clusters, you need to copy initialization actions to your own GCS bucket. For example:
+**:warning: NOTICE:** For production usage, before creating clusters it's strongly recommended
+copying initialization actions to your own Cloud Storage bucket to guarantee consistent use of the
+same initialization action code across all Dataproc cluster nodes and prevent unintended upgrades
+from upstream in the cluster:
 
 ```bash
-MY_BUCKET=<gcs-bucket>
-gsutil cp presto/presto.sh gs://$MY_BUCKET/
-gcloud dataproc clusters create my-presto-cluster \
-  --initialization-actions gs://$MY_BUCKET/presto.sh
+BUCKET=<your_init_actions_bucket>
+CLUSTER=<cluster_name>
+gsutil cp presto/presto.sh gs://${BUCKET}/
+gcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh
 ```
 
 You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. This is also useful if you want to modify initialization actions to fit your needs.
@@ -92,9 +109,9 @@ custom metadata:
 
 ```bash
 gcloud dataproc clusters create cluster-name \
-  --initialization-actions ... \
-  --metadata name1=value1,name2=value2... \
-  ... other flags ...
+    --initialization-actions ... \
+    --metadata name1=value1,name2=value2,... \
+    ... other flags ...
 ```
 
 ## For more information

diff --git a/alluxio/README.MD b/alluxio/README.MD
@@ -7,17 +7,20 @@ will be Alluxio workers.
 
 ## Using this initialization action
 
+**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.
+
 You can use this initialization action to create a new Dataproc cluster with
 Alluxio installed:
 
 1.  Using the `gcloud` command to create a new cluster with this initialization
-    action. The following command will create a new cluster named
-    `<CLUSTER_NAME>`.
+    action.
 
     ```bash
-    gcloud dataproc clusters create <cluster_name> \
-      --initialization-actions gs://$my_bucket/alluxio/alluxio.sh \
-      --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
+    REGION=<region>
+    CLUSTER=<cluster_name>
+    gcloud dataproc clusters create ${CLUSTER} \
+        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \
+        --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
     ```
 
 You can find more information about using initialization actions with Dataproc
@@ -48,19 +51,23 @@ must precede the Alluxio action.
     `alluxio_site_properties` delimited using `;`.
 
     ```bash
-    gcloud dataproc clusters create <cluster_name> \
-      --initialization-actions gs://$my_bucket/alluxio/alluxio.sh \
-      --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
-      --metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY>"
+    REGION=<region>
+    CLUSTER=<cluster_name>
+    gcloud dataproc clusters create ${CLUSTER} \
+        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \
+        --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
+        --metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY>"
     ```
 
 *   Additional files can be downloaded into `/opt/alluxio/conf` using the
     metadata key `alluxio_download_files_list` by specifying `http(s)` or `gs`
     uris delimited using `;`.
 
     ```bash
-    gcloud dataproc clusters create <cluster_name> \
-      --initialization-actions gs://$my_bucket/alluxio/alluxio.sh \
-      --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS> \
-      --metadata alluxio_download_files_list="gs://$my_bucket/$my_file;https://$server/$file"
+    REGION=<region>
+    CLUSTER=<cluster_name>
+    gcloud dataproc clusters create ${CLUSTER} \
+        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \
+        --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS> \
+        --metadata alluxio_download_files_list="gs://goog-dataproc-initialization-actions-${REGION}/$my_file;https://$server/$file"
     ```
diff --git a/beam/README.md b/beam/README.md
@@ -14,6 +14,10 @@ Due to the current development
 portability framework, you are responsible for building and maintaining their
 own Beam artifacts manually. Instructions are included below.
 
+## Using this initialization action
+
+**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.
+
 ## Building Beam Artifacts
 
 You will generate two categories of artifacts for this initialization action:
@@ -110,10 +114,11 @@ You should explicitly set the Beam and Flink metadata variables (use a script as
 shown later).
 
 ```bash
-CLUSTER_NAME="$1
-INIT_ACTIONS="gs://$MY_BUCKET/docker/docker.sh"
-INIT_ACTIONS+=",gs://$MY_BUCKET/flink/flink.sh"
-INIT_ACTIONS+=",gs://$MY_BUCKET/beam/beam.sh"
+REGION=<region>
+CLUSTER_NAME="$1"
+INIT_ACTIONS="gs://goog-dataproc-initialization-actions-${REGION}/docker/docker.sh"
+INIT_ACTIONS+=",gs://goog-dataproc-initialization-actions-${REGION}/flink/flink.sh"
+INIT_ACTIONS+=",gs://goog-dataproc-initialization-actions-${REGION}/beam/beam.sh"
 FLINK_SNAPSHOT="https://archive.apache.org/dist/flink/flink-1.5.3/flink-1.5.3-bin-hadoop28-scala_2.11.tgz"
 METADATA="beam-job-service-snapshot=<...>"
 METADATA+=",beam-image-enable-pull=true"
@@ -123,9 +128,9 @@ METADATA+=",flink-start-yarn-session=true"
 METADATA+=",flink-snapshot-url=${FLINK_SNAPSHOT}"
 
 gcloud dataproc clusters create "${CLUSTER_NAME}" \
-  --initialization-actions="${INIT_ACTIONS}" \
-  --image-version="1.2" \
-  --metadata="${METADATA}"
+    --initialization-actions "${INIT_ACTIONS}" \
+    --image-version "1.2" \
+    --metadata "${METADATA}"
 ```
 
 The Beam Job Service runs on port `8099` of the master node. You can submit
@@ -135,11 +140,11 @@ on the master node, upload the wordcount job binary, and then run:
 
 ```bash
 ./wordcount \
-  --runner flink \
-  --endpoint localhost:8099 \
-  --experiments beam_fn_api \
-  --output=<out> \
-  --container_image <BEAM_CONTAINER_DESTINATION>/go:<BEAM_SOURCE_VERSION>
+    --runner flink \
+    --endpoint localhost:8099 \
+    --experiments beam_fn_api \
+    --output=<out> \
+    --container_image <BEAM_CONTAINER_DESTINATION>/go:<BEAM_SOURCE_VERSION>
 ```
 
 The Beam Job Service port must be opened to submit beam jobs from machines

diff --git a/bigdl/README.md b/bigdl/README.md
@@ -10,11 +10,19 @@ More information [project's website](https://analytics-zoo.github.io)
 
 ## Using this initialization action
 
+**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.
+
 You can use this initialization to create a new Dataproc cluster with BigDL's Spark and PySpark libraries installed.
 
+Because of a time needed to install BigDL on the cluster nodes we need to set
+`--initialization-action-timeout 10m` property to prevent cluster creation timeout. 
+
 ```
-gcloud dataproc clusters create <CLUSTER_NAME> \
-    --initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
+REGION=<region>
+CLUSTER_NAME=<cluster_name>
+gcloud dataproc clusters create ${CLUSTER_NAME} \
+    --region ${REGION} \
+    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
     --initialization-action-timeout 10m
 ```
 
@@ -28,19 +36,25 @@ The URL should end in `-dist.zip`.
 For example, for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2:
 
 ```
-gcloud dataproc clusters create <CLUSTER_NAME> \
+REGION=<region>
+CLUSTER_NAME=<cluster_name>
+gcloud dataproc clusters create ${CLUSTER_NAME} \
+    --region ${REGION} \
     --image-version 1.0 \
-    --initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
+    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
     --initialization-action-timeout 10m \
     --metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/bigdl/dist-spark-1.6.2-scala-2.10.5-all/0.7.2/dist-spark-1.6.2-scala-2.10.5-all-0.7.2-dist.zip'
 ```
 
 Or, for example, to download Analytics Zoo 0.4.0 with BigDL v0.7.2 for Dataproc 1.3 (Spark 2.3) use this:
 
 ```
-gcloud dataproc clusters create <CLUSTER_NAME> \
+REGION=<region>
+CLUSTER_NAME=<cluster_name>
+gcloud dataproc clusters create ${CLUSTER_NAME} \
+    --region ${REGION} \
     --image-version 1.3 \
-    --initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
+    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \
     --initialization-action-timeout 10m \
     --metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/zoo/analytics-zoo-bigdl_0.7.2-spark_2.3.1/0.4.0/analytics-zoo-bigdl_0.7.2-spark_2.3.1-0.4.0-dist-all.zip'
 ```

diff --git a/bigtable/README.MD b/bigtable/README.MD
@@ -1,17 +1,22 @@
 # Google Cloud Bigtable via Apache HBase
 This initialization action installs Apache HBase libraries and the [Google Cloud Bigtable](https://cloud.google.com/bigtable/) [HBase Client](https://github.com/GoogleCloudPlatform/cloud-bigtable-client).
 
-
 ## Using this initialization action
+
+**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production.
+
 You can use this initialization action to create a Dataproc cluster configured to connect to Cloud Bigtable:
 
 1. Create a Bigtable instance by following [these directions](https://cloud.google.com/bigtable/docs/creating-instance).
 1. Using the `gcloud` command to create a new cluster with this initialization action.
 
     ```bash
-    gcloud dataproc clusters create <CLUSTER_NAME> \
-    --initialization-actions gs://$MY_BUCKET/bigtable/bigtable.sh \
-    --metadata bigtable-instance=<BIGTABLE INSTANCE>
+    REGION=<region>
+    CLUSTER_NAME=<cluster_name>
+    gcloud dataproc clusters create ${CLUSTER_NAME} \
+        --region ${REGION} \
+        --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigtable/bigtable.sh \
+        --metadata bigtable-instance=<BIGTABLE INSTANCE>
     ```
 1. The cluster will have HBase libraries, the Bigtable client, and the [Apache Spark - Apache HBase Connector](https://github.com/hortonworks-spark/shc) installed.
 1. In addition to running Hadoop and Spark jobs, you can SSH to the master (`gcloud compute ssh <CLUSTER_NAME>-m`) and use `hbase shell` to [connect](https://cloud.google.com/bigtable/docs/installing-hbase-shell#connect) to your Bigtable instance.
@@ -28,8 +33,14 @@ You can use this initialization action to create a Dataproc cluster configured t
     ```
 1. Submit the jar with dependecies as a Dataproc job. Note that `OUTPUT_TABLE` should not already exist. This job will create the table with the correct column family.
 
-    ```bash
-    gcloud dataproc jobs submit hadoop --cluster <CLUSTER_NAME> --class com.example.bigtable.sample.WordCountDriver --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar -- wordcount-hbase gs://$MY_BUCKET/README.md <OUTPUT_TABLE>
+    ```bass
+    REGION=<region>
+    CLUSTER_NAME=<cluster_name>
+    gcloud dataproc jobs submit hadoop --cluster ${CLUSTER_NAME} \
+        --class com.example.bigtable.sample.WordCountDriver
+        --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar \
+        -- \
+        wordcount-hbase gs://goog-dataproc-initialization-actions-${REGION}/README.md <OUTPUT_TABLE>
     ```
 
 ## Running an example Spark job on cluster using SHC