diff --git a/README.md b/README.md index 44d58d975..6fc385489 100644 --- a/README.md +++ b/README.md @@ -4,19 +4,36 @@ When creating a [Google Cloud Dataproc](https://cloud.google.com/dataproc/) clus ## How initialization actions are used -Initialization actions are stored in a [Google Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run: +Initialization actions must be stored in a [Google Cloud Storage](https://cloud.google.com/storage) bucket and can be passed as a parameter to the `gcloud` command or the `clusters.create` API when creating a Cloud Dataproc cluster. For example, to specify an initialization action when creating a cluster with the `gcloud` command, you can run: - gcloud dataproc clusters create \ - [--initialization-actions [GCS_URI,...]] \ - [--initialization-action-timeout TIMEOUT] +```bash +gcloud dataproc clusters create \ + [--initialization-actions [GCS_URI,...]] \ + [--initialization-action-timeout TIMEOUT] +``` + +During development, you can create Dataproc cluster using Dataproc-provided +[regional](https://cloud.google.com/dataproc/docs/concepts/regional-endpoints) initialization +actions buckets (for example `goog-dataproc-initialization-actions-us-east1`): + +```bash +REGION= +CLUSTER= +gcloud dataproc clusters create ${CLUSTER} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh +``` -Before creating clusters, you need to copy initialization actions to your own GCS bucket. For example: +**:warning: NOTICE:** For production usage, before creating clusters it's strongly recommended +copying initialization actions to your own Cloud Storage bucket to guarantee consistent use of the +same initialization action code across all Dataproc cluster nodes and prevent unintended upgrades +from upstream in the cluster: ```bash -MY_BUCKET= -gsutil cp presto/presto.sh gs://$MY_BUCKET/ -gcloud dataproc clusters create my-presto-cluster \ - --initialization-actions gs://$MY_BUCKET/presto.sh +BUCKET= +CLUSTER= +gsutil cp presto/presto.sh gs://${BUCKET}/ +gcloud dataproc clusters create ${CLUSTER} --initialization-actions gs://${BUCKET}/presto.sh ``` You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. This is also useful if you want to modify initialization actions to fit your needs. @@ -92,9 +109,9 @@ custom metadata: ```bash gcloud dataproc clusters create cluster-name \ - --initialization-actions ... \ - --metadata name1=value1,name2=value2... \ - ... other flags ... + --initialization-actions ... \ + --metadata name1=value1,name2=value2,... \ + ... other flags ... ``` ## For more information diff --git a/alluxio/README.MD b/alluxio/README.MD index 06f0d847a..9d7b0bc3a 100644 --- a/alluxio/README.MD +++ b/alluxio/README.MD @@ -7,17 +7,20 @@ will be Alluxio workers. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Alluxio installed: 1. Using the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$my_bucket/alluxio/alluxio.sh \ - --metadata alluxio_root_ufs_uri= + REGION= + CLUSTER= + gcloud dataproc clusters create ${CLUSTER} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \ + --metadata alluxio_root_ufs_uri= ``` You can find more information about using initialization actions with Dataproc @@ -48,10 +51,12 @@ must precede the Alluxio action. `alluxio_site_properties` delimited using `;`. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$my_bucket/alluxio/alluxio.sh \ - --metadata alluxio_root_ufs_uri= - --metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=" + REGION= + CLUSTER= + gcloud dataproc clusters create ${CLUSTER} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \ + --metadata alluxio_root_ufs_uri= + --metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=" ``` * Additional files can be downloaded into `/opt/alluxio/conf` using the @@ -59,8 +64,10 @@ must precede the Alluxio action. uris delimited using `;`. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$my_bucket/alluxio/alluxio.sh \ - --metadata alluxio_root_ufs_uri= \ - --metadata alluxio_download_files_list="gs://$my_bucket/$my_file;https://$server/$file" + REGION= + CLUSTER= + gcloud dataproc clusters create ${CLUSTER} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \ + --metadata alluxio_root_ufs_uri= \ + --metadata alluxio_download_files_list="gs://goog-dataproc-initialization-actions-${REGION}/$my_file;https://$server/$file" ``` diff --git a/beam/README.md b/beam/README.md index 7b6b395f3..e03de8c27 100644 --- a/beam/README.md +++ b/beam/README.md @@ -14,6 +14,10 @@ Due to the current development portability framework, you are responsible for building and maintaining their own Beam artifacts manually. Instructions are included below. +## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + ## Building Beam Artifacts You will generate two categories of artifacts for this initialization action: @@ -110,10 +114,11 @@ You should explicitly set the Beam and Flink metadata variables (use a script as shown later). ```bash -CLUSTER_NAME="$1 -INIT_ACTIONS="gs://$MY_BUCKET/docker/docker.sh" -INIT_ACTIONS+=",gs://$MY_BUCKET/flink/flink.sh" -INIT_ACTIONS+=",gs://$MY_BUCKET/beam/beam.sh" +REGION= +CLUSTER_NAME="$1" +INIT_ACTIONS="gs://goog-dataproc-initialization-actions-${REGION}/docker/docker.sh" +INIT_ACTIONS+=",gs://goog-dataproc-initialization-actions-${REGION}/flink/flink.sh" +INIT_ACTIONS+=",gs://goog-dataproc-initialization-actions-${REGION}/beam/beam.sh" FLINK_SNAPSHOT="https://archive.apache.org/dist/flink/flink-1.5.3/flink-1.5.3-bin-hadoop28-scala_2.11.tgz" METADATA="beam-job-service-snapshot=<...>" METADATA+=",beam-image-enable-pull=true" @@ -123,9 +128,9 @@ METADATA+=",flink-start-yarn-session=true" METADATA+=",flink-snapshot-url=${FLINK_SNAPSHOT}" gcloud dataproc clusters create "${CLUSTER_NAME}" \ - --initialization-actions="${INIT_ACTIONS}" \ - --image-version="1.2" \ - --metadata="${METADATA}" + --initialization-actions "${INIT_ACTIONS}" \ + --image-version "1.2" \ + --metadata "${METADATA}" ``` The Beam Job Service runs on port `8099` of the master node. You can submit @@ -135,11 +140,11 @@ on the master node, upload the wordcount job binary, and then run: ```bash ./wordcount \ - --runner flink \ - --endpoint localhost:8099 \ - --experiments beam_fn_api \ - --output= \ - --container_image /go: + --runner flink \ + --endpoint localhost:8099 \ + --experiments beam_fn_api \ + --output= \ + --container_image /go: ``` The Beam Job Service port must be opened to submit beam jobs from machines diff --git a/bigdl/README.md b/bigdl/README.md index 485d9066f..382be5f5e 100644 --- a/bigdl/README.md +++ b/bigdl/README.md @@ -10,11 +10,19 @@ More information [project's website](https://analytics-zoo.github.io) ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization to create a new Dataproc cluster with BigDL's Spark and PySpark libraries installed. +Because of a time needed to install BigDL on the cluster nodes we need to set +`--initialization-action-timeout 10m` property to prevent cluster creation timeout. + ``` -gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \ --initialization-action-timeout 10m ``` @@ -28,9 +36,12 @@ The URL should end in `-dist.zip`. For example, for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2: ``` -gcloud dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --image-version 1.0 \ - --initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \ --initialization-action-timeout 10m \ --metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/bigdl/dist-spark-1.6.2-scala-2.10.5-all/0.7.2/dist-spark-1.6.2-scala-2.10.5-all-0.7.2-dist.zip' ``` @@ -38,9 +49,12 @@ gcloud dataproc clusters create \ Or, for example, to download Analytics Zoo 0.4.0 with BigDL v0.7.2 for Dataproc 1.3 (Spark 2.3) use this: ``` -gcloud dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --image-version 1.3 \ - --initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigdl/bigdl.sh \ --initialization-action-timeout 10m \ --metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/zoo/analytics-zoo-bigdl_0.7.2-spark_2.3.1/0.4.0/analytics-zoo-bigdl_0.7.2-spark_2.3.1-0.4.0-dist-all.zip' ``` diff --git a/bigtable/README.MD b/bigtable/README.MD index 68cae3adf..611e46104 100644 --- a/bigtable/README.MD +++ b/bigtable/README.MD @@ -1,17 +1,22 @@ # Google Cloud Bigtable via Apache HBase This initialization action installs Apache HBase libraries and the [Google Cloud Bigtable](https://cloud.google.com/bigtable/) [HBase Client](https://github.com/GoogleCloudPlatform/cloud-bigtable-client). - ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a Dataproc cluster configured to connect to Cloud Bigtable: 1. Create a Bigtable instance by following [these directions](https://cloud.google.com/bigtable/docs/creating-instance). 1. Using the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/bigtable/bigtable.sh \ - --metadata bigtable-instance= + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/bigtable/bigtable.sh \ + --metadata bigtable-instance= ``` 1. The cluster will have HBase libraries, the Bigtable client, and the [Apache Spark - Apache HBase Connector](https://github.com/hortonworks-spark/shc) installed. 1. In addition to running Hadoop and Spark jobs, you can SSH to the master (`gcloud compute ssh -m`) and use `hbase shell` to [connect](https://cloud.google.com/bigtable/docs/installing-hbase-shell#connect) to your Bigtable instance. @@ -28,8 +33,14 @@ You can use this initialization action to create a Dataproc cluster configured t ``` 1. Submit the jar with dependecies as a Dataproc job. Note that `OUTPUT_TABLE` should not already exist. This job will create the table with the correct column family. - ```bash - gcloud dataproc jobs submit hadoop --cluster --class com.example.bigtable.sample.WordCountDriver --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar -- wordcount-hbase gs://$MY_BUCKET/README.md + ```bass + REGION= + CLUSTER_NAME= + gcloud dataproc jobs submit hadoop --cluster ${CLUSTER_NAME} \ + --class com.example.bigtable.sample.WordCountDriver + --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar \ + -- \ + wordcount-hbase gs://goog-dataproc-initialization-actions-${REGION}/README.md ``` ## Running an example Spark job on cluster using SHC diff --git a/cloud-sql-proxy/README.MD b/cloud-sql-proxy/README.MD index fbdfd3851..0d77ad57e 100644 --- a/cloud-sql-proxy/README.MD +++ b/cloud-sql-proxy/README.MD @@ -8,6 +8,8 @@ metadata on a given Cloud SQL instance. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + Prerequisite: If this is your first time using Cloud SQL, enable the [Cloud SQL Admin API](https://cloud.google.com/sql/docs/mysql/admin-api/#enabling_the_api) before continuing. @@ -22,7 +24,7 @@ same region. You can use this initialization action to create a Dataproc cluster using a shared hive metastore. -1. Use the `gcloud` command to create a new 2nd generation Cloud SQL intance +1. Use the `gcloud` command to create a new 2nd generation Cloud SQL instance (or use a previously created instance). ```bash @@ -39,12 +41,17 @@ shared hive metastore. action. ```bash - gcloud dataproc clusters create \ - --region \ + HIVE_DATA_BUCKET= + PROJECT_ID= + REGION= + INSTANCE_NAME= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes sql-admin \ - --initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \ - --properties hive:hive.metastore.warehouse.dir=gs:///hive-warehouse \ - --metadata "hive-metastore-instance=::" + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ + --properties hive:hive.metastore.warehouse.dir=gs://${HIVE_DATA_BUCKET}/hive-warehouse \ + --metadata "hive-metastore-instance=${PROJECT_ID}:${REGION}:${INSTANCE_NAME}" ``` a. Optionally add other instances, paired with distict TCP ports for further @@ -67,10 +74,15 @@ shared hive metastore. 1. Create another dataproc cluster with the same Cloud SQL metastore. ```bash - gcloud dataproc clusters create \ + PROJECT_ID= + REGION= + INSTANCE_NAME= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes sql-admin \ - --initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \ - --metadata "hive-metastore-instance=::" + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ + --metadata "hive-metastore-instance=${PROJECT_ID}:${REGION}:${INSTANCE_NAME}" ``` 1. The two clusters should now be sharing Hive Tables and Spark SQL Dataframes @@ -103,11 +115,16 @@ write to Cloud SQL. Set the `enable-cloud-sql-hive-metastore` metadata key to `additional-cloud-sql-instances` to install one or more proxies. For example: ```bash -gcloud dataproc clusters create \ +PROJECT_ID= +REGION= +INSTANCE_NAME= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes sql-admin \ - --initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ --metadata "enable-cloud-sql-hive-metastore=false" \ - --metadata "additional-cloud-sql-instances=::" + --metadata "additional-cloud-sql-instances=${PROJECT_ID}:${REGION}:${INSTANCE_NAME}" ``` ## Private IP Clusters and Cloud SQL Instances @@ -178,14 +195,20 @@ additional setup. the following: ```bash - gcloud dataproc clusters create \ + HIVE_DATA_BUCKET= + PROJECT_ID= + REGION= + INSTANCE_NAME= + CLUSTER_NAME= + SUBNET= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes sql-admin \ - --initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \ - --properties hive:hive.metastore.warehouse.dir=gs:///hive-warehouse \ - --metadata "hive-metastore-instance=::" \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ + --properties hive:hive.metastore.warehouse.dir=gs://${HIVE_DATA_BUCKET}/hive-warehouse \ + --metadata "hive-metastore-instance=${PROJECT_ID}:${REGION}:${INSTANCE_NAME}" \ --metadata "use-cloud-sql-private-ip=true" \ - --subnet \ - --region \ + --subnet ${SUBNET} \ --no-address ``` @@ -197,7 +220,7 @@ additional setup. **Important notes:** - Make sure to pass the flag `--metadata=use-cloud-sql-private-ip=true`. This + Make sure to pass the flag `--metadata use-cloud-sql-private-ip=true`. This tells the Cloud SQL proxy to use the private IP address of the Cloud SQL instance, not the public one. @@ -272,26 +295,41 @@ the `hive` user does not already exist in MySQL. Proceed as follows: following command to specify both the `root` and `hive` passwords: ```bash - gcloud dataproc clusters create \ + HIVE_DATA_BUCKET= + SECRETS_BUCKET= + PROJECT_ID= + REGION= + INSTANCE_NAME= + CLUSTER_NAME= + SUBNET= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes cloud-platform \ - --initialization-actions gs:///cloud-sql-proxy.sh \ - --properties hive:hive.metastore.warehouse.dir=gs:///hive-warehouse \ - --metadata "hive-metastore-instance=::" \ - --metadata "kms-key-uri=projects//locations/global/keyRings/my-key-ring/cryptoKeys/my-key" \ - --metadata "db-admin-password-uri=gs:///admin-password.encrypted" \ - --metadata "db-hive-password-uri=gs:///hive-password.encrypted" + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy.sh \ + --properties hive:hive.metastore.warehouse.dir=gs://${HIVE_DATA_BUCKET}/hive-warehouse \ + --metadata "hive-metastore-instance=${PROJECT_ID}:${REGION}:${INSTANCE_NAME}" \ + --metadata "kms-key-uri=projects/${PROJECT_ID}/locations/global/keyRings/my-key-ring/cryptoKeys/my-key" \ + --metadata "db-admin-password-uri=gs://${SECRETS_BUCKET}/admin-password.encrypted" \ + --metadata "db-hive-password-uri=gs://${SECRETS_BUCKET}/hive-password.encrypted" ``` If you have already created a `hive` user in MySQL, use the following command, which does not require the `root` password: ```bash - gcloud dataproc clusters create \ + SECRETS_BUCKET= + PROJECT_ID= + REGION= + INSTANCE_NAME= + CLUSTER_NAME= + SUBNET= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes cloud-platform \ - --initialization-actions gs:///cloud-sql-proxy.sh \ - --metadata "hive-metastore-instance=::" \ - --metadata "kms-key-uri=projects//locations/global/keyRings/my-key-ring/cryptoKeys/my-key" \ - --metadata "db-hive-password-uri=gs:///hive-password.encrypted" + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy.sh \ + --metadata "hive-metastore-instance=${PROJECT_ID}:${REGION}:${INSTANCE_NAME}" \ + --metadata "kms-key-uri=projects/${PROJECT_ID}/locations/global/keyRings/my-key-ring/cryptoKeys/my-key" \ + --metadata "db-hive-password-uri=gs://${SECRETS_BUCKET}/hive-password.encrypted" ``` 8. Upgrading schema (create cluster step failed on new Dataproc version): diff --git a/conda/README.MD b/conda/README.MD index a0e241e45..003f8ab5d 100644 --- a/conda/README.MD +++ b/conda/README.MD @@ -23,11 +23,17 @@ Please see the following tutorial for full details https://cloud.google.com/data ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + ### Just install and configure conda environment ``` -gcloud dataproc clusters create foo --initialization-actions \ - gs://$MY_BUCKET/conda/bootstrap-conda.sh,gs://$MY_BUCKET/conda/install-conda-env.sh +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions \ + gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh ``` ### Install extra conda and/or pip packages @@ -35,17 +41,23 @@ gcloud dataproc clusters create foo --initialization-actions \ You can add extra packages by using the metadata entries `CONDA_PACKAGES` and `PIP_PACKAGES`. These variables provide a space separated list of additional packages to install. ``` -gcloud dataproc clusters create foo \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata 'CONDA_PACKAGES="numpy pandas",PIP_PACKAGES=pandas-gbq' \ --initialization-actions \ - gs://$MY_BUCKET/conda/bootstrap-conda.sh,gs://$MY_BUCKET/conda/install-conda-env.sh + gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh ``` Alternatively, you can use environment variables, e.g.: ``` -gcloud dataproc clusters create foo \ - --initialization-actions gs:///create-my-cluster.sh +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/create-my-cluster.sh ``` Where `create-my-cluster.sh` specifies a list of conda and/or pip packages to install: @@ -53,8 +65,8 @@ Where `create-my-cluster.sh` specifies a list of conda and/or pip packages to in ``` #!/usr/bin/env bash -gsutil -m cp -r gs://$MY_BUCKET/conda/bootstrap-conda.sh . -gsutil -m cp -r gs://$MY_BUCKET/conda/install-conda-env.sh . +gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh . +gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh . chmod 755 ./*conda*.sh @@ -77,8 +89,8 @@ CONDA_ENV_YAML_GSC_LOC="gs://my-bucket/path/to/conda-environment.yml" CONDA_ENV_YAML_PATH="/root/conda-environment.yml" echo "Downloading conda environment at $CONDA_ENV_YAML_GSC_LOC to $CONDA_ENV_YAML_PATH ... " gsutil -m cp -r $CONDA_ENV_YAML_GSC_LOC $CONDA_ENV_YAML_PATH -gsutil -m cp -r gs://$MY_BUCKET/conda/bootstrap-conda.sh . -gsutil -m cp -r gs://$MY_BUCKET/conda/install-conda-env.sh . +gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh . +gsutil -m cp -r gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh . chmod 755 ./*conda*.sh diff --git a/connectors/README.md b/connectors/README.md index f1c757329..b7ed17cb5 100644 --- a/connectors/README.md +++ b/connectors/README.md @@ -16,12 +16,17 @@ on a [Google Cloud Dataproc](https://cloud.google.com/dataproc) cluster. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with specific version of Google Cloud Storage and BigQuery connector installed: ``` -gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/connectors/connectors.sh \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \ --metadata gcs-connector-version=2.0.0 \ --metadata bigquery-connector-version=1.0.0 ``` @@ -42,15 +47,21 @@ For example: specified, then Google Cloud Storage connector will be updated to 1.7.0 version and BigQuery connector will be updated to 0.11.0 version: ``` - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/connectors/connectors.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \ --metadata gcs-connector-version=1.7.0 ``` * if Google Cloud Storage connector 1.8.0 version is specified and BigQuery connector version is not specified, then only Google Cloud Storage connector will be updated to 1.8.0 version and BigQuery - connector will be left intact: + connector will be left unchanged: ``` - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/connectors/connectors.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/connectors/connectors.sh \ --metadata gcs-connector-version=1.8.0 ``` diff --git a/datalab/README.md b/datalab/README.md index 23adc2e2c..97559c8fd 100644 --- a/datalab/README.md +++ b/datalab/README.md @@ -6,13 +6,17 @@ Dataproc cluster. You will need to connect to Datalab using an SSH tunnel. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/datalab/datalab.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/datalab/datalab.sh \ --scopes cloud-platform ``` @@ -40,10 +44,13 @@ must be at the same minor version. Currently, Datalab uses Python 3.5. Here is how to set up Python 3.5 on workers: ```bash -gcloud dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata 'CONDA_PACKAGES="python==3.5"' \ --scopes cloud-platform \ - --initialization-actions gs://$MY_BUCKET/conda/bootstrap-conda.sh,gs://$MY_BUCKET/conda/install-conda-env.sh,gs://$MY_BUCKET/datalab/datalab.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh,gs://goog-dataproc-initialization-actions-${REGION}/datalab/datalab.sh ``` In effect, this means that a particular Datalab-on-Dataproc cluster can only run @@ -64,11 +71,11 @@ Python 2 or Python 3 kernels, but not both. can cause problems on moderately small clusters. * If you [build your own Datalab images](https://github.com/googledatalab/datalab/wiki/Development-Environment), - you can specify `--metadata=docker-image=gcr.io//` to point + you can specify `--metadata docker-image=gcr.io//` to point to your image. * If you normally only run Datalab kernels on VMs and connect to them with a local Docker frontend, set the flag - `--metadata=docker-image=gcr.io/cloud-datalab/datalab-gateway` and then set + `--metadata docker-image=gcr.io/cloud-datalab/datalab-gateway` and then set `GATEWAY_VM` to your cluster's master node in your local `docker` command [as described here](https://cloud.google.com/datalab/docs/quickstarts/quickstart-gce#install_the_datalab_docker_container_on_your_computer). * You can pass Spark packages as a comma separated list with `--metadata diff --git a/docker/README.md b/docker/README.md index 1b6ac60e6..c08d15de0 100644 --- a/docker/README.md +++ b/docker/README.md @@ -8,13 +8,17 @@ applications can access Docker. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``: + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/docker/docker.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/docker/docker.sh ``` 1. Docker is installed and configured on all nodes of the cluster (both master diff --git a/dr-elephant/README.MD b/dr-elephant/README.MD index 8c2d60980..d3a824034 100644 --- a/dr-elephant/README.MD +++ b/dr-elephant/README.MD @@ -5,12 +5,17 @@ dataproc clusters. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Dr. Elephant installed. ```bash -gcloud dataproc clusters ${CLUSTER_NAME} \ - --initialization-actions=gs://$MY_BUCKET/dr-elephant/dr-elephant.sh +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/dr-elephant/dr-elephant.sh ``` Once the cluster has been created, Dr. Elephant is configured to run on port diff --git a/drill/README.md b/drill/README.md index 093a4fda7..1be2d8deb 100644 --- a/drill/README.md +++ b/drill/README.md @@ -4,6 +4,8 @@ This initialization action installs [Apache Drill](http://drill.apache.org) on a ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + Check the variables set in the script to ensure they're to your liking. 1. Use the `gcloud` command to create a new cluster with Drill installed. Run one of the following commands depending on your desired cluster type. @@ -11,25 +13,33 @@ Check the variables set in the script to ensure they're to your liking. Standard cluster (requires Zookeeper init action) ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/zookeeper/zookeeper.sh \ - --initialization-actions gs://$MY_BUCKET/drill/drill.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/zookeeper/zookeeper.sh,gs://goog-dataproc-initialization-actions-${REGION}/drill/drill.sh ``` High availability cluster (Zookeeper comes pre-installed) ```bash - gcloud dataproc clusters create \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --num-masters 3 \ - --initialization-actions gs://$MY_BUCKET/drill/drill.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/drill/drill.sh ``` Single node cluster (Zookeeper is unnecessary) ```bash - gcloud dataproc clusters create \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --single-node \ - --initialization-actions gs://$MY_BUCKET/drill/drill.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/drill/drill.sh ``` 1. Once the cluster has been created, Drillbits will start on all nodes. You can log into any node of the cluster to run Drill queries. Drill is installed in `/usr/lib/drill` (unless you change the setting) which contains a `bin` directory with `sqlline`. diff --git a/flink/README.md b/flink/README.md index 0c42a6090..af0bd755b 100644 --- a/flink/README.md +++ b/flink/README.md @@ -5,11 +5,16 @@ Flink and start a Flink session running on YARN. ## Using this initialization action -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/flink/flink.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/flink/flink.sh ``` 1. You can log into the master node of the cluster to submit jobs to Flink. Flink is installed in `/usr/lib/flink` (unless you change the setting) which contains a `bin` directory with Flink. **Note** - you need to specify `HADOOP_CONF_DIR=/etc/hadoop/conf` before your Flink commands for them to execute properly. diff --git a/ganglia/README.MD b/ganglia/README.MD index d7eb387fc..628adce43 100644 --- a/ganglia/README.MD +++ b/ganglia/README.MD @@ -4,11 +4,16 @@ This initialization action installs [Ganglia](http://ganglia.info/), a scalable ## Using this initialization action -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/ganglia/ganglia.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/ganglia/ganglia.sh ``` 1. Once the cluster has been created, Ganglia is served on port `80` on the master node at `/ganglia`. To connect to the Ganglia web interface, you will need to create an SSH tunnel and use a SOCKS 5 Proxy with your web browser as described in the [dataproc web interfaces](https://cloud.google.com/dataproc/cluster-web-interfaces) documentation. In the opened web browser, go to `http://CLUSTER_NAME-m/ganglia` on Standard/Single Node clusters, or `http://CLUSTER_NAME-m-0/ganglia` on High Availability clusters. diff --git a/gobblin/README.md b/gobblin/README.md index 368f558a0..09b6e8924 100644 --- a/gobblin/README.md +++ b/gobblin/README.md @@ -8,13 +8,18 @@ The distribution is hosted in Dataproc-team owned Google Cloud Storage bucket `g ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Gobblin installed by: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/gobblin/gobblin.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gobblin/gobblin.sh ``` 1. Submit jobs @@ -22,7 +27,8 @@ You can use this initialization action to create a new Dataproc cluster with Gob ```bash gcloud dataproc jobs submit hadoop --cluster= \ --class org.apache.gobblin.runtime.mapreduce.CliMRJobLauncher \ - --properties mapreduce.job.user.classpath.first=true -- \ + --properties mapreduce.job.user.classpath.first=true \ + -- \ -sysconfig /usr/local/lib/gobblin/conf/gobblin-mapreduce.properties \ -jobconfig gs:// ``` diff --git a/gpu/README.md b/gpu/README.md index f42e846a8..28d93489c 100644 --- a/gpu/README.md +++ b/gpu/README.md @@ -10,34 +10,40 @@ GPU drivers for NVIDIA on master and workers node in a ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with GPU support: this initialization action will install GPU drivers and CUDA. If you need a more recent GPU driver please visit NVIDIA [site](https://www.nvidia.com/Download/index.aspx?lang=en-us). 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - `` and install GPU drivers. + action. ```bash - gcloud beta dataproc clusters create \ - --master-accelerator type=nvidia-tesla-v100 \ - --worker-accelerator type=nvidia-tesla-v100,count=4 \ - --initialization-actions gs://$MY_BUCKET/gpu/install_gpu_driver.sh \ - --metadata install_gpu_agent=false + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --master-accelerator type=nvidia-tesla-v100 \ + --worker-accelerator type=nvidia-tesla-v100,count=4 \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ + --metadata install_gpu_agent=false ``` 2. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``, install GPU drivers and add the GPU monitoring service. + action. The following command will create a new cluster, install GPU drivers and add the GPU monitoring service. ```bash - gcloud beta dataproc clusters create \ - --master-accelerator type=nvidia-tesla-v100 \ - --worker-accelerator type=nvidia-tesla-v100,count=4 \ - --initialization-actions gs://$MY_BUCKET/gpu/install_gpu_driver.sh \ - --metadata install_gpu_agent=true \ - --scopes https://www.googleapis.com/auth/monitoring.write + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --master-accelerator type=nvidia-tesla-v100 \ + --worker-accelerator type=nvidia-tesla-v100,count=4 \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \ + --metadata install_gpu_agent=true \ + --scopes https://www.googleapis.com/auth/monitoring.write ``` #### Supported metadata parameters: diff --git a/hbase/README.md b/hbase/README.md index 6e5ee4b11..8a5cd9465 100644 --- a/hbase/README.md +++ b/hbase/README.md @@ -5,16 +5,20 @@ clusters. Apache HBase is a distributed and scalable Hadoop database. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Apache HBase installed on every node: 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/hbase/hbase.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/hbase/hbase.sh \ --num-masters 3 --num-workers 2 ``` @@ -28,17 +32,17 @@ Apache HBase installed on every node: command: ```bash - gcloud compute ssh -m-0 -- -L 16010:-m-0:16010 + gcloud compute ssh ${CLUSTER_NAME}-m-0 -- -L 16010:-m-0:16010 ``` Then just open a browser and type `localhost:16010` address. 1. HBase running on Dataproc can be easily scaled up. The following command will add three additional workers (RegionServers) to previously created - cluster named ``. + cluster named `${CLUSTER_NAME}`. ```bash - gcloud dataproc clusters update --num-workers 5 + gcloud dataproc clusters update ${CLUSTER_NAME} --region ${REGION} --num-workers 5 ``` ## Using different storage for HBase data @@ -51,8 +55,11 @@ metadata during the cluster creation process. path to your storage bucket. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/hbase/hbase.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/hbase/hbase.sh \ --metadata 'hbase-root-dir=gs:///' \ --metadata 'hbase-wak-dir=hdfs://path/to/wal' \ --num-masters 3 --num-workers 2 @@ -72,16 +79,19 @@ necessary configurations and creates all keytabs necessary for HBase. new cluster provisioning with the same cluster name. ```bash - gcloud beta dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/hbase/hbase.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/hbase/hbase.sh \ --metadata 'enable-kerberos=true,keytab-bucket=gs://' \ --num-masters 3 --num-workers 2 \ - --kerberos-root-principal-password-uri="Cloud Storage URI of KMS-encrypted password for Kerberos root principal" \ - --kerberos-kms-key="The URI of the KMS key used to decrypt the root password" \ - --image-version=1.3 + --kerberos-root-principal-password-uri "Cloud Storage URI of KMS-encrypted password for Kerberos root principal" \ + --kerberos-kms-key "The URI of the KMS key used to decrypt the root password" \ + --image-version 1.3 ``` -1. Login to master `-m-0` and add a principal to Kerberos key +1. Login to master `${CLUSTER_NAME}-m-0` and add a principal to Kerberos key distribution center to authenticate for HBase. ```bash @@ -108,7 +118,7 @@ necessary configurations and creates all keytabs necessary for HBase. pass additional init action when creating HBase standard cluster: ```bash - --initialization-actions gs://$MY_BUCKET/zookeeper/zookeeper.sh,gs://$MY_BUCKET/hbase/hbase.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/zookeeper/zookeeper.sh,gs://goog-dataproc-initialization-actions-${REGION}/hbase/hbase.sh ``` - The Kerberos version of this initialization action should be used in the HA diff --git a/hive-hcatalog/README.md b/hive-hcatalog/README.md index a9848f3dd..03e1c5931 100644 --- a/hive-hcatalog/README.md +++ b/hive-hcatalog/README.md @@ -4,13 +4,18 @@ This initialization action installs [Hive HCatalog](https://cwiki.apache.org/con ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Cloud Dataproc cluster with HCatalog installed by doing the following. -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/hive-hcatalog/hive-hcatalog.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/hive-hcatalog/hive-hcatalog.sh ``` 1. Once the cluster has been created HCatalog should be installed and configured for use with Pig. diff --git a/hue/README.md b/hue/README.md index 0d6ca737b..08cb521e3 100644 --- a/hue/README.md +++ b/hue/README.md @@ -6,16 +6,20 @@ a [Google Cloud Dataproc](https://cloud.google.com/dataproc) cluster. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Hue installed: 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/hue/hue.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/hue/hue.sh ``` 1. Once the cluster has been created, Hue is configured to run on port `8888` diff --git a/jupyter/README.MD b/jupyter/README.MD index d70b74557..691d426bd 100644 --- a/jupyter/README.MD +++ b/jupyter/README.MD @@ -19,19 +19,22 @@ without using SSH tunnels. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Jupyter installed: 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash # Simple one-liner; just use all default settings for your cluster. # Jupyter will run on port 8123 of your master node. - CLUSTER= - gcloud dataproc clusters create $CLUSTER \ - --initialization-actions gs://$MY_BUCKET/jupyter/jupyter.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter/jupyter.sh ``` 1. Run `./launch-jupyter-interface` to connect to the Jupyter notebook running @@ -57,13 +60,15 @@ For example to specify a different port and specify additional packages to install: ```bash -CLUSTER= -gcloud dataproc clusters create $CLUSTER \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata "JUPYTER_PORT=8124,JUPYTER_CONDA_PACKAGES=numpy:pandas:scikit-learn" \ - --initialization-actions gs://$MY_BUCKET/jupyter/jupyter.sh \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter/jupyter.sh \ --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m \ - --worker-machine-type=n1-standard-4 \ - --master-machine-type=n1-standard-4 + --worker-machine-type n1-standard-4 \ + --master-machine-type n1-standard-4 ``` Notebooks are stored and retrieved from the cluster staging bucket (Google Cloud @@ -81,7 +86,7 @@ by doing the following: - clones the GCS bucket or GitHub repo specified in the `INIT_ACTIONS_REPO` and `INIT_ACTIONS_BRANCH` (for GitHub repo) metadata keys - if `INIT_ACTIONS_REPO` metadata key is not set during cluster creation, - the default value `gs:///dataproc-initialization-actions` is used + the default value `gs://dataproc-initialization-actions` is used - this is provided so that a fork of the main repo can easily be used, eg, during development - executes `conda/bootstrap-conda.sh` from said repo/branch to ensure @@ -122,7 +127,7 @@ notebook process running on master node. ## Important notes -* This initialization action clones `gs://$MY_BUCKET` GCS +* This initialization action clones `gs://goog-dataproc-initialization-actions-${REGION}` GCS bucket to run other scripts in the repo. If you plan to copy `jupyter.sh` to your own GCS bucket, you will also need to fork this repository and specify the `INIT_ACTIONS_REPO` metadata key. diff --git a/jupyter2/README.md b/jupyter2/README.md index ffc04a853..79df86ec8 100644 --- a/jupyter2/README.md +++ b/jupyter2/README.md @@ -6,11 +6,16 @@ __Use the Dataproc [Jupyter Optional Component](https://cloud.google.com/datapro ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + Usage is similar to the original `jupyter` init action. ``` -gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/jupyter2/jupyter2.sh +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter2/jupyter2.sh ``` ### Options @@ -24,9 +29,12 @@ A few of same options are supported here: For example: ``` -gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/jupyter2/jupyter2.sh \ - --bucket gs://mybucket \ - --metadata JUPYTER_PORT=80,JUPYTER_AUTH_TOKEN=mytoken +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter2/jupyter2.sh \ + --bucket gs://mybucket \ + --metadata JUPYTER_PORT=80,JUPYTER_AUTH_TOKEN=mytoken ``` diff --git a/jupyter_sparkmonitor/README.MD b/jupyter_sparkmonitor/README.MD index e817614d5..e0ce402bf 100644 --- a/jupyter_sparkmonitor/README.MD +++ b/jupyter_sparkmonitor/README.MD @@ -17,18 +17,23 @@ Component. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with SparkMonitor installed: 1. Use the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash # Jupyter will run on port 8123 of your master node. - gcloud dataproc clusters create \ - --optional-components ANACONDA,JUPYTER --enable-component-gateway \ - --initialization-actions gs://$MY_BUCKET/jupyter_sparkmonitor/sparkmonitor.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --optional-components ANACONDA,JUPYTER \ + --enable-component-gateway \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter_sparkmonitor/sparkmonitor.sh ``` 1. To access to the Jupyter web interface, you can just use the Component diff --git a/kafka/README.MD b/kafka/README.MD index be57a7dab..6bfbc5c42 100644 --- a/kafka/README.MD +++ b/kafka/README.MD @@ -6,15 +6,20 @@ By default, Kafka brokers run only on all worker nodes in the cluster, and Kafka ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Kafka installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new high availability cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --num-masters 3 \ --metadata "run-on-master=true" \ - --initialization-actions gs://$MY_BUCKET/kafka/kafka.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/kafka/kafka.sh ``` 1. You can test your Kafka setup by creating a simple topic and publishing to it with Kafka's command-line tools, after SSH'ing into one of your nodes: @@ -29,9 +34,9 @@ You can use this initialization action to create a new Dataproc cluster with Kaf # Use worker 0 as broker to publish 100 messages over 100 seconds # asynchronously. CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name) - for i in {0..100}; do echo "message${i}"; sleep 1; done | \ + for i in {0..100}; do echo "message${i}"; sleep 1; done | /usr/lib/kafka/bin/kafka-console-producer.sh \ - --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test & + --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test & # User worker 1 as broker to consume those 100 messages as they come. # This can also be run in any other master or worker node of the cluster. @@ -47,11 +52,14 @@ You can find more information about using initialization actions with Dataproc i If you would like to use [Kafka Manager](https://github.com/yahoo/kafka-manager) to manage your Kafka cluster through web UI, use the `gcloud` command to create a new Kafka cluster with the Kafka Manager initialization action. The following command will create a new high availability Kafka cluster with Kafka Manager running on the first master node. The default HTTP port for Kafka Manager is 9000. Follow the instructions at [Dataproc cluster web interfaces](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces) to access the web UI. ```bash -gcloud dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --num-masters 3 \ --metadata "run-on-master=true" \ --metadata "kafka-enable-jmx=true" \ - --initialization-actions gs://$MY_BUCKET/kafka/kafka.sh,gs://$MY_BUCKET/kafka/kafka-manager.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/kafka/kafka.sh,gs://goog-dataproc-initialization-actions-${REGION}/kafka/kafka-manager.sh ``` ## Installing Cruise Control with Kafka @@ -59,10 +67,13 @@ gcloud dataproc clusters create \ If you would like to use [Cruise Control](https://github.com/linkedin/cruise-control) to automate common Kafka operations, e.g., automatically fixing under-replicated partitions caused by broker failures, use the `gcloud` command to create a new Kafka cluster with the Cruise Control initialization action. The following command will create a new high availability Kafka cluster with Cruise Control running on the first master node. The default HTTP port for Cruise Control is 9090. Follow the instructions at [Dataproc cluster web interfaces](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces) to access the web UI. ```bash -gcloud dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --num-masters 3 \ --metadata "run-on-master=true" \ - --initialization-actions gs://$MY_BUCKET/kafka/kafka.sh,gs://$MY_BUCKET/kafka/cruise-control.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/kafka/kafka.sh,gs://goog-dataproc-initialization-actions-${REGION}/kafka/cruise-control.sh ``` ## Installing Prometheus with Kafka @@ -70,11 +81,14 @@ gcloud dataproc clusters create \ If you would like to use [Prometheus](https://github.com/prometheus/prometheus) to monitor your Kafka cluster, use the `gcloud` command to create a new Kafka cluster with the Prometheus initialization action. The following command will create a new high availability Kafka cluster with Prometheus server listening on port 9096 of each node. Follow the instructions at [Dataproc cluster web interfaces](https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces) to access the web UI. ```bash -gcloud dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --num-masters 3 \ --metadata "run-on-master=true" \ --metadata "prometheus-http-port=9096" \ - --initialization-actions gs://$MY_BUCKET/kafka/kafka.sh,gs://$MY_BUCKET/prometheus/prometheus.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/kafka/kafka.sh,gs://goog-dataproc-initialization-actions-${REGION}/prometheus/prometheus.sh ``` ## Important notes diff --git a/kafka/kafka.sh b/kafka/kafka.sh index c6ab0401d..1edcf71d5 100755 --- a/kafka/kafka.sh +++ b/kafka/kafka.sh @@ -121,7 +121,7 @@ function install_and_configure_kafka_server() { # If all attempts failed, error out. if [[ -z "${zookeeper_list}" ]]; then - err 'Failed to find configured Zookeeper list; try --num-masters=3 for HA' + err 'Failed to find configured Zookeeper list; try "--num-masters=3" for HA' fi ZOOKEEPER_ADDRESS="${zookeeper_list%%,*}" diff --git a/livy/README.md b/livy/README.md index 51c92f173..9d364263f 100644 --- a/livy/README.md +++ b/livy/README.md @@ -7,6 +7,8 @@ installs version 0.6.0 (version 0.5.0 for Dataproc 1.0 and 1.1) of ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Livy installed: @@ -14,8 +16,11 @@ Livy installed: action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/livy/livy.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/livy/livy.sh ``` 1. Once the cluster has been created, Livy is configured to run on port `8998` diff --git a/oozie/README.MD b/oozie/README.MD index 95379841f..683c6404a 100644 --- a/oozie/README.MD +++ b/oozie/README.MD @@ -10,13 +10,18 @@ This initialization action installs the [Oozie](http://oozie.apache.org) workflo ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Oozie installed: 1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``: ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/oozie/oozie.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/oozie/oozie.sh ``` 1. Once the cluster has been created Oozie should be running on the master node. diff --git a/openssl/README.md b/openssl/README.md index a9b48a5c6..9fa8b2e3c 100644 --- a/openssl/README.md +++ b/openssl/README.md @@ -5,12 +5,18 @@ from Jessie-backports for Dataproc clusters running Dataproc 1.0 through 1.2 with debian 8. This init action is unnecessary on debian 9. ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with the backports version of OpenSSL using the following command: ```bash -gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/openssl/openssl.sh +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/openssl/openssl.sh ``` ## Important notes diff --git a/post-init/README.md b/post-init/README.md index c4a36654c..503db88c5 100644 --- a/post-init/README.md +++ b/post-init/README.md @@ -8,6 +8,8 @@ or submitting a job. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + For now, the script relies on polling the Dataproc API to determine an authoritative state of cluster health on startup, so requires the `--scopes cloud-platform` flag; do not use this initialization action if you are unwilling to grant your Dataproc clusters' service @@ -25,16 +27,20 @@ logs also in the `/var/log` directory. Simply modify the `POST_INIT_COMMAND` to whatever actual job submission command you want to run: + export REGION= export CLUSTER_NAME=${USER}-shortlived-cluster export POST_INIT_COMMAND=" \ gcloud dataproc jobs submit hadoop \ + --region ${REGION} \ --cluster ${CLUSTER_NAME} \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ + -- \ teragen 10000 /tmp/teragen; \ - gcloud dataproc clusters delete -q ${CLUSTER_NAME}" + gcloud dataproc clusters delete -q ${CLUSTER_NAME} --region ${REGION}" gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes cloud-platform \ - --initialization-actions gs://$MY_BUCKET/post-init/master-post-init.sh \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/post-init/master-post-init.sh \ --metadata post-init-command="${POST_INIT_COMMAND}" @@ -46,13 +52,16 @@ the cluster will only be deleted if the job was successful: export CLUSTER_NAME=${USER}-shortlived-cluster export POST_INIT_COMMAND=" \ gcloud dataproc jobs submit hadoop \ + --region ${REGION} \ --cluster ${CLUSTER_NAME} \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ + -- \ teragen 10000 /tmp/teragen && \ - gcloud dataproc clusters delete -q ${CLUSTER_NAME}" + gcloud dataproc clusters delete -q ${CLUSTER_NAME} --region ${REGION}" gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes cloud-platform \ - --initialization-actions gs://$MY_BUCKET/post-init/master-post-init.sh \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/post-init/master-post-init.sh \ --metadata post-init-command="${POST_INIT_COMMAND}" ## Create a cluster which automatically resubmits a job on restart or job termination (e.g. for streaming processing) @@ -71,14 +80,17 @@ to be orphaned from a failed job driver/client program; this makes it safe to re the job on reboot without manually tracking down orphaned YARN applications which may be consuming resources for the job that you want to resubmit. + export REGION= export CLUSTER_NAME=${USER}-longrunning-job-cluster export POST_INIT_COMMAND=" \ while true; do \ gcloud dataproc jobs submit spark \ + --region ${REGION} \ --cluster ${CLUSTER_NAME} \ --jar gs://${BUCKET}/my-longlived-job.jar foo args; \ done" gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --scopes cloud-platform \ - --metadata startup-script-url=gs://$MY_BUCKET/post-init/master-post-init.sh,post-init-command="${POST_INIT_COMMAND}" + --metadata startup-script-url=gs://goog-dataproc-initialization-actions-${REGION}/post-init/master-post-init.sh,post-init-command="${POST_INIT_COMMAND}" diff --git a/presto/README.MD b/presto/README.MD index 821655507..d8e00dfb2 100644 --- a/presto/README.MD +++ b/presto/README.MD @@ -8,16 +8,20 @@ Dataproc workers will be Presto workers. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Presto installed: 1. Using the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/presto/presto.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/presto/presto.sh ``` 1. Once the cluster has been created, Presto is configured to run on port diff --git a/prometheus/README.md b/prometheus/README.md index b20a25b4f..59a4d4d85 100644 --- a/prometheus/README.md +++ b/prometheus/README.md @@ -2,13 +2,19 @@ This script installs [Prometheus](https://prometheus.io/) on Dataproc clusters, performs necessary configurations and pulls metrics from Hadoop, Spark and Kafka if installed. Prometheus is a time series database that allows visualizing, querying metrics gathered from different cluster components during job execution. ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Prometheus installed on every node: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/prometheus/prometheus.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/prometheus/prometheus.sh ``` 1. Prometheus UI on the master node can be accessed after connecting with the command: ```bash diff --git a/python/README.md b/python/README.md index 7bf2ef3c3..e42cc0b8c 100644 --- a/python/README.md +++ b/python/README.md @@ -1,6 +1,8 @@ # Python setup and configuration tools -## Overview +## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. ## pip install packages @@ -13,17 +15,23 @@ Note: when using this initialization action with automation, pinning package ver Example 1: installing one package at head ``` -gcloud dataproc clusters create my-cluster \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata 'PIP_PACKAGES=pandas' \ - --initialization-actions gs://$MY_BUCKET/python/pip-install.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh ``` Example 2: installing several packages with version selectors ``` -gcloud dataproc clusters create my-cluster \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ - --initialization-actions gs://$MY_BUCKET/python/pip-install.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh ``` ## conda install packages @@ -34,15 +42,21 @@ as `CONDA_PACKAGES` metadata key. Packages are space separated and can contain v Example 1: installing one package at head ``` -gcloud dataproc clusters create my-cluster \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata 'CONDA_PACKAGES=scipy' \ - --initialization-actions gs://$MY_BUCKET/python/conda-install.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh ``` Example 2: installing several packages with version selectors ``` -gcloud dataproc clusters create my-cluster \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --metadata 'CONDA_PACKAGES=scipy=0.15.0 curl=7.26.0' \ - --initialization-actions gs://$MY_BUCKET/python/conda-install.sh + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh ``` diff --git a/ranger/README.md b/ranger/README.md index 2582a9fd6..b555cfded 100644 --- a/ranger/README.md +++ b/ranger/README.md @@ -5,16 +5,21 @@ This initialization action installs [Apache Ranger](https://ranger.apache.org/) ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Apache Ranger installed: 1. Use the `gcloud` command to create a new cluster with this initialization action. -The following command will create a new standard cluster named `` with the Ranger Policy Manager accessible via user `admin` and ``. +The following command will create a new cluster with the Ranger Policy Manager accessible via user `admin` and ``. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/solr/solr.sh,\ - gs://$MY_BUCKET/ranger/ranger.sh \ - --metadata "default-admin-password=" + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions \ + gs://goog-dataproc-initialization-actions-${REGION}/solr/solr.sh,gs://goog-dataproc-initialization-actions-${REGION}/ranger/ranger.sh \ + --metadata "default-admin-password=" ``` 1. Once the cluster has been created Apache Ranger Policy Manager should be running on master node and use Solr in standalone mode for audits. 1. The Policy Manager Web UI is served by default on port 6080. You can login using username `admin` and password provided in metadata. diff --git a/rapids/README.md b/rapids/README.md index ef910de54..0bff0ca68 100644 --- a/rapids/README.md +++ b/rapids/README.md @@ -15,7 +15,7 @@ On the Dataproc worker nodes: - `dask-cuda-worker` -Our initialization action does the following: +This initialization action does the following: 1. [install nvidia GPU driver](internal/install-gpu-driver.sh) 1. [install RAPIDS](rapids.sh) - @@ -25,23 +25,25 @@ Our initialization action does the following: ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with RAPIDS installed: 1. Using the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - DATAPROC_BUCKET=dataproc-initialization-actions - - gcloud beta dataproc clusters create \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --master-accelerator type=nvidia-tesla-t4,count=4 \ --master-machine-type n1-standard-32 \ --worker-accelerator type=nvidia-tesla-t4,count=4 \ --worker-machine-type n1-standard-32 \ - --initialization-actions gs://$DATAPROC_BUCKET/rapids/rapids.sh \ - --optional-components=ANACONDA + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \ + --optional-components ANACONDA ``` 1. Once the cluster has been created, the Dask scheduler listens for workers on @@ -78,22 +80,23 @@ to install a different driver version, [find the appropriate driver download URL](https://www.nvidia.com/Download/index.aspx?lang=en-us) for your driver's `.run` file. -* `--metadata=gpu-driver-url=http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run` - +* `--metadata gpu-driver-url=http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run` - to specify alternate driver download URL. For example: ```bash -DATAPROC_BUCKET=dataproc-initialization-actions - -gcloud beta dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --master-accelerator type=nvidia-tesla-t4,count=4 \ --master-machine-type n1-standard-32 \ --worker-accelerator type=nvidia-tesla-t4,count=4 \ --worker-machine-type n1-standard-32 \ --metadata "gpu-driver-url=http://us.download.nvidia.com/tesla/410.104/NVIDIA-Linux-x86_64-410.104.run" \ - --initialization-actions gs://$DATAPROC_BUCKET/rapids/rapids.sh \ - --optional-components=ANACONDA + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \ + --optional-components ANACONDA ``` RAPIDS works with @@ -117,16 +120,17 @@ configurable via a metadata key using `--metadata`. For example: ```bash -DATAPROC_BUCKET=dataproc-initialization-actions - -gcloud beta dataproc clusters create \ +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ --master-accelerator type=nvidia-tesla-t4,count=4 \ --master-machine-type n1-standard-32 \ --worker-accelerator type=nvidia-tesla-t4,count=4 \ --worker-machine-type n1-standard-32 \ --metadata "run-cuda-worker-on-master=false" \ - --initialization-actions gs://$DATAPROC_BUCKET/rapids/rapids.sh \ - --optional-components=ANACONDA + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \ + --optional-components ANACONDA ``` #### Initialization Action Source @@ -144,7 +148,7 @@ GCS bucket by default: * RAPIDS init actions depend on the [Anaconda](https://cloud.google.com/dataproc/docs/concepts/components/anaconda) component, which should be included at cluster creation time via the - `--optional-components=ANACONDA` argument. + `--optional-components ANACONDA` argument. * RAPIDS is supported on Pascal or newer GPU architectures (Tesla K80s will _not_ work with RAPIDS). See [list](https://cloud.google.com/compute/docs/gpus/) of available GPU types diff --git a/rstudio/README.MD b/rstudio/README.MD index 3bf7e0d3c..2b0c3ffc7 100644 --- a/rstudio/README.MD +++ b/rstudio/README.MD @@ -3,13 +3,19 @@ This initialization action installs the Open Source Edition of [RStudio Server](https://www.rstudio.com/products/rstudio/#Server) on the master node of a [Google Cloud Dataproc](https://cloud.google.com/dataproc) cluster. ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with RStudio Server installed by: -1. Using the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. The password must be at least 7 characters. +1. Using the `gcloud` command to create a new cluster with this initialization action. The password must be at least 7 characters. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/rstudio/rstudio.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/rstudio/rstudio.sh \ --metadata rstudio-user=rstudio \ --metadata rstudio-password= ``` diff --git a/solr/README.md b/solr/README.md index 67d499696..19af95c18 100644 --- a/solr/README.md +++ b/solr/README.md @@ -4,13 +4,18 @@ This initialization action installs [Apache Solr](https://lucene.apache.org/solr ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Apache Solr installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/solr/solr.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/solr/solr.sh ``` ## Solr UI diff --git a/spark-nlp/README.md b/spark-nlp/README.md index a31266e89..26df1c977 100644 --- a/spark-nlp/README.md +++ b/spark-nlp/README.md @@ -5,21 +5,26 @@ on all nodes within a [Google Cloud Dataproc](https://cloud.google.com/dataproc) ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Cloud Dataproc cluster with spark-nlp version 2.0.8 installed. You must also include Anaconda as an [Optional Component](https://cloud.google.com/dataproc/docs/concepts/components/overview) when creating the cluster: 1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named `my_cluster`: ```bash - gcloud dataproc clusters create my_cluster \ - --optional-components=ANACONDA \ - --initialization-actions gs://$MY_BUCKET/spark-nlp/spark-nlp.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --optional-components ANACONDA \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/spark-nlp/spark-nlp.sh ``` 2. To use `spark-nlp` in your code, you must include `spark-nlp` with the --properties flag when submitting a job (example shows a Python job): ```bash gcloud dataproc jobs submit pyspark --cluster my-cluster \ - --properties=spark:spark.jars.packages=JohnSnowLabs:spark-nlp:2.0.8 \ - my_job.py + --properties spark:spark.jars.packages=JohnSnowLabs:spark-nlp:2.0.8 \ + my_job.py ``` Note: `spark-nlp` is available for Java and Scala as well. diff --git a/stackdriver/README.md b/stackdriver/README.md index ba252eb16..f96dcc8d1 100644 --- a/stackdriver/README.md +++ b/stackdriver/README.md @@ -8,10 +8,10 @@ installations script. This will enable monitoring for a Cloud Dataproc cluster w you can, for example, look at fine-grained resource use across the cluster, alarm on various triggers, or analyze the performance of your cluster. - - ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + **You need to configure Stackdriver before you use this initialization action.** Specifically, you must create a group based on the cluster name prefix of your cluster. Once you do, Stackdriver will detect any new instances created with that prefix and use this group as the basis for your alerting policies and dashboards. You can create a new @@ -20,11 +20,14 @@ group through the [Stackdriver user interface](https://app.google.stackdriver.co Once you have configured a copy of this script, you can use this initialization action to create a new Dataproc cluster with the Stackdriver agent installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. You must add the [requisite stackdriver monitoring scope(s)](https://cloud.google.com/monitoring/api/authentication#cloud_monitoring_scopes). The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. You must add the [requisite stackdriver monitoring scope(s)](https://cloud.google.com/monitoring/api/authentication#cloud_monitoring_scopes). ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/stackdriver/stackdriver.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/stackdriver/stackdriver.sh \ --scopes https://www.googleapis.com/auth/monitoring.write ``` 1. Once the cluster is online, Stackdriver should automatically start capturing data from your cluster. You can visit @@ -37,18 +40,24 @@ You can find more information about using initialization actions with Dataproc i To better identify your cluster in a Stackdriver dashboard, you'll likely want to append a unique tag when creating your cluster: - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/stackdriver/stackdriver.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/stackdriver/stackdriver.sh \ --scopes https://www.googleapis.com/auth/monitoring.write \ --tags my-dataproc-cluster-20160901-1518 -This way, even if you reuse your `` in the future, you can easily disambiguate which incarnation +This way, even if you reuse your cluster in the future, you can easily disambiguate which incarnation of the cluster you want to look at in your Stackdriver dashboards. For convenience, you may also want to use to Google-hosted copy of the dataproc-initialization-actions repo; for example, once you've enabled the Stackdriver APIs you can simply copy/paste: - gcloud dataproc clusters create ${USER}-dataproc-cluster \ - --initialization-actions gs://$MY_BUCKET/stackdriver/stackdriver.sh \ + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/stackdriver/stackdriver.sh \ --scopes https://www.googleapis.com/auth/monitoring.write \ --tags ${USER}-dataproc-cluster-$(date +%Y%m%d-%H%M%S) diff --git a/starburst-presto/README.MD b/starburst-presto/README.MD index f323a8094..eca3320d0 100644 --- a/starburst-presto/README.MD +++ b/starburst-presto/README.MD @@ -9,16 +9,20 @@ Dataproc workers will be Presto workers. ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Presto installed: 1. Using the `gcloud` command to create a new cluster with this initialization - action. The following command will create a new cluster named - ``. + action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/starburst-presto/presto.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/starburst-presto/presto.sh ``` 1. Once the cluster has been created, Presto is configured to run on port diff --git a/tez/README.MD b/tez/README.MD index 94510fb55..d92834fdc 100644 --- a/tez/README.MD +++ b/tez/README.MD @@ -6,20 +6,25 @@ Note that Tez is pre-installed on Dataproc 1.3+ clusters, so you should not run ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Apache Tez installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/tez/tez.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/tez/tez.sh ``` 1. On Dataproc 1.3+ clusters in order to use pre-installed Tez is necessary to add the flag `--properties 'hadoop-env:HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/etc/tez/conf:/usr/lib/tez/*:/usr/lib/tez/lib/*'`. ```bash gcloud dataproc clusters create \ - --image-version 1.3 \ - --properties 'hadoop-env:HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/etc/tez/conf:/usr/lib/tez/*:/usr/lib/tez/lib/*' + --image-version 1.3 \ + --properties 'hadoop-env:HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/etc/tez/conf:/usr/lib/tez/*:/usr/lib/tez/lib/*' ``` 1. Hive is be configured to use Tez, rather than MapReduce, as its execution engine. This can significantly speed up some Hive queries. Read more in the [Hive on Tez documentation](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez). diff --git a/tony/README.md b/tony/README.md index 70befc362..1bc7db336 100644 --- a/tony/README.md +++ b/tony/README.md @@ -5,21 +5,29 @@ on a master node within a [Google Cloud Dataproc](https://cloud.google.com/datap ## Using this initialization action +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with TonY installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/tony/tony.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/tony/tony.sh ``` You can also pass specific metadata: ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/tony/tony.sh \ - --metadata name1=value1,name2=value2... + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/tony/tony.sh \ + --metadata name1=value1,name2=value2... ``` Supported metadata parameters: @@ -38,9 +46,12 @@ You can use this initialization action to create a new Dataproc cluster with Ton Example: ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/tony/tony.sh \ - --metadata worker_instances=4,worker_memory=4g,ps_instances=1,ps_memory=2g + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/tony/tony.sh \ + --metadata worker_instances=4,worker_memory=4g,ps_instances=1,ps_memory=2g ``` **Note:** For settings not defined in this configuration, you can pass a separate configuration when launching tasks diff --git a/user-environment/README.MD b/user-environment/README.MD index 865ccf22b..e1af1e648 100644 --- a/user-environment/README.MD +++ b/user-environment/README.MD @@ -5,11 +5,16 @@ This initialization action customizes the environment of all current and future By default it only enables the options already present in the .bashrc that Debian provides, but it documents where further changes can be made and gives some commented out examples. ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action by: 1. Editing and uploading a copy of this initialization action (`user-environment.sh`) to [Google Cloud Storage](https://cloud.google.com/storage). -1. Using the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``, specify the initialization action stored in ``: +1. Using the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs:///user-environment + REGION= + CLUSTER= + gcloud dataproc clusters create ${CLUSTER} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/user-environment diff --git a/zeppelin/README.MD b/zeppelin/README.MD index 54c6afe65..dccc7a22f 100644 --- a/zeppelin/README.MD +++ b/zeppelin/README.MD @@ -5,13 +5,19 @@ This initialization action installs the latest version of [Apache Zeppelin](http __Use the Dataproc [Zeppelin Optional Component](https://cloud.google.com/dataproc/docs/concepts/components/zeppelin)__. Clusters created with Cloud Dataproc image version 1.3 and later can install Zeppelin Notebook without using this initialization action. The Zeppelin Optional Component's web interface can be accessed via [Component Gateway](https://cloud.google.com/dataproc/docs/concepts/accessing/dataproc-gateways) without using SSH tunnels. ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with Apache Zeppelin installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/zeppelin/zeppelin.sh + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/zeppelin/zeppelin.sh ``` 1. Once the cluster has been created, Zeppelin is configured to run on port `8080` on the master node in a Dataproc cluster. To connect to the Apache Zeppelin web interface, you will need to create an SSH tunnel and use a SOCKS 5 Proxy as described in the [dataproc web interfaces](https://cloud.google.com/dataproc/cluster-web-interfaces) documentation. @@ -24,9 +30,12 @@ This option can be provided as a metadata key using `--metadata`. For example: ```bash -gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/zeppelin/zeppelin.sh \ - --metadata zeppelin-port=8081 +REGION= +CLUSTER_NAME= +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/zeppelin/zeppelin.sh \ + --metadata zeppelin-port=8081 ``` ## Important notes diff --git a/zookeeper/README.MD b/zookeeper/README.MD index 0b37b68dc..07f64e779 100644 --- a/zookeeper/README.MD +++ b/zookeeper/README.MD @@ -9,14 +9,20 @@ This script installs ZooKeeper on the **three required** nodes for a Cloud Datap * Worker 2 (`-w-1`) ## Using this initialization action + +**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. + You can use this initialization action to create a new Dataproc cluster with ZooKeeper installed: -1. Use the `gcloud` command to create a new cluster with this initialization action. The following command will create a new cluster named ``. +1. Use the `gcloud` command to create a new cluster with this initialization action. ```bash - gcloud dataproc clusters create \ - --initialization-actions gs://$MY_BUCKET/zookeeper/zookeeper.sh \ - [--properties zookeeper:=,...] + REGION= + CLUSTER_NAME= + gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region ${REGION} \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/zookeeper/zookeeper.sh \ + [--properties zookeeper:=,...] ``` 1. Once the cluster has been created, ZooKeeper is configured to run on port `2181` (though you can change this in the script.)