diff --git a/bigdl/README.md b/bigdl/README.md index 382be5f5e..9bb282a9e 100644 --- a/bigdl/README.md +++ b/bigdl/README.md @@ -1,21 +1,17 @@ # Intel BigDL and Analytics Zoo -This initialization action installs [BigDL](https://github.com/intel-analytics/BigDL) -on a [Google Cloud Dataproc](https://cloud.google.com/dataproc) cluster. -BigDL is a distributed deep learning library for Apache Spark. More information can be found on the -[project's website](https://bigdl-project.github.io/) +This initialization action installs [BigDL](https://github.com/intel-analytics/BigDL) on a [Dataproc](https://cloud.google.com/dataproc) cluster. BigDL is a distributed deep learning library for Apache Spark. See the GitHub [BigDL website](https://bigdl-project.github.io/) for more information. -This script also supports Intel Analytics Zoo which includes BigDL as well. -More information [project's website](https://analytics-zoo.github.io) +This script also supports the Intel [Analytics Zoo](https://software.intel.com/content/www/us/en/develop/topics/ai/analytics-zoo.html), +which includes BigDL. See the GitHub [Analytics Zoo website](https://analytics-zoo.github.io) site for more information. -## Using this initialization action +## Using the initialization action -**:warning: NOTICE:** See [best practices](/README.md#how-initialization-actions-are-used) of using initialization actions in production. +**:warning: NOTICE:** See [How initialization actions are used](/README.md#how-initialization-actions-are-used) and [Important considerations and guidelines](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions#important_considerations_and_guidelines) for additional information. -You can use this initialization to create a new Dataproc cluster with BigDL's Spark and PySpark libraries installed. +Use this initialization action to create a Dataproc cluster with BigDL's Spark and PySpark libraries installed. -Because of a time needed to install BigDL on the cluster nodes we need to set -`--initialization-action-timeout 10m` property to prevent cluster creation timeout. +Note: In the following examples, a 10-minute timeout is set with the `--initialization-action-timeout 10m` flag to allow for the time needed to install BigDL on cluster nodes. ``` REGION= @@ -26,14 +22,9 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --initialization-action-timeout 10m ``` -This script downloads BigDL 0.7.2 for Dataproc 1.3 (Spark 2.3.0 and Scala 2.11.8). -To download a different version of BigDL or Analytics Zoo distribution -or one targeted to a different version of Spark/Scala, -find the download URL from the [BigDL releases page](https://bigdl-project.github.io/master/#release-download), and set the metadata key `bigdl-download-url` -or beside [maven packages](https://repo1.maven.org/maven2/com/intel/analytics/). -The URL should end in `-dist.zip`. +By default, this initialization action script downloads BigDL 0.7.2 for Dataproc 1.3 (Spark 2.3.0 and Scala 2.11.8). To download a different BigDL or Analytics Zoo distribution version or one targeted to a different version of Spark/Scala, find the download URL on the [BigDL releases page](https://bigdl-project.github.io/master/#release-download) or in the [Maven repository](https://repo1.maven.org/maven2/com/intel/analytics/), then set the `bigdl-download-url` metadata key. The URL should end in `-dist.zip`. -For example, for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2: +Example for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2: ``` REGION= @@ -46,7 +37,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/bigdl/dist-spark-1.6.2-scala-2.10.5-all/0.7.2/dist-spark-1.6.2-scala-2.10.5-all-0.7.2-dist.zip' ``` -Or, for example, to download Analytics Zoo 0.4.0 with BigDL v0.7.2 for Dataproc 1.3 (Spark 2.3) use this: +Example for Dataproc 1.3 (Spark 2.3) and Analytics Zoo 0.4.0 with BigDL v0.7.2: ``` REGION= @@ -58,11 +49,8 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --initialization-action-timeout 10m \ --metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/zoo/analytics-zoo-bigdl_0.7.2-spark_2.3.1/0.4.0/analytics-zoo-bigdl_0.7.2-spark_2.3.1-0.4.0-dist-all.zip' ``` - - -You can find more information about using initialization actions with Dataproc in the [Dataproc documentation](https://cloud.google.com/dataproc/init-actions). ## Important notes -* You cannot use preemptible VMs with this init action, nor scale (add or remove workers from) the cluster. BigDL needs to know the exact number of Spark executors and cores per executor to make optimizations for Intel's MKL library (which BigDL ships with). This init action statically sets `spark.executor.instances` based on the original size of the cluster, and **disables** dynamic allocation (`spark.dynamicAllocation.enabled=false`). -* The init action sets `spark.executor.instances` such that a single application takes up all the resources in a cluster. To run multiple applications simulatenously, override `spark.executor.instances` on each job using `--properties` to `gcloud dataproc jobs submit [spark|pyspark|spark-sql]` or `--conf` to `spark-shell`/`spark-submit`. Note that each application needs to schedule an app master in addition to the executors. +* You cannot use preemptible VMs with this initilization action, and cannot scale (add or remove workers from) the cluster. BigDL expects a fixed number of Spark executors and cores per executor to make optimizations for Intel's MKL library (shipped with BigDL). This initilization action statically sets `spark.executor.instances` based on the original size of the cluster, and **disables** dynamic allocation (`spark.dynamicAllocation.enabled=false`). +* This initilization action sets `spark.executor.instances` so that a single application uses all cluster resources. To run multiple applications simulatenously, override `spark.executor.instances` on each job by adding the `--properties` flag to `gcloud dataproc jobs submit [spark|pyspark|spark-sql]` or the `--conf` flag to `spark-shell`/`spark-submit`. Note that each application schedules an app master in addition to the executors.