diff --git a/software/analytics/ibm-wml-ce.rst b/software/analytics/distributed-dl-env.rst similarity index 94% rename from software/analytics/ibm-wml-ce.rst rename to software/analytics/distributed-dl-env.rst index 14f6743d..81aad722 100644 --- a/software/analytics/ibm-wml-ce.rst +++ b/software/analytics/distributed-dl-env.rst @@ -1,11 +1,20 @@ ************************************************************************************* -IBM Watson Machine Learning CE -> Open CE +Distributed Deep Learning Environment ************************************************************************************* Getting Started =============== +The deep learning frameworks, TensorFlow and PyTorch, are available +on Summit in a conda environment via ``ibm-wml-ce`` or latest ``open-ce`` +module. In addition to framework's built-in data-parallel library, +such as `tf.distribute.Strategy` or `torch.nn.parallel`, Horovod +library is also available for distributed training. For performance +profiling, Nvidia deep learning profiler is provided via ``dlprof`` +module. For parallel hyperparameter search, ``ray-tune`` is also available +on Summit. + IBM Watson Machine Learning Community Edition is provided on Summit through the module ``ibm-wml-ce``, and after version ``1.7.0``, the module has been renamed to ``open-ce``, which is built based on the @@ -71,6 +80,7 @@ Comparing to IBM WML CE, `Open-CE `_ no long | WML-CE on Summit (`slides `__ | `recording `__) | Scaling up deep learning application on Summit (`slides `__ | `recording `__) | ML/DL on Summit (`slides `__ | `recording `__) + | OpenCE on Summit (`slides `__) Running Distributed Deep Learning Jobs ====================================== @@ -174,7 +184,8 @@ Performance Profiling There are several tools that can be used to profile the performance of a deep learning job. Below are links to several tools that are available -as part of the ibm-wml-ce and open-ce modules. +as part of the ibm-wml-ce and open-ce modules, or Nvidia deep learning +profiler. NVIDIA Profiling Tools ^^^^^^^^^^^^^^^^^^^^^^ @@ -185,6 +196,16 @@ different CUDA kernels are being launched and how long they take to complete. For more information on using the NVIDA profiling tools on Summit, please see these `slides `_. +The `Nvidia deep learning profiler `_ is also available on Summit via + +.. code-block:: console + + $ module use /sw/aaims/summit/modulefiles + $ module load dlprof + +Note that only PyTorch in ``open-ce`` module is currently supported, and usage +`examples `_ are provided. + Horovod Timeline ^^^^^^^^^^^^^^^^ @@ -250,6 +271,11 @@ We can break the reservation string down to understand each piece. * The ``maxcus=1`` specifies that the nodes can come from at most 1 rack. + +Hyperparameter Search +========================================== +(coming soon!) + Example =================== diff --git a/software/analytics/index.rst b/software/analytics/index.rst index fac87d2f..e5e4bed2 100644 --- a/software/analytics/index.rst +++ b/software/analytics/index.rst @@ -9,7 +9,7 @@ and data analytics tasks on OLCF systems. .. toctree:: :maxdepth: 2 - ibm-wml-ce + distributed-dl-env pbdR nvidia-rapids blazingsql