Spark XGBoost examples here showcase the need for end-to-end GPU acceleration. The Scala based XGBoost examples here use DMLC’s version. For PySpark based XGBoost, please refer to the Spark-RAPIDS-examples 22.04 branch that uses NVIDIA’s Spark XGBoost version. Most data scientists spend a lot of time not only on Training models but also processing the large amounts of data needed to train these models. As you can see below, XGBoost training on GPUs can be upto 7X and data processing using RAPIDS Accelerator can also be accelerated with an end-to-end speed-up of 7X on GPU compared to CPU. In the public cloud, better performance can lead to significantly lower costs as demonstrated in this blog.
In this folder, there are three blue prints for users to learn about using Spark XGBoost and RAPIDS Accelerator on GPUs :
- Mortgage Prediction
- Agaricus Classification
- Taxi Fare Prediction
For each of these examples we have prepared a sample dataset in this folder for testing. These datasets are only provided for convenience. In order to test for performance, please download the larger dataset from their respectives sources.
There are three sections in this readme section. In the first section, we will list the notebooks that can be run on Jupyter with Python or Scala (Spylon Kernel or Apache Toree Kernel).
In the second section, we have sample jar files and source code if users would like to build and run this as a Scala or a PySpark Spark-XGBoost application.
In the last section, we provide basic “Getting Started Guides” for setting up GPU Spark-XGBoost on different environments based on the Apache Spark scheduler such as YARN, Standalone or Kubernetes.
- Mortgage Notebooks
- Agaricus Notebooks
- Taxi Notebook
- Python
- Scala
The first step to build a Spark application is preparing packages and datasets needed to build the jars. Please use the instructions below for building the
In addition, we have the source code for building reference applications. Below are source codes for the example Spark jobs:
Please follow below steps to run the example Spark jobs in different Spark environments:
- Getting started on on-premises clusters
- Getting started on cloud service providers
- Amazon AWS
- Databricks
Please follow below steps to run the example notebooks in different notebook environments:
- Getting started for Jupyter Notebook applications
Note:
For the CrossValidator job, we need to set spark.task.resource.gpu.amount=1
to allow only 1 training task running on 1 GPU(executor),
otherwise the customized CrossValidator may schedule more than 1 xgboost training tasks into one executor simultaneously and trigger
issue-131.
For XGBoost job, if the number of shuffle stage tasks before training is less than the num_worker,
the training tasks will be scheduled to run on part of nodes instead of all nodes due to Spark Data Locality feature.
The workaround is to increase the partitions of the shuffle stage by setting spark.sql.files.maxPartitionBytes=RightNum
.