diff --git a/.gitignore b/.gitignore index f4f35fd..b0943e8 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ +env env_rml .idea .ipynb_checkpoints diff --git a/lecture_4.ipynb b/lecture_4.ipynb new file mode 100644 index 0000000..b9cdaad --- /dev/null +++ b/lecture_4.ipynb @@ -0,0 +1,2390 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## License \n", + "\n", + "Copyright 2020 Patrick Hall (jphall@gwu.edu)\n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", + "\n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and\n", + "limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**DISCLAIMER:** This notebook is not legal compliance advice." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "***" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Attacking a Constrained Machine Learning Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Global hyperpameters" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "SEED = 12345 # global random seed for better reproducibility" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Python imports and inits" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.\n", + "Attempting to start a local H2O server...\n", + " Java Version: openjdk version \"1.8.0_252\"; OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09); OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)\n", + " Starting server from /home/patrickh/Workspace/GWU_rml/env_rml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar\n", + " Ice root: /tmp/tmpwbigsuw9\n", + " JVM stdout: /tmp/tmpwbigsuw9/h2o_patrickh_started_from_python.out\n", + " JVM stderr: /tmp/tmpwbigsuw9/h2o_patrickh_started_from_python.err\n", + " Server is running at http://127.0.0.1:54321\n", + "Connecting to H2O server at http://127.0.0.1:54321 ... successful.\n", + "Warning: Your H2O cluster version is too old (9 months and 10 days)! Please download and install the latest version from http://h2o.ai/download/\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
H2O cluster uptime:00 secs
H2O cluster timezone:America/New_York
H2O data parsing timezone:UTC
H2O cluster version:3.26.0.3
H2O cluster version age:9 months and 10 days !!!
H2O cluster name:H2O_from_python_patrickh_8fev5r
H2O cluster total nodes:1
H2O cluster free memory:1.879 Gb
H2O cluster total cores:24
H2O cluster allowed cores:24
H2O cluster status:accepting new members, healthy
H2O connection url:http://127.0.0.1:54321
H2O connection proxy:None
H2O internal security:False
H2O API Extensions:Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
Python version:3.6.9 final
" + ], + "text/plain": [ + "-------------------------- ---------------------------------------------------\n", + "H2O cluster uptime: 00 secs\n", + "H2O cluster timezone: America/New_York\n", + "H2O data parsing timezone: UTC\n", + "H2O cluster version: 3.26.0.3\n", + "H2O cluster version age: 9 months and 10 days !!!\n", + "H2O cluster name: H2O_from_python_patrickh_8fev5r\n", + "H2O cluster total nodes: 1\n", + "H2O cluster free memory: 1.879 Gb\n", + "H2O cluster total cores: 24\n", + "H2O cluster allowed cores: 24\n", + "H2O cluster status: accepting new members, healthy\n", + "H2O connection url: http://127.0.0.1:54321\n", + "H2O connection proxy:\n", + "H2O internal security: False\n", + "H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4\n", + "Python version: 3.6.9 final\n", + "-------------------------- ---------------------------------------------------" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from rmltk import debug, evaluate, model # simple module for evaluating, debugging, and training models\n", + "\n", + "# h2o Python API with specific classes\n", + "import h2o \n", + "from h2o.estimators.gbm import H2OGradientBoostingEstimator # for GBM\n", + "\n", + "import numpy as np # array, vector, matrix calculations\n", + "import pandas as pd # DataFrame handling\n", + "\n", + "import matplotlib.pyplot as plt # general plotting\n", + "pd.options.display.max_columns = 999 # enable display of all columns in notebook\n", + "\n", + "# display plots in-notebook\n", + "%matplotlib inline \n", + "\n", + "h2o.init(max_mem_size='2G') # start h2o\n", + "h2o.remove_all() # remove any existing data structures from h2o memory" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Download, Explore, and Prepare UCI Credit Card Default Data\n", + "\n", + "UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients\n", + "\n", + "The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input variables: \n", + "\n", + "* **`LIMIT_BAL`**: Amount of given credit (NT dollar)\n", + "* **`SEX`**: 1 = male; 2 = female\n", + "* **`EDUCATION`**: 1 = graduate school; 2 = university; 3 = high school; 4 = others \n", + "* **`MARRIAGE`**: 1 = married; 2 = single; 3 = others\n", + "* **`AGE`**: Age in years \n", + "* **`PAY_0`, `PAY_2` - `PAY_6`**: History of past payment; `PAY_0` = the repayment status in September, 2005; `PAY_2` = the repayment status in August, 2005; ...; `PAY_6` = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above. \n", + "* **`BILL_AMT1` - `BILL_AMT6`**: Amount of bill statement (NT dollar). `BILL_AMNT1` = amount of bill statement in September, 2005; `BILL_AMT2` = amount of bill statement in August, 2005; ...; `BILL_AMT6` = amount of bill statement in April, 2005. \n", + "* **`PAY_AMT1` - `PAY_AMT6`**: Amount of previous payment (NT dollar). `PAY_AMT1` = amount paid in September, 2005; `PAY_AMT2` = amount paid in August, 2005; ...; `PAY_AMT6` = amount paid in April, 2005. \n", + "\n", + "Demographic variables will not be used as model inputs as is common in credit scoring models. However, demographic variables will be used after model training to test for disparate impact." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Import data and clean" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# import XLS file\n", + "path = 'default_of_credit_card_clients.xls'\n", + "data = pd.read_excel(path,\n", + " skiprows=1)\n", + "\n", + "# remove spaces from target column name \n", + "data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'}) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Assign modeling roles" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "y = DEFAULT_NEXT_MONTH\n", + "X = ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']\n" + ] + } + ], + "source": [ + "# assign target and inputs for GBM\n", + "y_name = 'DEFAULT_NEXT_MONTH'\n", + "x_names = [name for name in data.columns if name not in [y_name, 'ID', 'AGE', 'EDUCATION', 'MARRIAGE', 'SEX']]\n", + "print('y =', y_name)\n", + "print('X =', x_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Helper function for recoding values in the UCI credict card default data\n", + "This simple function maps longer, more understandable character string values from the UCI credit card default data dictionary to the original integer values of the input variables found in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Parse progress: |█████████████████████████████████████████████████████████| 100%\n" + ] + } + ], + "source": [ + "def recode_cc_data(frame):\n", + " \n", + " \"\"\" Recodes numeric categorical variables into categorical character variables\n", + " with more transparent values. \n", + " \n", + " Args:\n", + " frame: Pandas DataFrame version of UCI credit card default data.\n", + " \n", + " Returns: \n", + " H2OFrame with recoded values.\n", + " \n", + " \"\"\"\n", + " \n", + " # define recoded values\n", + " sex_dict = {1:'male', 2:'female'}\n", + " education_dict = {0:'other', 1:'graduate school', 2:'university', 3:'high school', \n", + " 4:'other', 5:'other', 6:'other'}\n", + " marriage_dict = {0:'other', 1:'married', 2:'single', 3:'divorced'}\n", + " \n", + " # recode values using apply() and lambda function\n", + " frame['SEX'] = frame['SEX'].apply(lambda i: sex_dict[i])\n", + " frame['EDUCATION'] = frame['EDUCATION'].apply(lambda i: education_dict[i]) \n", + " frame['MARRIAGE'] = frame['MARRIAGE'].apply(lambda i: marriage_dict[i]) \n", + " \n", + " return h2o.H2OFrame(frame)\n", + "\n", + "data = recode_cc_data(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Split data into training and validation partitions\n", + "Fairness metrics will be calculated for the validation data to give a better idea of how explanations will look on future unseen data." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train data rows = 21060, columns = 25\n", + "Validation data rows = 8940, columns = 25\n" + ] + } + ], + "source": [ + "# split into training and validation\n", + "train, valid = data.split_frame([0.7], seed=12345)\n", + "\n", + "# summarize split\n", + "print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))\n", + "print('Validation data rows = %d, columns = %d' % (valid.shape[0], valid.shape[1]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load Pre-trained Monotonic GBM\n", + "Load the model known as `mgbm5` from the first lecture." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model Details\n", + "=============\n", + "H2OGradientBoostingEstimator : Gradient Boosting Machine\n", + "Model Key: best_mgbm\n", + "\n", + "\n", + "Model Summary: " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
number_of_treesnumber_of_internal_treesmodel_size_in_bytesmin_depthmax_depthmean_depthmin_leavesmax_leavesmean_leaves
046.046.06939.03.03.03.05.08.07.369565
\n", + "
" + ], + "text/plain": [ + " number_of_trees number_of_internal_trees model_size_in_bytes \\\n", + "0 46.0 46.0 6939.0 \n", + "\n", + " min_depth max_depth mean_depth min_leaves max_leaves mean_leaves \n", + "0 3.0 3.0 3.0 5.0 8.0 7.369565 " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "ModelMetricsBinomial: gbm\n", + "** Reported on train data. **\n", + "\n", + "MSE: 0.13637719864300343\n", + "RMSE: 0.3692928358945018\n", + "LogLoss: 0.4351274080189972\n", + "Mean Per-Class Error: 0.2913939696264273\n", + "AUC: 0.7716491282246187\n", + "pr_auc: 0.5471826859054356\n", + "Gini: 0.5432982564492375\n", + "\n", + "Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.21968260039166268: " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01ErrorRate
0013482.02814.00.1727(2814.0/16296.0)
111907.02743.00.4101(1907.0/4650.0)
2Total15389.05557.00.2254(4721.0/20946.0)
\n", + "
" + ], + "text/plain": [ + " 0 1 Error Rate\n", + "0 0 13482.0 2814.0 0.1727 (2814.0/16296.0)\n", + "1 1 1907.0 2743.0 0.4101 (1907.0/4650.0)\n", + "2 Total 15389.0 5557.0 0.2254 (4721.0/20946.0)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Maximum Metrics: Maximum metrics at their respective thresholds\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
metricthresholdvalueidx
0max f10.2196830.537474248.0
1max f20.1278590.630227329.0
2max f0point50.4466990.583033147.0
3max accuracy0.4466990.821493147.0
4max precision0.9502471.0000000.0
5max recall0.0506091.000000395.0
6max specificity0.9502471.0000000.0
7max absolute_mcc0.3251590.413494194.0
8max min_per_class_accuracy0.1775420.698495281.0
9max mean_per_class_accuracy0.2196830.708606248.0
\n", + "
" + ], + "text/plain": [ + " metric threshold value idx\n", + "0 max f1 0.219683 0.537474 248.0\n", + "1 max f2 0.127859 0.630227 329.0\n", + "2 max f0point5 0.446699 0.583033 147.0\n", + "3 max accuracy 0.446699 0.821493 147.0\n", + "4 max precision 0.950247 1.000000 0.0\n", + "5 max recall 0.050609 1.000000 395.0\n", + "6 max specificity 0.950247 1.000000 0.0\n", + "7 max absolute_mcc 0.325159 0.413494 194.0\n", + "8 max min_per_class_accuracy 0.177542 0.698495 281.0\n", + "9 max mean_per_class_accuracy 0.219683 0.708606 248.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Gains/Lift Table: Avg response rate: 22.20 %, avg score: 22.00 %\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
groupcumulative_data_fractionlower_thresholdliftcumulative_liftresponse_ratescorecumulative_response_ratecumulative_scorecapture_ratecumulative_capture_rategaincumulative_gain
010.0100740.8139273.6078833.6078830.8009480.8434460.8009480.8434460.0363440.036344260.788259260.788259
120.0203380.7955753.5198083.5634320.7813950.8051530.7910800.8241190.0361290.072473251.980795256.343177
230.0303160.7636793.4053283.5113940.7559810.7839700.7795280.8109050.0339780.106452240.532798251.139446
340.0400080.7151383.2618913.4509540.7241380.7398150.7661100.7936840.0316130.138065226.189099245.095388
450.0500810.6644163.1168693.3837550.6919430.6866950.7511920.7721640.0313980.169462211.686898238.375473
560.1000190.5433842.8594633.1219840.6347990.6017940.6930790.6871010.1427960.312258185.946339212.198445
670.1500050.3662372.2242932.8228490.4937920.4469510.6266710.6070760.1111830.423441122.429306182.284922
780.2056720.2927651.5955102.4906590.3542020.3127770.5529250.5274220.0888170.51225859.551043149.065864
890.3012510.1966481.1745042.0730770.2607390.2344990.4602220.4344850.1122580.62451617.450421107.307684
9100.4000290.1738170.8643271.7746040.1918800.1848440.3939610.3728420.0853760.709892-13.56728477.460410
10110.5002860.1514310.7014181.5595370.1557140.1613350.3462160.3304550.0703230.780215-29.85824955.953665
11120.6003060.1312140.6192371.4028700.1374700.1407090.3114360.2988410.0619350.842151-38.07634240.286982
12130.7006590.1147940.5593141.2820500.1241670.1228170.2846140.2736300.0561290.898280-44.06856828.204987
13140.8008210.1022260.3692931.1678870.0819830.1080620.2592700.2529210.0369890.935269-63.07069716.788724
14150.9045640.0918610.4021521.0800660.0892770.0975240.2397740.2350990.0417200.976989-59.7848088.006633
15161.0000000.0348100.2411121.0000000.0535270.0769890.2219990.2200100.0230111.000000-75.8887830.000000
\n", + "
" + ], + "text/plain": [ + " group cumulative_data_fraction lower_threshold lift \\\n", + "0 1 0.010074 0.813927 3.607883 \n", + "1 2 0.020338 0.795575 3.519808 \n", + "2 3 0.030316 0.763679 3.405328 \n", + "3 4 0.040008 0.715138 3.261891 \n", + "4 5 0.050081 0.664416 3.116869 \n", + "5 6 0.100019 0.543384 2.859463 \n", + "6 7 0.150005 0.366237 2.224293 \n", + "7 8 0.205672 0.292765 1.595510 \n", + "8 9 0.301251 0.196648 1.174504 \n", + "9 10 0.400029 0.173817 0.864327 \n", + "10 11 0.500286 0.151431 0.701418 \n", + "11 12 0.600306 0.131214 0.619237 \n", + "12 13 0.700659 0.114794 0.559314 \n", + "13 14 0.800821 0.102226 0.369293 \n", + "14 15 0.904564 0.091861 0.402152 \n", + "15 16 1.000000 0.034810 0.241112 \n", + "\n", + " cumulative_lift response_rate score cumulative_response_rate \\\n", + "0 3.607883 0.800948 0.843446 0.800948 \n", + "1 3.563432 0.781395 0.805153 0.791080 \n", + "2 3.511394 0.755981 0.783970 0.779528 \n", + "3 3.450954 0.724138 0.739815 0.766110 \n", + "4 3.383755 0.691943 0.686695 0.751192 \n", + "5 3.121984 0.634799 0.601794 0.693079 \n", + "6 2.822849 0.493792 0.446951 0.626671 \n", + "7 2.490659 0.354202 0.312777 0.552925 \n", + "8 2.073077 0.260739 0.234499 0.460222 \n", + "9 1.774604 0.191880 0.184844 0.393961 \n", + "10 1.559537 0.155714 0.161335 0.346216 \n", + "11 1.402870 0.137470 0.140709 0.311436 \n", + "12 1.282050 0.124167 0.122817 0.284614 \n", + "13 1.167887 0.081983 0.108062 0.259270 \n", + "14 1.080066 0.089277 0.097524 0.239774 \n", + "15 1.000000 0.053527 0.076989 0.221999 \n", + "\n", + " cumulative_score capture_rate cumulative_capture_rate gain \\\n", + "0 0.843446 0.036344 0.036344 260.788259 \n", + "1 0.824119 0.036129 0.072473 251.980795 \n", + "2 0.810905 0.033978 0.106452 240.532798 \n", + "3 0.793684 0.031613 0.138065 226.189099 \n", + "4 0.772164 0.031398 0.169462 211.686898 \n", + "5 0.687101 0.142796 0.312258 185.946339 \n", + "6 0.607076 0.111183 0.423441 122.429306 \n", + "7 0.527422 0.088817 0.512258 59.551043 \n", + "8 0.434485 0.112258 0.624516 17.450421 \n", + "9 0.372842 0.085376 0.709892 -13.567284 \n", + "10 0.330455 0.070323 0.780215 -29.858249 \n", + "11 0.298841 0.061935 0.842151 -38.076342 \n", + "12 0.273630 0.056129 0.898280 -44.068568 \n", + "13 0.252921 0.036989 0.935269 -63.070697 \n", + "14 0.235099 0.041720 0.976989 -59.784808 \n", + "15 0.220010 0.023011 1.000000 -75.888783 \n", + "\n", + " cumulative_gain \n", + "0 260.788259 \n", + "1 256.343177 \n", + "2 251.139446 \n", + "3 245.095388 \n", + "4 238.375473 \n", + "5 212.198445 \n", + "6 182.284922 \n", + "7 149.065864 \n", + "8 107.307684 \n", + "9 77.460410 \n", + "10 55.953665 \n", + "11 40.286982 \n", + "12 28.204987 \n", + "13 16.788724 \n", + "14 8.006633 \n", + "15 0.000000 " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "ModelMetricsBinomial: gbm\n", + "** Reported on validation data. **\n", + "\n", + "MSE: 0.13326994104124376\n", + "RMSE: 0.3650615578792757\n", + "LogLoss: 0.4278285715046422\n", + "Mean Per-Class Error: 0.2856607030196092\n", + "AUC: 0.7776380047998697\n", + "pr_auc: 0.5486322626112021\n", + "Gini: 0.5552760095997393\n", + "\n", + "Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.27397344199105433: " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01ErrorRate
006093.0975.00.1379(975.0/7068.0)
11863.01123.00.4345(863.0/1986.0)
2Total6956.02098.00.203(1838.0/9054.0)
\n", + "
" + ], + "text/plain": [ + " 0 1 Error Rate\n", + "0 0 6093.0 975.0 0.1379 (975.0/7068.0)\n", + "1 1 863.0 1123.0 0.4345 (863.0/1986.0)\n", + "2 Total 6956.0 2098.0 0.203 (1838.0/9054.0)" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Maximum Metrics: Maximum metrics at their respective thresholds\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
metricthresholdvalueidx
0max f10.2739730.549951217.0
1max f20.1478350.634488307.0
2max f0point50.4366200.590736153.0
3max accuracy0.4569630.825271147.0
4max precision0.9470691.0000000.0
5max recall0.0451061.000000397.0
6max specificity0.9470691.0000000.0
7max absolute_mcc0.3472460.429999184.0
8max min_per_class_accuracy0.1815850.709970275.0
9max mean_per_class_accuracy0.2305180.714339240.0
\n", + "
" + ], + "text/plain": [ + " metric threshold value idx\n", + "0 max f1 0.273973 0.549951 217.0\n", + "1 max f2 0.147835 0.634488 307.0\n", + "2 max f0point5 0.436620 0.590736 153.0\n", + "3 max accuracy 0.456963 0.825271 147.0\n", + "4 max precision 0.947069 1.000000 0.0\n", + "5 max recall 0.045106 1.000000 397.0\n", + "6 max specificity 0.947069 1.000000 0.0\n", + "7 max absolute_mcc 0.347246 0.429999 184.0\n", + "8 max min_per_class_accuracy 0.181585 0.709970 275.0\n", + "9 max mean_per_class_accuracy 0.230518 0.714339 240.0" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Gains/Lift Table: Avg response rate: 21.94 %, avg score: 22.52 %\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
groupcumulative_data_fractionlower_thresholdliftcumulative_liftresponse_ratescorecumulative_response_ratecumulative_scorecapture_ratecumulative_capture_rategaincumulative_gain
010.0111550.8150103.2950553.2950550.7227720.8398580.7227720.8398580.0367570.036757229.505549229.505549
120.0205430.7955753.7007643.4804600.8117650.8056310.7634410.8242170.0347430.071501270.076417248.045999
230.0300420.7835503.6047213.5197490.7906980.7924410.7720590.8141700.0342400.105740260.472142251.974853
340.0400930.7431923.0058763.3909270.6593410.7613350.7438020.8009250.0302110.135952200.587630239.092657
450.0500330.6977023.4445123.4015730.7555560.7230910.7461370.7854610.0342400.170191244.451158240.157260
560.1012810.5531933.1047773.2513940.6810340.6147360.7131950.6990750.1591140.329305210.477654225.139444
670.1503200.3835642.1870462.9041710.4797300.4660670.6370320.6230610.1072510.436556118.704581190.417123
780.2000220.2969151.5804232.5752440.3466670.3278170.5648810.5496980.0785500.51510658.042296157.524427
890.3013030.2035391.1335142.0906160.2486370.2506480.4585780.4491740.1148040.62990913.351366109.061561
9100.4034680.1769700.9610681.8045950.2108110.1871900.3958390.3828360.0981870.728097-3.89319880.459549
10110.5002210.1520280.6557341.5823820.1438360.1635660.3470960.3404240.0634440.791541-34.42660358.238248
11120.5999560.1330090.5553491.4116510.1218160.1416510.3096470.3073810.0553880.846928-44.46507641.165144
12130.7022310.1150620.4923231.2777570.1079910.1235490.2802770.2806070.0503520.897281-50.76768527.775745
13140.8019660.1023800.3534041.1628020.0775190.1078340.2550610.2591210.0352470.932528-64.65959416.280206
14150.9053460.0918610.3799091.0734050.0833330.0975850.2354520.2406750.0392750.971803-62.0090637.340501
15161.0000000.0348100.2978991.0000000.0653440.0768840.2193510.2251720.0281971.000000-70.2101410.000000
\n", + "
" + ], + "text/plain": [ + " group cumulative_data_fraction lower_threshold lift \\\n", + "0 1 0.011155 0.815010 3.295055 \n", + "1 2 0.020543 0.795575 3.700764 \n", + "2 3 0.030042 0.783550 3.604721 \n", + "3 4 0.040093 0.743192 3.005876 \n", + "4 5 0.050033 0.697702 3.444512 \n", + "5 6 0.101281 0.553193 3.104777 \n", + "6 7 0.150320 0.383564 2.187046 \n", + "7 8 0.200022 0.296915 1.580423 \n", + "8 9 0.301303 0.203539 1.133514 \n", + "9 10 0.403468 0.176970 0.961068 \n", + "10 11 0.500221 0.152028 0.655734 \n", + "11 12 0.599956 0.133009 0.555349 \n", + "12 13 0.702231 0.115062 0.492323 \n", + "13 14 0.801966 0.102380 0.353404 \n", + "14 15 0.905346 0.091861 0.379909 \n", + "15 16 1.000000 0.034810 0.297899 \n", + "\n", + " cumulative_lift response_rate score cumulative_response_rate \\\n", + "0 3.295055 0.722772 0.839858 0.722772 \n", + "1 3.480460 0.811765 0.805631 0.763441 \n", + "2 3.519749 0.790698 0.792441 0.772059 \n", + "3 3.390927 0.659341 0.761335 0.743802 \n", + "4 3.401573 0.755556 0.723091 0.746137 \n", + "5 3.251394 0.681034 0.614736 0.713195 \n", + "6 2.904171 0.479730 0.466067 0.637032 \n", + "7 2.575244 0.346667 0.327817 0.564881 \n", + "8 2.090616 0.248637 0.250648 0.458578 \n", + "9 1.804595 0.210811 0.187190 0.395839 \n", + "10 1.582382 0.143836 0.163566 0.347096 \n", + "11 1.411651 0.121816 0.141651 0.309647 \n", + "12 1.277757 0.107991 0.123549 0.280277 \n", + "13 1.162802 0.077519 0.107834 0.255061 \n", + "14 1.073405 0.083333 0.097585 0.235452 \n", + "15 1.000000 0.065344 0.076884 0.219351 \n", + "\n", + " cumulative_score capture_rate cumulative_capture_rate gain \\\n", + "0 0.839858 0.036757 0.036757 229.505549 \n", + "1 0.824217 0.034743 0.071501 270.076417 \n", + "2 0.814170 0.034240 0.105740 260.472142 \n", + "3 0.800925 0.030211 0.135952 200.587630 \n", + "4 0.785461 0.034240 0.170191 244.451158 \n", + "5 0.699075 0.159114 0.329305 210.477654 \n", + "6 0.623061 0.107251 0.436556 118.704581 \n", + "7 0.549698 0.078550 0.515106 58.042296 \n", + "8 0.449174 0.114804 0.629909 13.351366 \n", + "9 0.382836 0.098187 0.728097 -3.893198 \n", + "10 0.340424 0.063444 0.791541 -34.426603 \n", + "11 0.307381 0.055388 0.846928 -44.465076 \n", + "12 0.280607 0.050352 0.897281 -50.767685 \n", + "13 0.259121 0.035247 0.932528 -64.659594 \n", + "14 0.240675 0.039275 0.971803 -62.009063 \n", + "15 0.225172 0.028197 1.000000 -70.210141 \n", + "\n", + " cumulative_gain \n", + "0 229.505549 \n", + "1 248.045999 \n", + "2 251.974853 \n", + "3 239.092657 \n", + "4 240.157260 \n", + "5 225.139444 \n", + "6 190.417123 \n", + "7 157.524427 \n", + "8 109.061561 \n", + "9 80.459549 \n", + "10 58.238248 \n", + "11 41.165144 \n", + "12 27.775745 \n", + "13 16.280206 \n", + "14 7.340501 \n", + "15 0.000000 " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "Scoring History: " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
timestampdurationnumber_of_treestraining_rmsetraining_loglosstraining_auctraining_pr_auctraining_lifttraining_classification_errorvalidation_rmsevalidation_loglossvalidation_aucvalidation_pr_aucvalidation_liftvalidation_classification_error
02020-05-28 14:33:2343.415 sec0.00.4155910.5294270.5000000.0000001.0000000.7780010.4138150.5261050.5000000.0000001.0000000.780649
12020-05-28 14:33:2343.443 sec1.00.4078220.5118640.7161310.5347173.4749120.2363700.4055380.5074960.7267310.5371253.4442640.187652
22020-05-28 14:33:2343.467 sec2.00.4014830.4987460.7446460.5321723.5297060.2287310.3988080.4936980.7529090.5345883.4223070.232825
32020-05-28 14:33:2343.489 sec3.00.3964710.4890130.7481890.5356213.5297060.2286360.3933940.4832730.7564480.5356923.4223070.214491
42020-05-28 14:33:2343.515 sec4.00.3924420.4814300.7501210.5353583.5297060.2107800.3890300.4751350.7585110.5360953.4223070.217915
52020-05-28 14:33:2343.535 sec5.00.3891410.4753750.7500580.5351983.5297060.2450590.3854530.4686300.7585050.5356593.4223070.214270
62020-05-28 14:33:2343.570 sec6.00.3863990.4703320.7569860.5350243.5297060.2439610.3824470.4631570.7647220.5360393.4223070.229843
72020-05-28 14:33:2343.592 sec7.00.3841910.4663160.7570050.5354183.5297060.2439610.3800450.4588340.7646340.5364113.4223070.220013
82020-05-28 14:33:2343.614 sec8.00.3823410.4627600.7611060.5401763.5143590.2474460.3780630.4550490.7703400.5420433.4575240.204330
92020-05-28 14:33:2343.639 sec9.00.3807010.4595890.7625150.5408803.5182790.2356540.3761840.4514640.7723580.5435223.4575240.223548
102020-05-28 14:33:2343.668 sec10.00.3792020.4567050.7625220.5414243.5182790.2356060.3745830.4483800.7729820.5438933.4575240.226309
112020-05-28 14:33:2343.697 sec11.00.3780520.4544670.7616480.5415053.5213320.2310230.3733540.4459730.7729250.5445533.4608820.228960
122020-05-28 14:33:2343.729 sec12.00.3770430.4524200.7627670.5416583.5213320.2299720.3721990.4436700.7734120.5431953.4608820.224542
132020-05-28 14:33:2343.762 sec13.00.3761370.4505170.7647950.5432643.5258990.2343170.3713690.4419320.7741610.5436323.4480380.227413
142020-05-28 14:33:2343.796 sec14.00.3753570.4489630.7651450.5431133.5258990.2356540.3705490.4403350.7741760.5432023.4480380.228076
152020-05-28 14:33:2343.848 sec15.00.3746990.4475430.7661180.5440373.5284170.2332190.3699990.4391610.7745920.5437093.4480380.228297
162020-05-28 14:33:2343.903 sec16.00.3740980.4463410.7665290.5438963.5607130.2291610.3693900.4379260.7750210.5448513.4248550.226751
172020-05-28 14:33:2343.949 sec17.00.3735340.4451150.7663120.5442083.5683700.2314520.3688100.4366690.7749270.5459573.4429290.225425
182020-05-28 14:33:2344.004 sec18.00.3731210.4441710.7667850.5447203.5683700.2293520.3684960.4359090.7752560.5455863.4429290.226530
192020-05-28 14:33:2344.054 sec19.00.3727220.4433600.7671450.5450593.5683700.2264390.3680470.4350060.7754740.5459223.4429290.224652
\n", + "
" + ], + "text/plain": [ + " timestamp duration number_of_trees training_rmse \\\n", + "0 2020-05-28 14:33:23 43.415 sec 0.0 0.415591 \n", + "1 2020-05-28 14:33:23 43.443 sec 1.0 0.407822 \n", + "2 2020-05-28 14:33:23 43.467 sec 2.0 0.401483 \n", + "3 2020-05-28 14:33:23 43.489 sec 3.0 0.396471 \n", + "4 2020-05-28 14:33:23 43.515 sec 4.0 0.392442 \n", + "5 2020-05-28 14:33:23 43.535 sec 5.0 0.389141 \n", + "6 2020-05-28 14:33:23 43.570 sec 6.0 0.386399 \n", + "7 2020-05-28 14:33:23 43.592 sec 7.0 0.384191 \n", + "8 2020-05-28 14:33:23 43.614 sec 8.0 0.382341 \n", + "9 2020-05-28 14:33:23 43.639 sec 9.0 0.380701 \n", + "10 2020-05-28 14:33:23 43.668 sec 10.0 0.379202 \n", + "11 2020-05-28 14:33:23 43.697 sec 11.0 0.378052 \n", + "12 2020-05-28 14:33:23 43.729 sec 12.0 0.377043 \n", + "13 2020-05-28 14:33:23 43.762 sec 13.0 0.376137 \n", + "14 2020-05-28 14:33:23 43.796 sec 14.0 0.375357 \n", + "15 2020-05-28 14:33:23 43.848 sec 15.0 0.374699 \n", + "16 2020-05-28 14:33:23 43.903 sec 16.0 0.374098 \n", + "17 2020-05-28 14:33:23 43.949 sec 17.0 0.373534 \n", + "18 2020-05-28 14:33:23 44.004 sec 18.0 0.373121 \n", + "19 2020-05-28 14:33:23 44.054 sec 19.0 0.372722 \n", + "\n", + " training_logloss training_auc training_pr_auc training_lift \\\n", + "0 0.529427 0.500000 0.000000 1.000000 \n", + "1 0.511864 0.716131 0.534717 3.474912 \n", + "2 0.498746 0.744646 0.532172 3.529706 \n", + "3 0.489013 0.748189 0.535621 3.529706 \n", + "4 0.481430 0.750121 0.535358 3.529706 \n", + "5 0.475375 0.750058 0.535198 3.529706 \n", + "6 0.470332 0.756986 0.535024 3.529706 \n", + "7 0.466316 0.757005 0.535418 3.529706 \n", + "8 0.462760 0.761106 0.540176 3.514359 \n", + "9 0.459589 0.762515 0.540880 3.518279 \n", + "10 0.456705 0.762522 0.541424 3.518279 \n", + "11 0.454467 0.761648 0.541505 3.521332 \n", + "12 0.452420 0.762767 0.541658 3.521332 \n", + "13 0.450517 0.764795 0.543264 3.525899 \n", + "14 0.448963 0.765145 0.543113 3.525899 \n", + "15 0.447543 0.766118 0.544037 3.528417 \n", + "16 0.446341 0.766529 0.543896 3.560713 \n", + "17 0.445115 0.766312 0.544208 3.568370 \n", + "18 0.444171 0.766785 0.544720 3.568370 \n", + "19 0.443360 0.767145 0.545059 3.568370 \n", + "\n", + " training_classification_error validation_rmse validation_logloss \\\n", + "0 0.778001 0.413815 0.526105 \n", + "1 0.236370 0.405538 0.507496 \n", + "2 0.228731 0.398808 0.493698 \n", + "3 0.228636 0.393394 0.483273 \n", + "4 0.210780 0.389030 0.475135 \n", + "5 0.245059 0.385453 0.468630 \n", + "6 0.243961 0.382447 0.463157 \n", + "7 0.243961 0.380045 0.458834 \n", + "8 0.247446 0.378063 0.455049 \n", + "9 0.235654 0.376184 0.451464 \n", + "10 0.235606 0.374583 0.448380 \n", + "11 0.231023 0.373354 0.445973 \n", + "12 0.229972 0.372199 0.443670 \n", + "13 0.234317 0.371369 0.441932 \n", + "14 0.235654 0.370549 0.440335 \n", + "15 0.233219 0.369999 0.439161 \n", + "16 0.229161 0.369390 0.437926 \n", + "17 0.231452 0.368810 0.436669 \n", + "18 0.229352 0.368496 0.435909 \n", + "19 0.226439 0.368047 0.435006 \n", + "\n", + " validation_auc validation_pr_auc validation_lift \\\n", + "0 0.500000 0.000000 1.000000 \n", + "1 0.726731 0.537125 3.444264 \n", + "2 0.752909 0.534588 3.422307 \n", + "3 0.756448 0.535692 3.422307 \n", + "4 0.758511 0.536095 3.422307 \n", + "5 0.758505 0.535659 3.422307 \n", + "6 0.764722 0.536039 3.422307 \n", + "7 0.764634 0.536411 3.422307 \n", + "8 0.770340 0.542043 3.457524 \n", + "9 0.772358 0.543522 3.457524 \n", + "10 0.772982 0.543893 3.457524 \n", + "11 0.772925 0.544553 3.460882 \n", + "12 0.773412 0.543195 3.460882 \n", + "13 0.774161 0.543632 3.448038 \n", + "14 0.774176 0.543202 3.448038 \n", + "15 0.774592 0.543709 3.448038 \n", + "16 0.775021 0.544851 3.424855 \n", + "17 0.774927 0.545957 3.442929 \n", + "18 0.775256 0.545586 3.442929 \n", + "19 0.775474 0.545922 3.442929 \n", + "\n", + " validation_classification_error \n", + "0 0.780649 \n", + "1 0.187652 \n", + "2 0.232825 \n", + "3 0.214491 \n", + "4 0.217915 \n", + "5 0.214270 \n", + "6 0.229843 \n", + "7 0.220013 \n", + "8 0.204330 \n", + "9 0.223548 \n", + "10 0.226309 \n", + "11 0.228960 \n", + "12 0.224542 \n", + "13 0.227413 \n", + "14 0.228076 \n", + "15 0.228297 \n", + "16 0.226751 \n", + "17 0.225425 \n", + "18 0.226530 \n", + "19 0.224652 " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "See the whole table with table.as_data_frame()\n", + "\n", + "Variable Importances: " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
variablerelative_importancescaled_importancepercentage
0PAY_02794.4448241.0000000.693347
1PAY_2307.2373660.1099460.076231
2PAY_3215.1528930.0769930.053383
3PAY_4155.4344480.0556230.038566
4PAY_AMT1127.9863130.0458000.031755
5PAY_5127.5386280.0456400.031644
6PAY_6102.3516010.0366270.025395
7LIMIT_BAL82.4323500.0294990.020453
8PAY_AMT258.9341350.0210900.014623
9PAY_AMT458.8580470.0210630.014604
\n", + "
" + ], + "text/plain": [ + " variable relative_importance scaled_importance percentage\n", + "0 PAY_0 2794.444824 1.000000 0.693347\n", + "1 PAY_2 307.237366 0.109946 0.076231\n", + "2 PAY_3 215.152893 0.076993 0.053383\n", + "3 PAY_4 155.434448 0.055623 0.038566\n", + "4 PAY_AMT1 127.986313 0.045800 0.031755\n", + "5 PAY_5 127.538628 0.045640 0.031644\n", + "6 PAY_6 102.351601 0.036627 0.025395\n", + "7 LIMIT_BAL 82.432350 0.029499 0.020453\n", + "8 PAY_AMT2 58.934135 0.021090 0.014623\n", + "9 PAY_AMT4 58.858047 0.021063 0.014604" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# load saved best model from lecture 1 \n", + "best_mgbm = h2o.load_model('best_mgbm')\n", + "\n", + "# display model details\n", + "best_mgbm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Shutdown H2O" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? n\n" + ] + } + ], + "source": [ + "# be careful, this can erase your work!\n", + "h2o.cluster().shutdown(prompt=True)" + ] + } + ], + "metadata": { + "anaconda-cloud": {}, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/tex/lecture_4.pdf b/tex/lecture_4.pdf index 8d12b5d..26805ac 100644 Binary files a/tex/lecture_4.pdf and b/tex/lecture_4.pdf differ diff --git a/tex/lecture_4.tex b/tex/lecture_4.tex index 0b5217e..c9249d8 100644 --- a/tex/lecture_4.tex +++ b/tex/lecture_4.tex @@ -166,12 +166,11 @@ \frametitle{Backdoors and Watermarks: \textbf{What?}} \begin{itemize} - \item Hackers gain unauthorized access to your production scoring code OR ... - \item Malicious or extorted data science or IT insiders change your production scoring code ... - \end{itemize} - \vspace{20pt} -\hspace{10pt} ... adding a backdoor that can be exploited using water-marked data. - + \Large + \item Hackers gain unauthorized access to your production scoring code \\ OR ... + \item Malicious or extorted data science or IT insiders change your production scoring code + \item{Also adding a backdoor that can be exploited using water-marked data} + \end{itemize} \end{frame} \begin{frame} @@ -208,7 +207,6 @@ \frametitle{Surrogate Model Inversion Attacks: \textbf{What?}} Due to lax security or a distributed attack on your model API or other model endpoint, hackers or competitors simulate data, submit it, receive predictions, and train a surrogate model between their simulated data and your model predictions. This surrogate can ... - \vspace{10pt} \begin{itemize} \item expose your proprietary business logic, i.e. ``model stealing'' \cite{model_stealing}. \item reveal sensitive aspects of your training data. @@ -279,6 +277,7 @@ \frametitle{Membership Inference Attacks: \textbf{Defenses}} \begin{itemize} + \Large \item See Slide \ref{slide:inversion_defense}. \item \textbf{Monitor for training data}: Monitor your production scoring queue for data that closely resembles any individual used to train your model. Real-time scoring of rows that are extremely similar or identical to data used in training, validation, or testing should be recorded and investigated. \end{itemize} @@ -341,6 +340,7 @@ \frametitle{Impersonation Attacks: \textbf{What?}} Bad actors learn ... \begin{itemize} + \large \item by inversion or adversarial example attacks (see Slides \ref{slide:inversion}, \ref{slide:adversary}), the attributes favored by your model and then impersonate them. \item by disparate impact analysis (see Slide \ref{slide:data_poisoning_defense}), that your model is discriminatory (e.g. \href{https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing}{Propublica and COMPAS}, \href{https://medium.com/@Joy.Buolamwini/response-racial-and-gender-bias-in-amazon-rekognition-commercial-ai-system-for-analyzing-faces-a289222eeced}{Gendershades and Rekognition}), and impersonate your model's privileged class to receive a favorable outcome.\footnote{This presentation makes no claim on the quality of the analysis in Angwin et al. (2016), which has been criticized, but is simply stating that such cracking is possible \cite{angwin16,}, \cite{flores2016false}.} \end{itemize} @@ -364,6 +364,7 @@ \frametitle{Impersonation Attacks: \textbf{Defenses}} \begin{itemize} + \Large \item \textbf{Authentication}: See Slide \ref{slide:inversion_defense}. \item \textbf{Disparate impact analysis}: See Slide \ref{slide:data_poisoning_defense}. \item \textbf{Model monitoring}: Watch for too many similar predictions in real-time. Watch for too many similar input rows in real-time. @@ -372,12 +373,12 @@ \end{frame} %------------------------------------------------------------------------------- - \section{General Concerns} + \section{General Concerns \& Solutions} %------------------------------------------------------------------------------- - + \subsection{Concerns} \begin{frame}[t, allowframebreaks] - \frametitle{General concerns} + \frametitle{General Concerns} \footnotesize \begin{itemize} \item \textbf{Black-box models}: Over time a motivated, malicious actor could learn more about your own black-box model than you know and use this knowledge imbalance to attack your model \cite{papernot2018marauder}. @@ -390,10 +391,6 @@ \end{frame} -%------------------------------------------------------------------------------- - \section{General Solutions} -%------------------------------------------------------------------------------- - %------------------------------------------------------------------------------- \subsection{General Solutions} %------------------------------------------------------------------------------- @@ -401,14 +398,22 @@ \begin{frame}[t, allowframebreaks] \frametitle{General Solutions} - \footnotesize + %\footnotesize + \small \begin{itemize} \item \textbf{Authenticated access and prediction throttling}: for prediction APIs and other model endpoints. \item \textbf{Benchmark models}: Compare complex model predictions to less complex (and hopefully less hackable) model predictions. For traditional, low signal-to-noise data mining problems, predictions should not be too different. If they are, investigate them. \item \textbf{Encrypted, differentially private, or federated training data}: Properly implemented, these technologies can thwart many types of attacks. Improperly implemented, they simply create a broader attack surface or hinder forensic efforts. \item \textbf{Interpretable, fair, or private models}: In addition to models like LFR and PATE, also checkout \href{https://github.com/h2oai/h2o-3/blob/master/h2o-py/demos/H2O_tutorial_gbm_monotonicity.ipynb}{monotonic GBMs}, \href{https://christophm.github.io/interpretable-ml-book/rulefit.html}{Rulefit}, \href{https://github.com/IBM/AIF360}{AIF360}, and the \href{https://users.cs.duke.edu/~cynthia/code.html}{Rudin group} at Duke. - \framebreak - \item \textbf{Model documentation, management, and monitoring}: + \end{itemize} + \end{frame} + + \begin{frame} + \frametitle{General Solutions} + \normalsize + \begin{itemize} + %\framebreak + \item \textbf{Model documentation, management, and monitoring}: \begin{itemize} \item Take an inventory of your predictive models. \item Document production models well-enough that a new employee can diagnose whether their current behavior is notably different from their intended behavior. @@ -419,7 +424,6 @@ \item \textbf{System monitoring and profiling}: Use a meta anomaly detection system on your entire production modeling system’s operating statistics — e.g. number of predictions in some time period, latency, CPU, memory and disk loads, number of concurrent users, etc. — then closely monitor for anomalies. \end{itemize} \normalsize - \end{frame} %-------------------------------------------------------------------------------