Logistic Regression is a statistical model that is used in classification and predictive analysis. This model is also know as Logit and it is characterized by a single binary dependent variable, i. e. a variable that only can take two values, often labeled as 0 and 1. Binary values are widely udsed in statistics to model the probablity of a certain event occuring, such as the probability of a pacient being health, a tumor being malignant or not, if an email is spam or not and if a team win or loose. Therefore, Logistic Regression has a large variety of applications. In this tutorial we will present a clear explanation about logistic regression models, build from scratch a model and compare it with the scikit learn implementaion.
The logistic function is defined by:
where
The logistic fuction is used to generate predictions as we will see below. Furthermore, it is interesting to write this function as
where
-
Binary Logistic Regression: In this case the categorical variable (target) has only two possible outcomes. Examples: Spam or Not, diagnostic positive or negative, some equipment failed or not, students pass or fail in an exam.
-
Multinomial Logistic Regression: The target contains three or more categories without ordering. Examples: Predicting if a individual is vegetarian, meat-eater or vegan, customer segmentation.
-
Ordinal Logistic Regression: The target containd three or more categories with ordering. Example: Movie rating from 1 to 5.
In this tutorial we will implement from scratch a binary logistic regression model.
The binary logistic regression can be used with one explanatory variable and two categories to answer a question. As an example, consider the case of a group of students which are preparing for a exam. The question is: how does the number of study hours affect the probability of the student passing the exam? In this case the explanatory variable is the number of hours studying and the two categories are being approved or not. Let's consider 10 students that spends between 0 to 10 hours studing and use 1 to identify "pass" and 0 to "fail". Then, we have the following "sintetic" data:
To predict if a student will pass or not we need to consider the loss function, which is a function that represents the "price to pay" for inaccuracy of predictions. Usually, the logistic loss function is used as the measure of goodness of a fit in logistic regression. Considering that
Note that log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction, and approaches to infinity when predictions get worse. The two cases can be combined into a function called cross entropy:
The sum of the this term for all elements is the negative likelihood function:
To estimate the probability of an outcome we need to find the values of
Then we need to solve the above two equations for
where
We will start with
# Defining the initial parameters (zeros)
def weightInitialization(n_features):
beta_1 = np.full((n_features),0) # Note that beta_1 must have the same dimension of x
beta_0 = 0
return beta_1,beta_0
# Defining the sigmoid function
def sigmoid_vec(t):
p = 1/(1+np.exp(-t))
return p
# Calculates the likelihood and the derivatives
def log_model(beta_1, beta_0, x, y):
#Prediction
t = np.dot(beta_1, x.T) + beta_0 # The .T means the transpose of x and np.dot the dot product
prob = sigmoid_vec(t)
y = np.array(y)
likelihood = np.sum( -y*np.log(prob) - (1 - y)*np.log(1 - prob) )/x.shape[1] # The factor 1/x.shape[1] is for scaling by the n# of features
#Gradient calculation
dbeta_1 = np.dot(x.T, (prob-y))/x.shape[1]
dbeta_0 = np.sum(prob-y)/x.shape[1]
gradients = {"dbeta_1": dbeta_1, "dbeta_0": dbeta_0}
return gradients, likelihood, prob
# Iterate the model to make it learn
def model_predict(n_features, x, y, learning_rate, n_iterations):
beta_1, beta_0 = weightInitialization(n_features)
costs = []
for i in range(n_iterations):
gradients, likelihood, prob = log_model(beta_1, beta_0, x, y)
dbeta_1 = gradients["dbeta_1"]
dbeta_0 = gradients["dbeta_0"]
#weight update
beta_1 = beta_1 - (learning_rate * dbeta_1)
beta_0 = beta_0 - (learning_rate * dbeta_0)
costs.append(likelihood)
#final parameters
y_prob = prob
parameters = {"beta_1": beta_1, "beta_0": beta_0}
gradient = {"dbeta_1": dbeta_1, "dbeta_0": dbeta_0}
return parameters, gradient, costs, y_prob
# Create an array with the final predictions
# Given a bias beta_0, a weight beta_1 and a input x return the predicted probability of pass or fail
def predict(beta_1, beta_0, x):
t = np.dot(beta_1, x.T) + beta_0
z = sigmoid_vec(t)
y_pred = np.zeros(x.shape[0])
for i in range(y_pred.shape[0]):
if z[i] > 0.5:
y_pred[i] = 1
return y_pred
Finally, we define a function to encompass all the functions above:
# Defining our logistic regression model
def my_Logistic_Regression(x_train, y_train, x_test, learning_rate, n_iterations):
n_features = x_train.shape[1] # Dimension of the input
beta_1, beta_0 = weightInitialization(n_features) # Defined above
parameters, gradients, costs, y_prob = model_predict(n_features, x_train, y_train, learning_rate, n_iterations)# Defined above
y_pred = predict(parameters["beta_1"], parameters["beta_0"], x_test) # Defined above
return y_pred, y_prob
The Scikit Learn linear_model.LogisticRegression()
class implements a regularized logistic regression. his implementation can fit binary, One-vs-Rest, or multinomial logistic regression.
Let's us import a data set to make predictions. As an example we will use a data set from National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The confusion matrix can tells us the number of true/false positive/negative predictions. By the time, we can calculate the precision rate, the missclassification rate, accuracy, prevalacence, etc. Let's see with more details the quantities that we can calculate using confusion matrix:
-
Accuracy = (TP+TN/total) Is the overall evaluation of the classifier.
-
Sensitivity = TP/(actual number of 1's) Indicates how many times 1 is predicted correctly.
-
Fall-out = FP/(actual number of 0's) Indicates how many times 1 is predicted when actual answer is 0.
-
Specificity = TN/(actual number of 0's) Indicates how many times 0 is predicted correctly.
-
Misclassification Rate = (FP+FN)/(total) Tells about how often our model is wrong.
-
Precision = TP/(TP + FP) Gives the rate of correct predictions.
-
Prevalence = (actual number of 1's /total) Indicates how often condition 1 really occurs.
This means that the Scikit Learn logistic model with the parameters choice above hit the predictions nearly
The Cumulative Distribution Function-CDF helps us understand how good is this model:
Another important validation method is the Area Under Curve - AUC, wich can be calculated using the Receiver Operating Characteristic-ROC curve ploted below. The ROC curve is the curve generated by the plotting of True Positive Rate by the False Positive rate.
The Kolmogorv-Smirnov coeficient is the maximum distance between the two curves, such as the AUC, higher values are preferable. Although, the model may be improved using thecniques like crossvalidation and standardizaton, this simple model shows a reasonable value for these metrics with a very low p-value (with means a high confidence).
As expected, the likelihood tends to a minima when the optimal value of iterations is reached.
Let's plot the confusion matrix of our model and calculate the precision, recal and f1-score:
Also, let's plot the ROC curve and calculate the AUC of our logistic regression implementation:
This model present a very low AUC and KS coefficient compared to Scikit Leanr implementaion. This was expected, since this is a model for didact purpose only.
We build from scratch a logistic regression model and compared it to the scikit learn implementation default model. Our focus was in undestand the mechanisms behind the blackbox of logistic regression models and clear explain it. As expected for a didatic simple model, our implementation performed worse than the scikit learn model, but the objective of the tutorial was succesfully achieved.