Logistic Regression From Scratch
Logistic regression is among the most famous classification algorithm. It is probably the first classifier that Data Scientists employ to establish a base model on a new project. In this article we will implement logistic regression from scratch using gradient descent. The Jupyter Notebook of this article can be found HERE.
Background
Given some data D:
where each Xi is a vector of length k and yi is the label (either 0 or 1), the logistic regression algorithm will find a function that maps each feature vector Xi to its label. This function will specify a decision boundary which would be a line with two-dimensional data (Figure 1), a plane with three-dimensional data or an hyperplane with higher dimensional data.
Figure 1. Logistic regression decision boundary
Each data point laying on the decision boundary will have probability equal to 0.5. On the right side of the decision boundary the probability will be higher than 0.5 and data points will be assigned to the positive class. On the left of the decision boundary the probability is lower than 0.5 and data points will belong to the negative class.
The Math - The Sigmoid Function
The core of the logistic regression is the sigmoid function:
Figure 2. Shape of sigmoid function
zi is a negative or positive real number. The sigmoid transforms each zi to a number between 0 and 1 (Figure 2). Note that:
By mapping every zi to a number between 0 and 1, the sigmoid function is perfect for obtaining a statistical interpretation of the input zi.
zi is given by:
Where Xi are the features and Wi the weights of each feature. (3) is the same formula of a linear regression model.
The Math - The Loss Function
We well use the sigmoid function to make predictions. To evaluate model's predictions we need an objective function. The loss function commonly used in logistic regression is the Binary Cross-Entropy Loss Function:
Figure 3. Binary Cross Entropy Loss Function plot
As Figure 3 depicts the binary cross entropy loss heavily penalizes predictions that are far away from the true value. The loss function (4) can be rearranged into a single formula:
The Math - Model's Parameters Update
The pillar of each machine learning model is reducing the value of the loss function during training. The most crucial and complicated part of this process is calculating the derivative, aka the gradient, of the loss function with respect to the model's parameters (W). Once the gradient is calculated, the model's parameters can be readily updated with gradient descent in an iteratively manner. As the model learns from the data, parameters are updated at each iteration and loss decreases.
The gradient of the Binary Cross-Entropy Loss Function w.r.t. the model's parameters is:
If you are interested in the mathematical derivation of (6), click HERE. Once the gradient is calculated, the model parameters are updated with gradient descent at each iteration:
Logistic Regression implementation
Example 1 - Non overlapping classes
We will train a logistic regressor on the data depicted below (Figure 4). The two classes are disjoint and a line (decision boundary) separating the two clusters can be easily drawn between the two clusters.
Figure 4. Observed data (non-overlapping classes)
The zi (3) for each data point of figure 4 is given by:
Therefore, the regressor model will learn the value of w0, w1 and w2. w0 is the intercept and to learn its value we need to pass a columns of ones (Xi0) to the original array:
Figure 5. Addition of the X0 column
We need few functions to calculate the loss, the gradient and making the predictions:
Figure 6. Helper functions
The vector W is instantiated with random values between 0 and 1. Next, we use a for loop to train the model:
Figure 7. Training loop
During training the loss dropped consistently, and after 750 iterations the trained model is able to accurately classify 100 percent of the training data:
Figure 8. Training loss and confusion matrix
Finally, let's look at how the decision boundary changed during training:
Figure 9. Decision boundary at 4 different iterations
Figure 9 shows that during training the decision boundary moved from the bottom left (random initialization) to between the 2 clusters.
Example 2 - Overlapping classes
In the previous example the two classes were so easily separable that we could draw the decision boundary on our own. In this second example, the two classes significantly overlap:
Figure 10. Observed data (Overlapping classes)
The trained model positioned decision boundary somewhere in between the two cluster where the loss was the smallest.
Figure 11. Decision boundary at 4 different iterations
In this second example the data is not linearly separable, thus the best we can aim for is highest accuracy possible (and smallest loss). The trained model has an accuracy of 93%
Closing remarks
In this tutorial we learned how to implement and train a logistic regressor from scratch. The two models were training for a predefined number of iterations. Another and more efficient approach is to train the model until the accuracy reached a plateau or the decrease of the loss was negligible (e.i. smaller than a predetermined threshold). Implementing these two options is pretty straight forward and encourage you to modify the training loop accordingly.
Comments