Testing in the time of Corona

With our third Sprint coming up, we were excited to go out in the field and test some of our assumptions with novel experiments. The aim is to test whether stick figures help a professional dancer to…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Understanding Logistic Regression

Hello you Machine Learning enthusiasts! This article is going to be on a well known classification algorithm known as ‘Logistic Regression’. Since it is a classification algorithm it comes under the umbrella of ‘Supervised Learning’.

Following topics will be covered in the upcoming sections:

(1) What is Logistic Regression?

(2) How does Logistic Regression work?

(3) Evaluation Metrics for Logistic Regression

(4) Applications of Logistic Regression

(5) Conclusion

Consider the following example where you would like to predict whether an Email you recieved is either spam or not spam based on the features present in the dataset. This type of classification where two class labels (Spam/No Spam) exists is known as Binary Classification problem and Logistic Regression is one of the basic algorithm implemented to solve such problems.

Figure 1: Logistic Regression

Figure 1 shows a simple plot of Logistic Regression. Here x1 and x2 are the independent features and y is the dependent feature (0/1). As we can see the regression line divides the plane into two halves representing the two classes respectively. So based on the x1 and x2 values the algorithm will try to label the unknown data and project it to one of the classes to which it thinks it should belong to. To get a better understanding of the algorithm let us look at it’s working mechanism in the next section.

In the previous section we have understood what Logistic Regression is and it’s basic working principle. In this section we are going to dive deeper in terms of its working mechanism.

So far we know that Logistic Regression is a binary classification algorithm where it can predict either ‘0 or 1’, ‘Yes/No’ etc. So how does it come to the conclusion that to which class label it has to assign when given a new data? This decision making is made with the help of something known as Sigmoid function.Sigmoid is a non-linear function which transforms the output and returns the probabiltiy value between 0 and 1. It considers a threshold (generally 0.5), thus if the output value is greater than 0.5 it assigns the value of 1 and similarly if the output value is less than 0.5 it is going to assign the value of 0. In this way the Sigmoid function helps the model to determine the class label for a new set of data.

Equation for Sigmoid activation function is as follows:

Where, x is the input given to the function.

To visualize how Sigmoid activation looks like when it is in action, this following image will give a brief idea about the same.

Figure 3: Sigmoid activation graph

Now that we have understood how Sigmoid function influences Logistic Regression, now we are going to understand the role of the Cost Function.

In figure 1, we have seen that a regression line is used to divide the plane into two halves representing the two classes and each point belonging to their respective classes. Such line is generated under ideal conditions whereas in real life scenarios where you are given a dataset to solve a binary classification problem and you choose Logistic Regression as your algorithm, it takes lot of expertise in identifying the best fit line to seperate the two classes because if the line is not optimal, chances of increasing the error rate and inversely decreasing the accuracy of the model increases. That is, the chances of new data falling into the wrong category increases. This can prove fatal while solving many crucial business problems.

Thus to identify the best fit line along we penalize the model for every wrong prediction by using something known as cost function. This process is undertaken while we first train the model on a dataset. The input along with the weights are passed onto the model to make the prediction. Once the model has given a prediction value the actual prediction value (say y) and the predicted value by the model (say y1) is compared and the error rate is calculated. This feedback is sent back to the model where the weights are adjusted again and new set of inputs from the dataset are fed into the model. These iterations are repeated for each record and by the time it has finished training, optimal weights are identified thus finding the best fit line for the model. This algorithm is famously known as Gradient Descent.

Equation for the Cost function is given by:

Figure 4: Cost function of Logistic Regression

The model has to “pay a price” for every wrong prediction it has made, thus improving the accuracy at each iteration.

Let us now visualize the cost function graphically to get a better picture of its working.

Now that we have understood how Logistic Regression works, it is important to understand the steps involved in evaluating your model performance, thus taking necessary steps can be taken to improve the same.

Now let us consider an example where you have built your Logistic Regression model, so the next step you consider is to check your model performance i.e., how accurately your model is predicting for both known and unknown set of data. To understand this there are various metrics involved which assists you in this. Some of the well known metrics are as follows:

(1) Accuracy

(2) Precision

(3) Recall

(4) F1-Score

Before we get into the topics mentioned above, we need to understand something known as Confusion Matrix. It is N X N matrix (where N is the number of classes) and is used to describe the performance of the model on the test data where the actual values are known. Following is the image of a typical 2 X 2 Confusion matrix.

TN (True Negative): These are correctly predicted negative values, which means the actual value is no and the predicted value is also no.

FP (False Positive): This is when actual class is no but the predicted class is yes. It is also known as Type I error. This type of error is more dangerous than Type II error, thus this count of prediction value should be less compared to FN.

FN (False Negative): This is when actual class is yes but the predicted class is no. It is also known as Type II error.

TP (True Positive): These are correctly predicted positive values, which means the actual value is yes and the predicted value is also yes.

Now that we understand what Confusion Matrix is, let us now understand how it relates to the metrics we had listed previously.

Accuracy: It is the ratio of right predictions with respect to overall predictions. Formula for Accuracy is given by,

Accuracy = TN + TP / TN + FP + FN + TP

To achieve higher accuracy it is obvious that values of TN and TP should be higher, since they are the correctly predicted classes by the model.

Precision: It is the ratio of correctly predicted positive observation to the total predicted positive observations. Formula for Precision is given by,

Precision = TP / TP + FP

Recall (Sensitivity): It is the ratio of correctly predicted positive observation to the all the observations in actual class. Formula for Recall is given by,

Recall = TP / TP + FN

F1-Score: It is the weighted average of Precision and Recall.

F1-Score = 2 * (Precision *Recall / Precision + Recall)

Now that we have understood what Logistic Regression is, its working principle and metrics involved, let us now look at some of the applications of Logistic Regression across various domains.

Although Logistic Regression seems to be a simple algorithm, it can be powerful when it is applied in a right manner. It is used as a starting point of any binary classification algorithm since it is easy to implement and the results obtained can be interpreted easily.

Here are some of the Use-Cases across various domains listed below:

Banking: In financial sectors such as banks, they use Logistic Regression to check if a person is Loan defaulter or not, based on his/her previous history of transactions. If the model predicts if he/she is a defaulter then that bank will not provide loan to that person in the future. Likewise if the model predicts that the person is not a defaulter then that bank will provide loan to that person whenever he/she asks for the same.

One more application in the banking sector is that they try and predict if a customer will leave the bank or not in the future. This problem is known as Customer Retention and it becomes important for the banks to retain their customers for as long as possible.

Medical: In Medical sectors, Logistic Regression is used to predict whether the patient can suffer from diabetes or not, or whether the tumour is malignant or not. Thus preventive measures can be taken at an early stage to reduce both treatment cost and risk of the disease reaching to an advanced stage decreases.

This algorithm is also implemented to check whether medical insurance claims are genuine or fraudulent, thus considering only those claims which are genuine and rejecting those which are considered fraudulent.

Business: In business sectors, Logistic Regression is used to check if an employee is going to quit working in that company or not. If the chances of quitting is higher then certain incentives such as bonus or promotion can be given to that employee, thus retaining him in that company.

Another important application is to improve the product sales where the model predicts if the buyers are going to purchase that product or not. If the chances of not buying is more then modifications has to be made to that product where it can attract more customers.

There are many more applications involved using Logistic Regression, but these are some of the prominent use-cases.

In this article we have understood the what Logistic Regression is, its working principle, evaluation metrics which are used to test the model performance and finally its applications across various sectors.

I hope you have enjoyed reading this article!!

Add a comment

Related posts:

Beauty Dominates Like The Glory Of An Emperor

Most of us believe that ugliness diverts attention from beauty because our attention is drawn to ugliness rather than beauty. In fact, beauty is similar to goodness in that it rules a space and…