A worm in a letter

Just like a photograph captures a physical moment, a story photograph captures a moment that happens inside our heart… In my blog series Story Photographs, I’ll be capturing a subtle but dear moment…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Health Insurance Premium Price Prediction

Before starting, first let’s understand what Health Insurance is and what is Premium?

Health insurance is a type of insurance that covers medical expenses that arise due to an illness. These expenses could be related to hospitalization costs, cost of medicines or doctor consultation fees. Insurance will cover a portion of the cost of a policyholder’s medical costs. How much the insurance covers — and how much the policyholder pays via copays, deductibles, and coinsurance — depend on the details of the policy itself, with specific rules and regulations that apply to some plans. Now if you are wondering what are these copay, deductible and coinsurance? Let’s check them briefly because it is very important to know about these 3 words.

Copay — Copay refers to when policyholders have to bear a fixed part of their expenses towards medical treatment while the rest is borne by the insurer. This can either be as a fixed amount or a fixed percentage of the treatment costs. For example, if your insurance policy comes with a copay clause of Rs. 2000 of your treatment expenses and the treatment costs you Rs. 10,000, you will be required to pay Rs. 2000 towards your treatment, while the rest of Rs. 8000 will be covered by the insurer.

Deductible: Deductibles is a fixed sum of money that policyholders are required to pay before their insurance policy starts contributing to their medical treatment. The term for paying deductibles is decided by the insurance provider — whether it is per year or per treatment. For example, if your policy mandates a deductible of Rs. 5000, you will be required to pay for your treatment expenditures amounting up to Rs. 5000, after which your insurance policy will kick in.

Coinsurance: Coinsurance refers to the percentage of treatment costs that you have to bear after paying the deductibles. This amount is generally offered as a fixed percentage. It is similar to the copayment provision under health insurance. For example, if your coinsurance is 20%, then you will be liable to bear 20% of the treatment cost while the rest 80% will be borne by your insurance provider.

What is Premium?

A health insurance premium is an upfront payment made on behalf of an individual or family to keep their health insurance policy active. Premiums are typically paid monthly when purchased on the individual market, although individuals who receive insurance through their employer usually pay their portion of the premium through payroll deductions In addition to the premium, consumers may have to pay out-of-pocket costs — deductible, copays, and coinsurance — when they seek medical care.

Why Health Insurance?

It’s important to have health insurance as a safety net. If you unexpectedly get sick or injured, health insurance is there to help cover costs that you likely can’t afford to pay on your own. Health care can be very expensive. It can be an enormous financial burden. Surgery, emergency care, prescription drugs, lab work, scans and examinations — these sorts of costs can add up very quickly.

Problem Statement: Based on the given features need to predict the premium price. It is a Regression problem.

Why this Model should be used?

The model should be used because it makes the Insurer to predict the premium price for the the policy holder based on their health records. Suppose if person wants to take health insurance and as per his health records his premium price to be opted is of Rs25000, but we suggest him to pay Rs 20000 which can be loss for the company. Hence using this model we get prediction of premium price such that we will get idea how much should be the premium price according to that we can continue the business towards individuals. This can be one of very good idea to increase the profit of business.

Drawback: If we have any advantages of using anything there will be disadvantage as well. This requires each employee to have the computer knowledge otherwise the person won’t understand what is happening. The other one can be training cost is more. The data collection can be harder and time consuming because if we don’t have sufficient data then it is difficult to build the Machine Learning Model.\

Dataset:

The dataset has total 986 rows and 11 columns or features.

Where,

M = MAPE

N = Number of times the Summation Iteration Happens

At = Actual Value

Ft = Predicted or Forecast value

Here the objective of the problem is to predict the premium price, which is Target variable. This target range will not have any effect on the metric score values and the error will be in range from 0 to 1 in which 1 is worst and 0 is best model, MAPE is also very intuitive to interpretation of relative error. This is also makes easier to compare the results of model.

This is used because r2 score will give the variability difference error by which we can get to know how the predicted value varies with respect to Actual values.

R-squared for the regression model on the left is 15%, and for the model on the right it is 85%. When a regression model accounts for more of the variance, the data points are closer to the regression line. Hence Scatter plot is suitable to observe these R2 score values.

This includes handling null values, removing duplicates and also converting categorical data into numerical data for further process.

But the good part of this dataset is that it does not have any null values, no duplicate values and all the features are numerical features.

Checking for Outliers:

Observation: From above figure the box plot is not clear for all features except for Premium Price. Let’s check for other 3 features i.e. for Age, Weight and Height because these have more value range as other features are just binary.

Code and Results:

Observation: Looking at above figures the weight feature has lot of outliers. Let’s apply Log transformation.

Because we have less number of rows hence we cannot remove these outliers which will affect shape of the dataset. This Log transformation will transform data to normal or close to normal.

Observation: From the above figure we can see that the number of outlier present in the weight feature is reduced , For premium price there few points which are not in the range because from above figure it is clear that for few these prices there are no sufficient number of samples to say about these prices. These cannot be altered. Only way is to increase the number of samples for these prices.

Observation: From the above figure we can say the maximum number of people who are paying premium are those who have age between 40 to 50 years. In a same way people who have age between 35 to 40 are those who are less number of premiums.

Observation: From above fig we can approximately say that the more number of people are paying the premium amount between 23500 to 24500. And very few members are paying the premium amount of 20000 and around 27000.

Observation: From above 2 figures we can say that the Known allergy feature is not much effecting to decide the premium price as both like people who have allergy and who don’t have allergy are paying almost same premium Price. But It is very clear More people who are doesn’t have any Knownallergy are paying the premium amount.

Observation: From above figure we can say also that the Diabetes feature is not effecting very much to decide the premium price but there is very slight difference as people who have diabetes are taking little high premium amount as compared to people who doesn’t have diabetes. But there is no much difference. But from second figure it is clear that more number of people who don’t have diabetes are paying premiums.

Observation: From above figures we can say also that the HistoryOfCancerInFamily feature is not effecting very much to decide the premium price but there is very slight difference as people who have HistoryOfCancerInFamily are taking little high premium amount as compared to people who doesn’t have HistoryOfCancerInFamily. But there is no much difference. But from second figure it is clear that very high number of people who don’t have HistoryOfCancerInFamily are paying premiums.

Observation: From above figures we can say that the people who have 2 or 3 major surgeries are paying the premium amount of more than 25000. But from second figure it is very clear that most of the people who have 0 number surgeries are paying the premium.

Observation: From above figures we can say also that the BloodPressureProblems feature is not effecting very much to decide the premium price but there is very slight difference as people who have BloodPressureProblems are paying little high premium amount as compared to people who doesn’t have BloodPressureProblems. But there is no much difference. But from second figure it is clear thatslightly higher numbers of people who don’t have BloodPressureProblems are paying premiums.

From above all the figures we can see that we create other features by using Age and Premium price. Because remaining features are not much effecting the premium price.

Features are created by dividing age and premium price columns into groups like Age columns is divided as Teen, Young, Middle, Old, and Oldest and Premium Price is divided like kind of salary bins such as Low, Basic, Average, High, Super high.

Before Going for Next process lets Visualize these bins which are created.

Observation: From the above figure we can say that the premium price that is categorized in the Low label were taken by the people who have age around 22 and that is the least as compared to other labels. And Most of the people who have age around 50+ are paying the premium price that are grouped under Average label.

Observation: From above figure it’s very clear that most of the old and oldest people are paying more premium price.

Now let’s convert these categorical data into numerical data as it is required for the further process.

As we can observe from the above columns that the measuring scale of all the columns are different like age is measures as years, weight is measured as kg’s but these are replaced by log values as it had outliers. And height is measured as centi meters. If we consider as it is this may effect the modelling.

The standardizing the data means it is the process of rescaling the attributes so that they have mean as 0 and variance as 1.The ultimate goal to perform standardization is to bring down all the

Features to a common scale without distorting the differences in the range of the values.

In this project total 6 Models are used because to compare the results of all these regression models so that we can test the data set on all the models instead just checking with anyone model. The hyper parameter tuning is done to that model which has good results and no overfitting.

Models are:

1. Linear Regression

2. XGBRegressor

3. RandomForestRegressor

4. ExtraTreeRegressor

5. SVR

6. KNeighborsRegressor

Before going to coding part let’s just understand little theory behind all the models.

Linear Regression:

RandomForestRegressor:

XGBRegressor:

ExtraTreeRegressor:

Support Vector Regression (SVR)

KNeighborsRegressor:

Observation: Looking at all the results of ExtraTreeRegressor model looks very good and very much suitable for this dataset let’s try to do hyper Parameter tuning to that and check if we can improve the model.

From above output we can say that there is no much change in the results, but one observation is that the difference between train error and test error is reduced.

· The models can be trained such that it is 100% error free, if we get more number of data samples.

· Because this dataset has 986 number of samples which is not actually good enough.

· Looking at the target variable value counts there are few premium prices for which the number of samples are just 1 or 2 because of which the model is predicting near to those values.

· If we increase the number of samples for those premium prices model can be 100% error free.

· One more way is to add up other features like income of person, his regular check up schedule etc, can help to predict the premium price.

Thank you for reading.

This is my first blog please support and any suggestions are most welcomed.

Add a comment

Related posts:

When Business Pushes You Against The Wall. A Lesson in Leadership and Entrepreneurship.

We had taken a hit with the first lockdown of 2019. We had ran through our savings. We were literally living hand-to-mouth as a company. Every sale mattered because all the cash had a particular…

Games UX Summit Highlights

As I previously posted, last month I was lucky enough to attend the Games UX Summit 2017 hosted by our friends at Ubisoft Toronto. The Games UX Summit is one of the first conferences dedicated to…

Expanding My Horizon

It was about seven years ago when I got my first real job in a warehouse working the graveyard shift. My daily routine was to sleep during the day, eat then go to work starting at 11 p.m. Working…