Problem Statement: Given the data of individuals and their healthcare charges billed by their insurance provider (Click here to download data), following are the columns in the data set:
- sex: Gender of the individual (female, male)
- bmi: Body mass index (You can read more about BMI here.)
- children: Number of children covered under health insurance / number of dependants
- smoker: Indicates whether the person smokes or not
- region: The individual's residential area in the US
- charges: Medical costs borne by the health insurance provider for the individual
Here, "charges" will be the dependent variable and all the other variables are independent variables.
Question 1: Following are some questions that require you to do some EDA, data preparation and finally perform linear regression on the data set to predict the healthcare charges for the individuals.
Create another feature based called BMI_group which groups people based on their BMI. The groups should be as follows:
- Underweight: BMI is less than 18.5.
- Normal: BMI is 18.5 to 24.9.
- Overweight: BMI is 25 to 29.9.
- Obese: BMI is 30 or more.
The grouping is based on WHO standards.
The output should have first five rows of the resulting dataframe.
import pandas as pd
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")
def bmi_group(val):
if val<18.5:
return "Underweight"
if (val>=18.5) & (val<24.9):
return "Normal"
if (val>=24.9) & (val<=29.9):
return "Overweight"
if val>=30:
return "Obese"
df["BMI_group"] = df.bmi.apply(bmi_group)
print(df.head())
Question 2: Encode all categorical features such that they can be used in a regression model. i.e.sex, BMI_group, smoker and region should be labelled properly. Use the label encoder for all features.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")
le = LabelEncoder()
#sex
le.fit(df.sex.drop_duplicates())
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates())
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates())
df.region = le.transform(df.region)
#changing data type
df.BMI_group=df.BMI_group.astype(str)
le.fit(df.BMI_group.drop_duplicates())
df.BMI_group = le.transform(df.BMI_group)
print(df.head())
Question 3: As everyone knows, smoking is a major cause of bad health. Here, try to find if smoking is affecting health of people. Print the correlation value of "smoker" columns with "bmi", "age" and "charges" columns in three lines respectively. Note: You should round off all three values till four decimal places using the round() function.
import pandas as pd
df=pd.read_csv("")
print(round(df.smoker.corr(df.bmi),4))
print(round(df.smoker.corr(df.age),4))
print(round(df.smoker.corr(df.charges),4))
Question 4: We have divided the dataset now into test and train sets. Since you already saw that being a smoker and healthcare charges are highly correlated. Try to create a linear regression model using only the "smoker" variable as the independent variable and "charges" as dependent variable.
Note: All operations you performed in the previous questions have already been performed on the dataset here.
Click here to download train data
You can take any other measures to ensure a better outcome if you want. The dataset has been divided into train and test sets and both have been loaded in the coding console. You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.
Your model's R-squared will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.6.
import numpy as np
import pandas as pd
# Read training data
train = pd.read_csv("insurance_training.csv")
# Read test data
test = pd.read_csv("insurance_test.csv")
# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(np.array(train['smoker']).reshape(-1,1),train['charges'])
y_test_pred=lr.predict(np.array(test['smoker']).reshape(-1,1))
# Write the output
test["predicted_charges"]=y_test_pred
test.to_csv("/code/output/predictions.csv")
Question 5: You saw that by using only the "smoker" variable, you can get an r-squared of 0.66 easily. Now your task is to perform linear regression using the entire dataset.
Note: All operations your performed in the questions 1-3 have already been performed on the dataset here.
You can take any other measures to ensure a better outcome if you want.(for example: normalising or standardising any values or adding any other columns).
Click here to download train data
You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.
Your model's R-squared-adjusted will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.72.
import numpy as np
import pandas as pd
# Read training data
train = pd.read_csv("insurance_training.csv")
# Read test data
test = pd.read_csv("insurance_test.csv")
# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train.drop(["region","charges"],axis=1),train['charges'])
y_test_predicted=lr.predict(test.drop("region",axis=1))
# Write the output
#Do not edit the last two lines here
#reload test set before this step if you have made any changes to the test set
test["predicted_charges"]=y_test_predicted
test.to_csv("/code/output/predictions.csv")