Linear Regression Interview Questions

Displaying 1 - 10 of 19

Coding Practice (Linear Regression): Given the data of individuals and their healthcare charges billed by their insurance provider

02/04/2022 - 14:54 by devraj

Problem Statement: Given the data of individuals and their healthcare charges billed by their insurance provider (Click here to download data), following are the columns in the data set:

  • sex: Gender of the individual (female, male)
  • bmi: Body mass index (You can read more about BMI here.)
  • children: Number of children covered under health insurance / number of dependants
  • smoker: Indicates whether the person smokes or not
  • region: The individual's residential area in the US
  • charges: Medical costs borne by the health insurance provider for the individual

Here, "charges" will be the dependent variable and all the other variables are independent variables.

Question 1: Following are some questions that require you to do some EDA, data preparation and finally perform linear regression on the data set to predict the healthcare charges for the individuals.

Create another feature based called BMI_group which groups people based on their BMI. The groups should be as follows:

  • Underweight: BMI is less than 18.5.
  • Normal: BMI is 18.5 to 24.9.
  • Overweight: BMI is 25 to 29.9.
  • Obese: BMI is 30 or more.

The grouping is based on WHO standards.

The output should have first five rows of the resulting dataframe.

import pandas as pd 
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")
def bmi_group(val):
    if val<18.5:
        return "Underweight"
    if (val>=18.5) & (val<24.9):
        return "Normal"
    if (val>=24.9) & (val<=29.9):
        return "Overweight"
    if val>=30:
        return "Obese"
    
df["BMI_group"] = df.bmi.apply(bmi_group)
print(df.head())

Question 2: Encode all categorical features such that they can be used in a regression model. i.e.sex, BMI_group, smoker and region should be labelled properly. Use the label encoder for all features. 

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")

le = LabelEncoder()
#sex
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)
#changing data type
df.BMI_group=df.BMI_group.astype(str)
le.fit(df.BMI_group.drop_duplicates()) 
df.BMI_group = le.transform(df.BMI_group)
print(df.head())

Question 3: As everyone knows, smoking is a major cause of bad health. Here, try to find if smoking is affecting health of people. Print the correlation value of "smoker" columns with "bmi", "age"  and "charges" columns in three lines respectively. Note: You should round off all three values till four decimal places using the round() function.

import pandas as pd 
df=pd.read_csv("")

print(round(df.smoker.corr(df.bmi),4))
print(round(df.smoker.corr(df.age),4))
print(round(df.smoker.corr(df.charges),4))

Question 4:  We have divided the dataset now into test and train sets. Since you already saw that being a smoker and healthcare charges are highly correlated. Try to create a linear regression model using only the "smoker" variable as the independent variable and "charges" as dependent variable.

Note: All operations you performed in the previous questions have already been performed on the dataset here. 

Click here to download train data

You can take any other measures to ensure a better outcome if you want. The dataset has been divided into train and test sets and both have been loaded in the coding console.  You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.6. 

import numpy as np
import pandas as pd

# Read training data
train = pd.read_csv("insurance_training.csv")

# Read test data
test = pd.read_csv("insurance_test.csv")

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(np.array(train['smoker']).reshape(-1,1),train['charges'])
y_test_pred=lr.predict(np.array(test['smoker']).reshape(-1,1))

# Write the output
test["predicted_charges"]=y_test_pred
test.to_csv("/code/output/predictions.csv")

Question 5: You saw that by using only the "smoker" variable, you can get an r-squared of 0.66 easily. Now your task is to perform linear regression using the entire dataset.

Note: All operations your performed in the questions 1-3 have already been performed on the dataset here. 

You can take any other measures to ensure a better outcome if you want.(for example: normalising or standardising any values or adding any other columns).

Click here to download train data

You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared-adjusted will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.72.

import numpy as np
import pandas as pd

# Read training data
train = pd.read_csv("insurance_training.csv")

# Read test data
test = pd.read_csv("insurance_test.csv")

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train.drop(["region","charges"],axis=1),train['charges'])
y_test_predicted=lr.predict(test.drop("region",axis=1))

# Write the output
#Do not edit the last two lines here
#reload test set before this step if you have made any changes to the test set 
test["predicted_charges"]=y_test_predicted
test.to_csv("/code/output/predictions.csv")

What is the likelihood function?

03/23/2021 - 00:09 by devraj

The likelihood function is the joint probability of observing the data. For example, let’s assume that a coin is tossed 100 times and you want to know the probability of getting 60 heads from the tosses. This example follows the binomial distribution formula.

  • p = Probability of heads from a single coin toss
  • n = 100 (the number of coin tosses)
  • x = 60 (the number of heads – success)
  • n - x = 40 (the number of tails)
  • Pr (X=60 | n = 100, p)

The likelihood function is the probability that the number of heads received is 60 in a trail of 100 coin tosses, where the probability of heads received in each coin toss is p. Here the coin toss result follows a binomial distribution.

This can be reframed as follows:

  • Pr(X=60|n=100, p) = c × p60 × (1−p)100−60
  • c = constant
  • p = unknown parameter

The likelihood function gives the probability of observing the results using unknown parameters.

What are odds?

03/23/2021 - 00:05 by devraj

It is the ratio of the probability of an event occurring to the probability of the event not occurring. For example, let’s assume that the probability of winning a lottery is 0.01. Then, the probability of not winning is 1 - 0.01 = 0.99.

Now, as per the definition,

The odds of winning the lottery = (Probability of winning) / (Probability of not winning)

The odds of winning the lottery = 0.01/0.99

Hence, the odds of winning the lottery is 1 to 99, and the odds of not winning the lottery is 99 to 1

How can the probability of a logistic regression model be expressed as a conditional probability?

03/23/2021 - 00:02 by devraj

The conditional probability can be given as:

P (Discrete value of target variable|X1, X2, X3 ... XN)

It is the probability of the target variable to take up a discrete value (either 0 or 1 in case of binary classification problems) when the values of independent variables are given. For example, the probability an employee will attrite (target variable) given his attributes such as his age, salary, KRA’s, etc.

What are the differences between logistic regression and linear regression?

03/19/2021 - 22:37 by devraj

The main important differences between logistic and linear regression are:

1. Dependent/response variable in linear regression is continuous whereas, in logistic regression, it is the discrete type.

2. Cost function in linear regression minimise the error term Sum(Actual(Y)-Predicted(Y))^2 but logistic regression uses maximum likelihood method for maximising probabilities.

What is scaling? Why is scaling performed?

03/07/2021 - 13:26 by devraj

It is a step of data Pre-Processing which is applied to independent variables to normalize the data within a particular range. It also helps in speeding up the calculations in an algorithm.

Most of the times, collected data set contains features highly varying in magnitudes, units and range. If scaling is not done then algorithm only takes magnitude in account and not units hence incorrect modelling. To solve this issue, we have to do scaling to bring all the variables to the same level of magnitude.

It is important to note that scaling just affects the coefficients and none of the other parameters like t-statistic, F-statistic, p-values, R-squared, etc.

Explain the bias-variance trade-off.

03/04/2021 - 17:11 by devraj

Bias refers to the difference between the values predicted by the model and the real values. It is an error. One of the goals of an ML algorithm is to have a low bias.

Variance refers to the sensitivity of the model to small fluctuations in the training data set. Another goal of an ML algorithm is to have low variance.

For a data set that is not exactly linear, it is not possible to have both bias and variance low at the same time. A straight line model will have low variance but high bias, whereas a high-degree polynomial will have low bias but high variance.

There is no escaping the relationship between bias and variance in machine learning.

  1. Decreasing the bias increases the variance.
  2. Decreasing the variance increases the bias.

So, there is a trade-off between the two; the ML specialist has to decide, based on the assigned problem, how much bias and variance can be tolerated. Based on this, the final model is built.

Explain gradient descent with respect to linear regression.

03/04/2021 - 17:07 by devraj

Gradient descent is an optimisation algorithm. In linear regression, it is used to optimise the cost function and find the values of the βs (estimators) corresponding to the optimised value of the cost function.

Gradient descent works like a ball rolling down a graph (ignoring the inertia). The ball moves along the direction of the greatest gradient and comes to rest at the flat surface (minima).

Gradient descent

Mathematically, the aim of gradient descent for linear regression is to find the solution of ArgMin J(Θ01), where J(Θ01) is the cost function of the linear regression. It is given by:  

gradient descent

Here, h is the linear hypothesis model, h = Θ0 + Θ1x, y is the true output, and m is the number of datapoints in the training set.

Gradient descent starts with a random solution, and then, based on the direction of the gradient, the solution is updated to the new value, where the cost function has a lower value.

The update is:

Repeat until convergence:

gradient descent

What is the major difference between R-squared and adjusted R-squared? Or, why is it advised to use adjusted R-squared in case of multiple linear regression?

03/04/2021 - 17:06 by devraj

The major difference between R-squared and adjusted R-squared is that R-squared does not penalise the model for having a higher number of variables. Thus, if you keep on adding variables to the model, the R-squared value will always increase (or remain the same in case the value of the correlation between that variable and the dependent variable is zero). Thus, R-squared assumes that any variable added to the model will increase the predictive power.

Adjusted R-squared, on the other hand, penalises models based on the number of variables present in it. Its formula is given as:

                                                             Adj. R2=1−(1−R2)(N−1) / N−k−1

where 'N' is the number of datapoints and 'k' is the number of features.

So, if you add a variable and the adjusted R-squared drops, you can be certain that that variable is insignificant to the model and, hence, should not be used. Thus, in the case of multiple linear regression, you should always look at the adjusted R-squared value in order to keep redundant variables out of your regression model.