Linear Regression Interview Questions

Displaying 1 - 10 of 32

Coding Practice (Linear Regression): Given the data of individuals and their healthcare charges billed by their insurance provider

Problem Statement: Given the data of individuals and their healthcare charges billed by their insurance provider (Click here to download data), following are the columns in the data set:

• sex: Gender of the individual (female, male)
• bmi: Body mass index (You can read more about BMI here.)
• children: Number of children covered under health insurance / number of dependants
• smoker: Indicates whether the person smokes or not
• region: The individual's residential area in the US
• charges: Medical costs borne by the health insurance provider for the individual

Here, "charges" will be the dependent variable and all the other variables are independent variables.

Question 1: Following are some questions that require you to do some EDA, data preparation and finally perform linear regression on the data set to predict the healthcare charges for the individuals.

Create another feature based called BMI_group which groups people based on their BMI. The groups should be as follows:

• Underweight: BMI is less than 18.5.
• Normal: BMI is 18.5 to 24.9.
• Overweight: BMI is 25 to 29.9.
• Obese: BMI is 30 or more.

The grouping is based on WHO standards.

The output should have first five rows of the resulting dataframe.

import pandas as pd
pd.set_option('display.max_columns', 500)
def bmi_group(val):
if val<18.5:
return "Underweight"
if (val>=18.5) & (val<24.9):
return "Normal"
if (val>=24.9) & (val<=29.9):
return "Overweight"
if val>=30:
return "Obese"

df["BMI_group"] = df.bmi.apply(bmi_group)
print(df.head())

Question 2: Encode all categorical features such that they can be used in a regression model. i.e.sex, BMI_group, smoker and region should be labelled properly. Use the label encoder for all features.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_columns', 500)

le = LabelEncoder()
#sex
le.fit(df.sex.drop_duplicates())
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates())
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates())
df.region = le.transform(df.region)
#changing data type
df.BMI_group=df.BMI_group.astype(str)
le.fit(df.BMI_group.drop_duplicates())
df.BMI_group = le.transform(df.BMI_group)
print(df.head())

Question 3: As everyone knows, smoking is a major cause of bad health. Here, try to find if smoking is affecting health of people. Print the correlation value of "smoker" columns with "bmi", "age"  and "charges" columns in three lines respectively. Note: You should round off all three values till four decimal places using the round() function.

import pandas as pd

print(round(df.smoker.corr(df.bmi),4))
print(round(df.smoker.corr(df.age),4))
print(round(df.smoker.corr(df.charges),4))

Question 4:  We have divided the dataset now into test and train sets. Since you already saw that being a smoker and healthcare charges are highly correlated. Try to create a linear regression model using only the "smoker" variable as the independent variable and "charges" as dependent variable.

Note: All operations you performed in the previous questions have already been performed on the dataset here.

You can take any other measures to ensure a better outcome if you want. The dataset has been divided into train and test sets and both have been loaded in the coding console.  You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.6.

import numpy as np
import pandas as pd

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(np.array(train['smoker']).reshape(-1,1),train['charges'])
y_test_pred=lr.predict(np.array(test['smoker']).reshape(-1,1))

# Write the output
test["predicted_charges"]=y_test_pred
test.to_csv("/code/output/predictions.csv")

Question 5: You saw that by using only the "smoker" variable, you can get an r-squared of 0.66 easily. Now your task is to perform linear regression using the entire dataset.

Note: All operations your performed in the questions 1-3 have already been performed on the dataset here.

You can take any other measures to ensure a better outcome if you want.(for example: normalising or standardising any values or adding any other columns).

You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared-adjusted will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.72.

import numpy as np
import pandas as pd

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train.drop(["region","charges"],axis=1),train['charges'])
y_test_predicted=lr.predict(test.drop("region",axis=1))

# Write the output
#Do not edit the last two lines here
#reload test set before this step if you have made any changes to the test set
test["predicted_charges"]=y_test_predicted
test.to_csv("/code/output/predictions.csv")

What is the likelihood function?

The likelihood function is the joint probability of observing the data. For example, let’s assume that a coin is tossed 100 times and you want to know the probability of getting 60 heads from the tosses. This example follows the binomial distribution formula.

• p = Probability of heads from a single coin toss
• n = 100 (the number of coin tosses)
• x = 60 (the number of heads – success)
• n - x = 40 (the number of tails)
• Pr (X=60 | n = 100, p)

The likelihood function is the probability that the number of heads received is 60 in a trail of 100 coin tosses, where the probability of heads received in each coin toss is p. Here the coin toss result follows a binomial distribution.

This can be reframed as follows:

• Pr(X=60|n=100, p) = c × p60 × (1−p)100−60
• c = constant
• p = unknown parameter

The likelihood function gives the probability of observing the results using unknown parameters.

Why can’t linear regression be used in place of logistic regression for binary classification?

The reasons why linear regressions cannot be used in case of binary classification are as follows:

Distribution of error terms: The distribution of data in the case of linear and logistic regression is different. Linear regression assumes that error terms are normally distributed. In the case of binary classification, this assumption does not hold true.

Model output: In linear regression, the output is continuous. In the case of binary classification, an output of a continuous value does not make sense. For binary classification problems, linear regression may predict values that can go beyond 0 and 1. If we want the output in the form of probabilities, which can be mapped to two different classes, then its range should be restricted to 0 and 1. As the logistic regression model can output probabilities with logistic/sigmoid function, it is preferred over linear regression.

Variance of Residual errors: Linear regression assumes that the variance of random errors is constant. This assumption is also violated in the case of logistic regression.

What are odds?

It is the ratio of the probability of an event occurring to the probability of the event not occurring. For example, let’s assume that the probability of winning a lottery is 0.01. Then, the probability of not winning is 1 - 0.01 = 0.99.

Now, as per the definition,

The odds of winning the lottery = (Probability of winning) / (Probability of not winning)

The odds of winning the lottery = 0.01/0.99

Hence, the odds of winning the lottery is 1 to 99, and the odds of not winning the lottery is 99 to 1

How can the probability of a logistic regression model be expressed as a conditional probability?

The conditional probability can be given as:

P (Discrete value of target variable|X1, X2, X3 ... XN)

It is the probability of the target variable to take up a discrete value (either 0 or 1 in case of binary classification problems) when the values of independent variables are given. For example, the probability an employee will attrite (target variable) given his attributes such as his age, salary, KRA’s, etc.

What is the formula for the logistic regression function?

In general, the formula for logistic regression is given by the following expression:

$f(z) = \frac{1}{(1+e^{-(\beta_{0} +\beta_{1} X_{1} +\beta_{2} X_{2}+ ...+\beta_{k} X_{k})})}$

What are the differences between logistic regression and linear regression?

The main important differences between logistic and linear regression are:

1. Dependent/response variable in linear regression is continuous whereas, in logistic regression, it is the discrete type.

2. Cost function in linear regression minimise the error term Sum(Actual(Y)-Predicted(Y))^2 but logistic regression uses maximum likelihood method for maximising probabilities.

What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression

Q-Q Plots (Quantile-Quantile plots) are plots of two quantiles against each other. A quantile is a fraction where certain values fall below that quantile. For example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45 degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.

A Q Q plot showing the 45 degree reference line:

If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x. If the distributions are linearly related, the points in the Q–Q plot will approximately lie on a line, but not necessarily on the line y = x. Q–Q plots can also be used as a graphical means of estimating parameters in a location-scale family of distributions.

A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.

You might have observed that sometimes the value of VIF is infinite. Why does this happen?

If there is perfect correlation, then VIF = infinity. This shows a perfect correlation between two independent variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this problem we need to drop one of the variables from the dataset which is causing this perfect multicollinearity.

An infinite VIF value indicates that the corresponding variable may be expressed exactly by a linear combination of other variables (which show an infinite VIF as well).

What is the difference between normalized scaling and standardized scaling?

Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

S.NO. Normalisation Standardisation
1. Minimum and maximum value of features are used for scaling Mean and standard deviation is used for scaling.
2. It is used when features are of different scales. It is used when we want to ensure zero mean and unit standard deviation.
3. Scales values between [0, 1] or [-1, 1]. It is not bounded to a certain range.
4. It is really affected by outliers. It is much less affected by outliers.
5. Scikit-Learn provides a transformer called MinMaxScaler for Normalization. Scikit-Learn provides a transformer called StandardScaler for standardization.
6. This transformation squishes the n-dimensional data into an n-dimensional unit hypercube. It translates the data to the mean vector of original data to the origin and squishes or expands.
7. It is useful when we don’t know about the distribution It is useful when the feature distribution is Normal or Gaussian.
8. It is a often called as Scaling Normalization It is a often called as Z-Score Normalization.