Common Machine Learning Interview Questions

Displaying 1 - 10 of 54

Coding Practice (Linear Regression): Given the data of individuals and their healthcare charges billed by their insurance provider

Problem Statement: Given the data of individuals and their healthcare charges billed by their insurance provider (Click here to download data), following are the columns in the data set:

  • sex: Gender of the individual (female, male)
  • bmi: Body mass index (You can read more about BMI here.)
  • children: Number of children covered under health insurance / number of dependants
  • smoker: Indicates whether the person smokes or not
  • region: The individual's residential area in the US
  • charges: Medical costs borne by the health insurance provider for the individual

Here, "charges" will be the dependent variable and all the other variables are independent variables.

Question 1: Following are some questions that require you to do some EDA, data preparation and finally perform linear regression on the data set to predict the healthcare charges for the individuals.

Create another feature based called BMI_group which groups people based on their BMI. The groups should be as follows:

  • Underweight: BMI is less than 18.5.
  • Normal: BMI is 18.5 to 24.9.
  • Overweight: BMI is 25 to 29.9.
  • Obese: BMI is 30 or more.

The grouping is based on WHO standards.

The output should have first five rows of the resulting dataframe.

import pandas as pd 
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")
def bmi_group(val):
    if val<18.5:
        return "Underweight"
    if (val>=18.5) & (val<24.9):
        return "Normal"
    if (val>=24.9) & (val<=29.9):
        return "Overweight"
    if val>=30:
        return "Obese"
    
df["BMI_group"] = df.bmi.apply(bmi_group)
print(df.head())

Question 2: Encode all categorical features such that they can be used in a regression model. i.e.sex, BMI_group, smoker and region should be labelled properly. Use the label encoder for all features. 

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")

le = LabelEncoder()
#sex
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)
#changing data type
df.BMI_group=df.BMI_group.astype(str)
le.fit(df.BMI_group.drop_duplicates()) 
df.BMI_group = le.transform(df.BMI_group)
print(df.head())

Question 3: As everyone knows, smoking is a major cause of bad health. Here, try to find if smoking is affecting health of people. Print the correlation value of "smoker" columns with "bmi", "age"  and "charges" columns in three lines respectively. Note: You should round off all three values till four decimal places using the round() function.

import pandas as pd 
df=pd.read_csv("")

print(round(df.smoker.corr(df.bmi),4))
print(round(df.smoker.corr(df.age),4))
print(round(df.smoker.corr(df.charges),4))

Question 4:  We have divided the dataset now into test and train sets. Since you already saw that being a smoker and healthcare charges are highly correlated. Try to create a linear regression model using only the "smoker" variable as the independent variable and "charges" as dependent variable.

Note: All operations you performed in the previous questions have already been performed on the dataset here. 

Click here to download train data

You can take any other measures to ensure a better outcome if you want. The dataset has been divided into train and test sets and both have been loaded in the coding console.  You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.6. 

import numpy as np
import pandas as pd

# Read training data
train = pd.read_csv("insurance_training.csv")

# Read test data
test = pd.read_csv("insurance_test.csv")

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(np.array(train['smoker']).reshape(-1,1),train['charges'])
y_test_pred=lr.predict(np.array(test['smoker']).reshape(-1,1))

# Write the output
test["predicted_charges"]=y_test_pred
test.to_csv("/code/output/predictions.csv")

Question 5: You saw that by using only the "smoker" variable, you can get an r-squared of 0.66 easily. Now your task is to perform linear regression using the entire dataset.

Note: All operations your performed in the questions 1-3 have already been performed on the dataset here. 

You can take any other measures to ensure a better outcome if you want.(for example: normalising or standardising any values or adding any other columns).

Click here to download train data

You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared-adjusted will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.72.

import numpy as np
import pandas as pd

# Read training data
train = pd.read_csv("insurance_training.csv")

# Read test data
test = pd.read_csv("insurance_test.csv")

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train.drop(["region","charges"],axis=1),train['charges'])
y_test_predicted=lr.predict(test.drop("region",axis=1))

# Write the output
#Do not edit the last two lines here
#reload test set before this step if you have made any changes to the test set 
test["predicted_charges"]=y_test_predicted
test.to_csv("/code/output/predictions.csv")

What is the Central Limit Theorem and why is it important?

Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.

The central limit theorem is an approximation. This means that our reasoning is not accurate any more. That said, for large enough sample sizes, the approximation is good enough to use it for practical predictions. Assume for the moment that we knew the variance σ 2 exactly. In this case we know that ¯X m is approximately normal with mean µ B and variance m −1 σ 2 . We are interested in the interval [µ−∊, µ+∊] which contains 95% of the probability mass of a normal distribution.

What is bias, variance trade off?

Bias: Bias is error introduced in your model due to over simplification of machine learning algorithm. It can lead to under fitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.

Low bias machine learning algorithms: Decision Trees, k-NN and SVM

High bias machine learning algorithms : Linear Regression, Logistic Regression

Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set. It can lead high sensitivity and over fitting.

Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
 

Bias, Variance trade off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.

The k-nearest neighbours algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.

The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.

list the differences between supervised and unsupervised machine learning.

Supervised learning: Supervised learning is the learning of the model where with input variable ( say, x) and an output variable (say, Y) and an algorithm to map the input to the output.

That is, Y = f(X)

Unsupervised Learning: Unsupervised learning is where only the input data (say, X) is present and no corresponding output variable is there.

Here are the differences:

Criteria Supervised Learning Unsupervised Learning
Input Data Input data is labeled. Input data is unlabelled.
Data Set Uses training data set. Uses the input data set.
Use Use for prediction. Use for analysis.
Enables Enables classification & regression.

Enables Classification, Density Estimation, & Dimension Reduction

Let's say in order to predict the churn rate of the customers you came up with 2 machine learning approaches-Logistic Regression and Neural Networks

Novelty vs Utility

let's say in order to predict the churn rate of the customers you came up with 2 machine learning approaches - logistic regression and neural networks.

you know that logistic regression models will be highly interpretable and you will be able to identify the important features whereas the neural networks model, even though will give a better performance will be less interpretable since it's a black box model( it won't explain clearly why it made a certain prediction)

Which modelling technique should you ideally prefer?

Answer: You need to convey to the client which customers are leaving as well as the features that are more important for their departure. Therefore interoperability matters to them. Hence you should preferably go with the Logistic Regression Model.

What is the objective function for measuring the quality of clustering in case of the K-means algorithm with Euclidean distance?

Sum of squared errors (SSE) is used as the objective function for K-means clustering with Euclidean distance. The Euclidean distance is calculated from each data point to its nearest centroid. These distances are squared and summed to obtain the SSE. The aim of the algorithm is to minimize the SSE. Note that SSE considers all the clusters formed using the K-means algorithm.

Explain the steps of K-means Clustering algorithm. Mention the key steps that need to be followed and how the algorithm works.

The algorithm for K-means algorithm is as follows:

  • Select initial centroids. The input regarding the number of centroids should be given by the user.
  • Assign the data points to the closest centroid
  • Recalculate the centroid for each cluster and assign the data objects again
  • Follow the same procedure until convergence. Convergence is achieved when there is no more assignment of data objects from one cluster to another, or when there is no change in the centroid of clusters.

How to choose a cutoff point in case of a logistic regression model?

The cutoff point depends on the business objective. Depending on the goals of your business, the cutoff point needs to be selected. For example, let’s consider loan defaults. If the business objective is to reduce the loss, then the specificity needs to be high. If the aim is to increase the profits, then it is an entirely different matter. It may not be the case that profits will increase by avoiding giving loans to all predicted default cases.

But it may be the case that the business has to disburse loans to default cases that are slightly less risky to increase the profits. In such a case, a different cutoff point, which maximises profit, will be required. In most of the instances, businesses will operate around many constraints. The cutoff point that satisfies the business objective will not be the same with and without limitations. The cutoff point needs to be selected considering all these points. If the business context doesn't matter much and you want to create a balanced model, then you use an ROC curve to see the tradeoff between sensitivity and specificity and accordingly choose an optimal cutoff point where both these values along with accuracy are decent.

Explain the use of ROC curves and the AUC of an ROC Curve.

An ROC (Receiver Operating Characteristic) curve illustrates the performance of a binary classification model. It is basically a TPR versus FPR (true positive rate versus false positive rate) curve for all the threshold values ranging from 0 to 1.

In an ROC curve, each point in the ROC space will be associated with a different confusion matrix. A diagonal line from the bottom-left to the top-right on the ROC graph represents random guessing.

The Area Under the Curve (AUC) signifies how good the classifier model is. If the value for AUC is high (near 1), then the model is working satisfactorily, whereas if the value is low (around 0.5), then the model is not working properly and just guessing randomly. From the image below, curve C (green) is the best ROC curve among the three and curve A (brown) is the worst ROC curve among the three.

roc curve