Skip to main content

Common Machine Learning Interview Questions

Displaying 1 - 10 of 68

Coding Practice (Linear Regression): Given the data of individuals and their healthcare charges billed by their insurance provider

Problem Statement: Given the data of individuals and their healthcare charges billed by their insurance provider (Click here to download data), following are the columns in the data set:

  • sex: Gender of the individual (female, male)
  • bmi: Body mass index (You can read more about BMI here.)
  • children: Number of children covered under health insurance / number of dependants
  • smoker: Indicates whether the person smokes or not
  • region: The individual's residential area in the US
  • charges: Medical costs borne by the health insurance provider for the individual

Here, "charges" will be the dependent variable and all the other variables are independent variables.

Question 1: Following are some questions that require you to do some EDA, data preparation and finally perform linear regression on the data set to predict the healthcare charges for the individuals.

Create another feature based called BMI_group which groups people based on their BMI. The groups should be as follows:

  • Underweight: BMI is less than 18.5.
  • Normal: BMI is 18.5 to 24.9.
  • Overweight: BMI is 25 to 29.9.
  • Obese: BMI is 30 or more.

The grouping is based on WHO standards.

The output should have first five rows of the resulting dataframe.

import pandas as pd 
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")
def bmi_group(val):
    if val<18.5:
        return "Underweight"
    if (val>=18.5) & (val<24.9):
        return "Normal"
    if (val>=24.9) & (val<=29.9):
        return "Overweight"
    if val>=30:
        return "Obese"
    
df["BMI_group"] = df.bmi.apply(bmi_group)
print(df.head())

Question 2: Encode all categorical features such that they can be used in a regression model. i.e.sex, BMI_group, smoker and region should be labelled properly. Use the label encoder for all features. 

import pandas as pd 
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_columns', 500)
df=pd.read_csv("")

le = LabelEncoder()
#sex
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)
#changing data type
df.BMI_group=df.BMI_group.astype(str)
le.fit(df.BMI_group.drop_duplicates()) 
df.BMI_group = le.transform(df.BMI_group)
print(df.head())

Question 3: As everyone knows, smoking is a major cause of bad health. Here, try to find if smoking is affecting health of people. Print the correlation value of "smoker" columns with "bmi", "age"  and "charges" columns in three lines respectively. Note: You should round off all three values till four decimal places using the round() function.

import pandas as pd 
df=pd.read_csv("")

print(round(df.smoker.corr(df.bmi),4))
print(round(df.smoker.corr(df.age),4))
print(round(df.smoker.corr(df.charges),4))

Question 4:  We have divided the dataset now into test and train sets. Since you already saw that being a smoker and healthcare charges are highly correlated. Try to create a linear regression model using only the "smoker" variable as the independent variable and "charges" as dependent variable.

Note: All operations you performed in the previous questions have already been performed on the dataset here. 

Click here to download train data

You can take any other measures to ensure a better outcome if you want. The dataset has been divided into train and test sets and both have been loaded in the coding console.  You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.6. 

import numpy as np
import pandas as pd

# Read training data
train = pd.read_csv("insurance_training.csv")

# Read test data
test = pd.read_csv("insurance_test.csv")

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(np.array(train['smoker']).reshape(-1,1),train['charges'])
y_test_pred=lr.predict(np.array(test['smoker']).reshape(-1,1))

# Write the output
test["predicted_charges"]=y_test_pred
test.to_csv("/code/output/predictions.csv")

Question 5: You saw that by using only the "smoker" variable, you can get an r-squared of 0.66 easily. Now your task is to perform linear regression using the entire dataset.

Note: All operations your performed in the questions 1-3 have already been performed on the dataset here. 

You can take any other measures to ensure a better outcome if you want.(for example: normalising or standardising any values or adding any other columns).

Click here to download train data

You have to write the predictions in the file: /code/output/predictions.csv. You have to add the predictions in a column titled "predicted_charges" in the test dataset. Make sure you use the same column name otherwise your score won't be evaluated.

Your model's R-squared-adjusted will be evaluated on an unseen test dataset. The R-squared of your model should be greater than 0.72.

import numpy as np
import pandas as pd

# Read training data
train = pd.read_csv("insurance_training.csv")

# Read test data
test = pd.read_csv("insurance_test.csv")

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train.drop(["region","charges"],axis=1),train['charges'])
y_test_predicted=lr.predict(test.drop("region",axis=1))

# Write the output
#Do not edit the last two lines here
#reload test set before this step if you have made any changes to the test set 
test["predicted_charges"]=y_test_predicted
test.to_csv("/code/output/predictions.csv")

What is the Central Limit Theorem and why is it important?

Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.

The central limit theorem is an approximation. This means that our reasoning is not accurate any more. That said, for large enough sample sizes, the approximation is good enough to use it for practical predictions. Assume for the moment that we knew the variance σ 2 exactly. In this case we know that ¯X m is approximately normal with mean µ B and variance m −1 σ 2 . We are interested in the interval [µ−∊, µ+∊] which contains 95% of the probability mass of a normal distribution.

What is bias, variance trade off?

Bias: Bias is error introduced in your model due to over simplification of machine learning algorithm. It can lead to under fitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.

Low bias machine learning algorithms: Decision Trees, k-NN and SVM

High bias machine learning algorithms : Linear Regression, Logistic Regression

Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set. It can lead high sensitivity and over fitting.

Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
 

Bias, Variance trade off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.

The k-nearest neighbours algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model.

The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.

list the differences between supervised and unsupervised machine learning.

Supervised learning: Supervised learning is the learning of the model where with input variable ( say, x) and an output variable (say, Y) and an algorithm to map the input to the output.

That is, Y = f(X)

Unsupervised Learning: Unsupervised learning is where only the input data (say, X) is present and no corresponding output variable is there.

Here are the differences:

Criteria Supervised Learning Unsupervised Learning
Input Data Input data is labeled. Input data is unlabelled.
Data Set Uses training data set. Uses the input data set.
Use Use for prediction. Use for analysis.
Enables Enables classification & regression.

Enables Classification, Density Estimation, & Dimension Reduction

Let's say in order to predict the churn rate of the customers you came up with 2 machine learning approaches-Logistic Regression and Neural Networks

Novelty vs Utility

let's say in order to predict the churn rate of the customers you came up with 2 machine learning approaches - logistic regression and neural networks.

you know that logistic regression models will be highly interpretable and you will be able to identify the important features whereas the neural networks model, even though will give a better performance will be less interpretable since it's a black box model( it won't explain clearly why it made a certain prediction)

Which modelling technique should you ideally prefer?

Answer: You need to convey to the client which customers are leaving as well as the features that are more important for their departure. Therefore interoperability matters to them. Hence you should preferably go with the Logistic Regression Model.

Is validation required for clustering? If yes, then why is it required?

Clustering algorithms have a tendency to cluster even when the data is random. It is essential to validate if a non-random structure is present in the data. It is also required to validate whether the number of clusters formed is appropriate or not.

Evaluation of clusters is done with or without external reference to check the fitness of the data. Evaluation is also done to compare clusters and decide the better among them.

What are the disadvantages of agglomerative hierarchical clustering?

Objective function: SSE is the objective function for K-means. Likewise, there exists no global objective function for hierarchical clustering. It considers proximity locally before merging two clusters.

Time and space complexity: The time and space complexity of agglomerative clustering is more than K-means clustering, and in some cases, it is prohibitive.

Final merging decisions: The merging decisions, once given by the algorithm, cannot be undone at a later point in time. Due to this, a local optimisation criteria cannot become global criteria. Note that there are some advanced approaches available to overcome this problem.

What are the types of hierarchical clustering?

There are two types of hierarchical clustering. They are agglomerative clustering and divisive clustering.

Agglomerative clustering: In this algorithm, initially every data object will be treated as a cluster. In each step, the nearest clusters will fuse together and form a bigger cluster. Ultimately, all the clusters will merge together. Finally, a single cluster, which encompasses all the data points, will remain.

Divisive clustering: This is the opposite of the agglomerative clustering. In this type, all the data objects will be considered as single clusters. In each step, the algorithm will split the cluster. This will repeat until only single data points remain, which will be considered as singleton clusters.

Is K-means clustering suitable for all shapes and sizes of clusters?

K-means is not suitable for all shapes, sizes, and densities of clusters. If the natural clusters of a dataset are vastly different from a spherical shape, then K-means will face great difficulties in detecting it. K-means will also fail if the sizes and densities of the clusters are different by a large margin. This is mostly due to using SSE as the objective function, which is more suited for spherical shapes. SSE is not suited for clusters with non-spherical shapes, varied cluster sizes, and densities.

What is the objective function for measuring the quality of clustering in case of the K-means algorithm with Euclidean distance?

Sum of squared errors (SSE) is used as the objective function for K-means clustering with Euclidean distance. The Euclidean distance is calculated from each data point to its nearest centroid. These distances are squared and summed to obtain the SSE. The aim of the algorithm is to minimize the SSE. Note that SSE considers all the clusters formed using the K-means algorithm.

Subscribe to Common Machine Learning Interview Questions

About

At ProgramsBuzz, you can learn, share and grow with millions of techie around the world from different domain like Data Science, Software Development, QA and Digital Marketing. You can ask doubt and get the answer for your queries from our experts.