Pandas Interview Questions

Displaying 1 - 10 of 12

Python Pandas: Bucketing in python and calculating mean for a bucket

07/11/2021 - 11:54 by devraj

Code:

from pathlib import Path

files = Path(".").glob("1*.csv")
my_df_list = []

bins = [650,1000,1350,1700,2050,2400,2750,3000]
columns = ("650-1000","1000-1350","1350-1700","1700-2050","2050-2400","2400-2750","2750-3000")

for file in files:
    file_name = file.name
    file_id = file_name.replace("*.csv","")
    df = pd.read_csv(file_name)
    print(df.columns)
    bins = [650,1000,1350,1700,2050,2400,2750,3000]
    a_bins = df.B.groupby(pd.cut(df['A'],bins))
    a_bins = a_bins.agg([np.mean]).reset_index(drop=True)
    a_bins_df = a_bins.T.copy()
    a_bins_df.columns = columns
    a_bins_df.index = [file_id]
    my_df_list.append(a_bins_df)
df_total = pd.concat(my_df_list,axis=0)

For the given dataframe you have to clean the "Installs" column and print its correlation with other numeric columns of the dataframe.(print df.corr())

03/23/2021 - 01:45 by devraj

You have to do the following:

1. Remove characters like ',' from the number of installs.
2. Delete rows where the Installs column has irrelevant strings like 'Free'
3. Convert the column to int type

You can download the dataframe from here

Sample Output:

           Rating  Installs
Rating    1.000000  0.051355
Installs  0.051355  1.000000

Cleaning Correlation DataFrame

import pandas as pd 

df=pd.read_csv("filename.csv")
df.Installs=df.Installs.str.replace(',','')

df.Installs=df.Installs.str.replace('+','')
df=df[df.Installs!='Free']

df.Installs=df.Installs.astype(int)
print(df.corr())

Given two pandas series find the position of elements in series2 in series1. you can assume that all elements in series2 will be present in

03/23/2021 - 01:43 by devraj

Given two pandas series, find the position of elements in series2 in series1. you can assume that all elements in series2 will be present in series1. the input will contain two lines with series1 and series2 respectively. the output should be a list of indexes indicating elements of series2 in series 1. note: in the output list, the indexes should be in ascending order.

Note: In the output list, the indexes should be in ascending order.

Sample Input:

[1,2,3,4,5,6,7]
[1,3,7]

Sample Output: [0,2,6]

Code:

import ast,sys
import pandas as pd

input_str = sys.stdin.read()
input_list = ast.literal_eval(input_str)

series1=pd.Series(input_list[0])
series2=pd.Series(input_list[1])

out_list=[pd.Index(series1).get_loc(num) for num in series2]
print(list(map(int,out_list)))

Write a program to select all columns of a dataframe except the ones specified.

03/23/2021 - 01:38 by devraj

Write a program to select all columns of a dataframe except the ones specified. the input will contain a list of columns that you should skip. you should print the first five rows of the dataframe as output where the columns are alphabetically sorted

Click here to download csv

Sample Input: ['PassengerId', 'Pclass', 'Name', 'Sex','Embarked']

Sample Output:

    Age Cabin     Fare  Parch  SibSp   Ticket
0  34.5   NaN   7.8292      0      0   330911
1  47.0   NaN   7.0000      0      1   363272
2  62.0   NaN   9.6875      0      0   240276
3  27.0   NaN   8.6625      0      0   315154
4  22.0   NaN  12.2875      1      1  3101298

Pandas select all columns except

import pandas as pd
import ast,sys

df=pd.read_csv("test.csv")
input_str = sys.stdin.read()
to_omit = ast.literal_eval(input_str)

df=df[df.columns[~df.columns.isin(to_omit)]]
print(df.loc[:, sorted(list(df.columns))].head())

You have been given a dataset called 'heart'. Here are its first few rows:

02/07/2021 - 09:08 by devraj

You have been given a dataset called 'heart'. Here are its first few rows:

you have been given a dataset called 'heart'. here are its first few rows:

As you can see, the column names are all lowercase. Your task is to capitalize the first character of every column present in it such that your dataframe looks like the following:

you have been given a dataset called 'heart'. here are its first few rows:

You just need to write the code for capitalizing; you need not print anything as the print statement has already been provided in the stub.

Pandas Capitalize Column Names

# Importing the pandas package
import pandas as pd

# Reading the dataframe
df = pd.read_csv('https://media-doselect.s3.amazonaws.com/generic/XWvQjYY4LZWdxLvPWOj2pPwn/heart.csv')

# Access the columns of the dataframe using df.columns and apply the following
# function which uses the ASCII values of the first character to capitalise the 
# first character of each word. As you know, the ASCII value of the lower cased 
# letters start with 97 (for 'a') and for upper case letters, it starts with 65
# (for 'A'). So you just need to subtract 32.
df.columns = df.columns.map(lambda x: x.replace(x[0], chr(ord(x[0]) - 32), 1))

# Printing the final columns. Do not edit this part.
print(df.columns)

You've been given the pima_indian_diabetes dataset. Here are its first few rows:

02/07/2021 - 09:06 by devraj

You've been given the pima_indian_diabetes dataset. Here are its first few rows:

Python data science

Notice the BMI and Diabetes column. You need to find the BMI which is most likely to cause Diabetes. First round the BMI to an integer value and return which BMI has the most risk of diabetes based on the 0, 1 diabetes values provided in the dataframe.

Expected Output: Just print a single integer denoting the value of the required BMI. (Please only output an integer value. For example, if the output is 30.0 please convert it to int and output 30.). 

Code

# Import the Pandas package
import pandas as pd 

# Reading the input dataframe
pima = pd.read_csv('https://media-doselect.s3.amazonaws.com/generic/pLZK3n22ezVwAG2XOYW5qEx7V/pima_indian_diabetes.csv')

# Round the BMI column
pima['BMI'] = round(pima['BMI'])

# Group the pima dataframe using the Diabetes column as values and BMI column
# as the index. Also, specify the 
pima_g = pima.pivot_table(values = ['Diabetes'], index = 'BMI', aggfunc = 'sum')

# Sort the dataframe by the diabetes column
pima_g.sort_values(by = 'Diabetes', inplace = True, ascending = False)

# Since BMI is present in the index of the grouped dataframe, just return the first
# index
print(int(pima_g.index[0]))

You have a dataset called 'Gun ViolenceSA' which has the instances of gun violence in San Antonio, Texas.

02/07/2021 - 08:09 by devraj

Here are the first few rows of the dataset:

data science gun violencesa

This time your aim is to find the safest and least safe locations in San Antonio, Texas. The safeness of a location is measured as 1/(Kills + 0.8 x Injuries + 1). So if a particular location has, say, 3 kills and 5 injuries, its safeness index would be 1/(3 + 0.8 x 5 + 1) = 0.125

You're given the latitude and longitude of each instance of gun violence. Round the latitude and longitude to a single decimal place first and find out the safest location. Print the safest location first and then in the next line, print the most unsafe location

Print these locations in a dictionary format. For example:
{'Latitude': 30.1, 'Longitude': -94.5}
{'Latitude': 72.3, 'Longitude': -96.4}

import pandas as pd 

# Reading the gun violence dataset
gun = pd.read_csv('https://query.data.world/s/3iebgxp57luarsikz5wtcwahpjwed7')

# Round Latitude and Longitude to one decimal place 
gun['Latitude'] = round(gun['Latitude'], 1)
gun['Longitude'] = round(gun['Longitude'], 1)

# Group the gun dataframe using #Killed and #Injuries and values, and 'Latitude'
# and 'Longitude' as indices
gun_g = gun.pivot_table(values = ['# Killed', '# Injured'], 
                        index = ['Latitude', 'Longitude'], 
                        aggfunc = 'sum')

# Add a column 'Safeness Index' which is calculate by using the formula given
# above
gun_g['Safeness Index'] = 1/(gun_g['# Killed'] + 0.8 * gun_g['# Injured'] + 1)

# Sort the values in the dataframe using the Safeness Index
gun_g.sort_values(by = ['Safeness Index'], inplace = True, ascending = False)

# Reindex the gun_g dataframe so you can access the Latitude and Longitude easily
# as columns
gun_g.reset_index(inplace = True)

# Store the safest and least safe location in two variables as dictionaries
safest = {'Latitude': gun_g.loc[0, 'Latitude'], 
          'Longitude': gun_g.loc[0, 'Longitude']}
least_safe = {'Latitude': gun_g.loc[len(gun_g['Latitude'])-1, 'Latitude'],
              'Longitude': gun_g.loc[len(gun_g['Longitude'])-1, 'Longitude']}
              
# Print the values of the safest and the least safe location              
print(sorted(list(safest.values())))
print(sorted(list(least_safe.values())))

You have a dataset called 'Gun ViolenceSA': Print a single value denoting the month in which most kills + injuries happened.

02/07/2021 - 08:05 by devraj

Here are the first few rows of the dataset.

gun violences data

As you can see, there are 10 columns in the dataset.

Notice the column incident date. Using this column, find out the month in which the most kills + injuries happened. 

Print a single value denoting the month in which most kills + injuries happened.

import pandas as pd 

# Read the gun violence dataset
gun = pd.read_csv('https://query.data.world/s/3iebgxp57luarsikz5wtcwahpjwed7')

# Extract the month from the Incident Date column
gun['Incident Month'] = gun['Incident Date'].apply(lambda x: x.split('-')[1])

# Group the function by Incident Month
gun_g = gun.pivot_table(values = ['# Killed', '# Injured'], index = ['Incident Month'], aggfunc = 'sum')

# Add a column 'Total' which will contain the sum of # Injured and # Killed 
gun_g['Total'] = gun_g['# Injured'] + gun_g['# Killed']

# Now, since the incident month will be in the index, you need to remove it from
# the index of the dataframe so that you can access it as a column easily
gun_g.reset_index(inplace=True)

# Extract the month in which most kills + injuries happened and store it in a
# variable month
month = gun_g[gun_g['Total'] == max(gun_g['Total'])]

# Print the final value of month using indexing since month will have all the 
# columns for the single row that it has
print(month['Incident Month'].iloc[0])


# Alternatively, instead of the last 2 lines of code, you could have also used the
# following:

# Sort the values in the dataset in a descending order
#group.sort_values(by='Total', inplace = True, ascending = False)

# Print the first row of the column 'Incident Month'
#print(group.loc[0]['Incident Month'])

Python Pandas: Air Pollution in Seoul Exercise

12/07/2020 - 00:50 by devraj

The air quality data for this segment has been divided into three different csv files.

info.csv has the data hour by hour data about the concentration of polutants in the air and the status of the intruments. item_info has the data for items and levels of concetration. station_info has the data for measutring stations.

You can download the dataset from kaggle website: https://www.kaggle.com/bappekim/air-pollution-in-seoul

Q1: Read in all the three datasets and then print the first five rows.

import pandas as pd

data = pd.read_csv("Measurement_info.csv", header = 0)
item = pd.read_csv("Measurement_item_info.csv", header = 0)
station = pd.read_csv("Measurement_station_info.csv", header = 0)
data.head(5)
item.head(5)
station.head(5)

Q2: Create a new Dataframe whcih has information about the item code and item name

sub_item = item[['Item code', 'Item name']]
sub_item

Q3: In the `data` DataFrame add in a column displaying the names of the items. 

data_i = data.merge(sub_item, on = "Item code", how = "left")
data_i.head()

Q4: In the data_i DataFrame add in a column displaying the names of the stations.

data_s = data_i.merge(sub_station, on = "Station code", how = "left")
data_s.head()

Q5: In the data_s DataFrame drop the columns Station code and Item code. As these columns have not become redundant.

data = data_s.drop(['Station code', 'Item code'], axis = 1)
data.head()

Q6: Given below are the meanings of the values in the Instrument status.

  • 0: Normal
  • 1: Need for calibration
  • 2: Abnormal
  • 4: Power cut off
  • 8: Under repair
  • 9: Abnormal data

Using the information given above, add a column in the data DataFrame to give the status of the intsruments. Then drop the Instrument status column.

status_dict = {"Instrument status": [0,1,2,4,8,9], 
               "Status": ["Normal", "Need for calibration", "Abnormal", "Power cut off", "Under repair", "Abnormal data"]}
status_dict
dictdf = pd.DataFrame(status_dict)
dictdf
data = data.merge(dictdf, on = "Instrument status", how = "left")
data.head()
data = data.drop(["Instrument status"], axis = 1)
data.head()

Q7: Extract the time series data, that is year, month, date and hour form the Measurement date column. Once all the data is extrcted drop the Measurement date column.

This operation might take some time as the dataset we are working with is very large.

data['Year'] = pd.DatetimeIndex(data['Measurement date']).year
data['Month'] = pd.DatetimeIndex(data['Measurement date']).month
data['Date'] = pd.DatetimeIndex(data['Measurement date']).day
data['Hour'] = pd.DatetimeIndex(data['Measurement date']).hour
data.head()
data = data.drop(["Measurement date"], axis =1)
data.head()