The air quality data for this segment has been divided into three different csv files.
info.csv has the data hour by hour data about the concentration of polutants in the air and the status of the intruments. item_info has the data for items and levels of concetration. station_info has the data for measutring stations.
You can download the dataset from kaggle website: https://www.kaggle.com/bappekim/air-pollution-in-seoul
Q1: Read in all the three datasets and then print the first five rows.
import pandas as pd
data = pd.read_csv("Measurement_info.csv", header = 0)
item = pd.read_csv("Measurement_item_info.csv", header = 0)
station = pd.read_csv("Measurement_station_info.csv", header = 0)
data.head(5)
item.head(5)
station.head(5)
Q2: Create a new Dataframe whcih has information about the item code and item name
sub_item = item[['Item code', 'Item name']]
sub_item
Q3: In the `data` DataFrame add in a column displaying the names of the items.
data_i = data.merge(sub_item, on = "Item code", how = "left")
data_i.head()
Q4: In the data_i
DataFrame add in a column displaying the names of the stations.
data_s = data_i.merge(sub_station, on = "Station code", how = "left")
data_s.head()
Q5: In the data_s
DataFrame drop the columns Station code
and Item code
. As these columns have not become redundant.
data = data_s.drop(['Station code', 'Item code'], axis = 1)
data.head()
Q6: Given below are the meanings of the values in the Instrument status
.
- 0: Normal
- 1: Need for calibration
- 2: Abnormal
- 4: Power cut off
- 8: Under repair
- 9: Abnormal data
Using the information given above, add a column in the data
DataFrame to give the status of the intsruments. Then drop the Instrument status
column.
status_dict = {"Instrument status": [0,1,2,4,8,9],
"Status": ["Normal", "Need for calibration", "Abnormal", "Power cut off", "Under repair", "Abnormal data"]}
status_dict
dictdf = pd.DataFrame(status_dict)
dictdf
data = data.merge(dictdf, on = "Instrument status", how = "left")
data.head()
data = data.drop(["Instrument status"], axis = 1)
data.head()
Q7: Extract the time series data, that is year, month, date and hour form the Measurement date
column. Once all the data is extrcted drop the Measurement date
column.
This operation might take some time as the dataset we are working with is very large.
data['Year'] = pd.DatetimeIndex(data['Measurement date']).year
data['Month'] = pd.DatetimeIndex(data['Measurement date']).month
data['Date'] = pd.DatetimeIndex(data['Measurement date']).day
data['Hour'] = pd.DatetimeIndex(data['Measurement date']).hour
data.head()
data = data.drop(["Measurement date"], axis =1)
data.head()