Skip to main content

pandas is a fast, powerful, flexible and easy to use data analysis library built on top of NumPy and provides features not available in it. pandas stands for panel data, a reference to the tabular format. It adopts significant parts of NumPy’s idiomatic style of array-based computing. While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular, heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

The key to learning pandas is to understand its data structures. A data structure is a collection of data values and defines the relationship between the data, and the operations that can be performed on the data.

The most widely used pandas data structures are the Series and the DataFrame. Simply, a Series is similar to a single column of data while a DataFrame is similar to a sheet with rows and columns. Likewise, a Panel can have many DataFrames.

There are three main data structures in pandas:

  • Series — 1D
  • DataFrame — 2D
  • Panel — 3D

Pandas Series

Think of Series as a single column in an Excel sheet. You can also think of it as a 1d Numpy array. The only thing that differentiates it from 1d Numpy array is that we can have Index Names. The series is composed of two arrays associated with each other. The main array (array of values) holds one-dimensional data to which each element is associated with a label, contained within the other array (array of labels), called the index. If you want to individually see the two arrays that make up the series, you can call index and values attributes of the series. Because a series is one dimensional, it has a single axis (dimension) — the index and the values of the index — 0, 1, 2, 3 — are called axis labels.

The basic syntax to create a pandas Series is as follows:

newSeries = pd.Series(data , index)

A series consists of two components.

  • One-dimensional data (Values)
  • Index

Introduction

The general construct for creating a Series data structure is:To create a series, you simply call the Series() class constructor and pass as an argument containing the data to be included in it. Here, data can be one of the following:

  • A one-dimensional ndarray
  • A Python list
  • A Python dictionary
  • A scalar value

If an index is not specified, the default index [0,… n-1] will be created, where n is the length of the data. A series can be created from a variety of sources as shown in the following subsections.

Using a one-dimensional ndarray

The following example creates a Series of the 1st 5 odd numbers.

Image for post

If you do not specify any index during the definition of the series, by default, pandas will assign numerical values increasing from 0 as labels. In this case, the labels correspond to the indexes (position in the array) of the elements in the series object. If you want to create this series using meaningful labels, you would specify the index parameter during the series creation. Labels are included inside a list of the same length of an_array.

Using a Python list

To create a series using a Python list, you can just pass a list to the data parameter of the Series() class constructor.

Image for post

Using a scalar value

we can also create a Series from a scalar value. If you do not specify the index argument, the default index is 0. If you specify the index, the value will be repeated for specified index values.

Image for post

Using a Python dictionary

To create a series using a Python dictionary, you can just pass a dictionary to the data parameter of the Series() class constructor. This time, the arrays of the index and values are filled with the corresponding keys and values of the dictionary.

Image for post

Pandas DataFrame

Dataframe is indeed the most commonly used and important data structure of Pandas. Think of a data frame as an excel sheet. 

DataFrame is a two-dimensional data structure composed of rows and columns — exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types — float, int, bool, and so on. DataFrames are both value-mutable and size-mutable (Series, by contrast, is only value-mutable, not size-mutable. The length of a Series cannot be changed although the values can be changed). This lets us perform operations that would alter values held within the DataFrame or add/delete columns to/from the DataFrame.

A DataFrame consists of three components.

  • Two-dimensional data (Values)
  • Row index
  • Column index

Main ways to create Data Frame are

  • Reading a CSV/Excel File
  • Python Dictionary
  • ndarray 
#create a data frame by passing in a dictionary

df1 = {"Name":["Ahmad","Ali",'Ismail',"John"],"Age":  [20,21,19,17],
            "Height":[5.1,5.6,6.1,5.7]}

#convert this dictionary into a data frame

df1 = pd.DataFrame(df1)
df1

DataFrame creation: Introduction

A DataFrame is the most commonly used data structure in pandas. The DataFrame() class constructor accepts many different types of arguments:

  • A two-dimensional ndarray
  • A dictionary of dictionaries
  • A dictionary of lists
  • A dictionary of series

Row label indexes and column labels can be specified along with the data. If they’re not specified, they will be generated from the input data in an intuitive fashion. A DataFrame can be created from a variety of sources as discussed in the following subsections.

DataFrame creation: Using a two-dimensional ndarray

Image for post

If you want to see the individual components which make up the DataFrame, you can call valuesindex and columns attributes of the DataFrame.

Image for post

DataFrame creation: Using a dictionary of dictionaries

Image for post

Column names are created from the keys of the main dictionary, and the row index is created from the keys of the sub dictionaries.

DataFrame creation: Using a dictionary lists

Image for post

If you want to see the individual components which make up the DataFrame, you can call valuesindex and columns attributes of the DataFrame.

Image for post

DataFrame creation: Using a dictionary of series

Image for post

The Pandas Panel

A Panel is a 3D array. It is not as widely used as Series or DataFrames. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. It is generally used for 3D time-series data. The three-axis names are as follows:

  • items: This is axis 0. Each item corresponds to a DataFrame structure.
  • major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
  • minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.

As with Series and DataFrames, there are different ways to create Panel objects.

Panel creation: Using a 3D NumPy array

Image for post

Panel is deprecated and will not be available in future versions. Hence, the recommended way to represent these types of 3-dimensional data is to use multi-indexing in DataFrames instead of Panels. A multi-indexed DataFrame can be directly converted to a Panel via DataFrame.to_panel() method.

Tags
Submitted by shiksha.dahiya on February 9, 2021

Shiksha is working as a Data Scientist at iVagus. She has expertise in Data Science and Machine Learning.

About

Elix is a premium wordpress theme for portfolio, freelancer, design agencies and a wide range of other design institutions.