Skip to main content

These operations are some of the most fundamental (and important) things you can do with dataframes.

Good, let's get started!

In data science we often extract and scrape data from multiple sources. While analyzing the data we come to situations where we need to do a comparison of different data frames, for example, checking what all is different in each of the data frames or what is common in both the data frames. To achieve this we have different ways also known as set operations like Union, Intersection, and Difference. Through this article, we will understand what these set operations are and how they are used for comparison. In this experiment, we will first create two data frames and then will perform these sets of operations. 

Same index, obvious behavior

If two (or more) dataframes share the same index both row and column index in the case of dataframes, operations follow the obvious element-wise behavior you would expect if you've used NumPy in the past:

import numpy as np
df_1 = pd.DataFrame(np.arange(1,17).reshape(4,4),
                    index= ['Fi', 'Se', 'Th', 'Fo'],
                    columns = ['a', 'b', 'c', 'd'])

df_2 = pd.DataFrame(np.arange(1,17).reshape(4,4) * 10,
                    index= ['Fi', 'Se', 'Th', 'Fo'],
                    columns = ['a', 'b', 'c', 'd'])
df_1
  a b c d
Fi 1 2 3 4
Se 5 6 7 8
Th 9 10 11 12
Fo 13 14 15 16
df_2
  a b c d
Fi 10 20 30 40
Se 50 60 70 80
Th 90 100 110 120
Fo 130 140 150 160
# Addition of two dataframes with the same index
df_1 + df_2

 

  a b c d
Fi 10 40 90 160
Se 250 360 490 640
Th 810 1000 1210 1440
Fo 1690 1960 2250 2560

It's also possible to perform operations between dataframes and series that share an index. The default behavior is to align the index of the series with the column index of the dataframe and perform the operations between each row and the series.

# Sum a series and a dataframe
ser_1 + df_1
  a b c d
Fi 2 4 6 8
Se 6 8 10 12
Th 10 12 14 16
Fo 14 16 18 20

Different index, outer joins

If you perform operations between dataframes with different index, the result will be a new data structure whose index is the union of the original indexes. If you have worked with databases before this is similar to an outer join using the indexes of the original dataframes. This is much easier to see with an example:

import numpy as np

# In this case, the union are the elements [a,b,c] in the columns and [Fi,Fo,Th] in the rows

df_1 = pd.DataFrame(np.arange(1,17).reshape(4,4),
                    index= ['Fi', 'Ma', 'Th', 'Fo'],
                    columns = ['a', 'b', 'c', 'd'])

df_2 = pd.DataFrame(np.arange(1,17).reshape(4,4) * 10,
                    index= ['Fi', 'Se', 'Th', 'Fo'],
                    columns = ['a', 'b', 'c', 'e'])

df_1 + df_2

 


 

a

b

c

d

 

       e

Fi

11.0

22.0

33.0

NaN

NaN

Fo

143.0

154.0

165.0

NaN

NaN

Ma

NaN

NaN

NaN

NaN

NaN

Se

NaN

NaN

NaN

NaN

NaN

Th

99.0

110.0

121.0

NaN

NaN

 

 

 

 

 

 

 

 

 

In the case of operations between dataframes and series with different indexes, a union will be performed between the column index of the dataframe and the index of the series:

df_1 + ser_2

 

a

b

c

d

e

f

g

Fi

NaN

NaN

8.0

9.0

NaN

NaN

NaN

Ma

NaN

NaN

12.0

13.0

NaN

NaN

NaN

Th

NaN

NaN

16.0

17.0

NaN

NaN

NaN

Fo

NaN

NaN

20.0

21.0

NaN

NaN

NaN

Conclusion 

Now you know fundamental operations.The toughest thing about working with arithmetic operations using pandas data structures is understanding how it works when indexes are not the same. As long as you remember that it behaves like an outer join, everything will be clear and easy.In this article, we discussed the basic set of operations of pandas that are performed between different data frames to compute similarity, dissimilarity, and common data between the data frame. We first checked the union operation followed by intersection and different operations. These are very useful sets of operations that are used to manipulate your data frames well and understand the data. 

Thanks for reading!

Tags
Submitted by shiksha.dahiya on February 12, 2021

Shiksha is working as a Data Scientist at iVagus. She has expertise in Data Science and Machine Learning.

About

Elix is a premium wordpress theme for portfolio, freelancer, design agencies and a wide range of other design institutions.