Skip to main content

The versatile merge() method was employed to execute the intersection operation. This method can be used to combine or join DataFrames in different ways. However, when used without the specification of any parameter in an operation involving two compatible DataFrames, it yields their intersection:

sql_and_python = P.merge(S)

The intersection is opposite of union where we only keep the common between the two data frames. Consider we have to pick those students that are enrolled for both ML and NLP courses or students that are there in ML and CV. Refer to the below to code to understand how to compute the intersection between two data frames. 

all_students = pd.concat([ML_df,NLP_df,CV_df], ignore_index = True)

all_students = all_students.drop_duplicates()

print(all_students)

Common_ML_CV = ML ∩ CV

Common_ML_CV = ML_df.merge(CV_df)

print(Common_ML_CV)

Join Pandas DataFrames using Merge

If so, I’ll show you how to join Pandas DataFrames using Merge.

In particular, I’ll review the steps to create the following joins:

  • Inner Join
  • Left Join
  • Right Join
  • Outer Join

But before we dive into few examples, here is a template that you may refer to when joining DataFrames:

pd.merge(df1, df2, how='type of join', on=['df1 key', 'df2 key'])

Steps to Join Pandas DataFrames using Merge

Step 1: Create the DataFrames to be joined

Let’s say that you have two datasets that you’d like to join:

(1) The clients dataset:

Client_ID Client_Name
111 Jon Snow
222 Maria Green
333 Bill Jones
444 Rick Lee
555 Pamela Lopez

(2) The countries dataset:

Client_ID Client_Country
111 UK
222 Canada
333 Spain
444 China
777 Brazil

The goal is to join the above two datasets using the common Client_ID key.

To start, you may create two DataFrames, where:

  • df1 will capture the clients data
  • df2 will capture the countries data

Here is the code that you can use to create the DataFrames:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients, columns= ['Client_ID','Client_Name'])
print(df1)

countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries, columns= ['Client_ID', 'Client_Country'])
print(df2)

Run the code in Python, and you’ll get the following two DataFrames:

Two datasets in Python

Step 2: Merge the pandas DataFrames using an inner join

You may add this syntax in order to merge the two DataFrames using an innerjoin:

Inner_Join = pd.merge(df1, df2, how='inner', on=['Client_ID', 'Client_ID'])

You may notice that the how is equal to ‘inner’ to represent an inner join.

Here is the complete code that you may apply in Python:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients, columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries, columns= ['Client_ID', 'Client_Country'])


Inner_Join = pd.merge(df1, df2, how='inner', on=['Client_ID', 'Client_ID'])
print(Inner_Join)

When performing an inner join, only the records that have the same key in both of the datasets will be captured. In our example, there are only 4 records in each of the datasets that contain the same Client_ID key. Only those 4 records will be displayed:

Inner join pandas dataframes

  • The record with the ‘555’ Client_ID, which is present in the clients dataset, but not in the countries dataset, will not be displayed, when applying the inner join
  • Similarly, the record with the ‘777’ Client_ID, which is present in the countries dataset, but not in the clients dataset, will not be displayed

Applying a Left Join

Now let’s see what will happen when you apply a left join. Simply add the following syntax to the code:

Left_Join = pd.merge(df1, df2, how='left', on=['Client_ID', 'Client_ID'])

You’ll notice that the ‘how’ is now set to ‘left’ to represent the left join.

The complete Python code is therefore:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients, columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries, columns= ['Client_ID', 'Client_Country'])


Left_Join = pd.merge(df1, df2, how='left', on=['Client_ID', 'Client_ID'])
print(Left_Join)

With a left join, all the records from the first dataset will be displayed, irrespective of whether the keys in the first dataset can be found in the second dataset.

However, for the second dataset, only the records with the keys in the second dataset that can be found in the first dataset will be displayed.

How to Join Pandas DataFrames using Merge

  • The record with the ‘555’ Client_ID, which is present in the clients dataset, but not in the countries dataset, will be displayed, when using a left join
  • However, the record with the ‘777’ Client_ID, which is present in the countries dataset, but not in the clients dataset, will not be displayed

Also note that since there is no 555 key in the second dataset, there will be a NaN value under the Client_Country column from the second dataset.

Applying a Right Join

You may now apply this syntax to get the right join (where the ‘how’ is now set to ‘right’):

Right_Join = pd.merge(df1, df2, how='right', on=['Client_ID', 'Client_ID'])

So the complete Python code is:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients, columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries, columns= ['Client_ID', 'Client_Country'])


Right_Join = pd.merge(df1, df2, how='right', on=['Client_ID', 'Client_ID'])
print(Right_Join)

For a right join, all the records from the second dataset will be displayed. However, only the records with the keys in the first dataset that can be found in the second dataset will be displayed.

Join Pandas DataFrames using Merge

  • The record with the ‘555’ Client_ID from the first dataset will not be displayed when applying a right join
  • While, the record with the ‘777’ Client_ID from the second dataset will be displayed

Applying an Outer Join

For the final case, you can apply an outer join by setting the ‘how’ to ‘outer’:

Outer_Join = pd.merge(df1, df2, how='outer', on=['Client_ID', 'Client_ID'])

Here is the Python code:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients, columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries, columns= ['Client_ID', 'Client_Country'])


Outer_Join = pd.merge(df1, df2, how='outer', on=['Client_ID', 'Client_ID'])
print(Outer_Join)

With an outer join, all the records from both of the datasets will be displayed, irrespective of whether a key is missing in one of the datasets:

Merge pandas dataframes

As you may observe, both the ‘555’ Client_ID from the first dataset, as well as the ‘777’ Client_ID from the second dataset, will be displayed.

Tags
Submitted by shiksha.dahiya on February 15, 2021

Shiksha is working as a Data Scientist at iVagus. She has expertise in Data Science and Machine Learning.

About

Elix is a premium wordpress theme for portfolio, freelancer, design agencies and a wide range of other design institutions.