Skip to main content

   To perform the union operation, we applied two methods: concat() followed by drop_duplicates(). The first accomplishes the concatenation of data, which means to place the rows from one DataFrame below the rows of another DataFrame

Union Pandas DataFrames using Concat

You can union Pandas DataFrames using contact.

pd.concat([df1, df2])

You may concatenate additional DataFrames by adding them within the brackets.

Steps to Union Pandas DataFrames using Concat

Step 1: Create the first DataFrame

For example, let’s say that you have the following data about your customers:

clientFirstName clientLastName country
Jon Smith US
Maria Lam Canada
Bruce Jones Italy
Lili Chang China

You can then create a dataframe to capture the above data in Python:

import pandas as pd

clients1 = {'clientFirstName': ['Jon','Maria','Bruce','Lili'],
            'clientLastName': ['Smith','Lam','Jones','Chang'],
            'country': ['US','Canada','Italy','China']
           }

df1 = pd.DataFrame(clients1, columns= ['clientFirstName', 'clientLastName','country'])

print (df1)

Run the code in Python and you would get:

Union Pandas DataFrames using Concat

Step 2: Create the second DataFrame

Now suppose that you got an additional data about new customers:

clientFirstName clientLastName country
Bill Jackson UK
Jack Green Germany
Elizabeth Gross Brazil
Jenny Sing Japan

You can then create the second DataFrame as follows:

import pandas as pd

clients2 = {'clientFirstName': ['Bill','Jack','Elizabeth','Jenny'],
            'clientLastName': ['Jackson','Green','Gross','Sing'],
            'country': ['UK','Germany','Brazil','Japan']
           }

df2 = pd.DataFrame(clients2, columns= ['clientFirstName', 'clientLastName','country'])

print (df2)

Run the code, and you’ll see:

How to Union Pandas DataFrames

Your goal is to union those two DataFrame together. You can then use Pandas concat to accomplish this goal.

Step 3: Union Pandas DataFrames using Concat

Finally, to union the two Pandas DataFrames together, you can apply the generic syntax that you saw at the beginning of this guide:

pd.concat([df1, df2])

And here is the complete Python code to union Pandas DataFrames using concat:

import pandas as pd

clients1 = {'clientFirstName': ['Jon','Maria','Bruce','Lili'],
            'clientLastName': ['Smith','Lam','Jones','Chang'],
            'country': ['US','Canada','Italy','China']
           }

df1 = pd.DataFrame(clients1, columns= ['clientFirstName', 'clientLastName','country'])


clients2 = {'clientFirstName': ['Bill','Jack','Elizabeth','Jenny'],
            'clientLastName': ['Jackson','Green','Gross','Sing'],
            'country': ['UK','Germany','Brazil','Japan']
           }

df2 = pd.DataFrame(clients2, columns= ['clientFirstName', 'clientLastName','country'])

union = pd.concat([df1, df2])
print (union)

Once you run the code, you’ll get the concatenated DataFrames:

How to Union Pandas DataFrames using Concat

Notice that the index values keep repeating themselves (from 0 to 3 for the first DataFrame, and then from 0 to 3 for the second DataFrame):

You may then choose to assign the index values in an incremental manner once you concatenated the two DataFrames.

To do so, simply set ignore_index=True within the pd.concat brackets:

import pandas as pd

clients1 = {'clientFirstName': ['Jon','Maria','Bruce','Lili'],
            'clientLastName': ['Smith','Lam','Jones','Chang'],
            'country': ['US','Canada','Italy','China']
           }

df1 = pd.DataFrame(clients1, columns= ['clientFirstName', 'clientLastName','country'])


clients2 = {'clientFirstName': ['Bill','Jack','Elizabeth','Jenny'],
            'clientLastName': ['Jackson','Green','Gross','Sing'],
            'country': ['UK','Germany','Brazil','Japan']
           }

df2 = pd.DataFrame(clients2, columns= ['clientFirstName', 'clientLastName','country'])

union = pd.concat([df1, df2], ignore_index=True)
print (union)

And the result:

concat pandas

Remove Duplicates from Pandas DataFrame

If so, you can apply the following syntax in Python to remove duplicates from your DataFrame:

pd.DataFrame.drop_duplicates(df)

Steps to Remove Duplicates from Pandas DataFrame

Step 1: Gather the data that contains duplicates

Firstly, you’ll need to gather the data that contains the duplicates.

For example, let’s say that you have the following data about boxes, where each box may have a different color or shape:

Color Shape
Green Rectangle
Green Rectangle
Green Square
Blue Rectangle
Blue Square
Red Square
Red Square
Red Rectangle

As you can see, there are duplicates under both columns.

Before you remove those duplicates, you’ll need to create pandas data frame to capture that data in Python.

Step 2: Create Pandas DataFrame

Next, create Pandas DataFrame using this code:

import pandas as pd

boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color', 'Shape'])

print(df)

Once you run the code in Python, you’ll get the same values as in step 1:

How to Remove Duplicates from Pandas DataFrame

Step 3: Remove duplicates from Pandas DataFrame

To remove duplicates from the DataFrame, you may use the following syntax that you saw at the beginning of this guide:

pd.DataFrame.drop_duplicates(df)

Let’s say that you want to remove the duplicates across the two columns of Color and Shape.

In that case, apply the code below in order to remove those duplicates:

import pandas as pd

boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color', 'Shape'])

df_duplicates_removed = pd.DataFrame.drop_duplicates(df)
print(df_duplicates_removed)

As you can see, only the distinct values across the two columns remain:

Remove Duplicates from Pandas DataFrame

But what if you want to remove the duplicates under a single column?

For example, what if you want to remove the duplicates under the Color column only?

In that case, you should just keep the Color column when assigning the columns to the DataFrame:

df = pd.DataFrame(boxes, columns= [‘Color’])

So the full Python code to remove the duplicates under the Color column would look like this:

import pandas as pd

boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = pd.DataFrame(boxes, columns = ['Color'])

df_duplicates_removed = pd.DataFrame.drop_duplicates(df)
print(df_duplicates_removed)

As you may observe, only the distinct values under the Color column remain:

Drop Duplicates from Pandas DataFrame

Tags
Submitted by shiksha.dahiya on February 15, 2021

Shiksha is working as a Data Scientist at iVagus. She has expertise in Data Science and Machine Learning.

About

Elix is a premium wordpress theme for portfolio, freelancer, design agencies and a wide range of other design institutions.