Skip to main content

Set operations are the operation that are used for comparison purposes. Consider we have two data frames having 2 columns each containing students with their ID who are enrolled in different courses that are ML, NLP, and COMPUTER VISION. Now we want to look for all the students or students who are in ML but now in NLP and combinations like these. Refer to the below tables for all the three courses. 

Now we will create these three tables using pandas. Use the below code to do the same. First, we will import the pandas’ package, and then we will create these tables. 

import pandas as pd
ML_df = pd.DataFrame ({"Name":["Rohit","Arpit","Chiranjeev","Piyush"],
        "Student_ID":["101","102","103","104"]})
NLP_df = pd.DataFrame ({"Name":["Rohit","Aman","Ayush","Piyush"],
        "Student_ID":["101","105","106","104"]})
CV_df = pd.DataFrame ({"Name":["Rohit","Arpit","Pawan","Ayush"],
        "Student_ID":["101","102","107","106"]})

In certain practical situations, it might be interesting to treat a pandas DataFrame as a mathematical set. In this case, each row of the DataFrame can be considered as an element or member of the set.

The question then becomes: Why would it be useful? Here’s the answer. As we know, data science problems typically require the analysis of data obtained from multiple sources. At some point in the analysis of data from a study, you may face the problem of having to compare the contents of two or more DataFrames to determine if they have elements (rows) in common. In this tutorial you will learn that set operations are one of the best and most natural techniques you can choose to perform such a task.

Set Operations in Pandas

Although pandas does not offer specific methods for performing set operations, we can easily mimic them using the below methods:

  • Union: concat() + drop_duplicates()
  • Intersection: merge()
  • Difference: isin() + Boolean indexing

In the following program, we demonstrate how to do it. A detailed explanation is given after the code listing.

#Set Operations applied to pandas DataFrames
import pandas as pd

#(1)-Defining the DataFrames

# 1.1 Python students
P = pd.DataFrame ({"name":["Elizabeth","Darcy"],
        "email":["bennet@xyz.com","darcy@acmecorpus.com"]})

# 1.2 SQL students
S = pd.DataFrame ({"name":["Bingley","Elizabeth"],
        "email": ["bingley@xyz.com","bennet@xyz.com"]})

#(2)-Performing set operations

# 2.1 Union
all_students = pd.concat([P, S], ignore_index = True)
all_students = all_students.drop_duplicates()

# 2.2 Intersection
sql_and_python = P.merge(S)

# 2.3 Difference
python_only = P[P.email.isin(S.email) == False]
sql_only = S[S.email.isin(P.email) == False]

#(3)-Showing Results
print ('------------------------------')
print ('all students (UNION):')
print (all_students)

print ('------------------------------')
print ('Students enrolled in both courses (INTERSECTION):')
print (sql_and_python)

print ('------------------------------')
print ('Python students who are not taking SQL (DIFFERENCE):')
print (python_only)

print ('------------------------------')
print ('SQL students who are not taking Python (DIFFERENCE):')
print (sql_only)

Results are shown below:

------------------------------
all students (UNION):
        name                 email
0  Elizabeth        bennet@xyz.com
1      Darcy  darcy@acmecorpus.com
2    Bingley       bingley@xyz.com
------------------------------
Students enrolled in both courses (INTERSECTION):
        name           email
0  Elizabeth  bennet@xyz.com
------------------------------
Python students who are not taking SQL (DIFFERENCE):
    name                 email
1  Darcy  darcy@acmecorpus.com
------------------------------
SQL students who are not taking Python (DIFFERENCE):
      name            email
0  Bingley  bingley@xyz.com

A Practical Example

Suppose you have two DataFrames, named P and S, which respectively contain the names and emails from students enrolled in two different courses, SQL and Python.

Figure

Figure

Consider that you need answers to the following questions:

  1. How many different students are in the two DataFrames?
  2. Are there students enrolled in both courses, Python and SQL?
  3. Which students are taking the Python course, but not the SQL course (and vice versa)?

Answers can be obtained in a straightforward way if you treat the DataFrames as two distinct mathematical sets. Then, all you will have to do is to apply the basic unionintersection, and difference set operations:

P ∪ S, the union of P and S, is the set of elements that are in P or S or both. Note that the element (student) Elizabeth appears only once in the result.

Figure

P ∩ S, the intersection of P and S, is the set of elements that are in both P and S. Now, only Elizabeth appears, because she is the only in both sets.

Figure

P − S, the difference of P and S, is the set that includes all elements that are in P but not in S:

Figure

Note that S − P is different from P − S:

Figure

It is important to remark that the DataFrames on which any of these three operations are applied must have identical attributes (as shown in the example).

Tags
Submitted by shiksha.dahiya on February 15, 2021

Shiksha is working as a Data Scientist at iVagus. She has expertise in Data Science and Machine Learning.

About

Elix is a premium wordpress theme for portfolio, freelancer, design agencies and a wide range of other design institutions.