How can you handle categorical variables present in the data set?

Many a time, your data set may have categorical variables that are potentially good predictors for the response variable. So, handling them right is quite crucial.

One of the ways to handle categorical data with just two levels is to do a binary mapping of the variables, wherein one of the levels will correspond to zero and the other to 1.

Another way of handling categorical variables with a few levels is to perform dummy encoding. The key idea behind dummy encoding is that for a variable with, say, 'N' levels, you create 'N-1' new indicator variables for each of these levels. So for a variable say, 'Relationship' with three levels, namely, 'Single', 'In a Relationship', and 'Married', you would create a dummy table like the following:

Relationship Status Single In a Relationship Married
Single 1 0 0
In a Relationship 0 1 0
Married 0 0 1

But you can clearly see that there is no need to define three different levels. If you drop a level, say 'Single', you would still be able to explain the three levels.

Let's drop the dummy variable 'Single' from the columns and see what the table looks like:

Relationship Status In a Relationship Married
Single 0 0
In a Relationship 1 0
Married 0 1

If both the dummy variables, namely, 'In a Relationship' and 'Married', are equal to zero, that means that the person is single. If 'In a relationship' is one and 'Married' is zero, that means that the person is in a relationship, and finally, if 'In a relationship' is zero and 'Married' is 1, that means that the person is married.

Now, creating dummy variables may be useful when the number of levels in a categorical variable is small, but if a categorical variable has a hundred levels, it is clearly impossible to create 99 new variables. In such cases, grouping the variables could be useful. For example, for the variable 'Cities in India', you can use geographical grouping, such as follows:

  • Keep the 'n' largest cities, group the rest.
  • Geographical hierarchy:
    • City -> District -> State -> Zone
  • Group cities with similar values for the outcome variable.
  • Cluster cities with similar values for the predictor variables.