Python: one hot encoding pandas

Python one hot encoding pandas

Introduction to One-Hot Encoding

When we want to convert categorical data into a numerical format we use One Hot encoding which can be understood by computer. In one hot encoding we create new binary columns for each unique value in categorical features.

 In one hot encoding we assign values like 0 or 1 to columns where 1 represents the presence of the value and  0 indicates its absence. In Python we perform one hot encoding using pd.get_dummies() function and using sklearn’s OneHotEncoder  class.

What is One-Hot Encoding?

The process of transforming categorical data into a numerical representation is called One Hot Encoding. In this, each category is represented by a separate binary column.

 For example, if we have a categorical column which has three unique values (e.g. ‘red’, ‘green’, ‘blue’ ). One hot encoding will create three new binary columns called red, green, blue. If value is present it is represented as 1 and 0 if not.

Colorscolors_redcolors_greencolors_blue
red100
green010
blue001
green010
green010
red100

Why is One-Hot Encoding Important in Machine Learning?

To machine learning algorithms to work we need numerical representation of categorical columns. One hot encoding helps in this process.

One hot ending prevents algorithms from assuming an ordinal relationship between categories which can affect the accuracy of the model due to learning the wrong pattern in data.

One-Hot Encoding can often lead to better model performance when dealing with categorical features that are not inherently ordered.

Understanding Categorical Data

Categorical data is a type of data which represents categories or groups. In categorical data we have two types i.e ordinal and nominal. Nominal data doesn’t have any order whereas ordinal data has an order.

When to Use One-Hot Encoding

One-Hot Encoding is generally suitable for nominal categorical data. Basically we can use one hot encoding any categorical data which doesn’t have any natural order.

Implementing one hot encoding pandas using get_dummies()

installing Pandas:
pip install pandas
Importing Necessary Libraries:
import pandas as pd
Basic Syntax of get_dummies():
pd.get_dummies(data, prefix='', prefix_sep='', dummy_na=False, 
columns=None, sparse=False, drop_first=False)
Loading the dataset
import pandas as pd
data = {'color': ['red', 'green', 'blue', 'red']}
df = pd.DataFrame(data)

#performing one hot encoding on dataframe (df)
df_encoded = pd.get_dummies(df)
df_encoded.head()
Output
    color_blue	    color_green	      color_red
0     0	             0	              1
1     0	             1	              0
2     1	             0	              0
3     0	             0	              1

Customizing One-Hot Encoding:

Prefixes:

Using prefix and prefix_sep parameters in pd.get_dummies() to customize the column names.

You can customize the column names by specifying prefixes:

encoded_data = pd.get_dummies(data, columns=['Fruit'], prefix='Fruit')

NaN Values:

Handle NaN values in pd.get_dummies() using dummy_na parameter. Use the dummy_na parameter to manage NaN values effectively:
encoded_data = pd.get_dummies(data, dummy_na=True)

Specific Columns:

Encode only specific columns in pd.get_dummies() using columns parameter.
data = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Orange'],
    'Color': ['Red', 'Yellow', 'Orange'],
    'Units':[1,4,10]
})
df = pd.DataFrame(data)


# Specify columns to encode
columns_to_encode = ['Fruit','Color']

df_encoded = pd.get_dummies(df, columns=columns_to_encode, drop_first=True)

print(df_encoded)
Output
   Units  Fruit_Banana  Fruit_Orange  Color_Yellow  Color_Orange
0      1             0             0             0             0
1      4             1             0             1             0
2     10             0             1             0             1

One-Hot Encoding with scikit-learn example

In scikit-learn we can use the OneHotEncoder class to perform one hot encoding. Lets understand it with a example

Import the necessary library and modules

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

Creating a Sample DataFrame

data = {
    'Animal': ['cat', 'dog', 'cat', 'bird'],
    'Color': ['red', 'blue', 'blue', 'red'],
    'Count': [1, 2, 3, 4]
}
df = pd.DataFrame(data)

Creating an instance of OneHotEncoder

encoder = OneHotEncoder(sparse=False, drop='first')

Fit and transform the specified columns

encoded_features = encoder.fit_transform(df[['Animal', 'Color']])

Create a DataFrame from the encoded features

encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Animal', 'Color']))

Concatenating the encoded DataFrame with the original DataFrame (excluding the original columns)

df_final = pd.concat([df.drop(['Animal', 'Color'], axis=1), encoded_df], axis=1)

print(df_final)

Output

Count  Animal_cat  Animal_dog  Color_red
0      1         1.0         0.0        1.0
1      2         0.0         1.0        0.0
2      3         1.0         0.0        0.0
3      4         0.0         0.0        1.0

Label Encoding vs. One-Hot Encoding

  • Definition level and usage :
    In label encoding we assign a unique integer to each category in the column. Label encoding is suitable for ordinal data.
    In one hot encoding we create binary columns for each category in the category column and it is suitable for nominal data.
  • Dataset size :
    One hot encoding increases the dimension of the dataset and can lead to sparse dataset whereas label encoding maintains the dataset size.
  • Relationship :
    One hot encoding doesn’t have ordinal assumptions whereas label encoding may impose misleading ordinal relationships.

Alternatives to One-Hot Encoding

  • Label Encoding: Suitable for ordinal data.
  • Frequency Encoding: Replaces categories with their frequency in the dataset.
  • Target Encoding: Encodes categories based on the target variable.

Real-World Example

You can follow the below framework for Python: one hot encoding in Pandas using pd.get_dummies() function.

  1. Load the dataset.
  2. Identify categorical columns.
  3. Apply get_dummies() to encode categorical columns.
  4. Analyze the encoded DataFrame.

First we import pandas and seaborn in our code. We are using seaborn to get inbuilt dataset in Python like ‘iris’.

import pandas as pd
import seaborn as sns

At start we Load the built-in iris dataset from seaborn

df = sns.load_dataset('iris')
df.head()

Output

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1		3.5		1.4		0.2		setosa
1	4.9		3.0		1.4		0.2		setosa
2	4.7		3.2		1.3		0.2		setosa
3	4.6		3.1		1.5		0.2		setosa
4	5.0		3.6		1.4		0.2		setosa

In step 2 we will try to Identify categorical columns. The code df.select_dtypes(include=[‘object’, ‘category’]).columns helps us to get categorical columns

categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns:", categorical_cols)

Output

Categorical columns: ['species']

In the third step we apply pd.get_dummies() function to encode categorical columns

encoded_df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
print( encoded_df.head())

Finally in the fourth step we analyze the encoded DataFrame
Output

    sepal_length   sepal_width   petal_length  petal_width  species_setosa  species_versicolor  species_virginica
0            5.1          3.5           1.4          0.2               1                  0                  0
1            4.9          3.0           1.4          0.2               1                  0                  0
2            4.7          3.2           1.3          0.2               1                  0                  0
3            4.6          3.1           1.5          0.2               1                  0                  0
4            5.0          3.6           1.4          0.2               1                  0                  0

By following these steps you can load any dataset and encode categorical columns using pd.get_dummies function() and analyze the resulting dataframe. Just when loading the dataset add the correct file path and other parameters which are needed.

Common Pitfalls and How to Avoid Them

  • Use prefix and prefix_sep in the get_dummies function to avoid duplicate columns.
  • We always need to maintain consistent encoding across datasets to prevent issues during model training and evaluation.

Conclusion

One hot encoding is a vital technique in handling categorical data in machine learning. By understanding its principles and how to use it, we can improve the performance of our models. You can also learn how to use pandas groupby() function.