Table of Contents
ToggleIntroduction to One-Hot Encoding
When we want to convert categorical data into a numerical format we use One Hot encoding which can be understood by computer. In one hot encoding we create new binary columns for each unique value in categorical features.
In one hot encoding we assign values like 0 or 1 to columns where 1 represents the presence of the value and 0 indicates its absence. In Python we perform one hot encoding using pd.get_dummies() function and using sklearn’s OneHotEncoder class.
What is One-Hot Encoding?
The process of transforming categorical data into a numerical representation is called One Hot Encoding. In this, each category is represented by a separate binary column.
For example, if we have a categorical column which has three unique values (e.g. ‘red’, ‘green’, ‘blue’ ). One hot encoding will create three new binary columns called red, green, blue. If value is present it is represented as 1 and 0 if not.
Colors | colors_red | colors_green | colors_blue |
red | 1 | 0 | 0 |
green | 0 | 1 | 0 |
blue | 0 | 0 | 1 |
green | 0 | 1 | 0 |
green | 0 | 1 | 0 |
red | 1 | 0 | 0 |
Why is One-Hot Encoding Important in Machine Learning?
To machine learning algorithms to work we need numerical representation of categorical columns. One hot encoding helps in this process.
One hot ending prevents algorithms from assuming an ordinal relationship between categories which can affect the accuracy of the model due to learning the wrong pattern in data.
One-Hot Encoding can often lead to better model performance when dealing with categorical features that are not inherently ordered.
Understanding Categorical Data
Categorical data is a type of data which represents categories or groups. In categorical data we have two types i.e ordinal and nominal. Nominal data doesn’t have any order whereas ordinal data has an order.
When to Use One-Hot Encoding
One-Hot Encoding is generally suitable for nominal categorical data. Basically we can use one hot encoding any categorical data which doesn’t have any natural order.
Implementing one hot encoding pandas using get_dummies()
pip install pandasImporting Necessary Libraries:
import pandas as pdBasic Syntax of get_dummies():
pd.get_dummies(data, prefix='', prefix_sep='', dummy_na=False, columns=None, sparse=False, drop_first=False)Loading the dataset
import pandas as pd data = {'color': ['red', 'green', 'blue', 'red']} df = pd.DataFrame(data) #performing one hot encoding on dataframe (df) df_encoded = pd.get_dummies(df) df_encoded.head()Output
color_blue color_green color_red 0 0 0 1 1 0 1 0 2 1 0 0 3 0 0 1
Customizing One-Hot Encoding:
Prefixes:
Using prefix and prefix_sep parameters in pd.get_dummies() to customize the column names.
You can customize the column names by specifying prefixes:
encoded_data = pd.get_dummies(data, columns=['Fruit'], prefix='Fruit')
NaN Values:
encoded_data = pd.get_dummies(data, dummy_na=True)
Specific Columns:
data = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Orange'], 'Color': ['Red', 'Yellow', 'Orange'], 'Units':[1,4,10] }) df = pd.DataFrame(data) # Specify columns to encode columns_to_encode = ['Fruit','Color'] df_encoded = pd.get_dummies(df, columns=columns_to_encode, drop_first=True) print(df_encoded)Output
Units Fruit_Banana Fruit_Orange Color_Yellow Color_Orange 0 1 0 0 0 0 1 4 1 0 1 0 2 10 0 1 0 1
One-Hot Encoding with scikit-learn example
In scikit-learn we can use the OneHotEncoder class to perform one hot encoding. Lets understand it with a example
Import the necessary library and modules
import pandas as pd from sklearn.preprocessing import OneHotEncoder
Creating a Sample DataFrame
data = { 'Animal': ['cat', 'dog', 'cat', 'bird'], 'Color': ['red', 'blue', 'blue', 'red'], 'Count': [1, 2, 3, 4] } df = pd.DataFrame(data)
Creating an instance of OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')
Fit and transform the specified columns
encoded_features = encoder.fit_transform(df[['Animal', 'Color']])
Create a DataFrame from the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Animal', 'Color']))
Concatenating the encoded DataFrame with the original DataFrame (excluding the original columns)
df_final = pd.concat([df.drop(['Animal', 'Color'], axis=1), encoded_df], axis=1) print(df_final)
Output
Count Animal_cat Animal_dog Color_red 0 1 1.0 0.0 1.0 1 2 0.0 1.0 0.0 2 3 1.0 0.0 0.0 3 4 0.0 0.0 1.0
Label Encoding vs. One-Hot Encoding
- Definition level and usage :
In label encoding we assign a unique integer to each category in the column. Label encoding is suitable for ordinal data.
In one hot encoding we create binary columns for each category in the category column and it is suitable for nominal data. - Dataset size :
One hot encoding increases the dimension of the dataset and can lead to sparse dataset whereas label encoding maintains the dataset size. - Relationship :
One hot encoding doesn’t have ordinal assumptions whereas label encoding may impose misleading ordinal relationships.
Alternatives to One-Hot Encoding
- Label Encoding: Suitable for ordinal data.
- Frequency Encoding: Replaces categories with their frequency in the dataset.
- Target Encoding: Encodes categories based on the target variable.
Real-World Example
You can follow the below framework for Python: one hot encoding in Pandas using pd.get_dummies() function.
- Load the dataset.
- Identify categorical columns.
- Apply get_dummies() to encode categorical columns.
- Analyze the encoded DataFrame.
First we import pandas and seaborn in our code. We are using seaborn to get inbuilt dataset in Python like ‘iris’.
import pandas as pd import seaborn as sns
At start we Load the built-in iris dataset from seaborn
df = sns.load_dataset('iris') df.head()
Output
sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
In step 2 we will try to Identify categorical columns. The code df.select_dtypes(include=[‘object’, ‘category’]).columns helps us to get categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist() print("Categorical columns:", categorical_cols)
Output
Categorical columns: ['species']
In the third step we apply pd.get_dummies() function to encode categorical columns
encoded_df = pd.get_dummies(df, columns=categorical_cols, drop_first=True) print( encoded_df.head())
Finally in the fourth step we analyze the encoded DataFrame
Output
sepal_length sepal_width petal_length petal_width species_setosa species_versicolor species_virginica 0 5.1 3.5 1.4 0.2 1 0 0 1 4.9 3.0 1.4 0.2 1 0 0 2 4.7 3.2 1.3 0.2 1 0 0 3 4.6 3.1 1.5 0.2 1 0 0 4 5.0 3.6 1.4 0.2 1 0 0
By following these steps you can load any dataset and encode categorical columns using pd.get_dummies function() and analyze the resulting dataframe. Just when loading the dataset add the correct file path and other parameters which are needed.
Common Pitfalls and How to Avoid Them
- Use prefix and prefix_sep in the get_dummies function to avoid duplicate columns.
- We always need to maintain consistent encoding across datasets to prevent issues during model training and evaluation.
Conclusion
One hot encoding is a vital technique in handling categorical data in machine learning. By understanding its principles and how to use it, we can improve the performance of our models. You can also learn how to use pandas groupby() function.
Pingback: Converting pandas dataframe to csv without index in Python.
Pingback: Efficient Text Processing: Python TF IDF Code Explained