In this chapter, we will use the following Python libraries: pandas, NumPy, Matplotlib, and scikit-learn. I recommend installing the free Anaconda Python distribution, which contains all of these packages.
For details on how to install the Anaconda Python distribution, visit the Technical requirements section in Chapter 1, Foreseeing Variable Problems in Building ML Models.
We will also use the open source Python library's feature-engine and category encoders, which can be installed using pip:
pip install feature-engine
pip install category_encoders
To learn more about Feature-engine, visit the following sites:
- Home page: https://www.trainindata.com/feature-engine
- GitHub: https://github.com/solegalli/feature_engine/
- Documentation: https://feature-engine.readthedocs.io
To learn more about category encoders, visit the following:
- Documentation: https://contrib.scikit-learn.org/categorical-encoding/
To run the recipes successfully, check that you have the same or higher versions of the Python libraries indicated in the requirement.txt file in the accompanying GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook.
We will use the Credit Approval Dataset available in the UCI Machine Learning Repository, available at https://archive.ics.uci.edu/ml/datasets/credit+approval.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
To prepare the dataset, follow these steps:
- Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/.
- Click on crx.data to download the data:
- Save crx.data to the folder from which you will run the following commands.
After downloading the data, open up a Jupyter Notebook or a Python IDE and run the following commands.
- Import the required libraries:
import random
import pandas as pd
import numpy as np
- Load the data:
data = pd.read_csv('crx.data', header=None) - Create a list with the variable names:
varnames = ['A'+str(s) for s in range(1,17)]
- Add the variable names to the dataframe:
data.columns = varnames
- Replace the question marks in the dataset with NumPy NaN values:
data = data.replace('?', np.nan) - Re-cast numerical variables to float types:
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float') - Re-code the target variable as binary:
data['A16'] = data['A16'].map({'+':1, '-':0}) - Make lists with categorical and numerical variables:
cat_cols = [c for c in data.columns if data[c].dtypes=='O']
num_cols = [c for c in data.columns if data[c].dtypes!='O']
- Fill in the missing data:
data[num_cols] = data[num_cols].fillna(0)
data[cat_cols] = data[cat_cols].fillna('Missing')
- Save the prepared data:
data.to_csv('creditApprovalUCI.csv', index=False) You can find a Jupyter Notebook with these commands in the accompanying GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook.
In one-hot encoding, we represent a categorical variable as a group of binary variables, where each binary variable represents one category. The binary variable indicates whether the category is present in an observation (1) or not (0). The following table shows the one-hot encoded representation of the Gender variable with the categories of Male and Female:
| Gender | Female | Male |
| Female | 1 | 0 |
| Male | 0 | 1 |
| Male | 0 | 1 |
| Female | 1 | 0 |
| Female | 1 | 0 |
As shown in the table, from the Gender variable, we can derive the binary variable of Female, which shows the value of 1 for females, or the binary variable of Male, which takes the value of 1 for the males in the dataset.
For the categorical variable of Color with the values of red, blue, and green, we can create three variables called red, blue, and green. These variables will take the value of 1 if the observation is red, blue, or green, respectively, or 0 otherwise.
A categorical variable with k unique categories can be encoded in k-1 binary variables. For Gender, k is 2 as it contains two labels (male and female), therefore, we need to create only one binary variable (k - 1 = 1) to capture all of the information. For the color variable, which has three categories (k=3; red, blue, and green), we need to create two (k - 1 = 2) binary variables to capture all the information, so that the following occurs:
- If the observation is red, it will be captured by the variable red (red = 1, blue = 0).
- If the observation is blue, it will be captured by the variable blue (red = 0, blue = 1).
- If the observation is green, it will be captured by the combination of red and blue (red = 0, blue = 0).
There are a few occasions in which we may prefer to encode the categorical variables with k binary variables:
- When training decision trees, as they do not evaluate the entire feature space at the same time
- When selecting features recursively
- Wh...