Introduction to Feature Creation & Dummy Variables
introml.analyticsdojo.com
!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv
!wget https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv
--2020-11-10 17:42:13-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61194 (60K) [text/plain]
Saving to: ‘train.csv.1’
train.csv.1 100%[===================>] 59.76K --.-KB/s in 0.009s
2020-11-10 17:42:13 (6.51 MB/s) - ‘train.csv.1’ saved [61194/61194]
--2020-11-10 17:42:13-- https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28629 (28K) [text/plain]
Saving to: ‘test.csv.1’
test.csv.1 100%[===================>] 27.96K --.-KB/s in 0.005s
2020-11-10 17:42:13 (6.00 MB/s) - ‘test.csv.1’ saved [28629/28629]
15. Feature Extraction¶
Here we will talk about an important piece of machine learning: the extraction of quantitative features from data. By the end of this section you will
Know how features are extracted from real-world data.
See an example of extracting numerical features from textual data
In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks.
15.1. What Are Features?¶
15.2. Numerical Features¶
Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size n_samples \(\times\) n_features.
Previously, we looked at the iris dataset, which has 150 samples and 4 features
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.data.shape)
(150, 4)
These features are:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Numerical features such as these are pretty straightforward: each sample contains a list of floating-point numbers corresponding to the features
15.3. Categorical Features¶
What if you have categorical features? For example, imagine there is data on the color of each iris:
color in [red, blue, purple]
You might be tempted to assign numbers to these features, i.e. red=1, blue=2, purple=3 but in general this is a bad idea. Estimators tend to operate under the assumption that numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike than 1 and 3, and this is often not the case for categorical features.
In fact, the example above is a subcategory of “categorical” features, namely, “nominal” features. Nominal features don’t imply an order, whereas “ordinal” features are categorical features that do imply an order. An example of ordinal features would be T-shirt sizes, e.g., XL > L > M > S.
One work-around for parsing nominal features into a format that prevents the classification algorithm from asserting an order is the so-called one-hot encoding representation. Here, we give each category its own dimension.
The enriched iris feature set would hence be in this case:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
color=purple (1.0 or 0.0)
color=blue (1.0 or 0.0)
color=red (1.0 or 0.0)
Note that using many of these categorical features may result in data which is better represented as a sparse matrix, as we’ll see with the text classification example below.
15.4. Derived Features¶
Another common feature type are derived features, where some pre-processing step is applied to the data to generate features that are somehow more informative. Derived features may be based in feature extraction and dimensionality reduction (such as PCA or manifold learning), may be linear or nonlinear combinations of features (such as in polynomial regression), or may be some more sophisticated transform of the features.
15.5. Combining Numerical and Categorical Features¶
As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.
import os
import pandas as pd
titanic = pd.read_csv('train.csv')
print(titanic.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Here is a broad description of the keys and what they mean:
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival
(0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
In general, it looks like name
, sex
, cabin
, embarked
, boat
, body
, and homedest
may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
We clearly want to discard the “boat” and “body” columns for any classification into survived vs not survived as they already contain this information. The name is unique to each person (probably) and also non-informative. For a first try, we will use “pclass”, “sibsp”, “parch”, “fare” and “embarked” as our features:
labels = titanic.Survived.values
features = titanic[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']].copy()
features.head()
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
The data now contains only useful features, but they are not in a format that the machine learning algorithms can understand. We need to transform the strings “male” and “female” into binary variables that indicate the gender, and similarly for “embarked”.
We can do that using the pandas get_dummies
function:
featuremodel=pd.get_dummies(features)
featuremodel
Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0 | 1 | 0 | 0 |
2 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | 1 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 0 | 1 |
4 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 2 | 27.0 | 0 | 0 | 13.0000 | 0 | 1 | 0 | 0 | 1 |
887 | 1 | 19.0 | 0 | 0 | 30.0000 | 1 | 0 | 0 | 0 | 1 |
888 | 3 | NaN | 1 | 2 | 23.4500 | 1 | 0 | 0 | 0 | 1 |
889 | 1 | 26.0 | 0 | 0 | 30.0000 | 0 | 1 | 1 | 0 | 0 |
890 | 3 | 32.0 | 0 | 0 | 7.7500 | 0 | 1 | 0 | 1 | 0 |
891 rows × 10 columns
Notice that this includes N dummy variables. When we are modeling we will need N-1 categorical variables.
pd.get_dummies(features, drop_first=True).head()
Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|
0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
2 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 1 |
4 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 | 0 | 1 |
This transformation successfully encoded the string columns. However, one might argue that the class is also a categorical variable. We can explicitly list the columns to encode using the columns
parameter, and include pclass
:
features_dummies = pd.get_dummies(features, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True)
features_dummies
Age | SibSp | Parch | Fare | Pclass_2 | Pclass_3 | Sex_male | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 1 | 0 | 1 |
1 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 | 0 | 0 |
2 | 26.0 | 0 | 0 | 7.9250 | 0 | 1 | 0 | 0 | 1 |
3 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 0 | 0 | 1 |
4 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 1 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 27.0 | 0 | 0 | 13.0000 | 1 | 0 | 1 | 0 | 1 |
887 | 19.0 | 0 | 0 | 30.0000 | 0 | 0 | 0 | 0 | 1 |
888 | NaN | 1 | 2 | 23.4500 | 0 | 1 | 0 | 0 | 1 |
889 | 26.0 | 0 | 0 | 30.0000 | 0 | 0 | 1 | 0 | 0 |
890 | 32.0 | 0 | 0 | 7.7500 | 0 | 1 | 1 | 1 | 0 |
891 rows × 9 columns
#Transform from Pandas to numpy with .values
data = features_dummies.values
data
array([[22., 1., 0., ..., 1., 0., 1.],
[38., 1., 0., ..., 0., 0., 0.],
[26., 0., 0., ..., 0., 0., 1.],
...,
[nan, 1., 2., ..., 0., 0., 1.],
[26., 0., 0., ..., 1., 0., 0.],
[32., 0., 0., ..., 1., 1., 0.]])
type(data)
numpy.ndarray
16. Feature Preprocessing with Scikit Learn¶
Here we are going to look at a more efficient way to prepare our datasets using pipelines.
features.head()
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
features.isna().sum()
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
Embarked 2
dtype: int64
#Quick example to show how the data Imputer works.
from sklearn.impute import SimpleImputer
import numpy as np
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean=imp_mean.fit_transform([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
imp_mean
array([[ 7. , 2. , 3. ],
[ 4. , 3.5, 6. ],
[10. , 5. , 9. ]])
A really useful function below. You will want to remember this one.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
missing_values = ['Age','Embarked']
features_num = ['Fare', 'Age']
features_cat = [ 'Sex', 'Embarked', 'Pclass', 'SibSp']
def pre_process_dataframe(df, numeric, categorical, missing=np.nan, missing_num='mean', missing_cat = 'most_frequent'):
"""This will use a data imputer to fill in missing values and standardize numeric features.
"""
#Create a data imputer for numeric values
imp_num = SimpleImputer(missing_values=missing, strategy=missing_num)
#Create a pipeline which imputes values and then usese the standard scaler.
pipe_num = make_pipeline(imp_num, StandardScaler()) # StandardScaler()
#Create a different imputer for categorical values.
imp_cat = SimpleImputer(missing_values=missing, strategy=missing_cat)
pipe_cat = make_pipeline(imp_cat, OneHotEncoder(drop= 'first'))
preprocessor = make_column_transformer((pipe_num, features_num),(pipe_cat, features_cat))
return pd.DataFrame(preprocessor.fit_transform(df))
df=pre_process_dataframe(features, features_num, features_cat )
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.502445 | -0.592481 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.786845 | 0.638789 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | -0.488854 | -0.284663 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.420730 | 0.407926 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | -0.486337 | 0.407926 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | -0.386671 | -0.207709 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
887 | -0.044381 | -0.823344 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
888 | -0.176263 | 0.000000 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
889 | -0.044381 | -0.284663 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
890 | -0.492378 | 0.177063 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
891 rows × 13 columns
features
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
... | ... | ... | ... | ... | ... | ... | ... |
886 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S |
887 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S |
888 | 3 | female | NaN | 1 | 2 | 23.4500 | S |
889 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C |
890 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q |
891 rows × 7 columns
df.isna().sum()
0 0
1 0
2 0
3 0
4 0
5 0
6 0
dtype: int64
imp=SimpleImputer(strategy="most_frequent")
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
missing_values = ['Age','Embarked']
features_num = ['Fare', 'Age']
features_cat = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
cat_preprocess = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder())
preprocessor = make_column_transformer(
(SimpleImputer(strategy="most_frequent"), missing_values),
(StandardScaler(), features_num),
(cat_preprocess, features_cat),
)
X = preprocessor.fit_transform(features)
#X_valid = preprocessor.transform(X_valid)
pd.DataFrame(X).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22 | S | -0.502445 | -0.530377 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 38 | C | 0.786845 | 0.571831 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 26 | S | -0.488854 | -0.254825 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 35 | S | 0.42073 | 0.365167 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 35 | S | -0.486337 | 0.365167 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |