Introduction

Intuitively, feature engineering is the process of understanding the data intimately. So that we can handcraft new features that represent the dataset better and improve the prediction of the model.

Some common methods are:

Binning/bucketing: For example, in a dataset about the home credit default rate, when collecting client’s data of their age, it could make better sense to divide the range into categories: less than 20 years old, from 20 to 30 years old, from 30 to 40 years old, from 40 to 50 years old, and above 50 years old. The reasoning behind this division is that, client less than 20 years old are not allowed to take a loan, and client above 50 years old can be groupped into one group since the most popular ages to take loans are from 20 to 50. Then we diving equally from the age 20 to 50. This unequal division of ages into buckets actually make better sense and generalize the age groups better.
Polynomial features: We can take square of features, for example, to assume that those features having a nonlinear relationship with the target.
Log transform the variables with long tailed distribution so that the new logged feature has a normal distribution
Feature interaction: This is a way to combine different features, by assuming them having relationship among themselves. For example, we can combine family related features of a client together (which can be a simple linear combination or a complicated equation). The new feature would represent an overview of the client’s family status.
Categorical feature handling: Since we usually need to transform categorical feature into numerical one, there are ways to do it such as onehot encoding (encode the value into a vector of 1 and 0s, with 1 being the cateogry it belongs to) or label encoding (encode each category as a different number).
Date time variables: If we have the data on date and time, we can add a lagged variable (the value of the feature in some day in the past), calculate the interval between two dates (for example, the age of the house/car of the client who comes to request a loan).
Scale the feature: since features are different in nature, they naturally use different units and scales. But that would makes the model inaccurate since the model doesn’t really grasp the differences in scales. We can do some engineering to bring all features into one scale, in a way, for the machine to understand the dataset a bit better. The most two popular ways is to do minmax scaling and standardization. In min max scaling, we scale each feature back to a range, could be from 0 to 1. This is also called normalization. In standardization, we minus each value to the mean and divided by the standard deviation of the sample.

Code example

The things noted above are general advice. In reality, the feature engineering process depends on the nature of the dataset itself (its dimensions, its purpose, the underlying patterns). Today we explore the the dataset for the home credit default risk. When we look into the features, we can see that there are about 50 features about the building that the client lives in. We can combine those features into a new one named “living_condition” by machine learning technique such as kernal PCA and Kmeans algorithm. Then we can add a financial_stress variable by considering a weighted combination of common factors such as credit income ration, current income, number of children, family size, spouse situation and other bills. Thirdly, add an statistical aggregation of previous loan application to add credibility and credit worthiness of the client into the dataset. Finally, we can consider some other features such as red flag that takes into account credit evaluation from external sources and then non linear relationship with the target.

Preprocessing

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
np.set_printoptions(suppress=True)

labels = pd.read_csv('home-credit-default-risk/HomeCredit_columns_description.csv',encoding='ISO-8859-1')
data = pd.read_csv('home-credit-default-risk/application_train.csv')

labels.head()

	Unnamed: 0	Table	Row	Description	Special
0	1	application_{train\|test}.csv	SK_ID_CURR	ID of loan in our sample	NaN
1	2	application_{train\|test}.csv	TARGET	Target variable (1 - client with payment diffi...	NaN
2	5	application_{train\|test}.csv	NAME_CONTRACT_TYPE	Identification if loan is cash or revolving	NaN
3	6	application_{train\|test}.csv	CODE_GENDER	Gender of the client	NaN
4	7	application_{train\|test}.csv	FLAG_OWN_CAR	Flag if the client owns a car	NaN

# First take all the name of the features related to the building
living_condition = labels['Row'][44:91]
living_condition

                APARTMENTS_AVG
              BASEMENTAREA_AVG
   YEARS_BEGINEXPLUATATION_AVG
               YEARS_BUILD_AVG
                COMMONAREA_AVG
                 ELEVATORS_AVG
                 ENTRANCES_AVG
                 FLOORSMAX_AVG
                 FLOORSMIN_AVG
                  LANDAREA_AVG
          LIVINGAPARTMENTS_AVG
                LIVINGAREA_AVG
       NONLIVINGAPARTMENTS_AVG
             NONLIVINGAREA_AVG
               APARTMENTS_MODE
             BASEMENTAREA_MODE
  YEARS_BEGINEXPLUATATION_MODE
              YEARS_BUILD_MODE
               COMMONAREA_MODE
                ELEVATORS_MODE
                ENTRANCES_MODE
                FLOORSMAX_MODE
                FLOORSMIN_MODE
                 LANDAREA_MODE
         LIVINGAPARTMENTS_MODE
               LIVINGAREA_MODE
      NONLIVINGAPARTMENTS_MODE
            NONLIVINGAREA_MODE
               APARTMENTS_MEDI
             BASEMENTAREA_MEDI
  YEARS_BEGINEXPLUATATION_MEDI
              YEARS_BUILD_MEDI
               COMMONAREA_MEDI
                ELEVATORS_MEDI
                ENTRANCES_MEDI
                FLOORSMAX_MEDI
                FLOORSMIN_MEDI
                 LANDAREA_MEDI
         LIVINGAPARTMENTS_MEDI
               LIVINGAREA_MEDI
      NONLIVINGAPARTMENTS_MEDI
            NONLIVINGAREA_MEDI
            FONDKAPREMONT_MODE
                HOUSETYPE_MODE
                TOTALAREA_MODE
            WALLSMATERIAL_MODE
           EMERGENCYSTATE_MODE
Name: Row, dtype: object

# Now preprocess the data a bit
data.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	AMT_REQ_CREDIT_BUREAU_HOUR	AMT_REQ_CREDIT_BUREAU_DAY	AMT_REQ_CREDIT_BUREAU_WEEK	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	0.0	0.0	0.0	0.0	0.0	1.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	0.0	0.0	0.0	0.0	0.0	0.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	0.0	0.0	0.0	0.0	0.0	0.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	NaN	NaN	NaN	NaN	NaN	NaN
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 122 columns

y_train = data['TARGET']
X_train = data.drop(['TARGET'], axis=1)
y_train = y_train.to_frame()
y_train

	TARGET
0	1
1	0
2	0
3	0
4	0
...	...
307506	0
307507	0
307508	0
307509	1
307510	0

307511 rows × 1 columns

# Let's handle categorical / numerical variables and missing values

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']

X_train_categorical = X_train.select_dtypes(include=categoricals)
X_train_numerical = X_train.select_dtypes(include=numerics)

categorical_columns = X_train_categorical.columns
numerical_columns = X_train_numerical.columns

categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
numerical_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# imputer = imputer.fit(X_train)
  
X_train_categorical = categorical_imputer.fit_transform(X_train_categorical)
X_train_categorical = pd.DataFrame(data=X_train_categorical, columns=categorical_columns)

X_train_numerical = numerical_imputer.fit_transform(X_train_numerical)
X_train_numerical = pd.DataFrame(data=X_train_numerical, columns=numerical_columns)

The thing about using label encoder instead of one hot encoder is that in label encoder, there is an inherent assumption that the values are hierarchically meaningful. This might or might not reflect the qualitative meaning of the value in reality. For example, we categorize the house into 3 district: district 1, district 2, district 3 and encode them into number 0, 1, and 2. Since 2 > 1, it might suggest that district 2 is better than district 1 which might not reflect the real situation in which there are no inherent difference in those two geographical locations (they are both equal in distance to the center for example). We might take this inherent bias into account and try to make a new variable (via clustering or via distance to center) to compensate for this bias in the model. The same goes for the days of the week, inherently the meaning of monday tuesday to sunday might not be that linear. We can hope that the model might have enough data to learn this representation. One hot encoding, on the other hand, assume those categories are all equal, and it puts 1 for that category and 0s for others in the representation vector. For example: a house in district 1 can be represented as [0,1,0].

Building living_condition feature

X_train_categorical = X_train_categorical.apply(LabelEncoder().fit_transform)
X_train_categorical

	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	OCCUPATION_TYPE	WEEKDAY_APPR_PROCESS_START	ORGANIZATION_TYPE	FONDKAPREMONT_MODE	HOUSETYPE_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE
0	0	1	0	1	6	7	4	3	1	8	6	5	2	0	5	0
1	0	0	0	0	1	4	1	1	1	3	1	39	2	0	0	0
2	1	1	1	1	6	7	4	3	1	8	1	11	2	0	4	0
3	0	0	0	1	6	7	4	0	1	8	6	5	2	0	4	0
4	0	1	0	1	6	7	4	3	1	3	4	37	2	0	4	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
307506	0	1	0	0	6	7	4	2	5	14	4	43	2	0	5	0
307507	0	0	0	1	6	3	4	5	1	8	1	57	2	0	5	0
307508	0	0	0	1	6	7	1	2	1	10	4	39	2	0	4	0
307509	0	0	0	1	6	1	4	1	1	8	6	3	2	0	5	0
307510	0	0	0	0	6	1	1	1	1	8	4	5	2	0	4	0

307511 rows × 16 columns

# Some of the features are categorical ('FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE')
# the rest is numerical
living_condition_categoricals = ['FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
# living_condition_numericals = [e in living_condition if e not in living_condition_categoricals]
living_condition_numericals = np.setdiff1d(living_condition,living_condition_categoricals)
X_train_numerical[living_condition_numericals]

	APARTMENTS_AVG	APARTMENTS_MEDI	APARTMENTS_MODE	BASEMENTAREA_AVG	BASEMENTAREA_MEDI	BASEMENTAREA_MODE	COMMONAREA_AVG	COMMONAREA_MEDI	COMMONAREA_MODE	ELEVATORS_AVG	...	NONLIVINGAREA_AVG	NONLIVINGAREA_MEDI	NONLIVINGAREA_MODE	TOTALAREA_MODE	YEARS_BEGINEXPLUATATION_AVG	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_AVG	YEARS_BUILD_MEDI	YEARS_BUILD_MODE
0	0.02470	0.02500	0.025200	0.036900	0.036900	0.038300	0.014300	0.014400	0.014400	0.000000	...	0.000000	0.000000	0.000000	0.014900	0.972200	0.972200	0.972200	0.619200	0.624300	0.634100
1	0.09590	0.09680	0.092400	0.052900	0.052900	0.053800	0.060500	0.060800	0.049700	0.080000	...	0.009800	0.010000	0.000000	0.071400	0.985100	0.985100	0.985100	0.796000	0.798700	0.804000
2	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.028358	0.028236	0.027022	0.102547	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637
3	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.028358	0.028236	0.027022	0.102547	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637
4	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.028358	0.028236	0.027022	0.102547	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
307506	0.20210	0.20400	0.100800	0.088700	0.088700	0.017200	0.020200	0.020300	0.017200	0.220000	...	0.109500	0.111800	0.012500	0.289800	0.987600	0.987600	0.978200	0.830000	0.832300	0.712500
307507	0.02470	0.02500	0.025200	0.043500	0.043500	0.045100	0.002200	0.002200	0.002200	0.000000	...	0.000000	0.000000	0.000000	0.021400	0.972700	0.972700	0.972700	0.626000	0.631000	0.640600
307508	0.10310	0.10410	0.105000	0.086200	0.086200	0.089400	0.012300	0.012400	0.012400	0.000000	...	0.000000	0.000000	0.000000	0.797000	0.981600	0.981600	0.981600	0.748400	0.751800	0.758300
307509	0.01240	0.01250	0.012600	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.028358	0.028236	0.027022	0.008600	0.977100	0.977100	0.977200	0.752471	0.755746	0.759637
307510	0.07420	0.07490	0.075600	0.052600	0.052600	0.054600	0.017600	0.017700	0.017800	0.080000	...	0.000000	0.000000	0.000000	0.071800	0.988100	0.988100	0.988100	0.752471	0.755746	0.759637

307511 rows × 43 columns

X_train_categorical[living_condition_categoricals]

	FONDKAPREMONT_MODE	HOUSETYPE_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE
0	2	0	5	0
1	2	0	0	0
2	2	0	4	0
3	2	0	4	0
4	2	0	4	0
...	...	...	...	...
307506	2	0	5	0
307507	2	0	5	0
307508	2	0	4	0
307509	2	0	5	0
307510	2	0	4	0

307511 rows × 4 columns

X_train_living_condition = pd.concat([X_train_numerical[living_condition_numericals], X_train_categorical[living_condition_categoricals]],axis=1)
X_train_living_condition

	APARTMENTS_AVG	APARTMENTS_MEDI	APARTMENTS_MODE	BASEMENTAREA_AVG	BASEMENTAREA_MEDI	BASEMENTAREA_MODE	COMMONAREA_AVG	COMMONAREA_MEDI	COMMONAREA_MODE	ELEVATORS_AVG	...	YEARS_BEGINEXPLUATATION_AVG	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_AVG	YEARS_BUILD_MEDI	YEARS_BUILD_MODE	FONDKAPREMONT_MODE	HOUSETYPE_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE
0	0.02470	0.02500	0.025200	0.036900	0.036900	0.038300	0.014300	0.014400	0.014400	0.000000	...	0.972200	0.972200	0.972200	0.619200	0.624300	0.634100	2	0	5	0
1	0.09590	0.09680	0.092400	0.052900	0.052900	0.053800	0.060500	0.060800	0.049700	0.080000	...	0.985100	0.985100	0.985100	0.796000	0.798700	0.804000	2	0	0	0
2	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
3	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
4	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
307506	0.20210	0.20400	0.100800	0.088700	0.088700	0.017200	0.020200	0.020300	0.017200	0.220000	...	0.987600	0.987600	0.978200	0.830000	0.832300	0.712500	2	0	5	0
307507	0.02470	0.02500	0.025200	0.043500	0.043500	0.045100	0.002200	0.002200	0.002200	0.000000	...	0.972700	0.972700	0.972700	0.626000	0.631000	0.640600	2	0	5	0
307508	0.10310	0.10410	0.105000	0.086200	0.086200	0.089400	0.012300	0.012400	0.012400	0.000000	...	0.981600	0.981600	0.981600	0.748400	0.751800	0.758300	2	0	4	0
307509	0.01240	0.01250	0.012600	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977100	0.977100	0.977200	0.752471	0.755746	0.759637	2	0	5	0
307510	0.07420	0.07490	0.075600	0.052600	0.052600	0.054600	0.017600	0.017700	0.017800	0.080000	...	0.988100	0.988100	0.988100	0.752471	0.755746	0.759637	2	0	4	0

307511 rows × 47 columns

Since the dataset is quite large, I use only the first 10000 observations. Then we transform the X_train_living_condition into higher dimensional space with the RBF kernel (Radial basis function) where they can be separated better. After that, we use K-means to determine the clusters. The ELBOW shows that 3 clusters is optimal. Which roughly means that the living condition of clients can be groupped into three clusters. We will create a new feature with those new cluster labels.

X_train_living_condition = X_train_living_condition[:10000]

X_train_living_condition

	APARTMENTS_AVG	APARTMENTS_MEDI	APARTMENTS_MODE	BASEMENTAREA_AVG	BASEMENTAREA_MEDI	BASEMENTAREA_MODE	COMMONAREA_AVG	COMMONAREA_MEDI	COMMONAREA_MODE	ELEVATORS_AVG	...	YEARS_BEGINEXPLUATATION_AVG	YEARS_BEGINEXPLUATATION_MEDI	YEARS_BEGINEXPLUATATION_MODE	YEARS_BUILD_AVG	YEARS_BUILD_MEDI	YEARS_BUILD_MODE	FONDKAPREMONT_MODE	HOUSETYPE_MODE	WALLSMATERIAL_MODE	EMERGENCYSTATE_MODE
0	0.02470	0.02500	0.025200	0.036900	0.036900	0.038300	0.014300	0.014400	0.014400	0.000000	...	0.972200	0.972200	0.972200	0.619200	0.624300	0.634100	2	0	5	0
1	0.09590	0.09680	0.092400	0.052900	0.052900	0.053800	0.060500	0.060800	0.049700	0.080000	...	0.985100	0.985100	0.985100	0.796000	0.798700	0.804000	2	0	0	0
2	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
3	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
4	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	0.01630	0.01670	0.016800	0.000000	0.000000	0.000000	0.044621	0.044595	0.042553	0.080000	...	0.980600	0.980600	0.980600	0.632800	0.637700	0.647200	2	0	5	0
9996	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
9997	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
9998	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0
9999	0.11744	0.11785	0.114231	0.088442	0.087955	0.087543	0.044621	0.044595	0.042553	0.078942	...	0.977735	0.977752	0.977065	0.752471	0.755746	0.759637	2	0	4	0

10000 rows × 47 columns

import numpy as np
from sklearn.decomposition import KernelPCA
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

# assume df is your DataFrame
X = X_train_living_condition.values

# Kernel PCA transformation using RBF
kpca = KernelPCA(kernel="rbf")
X_kpca = kpca.fit_transform(X)

# finding the optimal number of clusters for KMeans after transformation
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X_kpca)
    kmeanModel.fit(X_kpca)
    distortions.append(sum(np.min(cdist(X_kpca, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X_kpca.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

# Apply KMeans with the optimal number of clusters (you can find it from the above plot)
kmeans = KMeans(n_clusters=3)  # here 3 is used as an example, replace it with your optimal number
kmeans.fit(X_kpca)

# getting the cluster labels for each sample
labels = kmeans.labels_

Elbow for optimal number of clusters in living ocndition

X_train_living_condition['living_condition_cluster_label'] = labels
the_rest = []
for e in X_train.columns:
    if e not in living_condition:
        the_rest.append(e)
the_rest
X_train_the_rest = X_train[the_rest][:10000]
X_train_new = pd.concat([X_train_the_rest, X_train_living_condition], axis=1)

Now we can create more features based on this new way of clustering living condition. For example, we can take the mean, min, max and sum of the income of those living condition brackets. They are different in the sum of income but the mean is similar.

living_condition_cluster_label	bincount	income mean	min	max	sum
0	7300	164439	25650	1935000	1000000000
1	454	188565	36000	810000	80000000
2	2246	172962	33300	810000	300000000

We can see that the group (2) with smallest size (454/10000) having the highest mean income (180000), even though summing up, they are the least (80000000). The group (0) with biggest size (7300) having the lowest mean income (160000), even though summing up, they are the highest. The biggest group also has the biggest range of income, ranging from 25000 to 2000000. These are the internal structure of the dataset that the kernelPCA and the Kmeans discover. Then we can merge this living situation feature into the original one.

np.bincount(labels)

X_train_living_situation = X_train_new.groupby('living_condition_cluster_label').agg({'AMT_INCOME_TOTAL': ['mean', 'min', 'max', 'sum'],
                                                        # Add more columns as needed
                                                        }).reset_index()
X_train_living_situation
X_train_living_situation.columns = ['living_condition_cluster_label', 'AMT_INCOME_TOTAL_mean','AMT_INCOME_TOTAL_min','AMT_INCOME_TOTAL_max','AMT_INCOME_TOTAL_sum' ] 
X_train_new = X_train_new.merge(X_train_living_situation, on='living_condition_cluster_label', how='left')

X_train_new

	SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	...	AMT_CREDIT_min	AMT_CREDIT_sum	AMT_ANNUITY_mean	AMT_ANNUITY_max	AMT_ANNUITY_min	DAYS_DECISION_mean	DAYS_DECISION_max	DAYS_DECISION_min	CNT_PAYMENT_mean	CNT_PAYMENT_sum
0	100002	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	351000.0	...	179055.0	179055.0	9251.775000	9251.775	9251.775	-606.000000	-606.0	-606.0	24.000000	24.0
1	100003	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	1129500.0	...	68053.5	1452573.0	56553.990000	98356.995	6737.310	-1305.000000	-746.0	-2341.0	10.000000	30.0
2	100004	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	135000.0	...	20106.0	20106.0	5357.250000	5357.250	5357.250	-815.000000	-815.0	-815.0	4.000000	4.0
3	100006	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	297000.0	...	0.0	2625259.5	23651.175000	39954.510	2482.920	-272.444444	-181.0	-617.0	23.000000	138.0
4	100007	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	513000.0	...	14616.0	999832.5	12278.805000	22678.785	1834.290	-1222.833333	-374.0	-2357.0	20.666667	124.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
307506	456251	Cash loans	M	N	N	0	157500.0	254700.0	27558.0	225000.0	...	40455.0	40455.0	6605.910000	6605.910	6605.910	-273.000000	-273.0	-273.0	8.000000	8.0
307507	456252	Cash loans	F	N	Y	0	72000.0	269550.0	12001.5	225000.0	...	56821.5	56821.5	10074.465000	10074.465	10074.465	-2497.000000	-2497.0	-2497.0	6.000000	6.0
307508	456253	Cash loans	F	N	Y	0	153000.0	677664.0	29979.0	585000.0	...	13945.5	41251.5	4770.405000	5567.715	3973.095	-2380.000000	-1909.0	-2851.0	5.000000	10.0
307509	456254	Cash loans	F	N	Y	0	171000.0	370107.0	20205.0	319500.0	...	21456.0	268879.5	10681.132500	19065.825	2296.440	-299.500000	-277.0	-322.0	15.000000	30.0
307510	456255	Cash loans	F	N	N	0	157500.0	675000.0	49117.5	675000.0	...	45000.0	3395448.0	20775.391875	54022.140	2250.000	-587.625000	-171.0	-991.0	21.750000	174.0

307511 rows × 133 columns

Adding financial_stress feature

There are multiple factors for the financial stress, but some main ones are: credit-income ratio, current income, current credit line, the number of children, the number of family members, spouse’s income, spouse’s credit line, bills. We can weight those factors differently, too, since they affect the client differently.

original = pd.concat([X_train[['AMT_INCOME_TOTAL','AMT_CREDIT','CNT_FAM_MEMBERS','CNT_CHILDREN']],X_train_categorical['NAME_FAMILY_STATUS']],axis=1)
original

	AMT_INCOME_TOTAL	AMT_CREDIT	CNT_FAM_MEMBERS	CNT_CHILDREN	NAME_FAMILY_STATUS
0	202500.0	406597.5	1.0	0	3
1	270000.0	1293502.5	2.0	0	1
2	67500.0	135000.0	1.0	0	3
3	135000.0	312682.5	2.0	0	0
4	121500.0	513000.0	1.0	0	3
...	...	...	...	...	...
307506	157500.0	254700.0	1.0	0	2
307507	72000.0	269550.0	1.0	0	5
307508	153000.0	677664.0	1.0	0	2
307509	171000.0	370107.0	2.0	0	1
307510	157500.0	675000.0	2.0	0	1

307511 rows × 5 columns

def compute_financial_stress(row):
    w_credit_income_ratio = 2
    w_family_size = 1
    w_family_status = 1 
    w_children = 1.5
    stress_score = (row['AMT_CREDIT'] / row['AMT_INCOME_TOTAL']) * w_credit_income_ratio + row['CNT_FAM_MEMBERS'] * w_family_size + row['NAME_FAMILY_STATUS'] * w_family_status + row['CNT_CHILDREN'] * w_children
    return stress_score

original['financial_stress'] = original.apply(compute_financial_stress, axis=1)
original

	AMT_INCOME_TOTAL	AMT_CREDIT	CNT_FAM_MEMBERS	CNT_CHILDREN	NAME_FAMILY_STATUS	financial_stress
0	202500.0	406597.5	1.0	0	3	8.015778
1	270000.0	1293502.5	2.0	0	1	12.581500
2	67500.0	135000.0	1.0	0	3	8.000000
3	135000.0	312682.5	2.0	0	0	6.632333
4	121500.0	513000.0	1.0	0	3	12.444444
...	...	...	...	...	...	...
307506	157500.0	254700.0	1.0	0	2	6.234286
307507	72000.0	269550.0	1.0	0	5	13.487500
307508	153000.0	677664.0	1.0	0	2	11.858353
307509	171000.0	370107.0	2.0	0	1	7.328737
307510	157500.0	675000.0	2.0	0	1	11.571429

307511 rows × 6 columns

Add previous application situation

Previous application situation is an agregation of previous loans by the same person. It might rougly tell the credibility of the person, plus other important information.

previous_application = pd.read_csv('home-credit-default-risk/previous_application.csv')
previous_application.head()

	SK_ID_PREV	SK_ID_CURR	NAME_CONTRACT_TYPE	AMT_ANNUITY	AMT_APPLICATION	AMT_CREDIT	AMT_DOWN_PAYMENT	AMT_GOODS_PRICE	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	...	NAME_SELLER_INDUSTRY	CNT_PAYMENT	NAME_YIELD_GROUP	PRODUCT_COMBINATION	DAYS_FIRST_DRAWING	DAYS_FIRST_DUE	DAYS_LAST_DUE_1ST_VERSION	DAYS_LAST_DUE	DAYS_TERMINATION	NFLAG_INSURED_ON_APPROVAL
0	2030495	271877	Consumer loans	1730.430	17145.0	17145.0	0.0	17145.0	SATURDAY	15	...	Connectivity	12.0	middle	POS mobile with interest	365243.0	-42.0	300.0	-42.0	-37.0	0.0
1	2802425	108129	Cash loans	25188.615	607500.0	679671.0	NaN	607500.0	THURSDAY	11	...	XNA	36.0	low_action	Cash X-Sell: low	365243.0	-134.0	916.0	365243.0	365243.0	1.0
2	2523466	122040	Cash loans	15060.735	112500.0	136444.5	NaN	112500.0	TUESDAY	11	...	XNA	12.0	high	Cash X-Sell: high	365243.0	-271.0	59.0	365243.0	365243.0	1.0
3	2819243	176158	Cash loans	47041.335	450000.0	470790.0	NaN	450000.0	MONDAY	7	...	XNA	12.0	middle	Cash X-Sell: middle	365243.0	-482.0	-152.0	-182.0	-177.0	1.0
4	1784265	202054	Cash loans	31924.395	337500.0	404055.0	NaN	337500.0	THURSDAY	9	...	XNA	24.0	high	Cash Street: high	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 37 columns

# Aggregate based on the SK_ID_CURR
prev_app_agg = previous_application.groupby('SK_ID_CURR').agg({'AMT_CREDIT': ['mean', 'max', 'min', 'sum'],
                                                    'AMT_ANNUITY': ['mean', 'max', 'min'],
                                                    'DAYS_DECISION': ['mean', 'max', 'min'],
                                                    'CNT_PAYMENT': ['mean', 'sum']})

# Handle multi-level column names
prev_app_agg.columns = ['_'.join(col).strip() for col in prev_app_agg.columns.values]

# Reset the index
prev_app_agg.reset_index(inplace=True)
prev_app_agg

	SK_ID_CURR	AMT_CREDIT_mean	AMT_CREDIT_max	AMT_CREDIT_min	AMT_CREDIT_sum	AMT_ANNUITY_mean	AMT_ANNUITY_max	AMT_ANNUITY_min	DAYS_DECISION_mean	DAYS_DECISION_max	DAYS_DECISION_min	CNT_PAYMENT_mean	CNT_PAYMENT_sum
0	100001	23787.00	23787.0	23787.0	23787.0	3951.000000	3951.000	3951.000	-1740.000	-1740	-1740	8.00	8.0
1	100002	179055.00	179055.0	179055.0	179055.0	9251.775000	9251.775	9251.775	-606.000	-606	-606	24.00	24.0
2	100003	484191.00	1035882.0	68053.5	1452573.0	56553.990000	98356.995	6737.310	-1305.000	-746	-2341	10.00	30.0
3	100004	20106.00	20106.0	20106.0	20106.0	5357.250000	5357.250	5357.250	-815.000	-815	-815	4.00	4.0
4	100005	20076.75	40153.5	0.0	40153.5	4813.200000	4813.200	4813.200	-536.000	-315	-757	12.00	12.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
338852	456251	40455.00	40455.0	40455.0	40455.0	6605.910000	6605.910	6605.910	-273.000	-273	-273	8.00	8.0
338853	456252	56821.50	56821.5	56821.5	56821.5	10074.465000	10074.465	10074.465	-2497.000	-2497	-2497	6.00	6.0
338854	456253	20625.75	27306.0	13945.5	41251.5	4770.405000	5567.715	3973.095	-2380.000	-1909	-2851	5.00	10.0
338855	456254	134439.75	247423.5	21456.0	268879.5	10681.132500	19065.825	2296.440	-299.500	-277	-322	15.00	30.0
338856	456255	424431.00	1271929.5	45000.0	3395448.0	20775.391875	54022.140	2250.000	-587.625	-171	-991	21.75	174.0

338857 rows × 13 columns

X_train_new = X_train.merge(prev_app_agg, on='SK_ID_CURR', how='left')
X_train_new

	SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	...	AMT_CREDIT_min	AMT_CREDIT_sum	AMT_ANNUITY_mean	AMT_ANNUITY_max	AMT_ANNUITY_min	DAYS_DECISION_mean	DAYS_DECISION_max	DAYS_DECISION_min	CNT_PAYMENT_mean	CNT_PAYMENT_sum
0	100002	Cash loans	M	N	Y	0	202500.0	406597.5	24700.5	351000.0	...	179055.0	179055.0	9251.775000	9251.775	9251.775	-606.000000	-606.0	-606.0	24.000000	24.0
1	100003	Cash loans	F	N	N	0	270000.0	1293502.5	35698.5	1129500.0	...	68053.5	1452573.0	56553.990000	98356.995	6737.310	-1305.000000	-746.0	-2341.0	10.000000	30.0
2	100004	Revolving loans	M	Y	Y	0	67500.0	135000.0	6750.0	135000.0	...	20106.0	20106.0	5357.250000	5357.250	5357.250	-815.000000	-815.0	-815.0	4.000000	4.0
3	100006	Cash loans	F	N	Y	0	135000.0	312682.5	29686.5	297000.0	...	0.0	2625259.5	23651.175000	39954.510	2482.920	-272.444444	-181.0	-617.0	23.000000	138.0
4	100007	Cash loans	M	N	Y	0	121500.0	513000.0	21865.5	513000.0	...	14616.0	999832.5	12278.805000	22678.785	1834.290	-1222.833333	-374.0	-2357.0	20.666667	124.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
307506	456251	Cash loans	M	N	N	0	157500.0	254700.0	27558.0	225000.0	...	40455.0	40455.0	6605.910000	6605.910	6605.910	-273.000000	-273.0	-273.0	8.000000	8.0
307507	456252	Cash loans	F	N	Y	0	72000.0	269550.0	12001.5	225000.0	...	56821.5	56821.5	10074.465000	10074.465	10074.465	-2497.000000	-2497.0	-2497.0	6.000000	6.0
307508	456253	Cash loans	F	N	Y	0	153000.0	677664.0	29979.0	585000.0	...	13945.5	41251.5	4770.405000	5567.715	3973.095	-2380.000000	-1909.0	-2851.0	5.000000	10.0
307509	456254	Cash loans	F	N	Y	0	171000.0	370107.0	20205.0	319500.0	...	21456.0	268879.5	10681.132500	19065.825	2296.440	-299.500000	-277.0	-322.0	15.000000	30.0
307510	456255	Cash loans	F	N	N	0	157500.0	675000.0	49117.5	675000.0	...	45000.0	3395448.0	20775.391875	54022.140	2250.000	-587.625000	-171.0	-991.0	21.750000	174.0

307511 rows × 133 columns

Misc

A feature of red-flag can be added based on credit rate from external sources. As before, the external sources can have different credibility. And then some polynomial features can be added to show non linear relationship. Here we will only demonstrate how to apply polynomial conversion to one feature: the goods price that the client applies for. Since it is not clear that the price of the goods would behave linearly with the default risk, we apply quadratic transformation.

original = X_train_numerical[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']]
original

	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3
0	0.083037	0.262949	0.139376
1	0.311267	0.622246	0.510853
2	0.502130	0.555912	0.729567
3	0.502130	0.650442	0.510853
4	0.502130	0.322738	0.510853
...	...	...	...
307506	0.145570	0.681632	0.510853
307507	0.502130	0.115992	0.510853
307508	0.744026	0.535722	0.218859
307509	0.502130	0.514163	0.661024
307510	0.734460	0.708569	0.113922

307511 rows × 3 columns

def compute_red_flag(row):
    w_source_1 = 2
    w_source_2 = 1
    w_source_3 = 1 
    red_flag = row['EXT_SOURCE_1'] * w_source_1 + row['EXT_SOURCE_2'] * w_source_2 + row['EXT_SOURCE_3'] * w_source_3
    return red_flag

original['red_flag'] = original.apply(compute_red_flag, axis=1)
original

/var/folders/kf/5_ggvsz93vxdbx_h0tvy66xh0000gn/T/ipykernel_15715/1336237161.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  original['red_flag'] = original.apply(compute_red_flag, axis=1)

	EXT_SOURCE_1	EXT_SOURCE_2	EXT_SOURCE_3	red_flag
0	0.083037	0.262949	0.139376	0.568398
1	0.311267	0.622246	0.510853	1.755633
2	0.502130	0.555912	0.729567	2.289738
3	0.502130	0.650442	0.510853	2.165554
4	0.502130	0.322738	0.510853	1.837851
...	...	...	...	...
307506	0.145570	0.681632	0.510853	1.483626
307507	0.502130	0.115992	0.510853	1.631105
307508	0.744026	0.535722	0.218859	2.242634
307509	0.502130	0.514163	0.661024	2.179446
307510	0.734460	0.708569	0.113922	2.291411

307511 rows × 4 columns

# Quadratic transformation
original = X_train_numerical['AMT_GOODS_PRICE'][:10000]

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2, interaction_only=True)
poly.fit_transform([original])
original

      351000.0
     1129500.0
      135000.0
      297000.0
      513000.0
          ...    
   270000.0
   900000.0
   450000.0
   315000.0
   270000.0
Name: AMT_GOODS_PRICE, Length: 10000, dtype: float64

Conclusion

So far we have studied some forms of feature engineering in the wild. And the examples use non trivial methods to engineer the features. Those examples reflect the domain knowledge necessary in due diligence of financial institution when they consider credit lines/cash for consumers and other knowledge as a data scientist when it comes to process large dataset. To recap, we have seen that there are many input features regarding the building in which the client lives in, these are indirect reflect of their living condition, hence we create a new variable based on those input features. Then we see that there can be a financial stress evaluation based on the situation of the client. That indicator would be helpful since we are wondering whether the client can pay back the loan or not. For example, the credit-income ratio can tell how much the credit line would weigh the person’s income down, especially if they have children. We can see that the dataset lacks the information on the spouse situation (their income and credit), so one logical thing to do is that we might come back to the field and collect such information, to make our prediction more sensible. The third feature we engineer is the situation of the previous loan applications of the same client. The past can say a lot about the present and the future of this same client. So that we can aggregate those statistics into the current model’s calculation.

Together, those examples provide an overview of how to do feature engineering for a dataset, which is a crucial process and it can affect the model’s performance directly. Translating into real world business, it can help the institution makes better decision in aiding people in need.

Feature Engineering

TOC

Introduction

Code example

Preprocessing

Building living_condition feature

Adding financial_stress feature

Add previous application situation

Misc

Conclusion