TOC

Introduction

Intuitively, feature engineering is the process of understanding the data intimately. So that we can handcraft new features that represent the dataset better and improve the prediction of the model.

Some common methods are:

  • Binning/bucketing: For example, in a dataset about the home credit default rate, when collecting client’s data of their age, it could make better sense to divide the range into categories: less than 20 years old, from 20 to 30 years old, from 30 to 40 years old, from 40 to 50 years old, and above 50 years old. The reasoning behind this division is that, client less than 20 years old are not allowed to take a loan, and client above 50 years old can be groupped into one group since the most popular ages to take loans are from 20 to 50. Then we diving equally from the age 20 to 50. This unequal division of ages into buckets actually make better sense and generalize the age groups better.

  • Polynomial features: We can take square of features, for example, to assume that those features having a nonlinear relationship with the target.

  • Log transform the variables with long tailed distribution so that the new logged feature has a normal distribution

  • Feature interaction: This is a way to combine different features, by assuming them having relationship among themselves. For example, we can combine family related features of a client together (which can be a simple linear combination or a complicated equation). The new feature would represent an overview of the client’s family status.

  • Categorical feature handling: Since we usually need to transform categorical feature into numerical one, there are ways to do it such as onehot encoding (encode the value into a vector of 1 and 0s, with 1 being the cateogry it belongs to) or label encoding (encode each category as a different number).

  • Date time variables: If we have the data on date and time, we can add a lagged variable (the value of the feature in some day in the past), calculate the interval between two dates (for example, the age of the house/car of the client who comes to request a loan).

  • Scale the feature: since features are different in nature, they naturally use different units and scales. But that would makes the model inaccurate since the model doesn’t really grasp the differences in scales. We can do some engineering to bring all features into one scale, in a way, for the machine to understand the dataset a bit better. The most two popular ways is to do minmax scaling and standardization. In min max scaling, we scale each feature back to a range, could be from 0 to 1. This is also called normalization. In standardization, we minus each value to the mean and divided by the standard deviation of the sample.

Code example

The things noted above are general advice. In reality, the feature engineering process depends on the nature of the dataset itself (its dimensions, its purpose, the underlying patterns). Today we explore the the dataset for the home credit default risk. When we look into the features, we can see that there are about 50 features about the building that the client lives in. We can combine those features into a new one named “living_condition” by machine learning technique such as kernal PCA and Kmeans algorithm. Then we can add a financial_stress variable by considering a weighted combination of common factors such as credit income ration, current income, number of children, family size, spouse situation and other bills. Thirdly, add an statistical aggregation of previous loan application to add credibility and credit worthiness of the client into the dataset. Finally, we can consider some other features such as red flag that takes into account credit evaluation from external sources and then non linear relationship with the target.

Preprocessing

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
np.set_printoptions(suppress=True)

labels = pd.read_csv('home-credit-default-risk/HomeCredit_columns_description.csv',encoding='ISO-8859-1')
data = pd.read_csv('home-credit-default-risk/application_train.csv')
labels.head()
Unnamed: 0 Table Row Description Special
0 1 application_{train|test}.csv SK_ID_CURR ID of loan in our sample NaN
1 2 application_{train|test}.csv TARGET Target variable (1 - client with payment diffi... NaN
2 5 application_{train|test}.csv NAME_CONTRACT_TYPE Identification if loan is cash or revolving NaN
3 6 application_{train|test}.csv CODE_GENDER Gender of the client NaN
4 7 application_{train|test}.csv FLAG_OWN_CAR Flag if the client owns a car NaN
# First take all the name of the features related to the building
living_condition = labels['Row'][44:91]
living_condition
44                  APARTMENTS_AVG
45                BASEMENTAREA_AVG
46     YEARS_BEGINEXPLUATATION_AVG
47                 YEARS_BUILD_AVG
48                  COMMONAREA_AVG
49                   ELEVATORS_AVG
50                   ENTRANCES_AVG
51                   FLOORSMAX_AVG
52                   FLOORSMIN_AVG
53                    LANDAREA_AVG
54            LIVINGAPARTMENTS_AVG
55                  LIVINGAREA_AVG
56         NONLIVINGAPARTMENTS_AVG
57               NONLIVINGAREA_AVG
58                 APARTMENTS_MODE
59               BASEMENTAREA_MODE
60    YEARS_BEGINEXPLUATATION_MODE
61                YEARS_BUILD_MODE
62                 COMMONAREA_MODE
63                  ELEVATORS_MODE
64                  ENTRANCES_MODE
65                  FLOORSMAX_MODE
66                  FLOORSMIN_MODE
67                   LANDAREA_MODE
68           LIVINGAPARTMENTS_MODE
69                 LIVINGAREA_MODE
70        NONLIVINGAPARTMENTS_MODE
71              NONLIVINGAREA_MODE
72                 APARTMENTS_MEDI
73               BASEMENTAREA_MEDI
74    YEARS_BEGINEXPLUATATION_MEDI
75                YEARS_BUILD_MEDI
76                 COMMONAREA_MEDI
77                  ELEVATORS_MEDI
78                  ENTRANCES_MEDI
79                  FLOORSMAX_MEDI
80                  FLOORSMIN_MEDI
81                   LANDAREA_MEDI
82           LIVINGAPARTMENTS_MEDI
83                 LIVINGAREA_MEDI
84        NONLIVINGAPARTMENTS_MEDI
85              NONLIVINGAREA_MEDI
86              FONDKAPREMONT_MODE
87                  HOUSETYPE_MODE
88                  TOTALAREA_MODE
89              WALLSMATERIAL_MODE
90             EMERGENCYSTATE_MODE
Name: Row, dtype: object
# Now preprocess the data a bit
data.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

y_train = data['TARGET']
X_train = data.drop(['TARGET'], axis=1)
y_train = y_train.to_frame()
y_train
TARGET
0 1
1 0
2 0
3 0
4 0
... ...
307506 0
307507 0
307508 0
307509 1
307510 0

307511 rows × 1 columns

# Let's handle categorical / numerical variables and missing values

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']

X_train_categorical = X_train.select_dtypes(include=categoricals)
X_train_numerical = X_train.select_dtypes(include=numerics)

categorical_columns = X_train_categorical.columns
numerical_columns = X_train_numerical.columns

categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
numerical_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# imputer = imputer.fit(X_train)
  
X_train_categorical = categorical_imputer.fit_transform(X_train_categorical)
X_train_categorical = pd.DataFrame(data=X_train_categorical, columns=categorical_columns)

X_train_numerical = numerical_imputer.fit_transform(X_train_numerical)
X_train_numerical = pd.DataFrame(data=X_train_numerical, columns=numerical_columns)

The thing about using label encoder instead of one hot encoder is that in label encoder, there is an inherent assumption that the values are hierarchically meaningful. This might or might not reflect the qualitative meaning of the value in reality. For example, we categorize the house into 3 district: district 1, district 2, district 3 and encode them into number 0, 1, and 2. Since 2 > 1, it might suggest that district 2 is better than district 1 which might not reflect the real situation in which there are no inherent difference in those two geographical locations (they are both equal in distance to the center for example). We might take this inherent bias into account and try to make a new variable (via clustering or via distance to center) to compensate for this bias in the model. The same goes for the days of the week, inherently the meaning of monday tuesday to sunday might not be that linear. We can hope that the model might have enough data to learn this representation. One hot encoding, on the other hand, assume those categories are all equal, and it puts 1 for that category and 0s for others in the representation vector. For example: a house in district 1 can be represented as [0,1,0].

Building living_condition feature

X_train_categorical = X_train_categorical.apply(LabelEncoder().fit_transform)
X_train_categorical
NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE OCCUPATION_TYPE WEEKDAY_APPR_PROCESS_START ORGANIZATION_TYPE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
0 0 1 0 1 6 7 4 3 1 8 6 5 2 0 5 0
1 0 0 0 0 1 4 1 1 1 3 1 39 2 0 0 0
2 1 1 1 1 6 7 4 3 1 8 1 11 2 0 4 0
3 0 0 0 1 6 7 4 0 1 8 6 5 2 0 4 0
4 0 1 0 1 6 7 4 3 1 3 4 37 2 0 4 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 0 1 0 0 6 7 4 2 5 14 4 43 2 0 5 0
307507 0 0 0 1 6 3 4 5 1 8 1 57 2 0 5 0
307508 0 0 0 1 6 7 1 2 1 10 4 39 2 0 4 0
307509 0 0 0 1 6 1 4 1 1 8 6 3 2 0 5 0
307510 0 0 0 0 6 1 1 1 1 8 4 5 2 0 4 0

307511 rows × 16 columns

# Some of the features are categorical ('FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE')
# the rest is numerical
living_condition_categoricals = ['FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
# living_condition_numericals = [e in living_condition if e not in living_condition_categoricals]
living_condition_numericals = np.setdiff1d(living_condition,living_condition_categoricals)
X_train_numerical[living_condition_numericals]
APARTMENTS_AVG APARTMENTS_MEDI APARTMENTS_MODE BASEMENTAREA_AVG BASEMENTAREA_MEDI BASEMENTAREA_MODE COMMONAREA_AVG COMMONAREA_MEDI COMMONAREA_MODE ELEVATORS_AVG ... NONLIVINGAREA_AVG NONLIVINGAREA_MEDI NONLIVINGAREA_MODE TOTALAREA_MODE YEARS_BEGINEXPLUATATION_AVG YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_AVG YEARS_BUILD_MEDI YEARS_BUILD_MODE
0 0.02470 0.02500 0.025200 0.036900 0.036900 0.038300 0.014300 0.014400 0.014400 0.000000 ... 0.000000 0.000000 0.000000 0.014900 0.972200 0.972200 0.972200 0.619200 0.624300 0.634100
1 0.09590 0.09680 0.092400 0.052900 0.052900 0.053800 0.060500 0.060800 0.049700 0.080000 ... 0.009800 0.010000 0.000000 0.071400 0.985100 0.985100 0.985100 0.796000 0.798700 0.804000
2 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.028358 0.028236 0.027022 0.102547 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637
3 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.028358 0.028236 0.027022 0.102547 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637
4 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.028358 0.028236 0.027022 0.102547 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 0.20210 0.20400 0.100800 0.088700 0.088700 0.017200 0.020200 0.020300 0.017200 0.220000 ... 0.109500 0.111800 0.012500 0.289800 0.987600 0.987600 0.978200 0.830000 0.832300 0.712500
307507 0.02470 0.02500 0.025200 0.043500 0.043500 0.045100 0.002200 0.002200 0.002200 0.000000 ... 0.000000 0.000000 0.000000 0.021400 0.972700 0.972700 0.972700 0.626000 0.631000 0.640600
307508 0.10310 0.10410 0.105000 0.086200 0.086200 0.089400 0.012300 0.012400 0.012400 0.000000 ... 0.000000 0.000000 0.000000 0.797000 0.981600 0.981600 0.981600 0.748400 0.751800 0.758300
307509 0.01240 0.01250 0.012600 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.028358 0.028236 0.027022 0.008600 0.977100 0.977100 0.977200 0.752471 0.755746 0.759637
307510 0.07420 0.07490 0.075600 0.052600 0.052600 0.054600 0.017600 0.017700 0.017800 0.080000 ... 0.000000 0.000000 0.000000 0.071800 0.988100 0.988100 0.988100 0.752471 0.755746 0.759637

307511 rows × 43 columns

X_train_categorical[living_condition_categoricals]
FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
0 2 0 5 0
1 2 0 0 0
2 2 0 4 0
3 2 0 4 0
4 2 0 4 0
... ... ... ... ...
307506 2 0 5 0
307507 2 0 5 0
307508 2 0 4 0
307509 2 0 5 0
307510 2 0 4 0

307511 rows × 4 columns

X_train_living_condition = pd.concat([X_train_numerical[living_condition_numericals], X_train_categorical[living_condition_categoricals]],axis=1)
X_train_living_condition
APARTMENTS_AVG APARTMENTS_MEDI APARTMENTS_MODE BASEMENTAREA_AVG BASEMENTAREA_MEDI BASEMENTAREA_MODE COMMONAREA_AVG COMMONAREA_MEDI COMMONAREA_MODE ELEVATORS_AVG ... YEARS_BEGINEXPLUATATION_AVG YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_AVG YEARS_BUILD_MEDI YEARS_BUILD_MODE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
0 0.02470 0.02500 0.025200 0.036900 0.036900 0.038300 0.014300 0.014400 0.014400 0.000000 ... 0.972200 0.972200 0.972200 0.619200 0.624300 0.634100 2 0 5 0
1 0.09590 0.09680 0.092400 0.052900 0.052900 0.053800 0.060500 0.060800 0.049700 0.080000 ... 0.985100 0.985100 0.985100 0.796000 0.798700 0.804000 2 0 0 0
2 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
3 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
4 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 0.20210 0.20400 0.100800 0.088700 0.088700 0.017200 0.020200 0.020300 0.017200 0.220000 ... 0.987600 0.987600 0.978200 0.830000 0.832300 0.712500 2 0 5 0
307507 0.02470 0.02500 0.025200 0.043500 0.043500 0.045100 0.002200 0.002200 0.002200 0.000000 ... 0.972700 0.972700 0.972700 0.626000 0.631000 0.640600 2 0 5 0
307508 0.10310 0.10410 0.105000 0.086200 0.086200 0.089400 0.012300 0.012400 0.012400 0.000000 ... 0.981600 0.981600 0.981600 0.748400 0.751800 0.758300 2 0 4 0
307509 0.01240 0.01250 0.012600 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977100 0.977100 0.977200 0.752471 0.755746 0.759637 2 0 5 0
307510 0.07420 0.07490 0.075600 0.052600 0.052600 0.054600 0.017600 0.017700 0.017800 0.080000 ... 0.988100 0.988100 0.988100 0.752471 0.755746 0.759637 2 0 4 0

307511 rows × 47 columns

Since the dataset is quite large, I use only the first 10000 observations. Then we transform the X_train_living_condition into higher dimensional space with the RBF kernel (Radial basis function) where they can be separated better. After that, we use K-means to determine the clusters. The ELBOW shows that 3 clusters is optimal. Which roughly means that the living condition of clients can be groupped into three clusters. We will create a new feature with those new cluster labels.

X_train_living_condition = X_train_living_condition[:10000]
X_train_living_condition
APARTMENTS_AVG APARTMENTS_MEDI APARTMENTS_MODE BASEMENTAREA_AVG BASEMENTAREA_MEDI BASEMENTAREA_MODE COMMONAREA_AVG COMMONAREA_MEDI COMMONAREA_MODE ELEVATORS_AVG ... YEARS_BEGINEXPLUATATION_AVG YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_AVG YEARS_BUILD_MEDI YEARS_BUILD_MODE FONDKAPREMONT_MODE HOUSETYPE_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE
0 0.02470 0.02500 0.025200 0.036900 0.036900 0.038300 0.014300 0.014400 0.014400 0.000000 ... 0.972200 0.972200 0.972200 0.619200 0.624300 0.634100 2 0 5 0
1 0.09590 0.09680 0.092400 0.052900 0.052900 0.053800 0.060500 0.060800 0.049700 0.080000 ... 0.985100 0.985100 0.985100 0.796000 0.798700 0.804000 2 0 0 0
2 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
3 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
4 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 0.01630 0.01670 0.016800 0.000000 0.000000 0.000000 0.044621 0.044595 0.042553 0.080000 ... 0.980600 0.980600 0.980600 0.632800 0.637700 0.647200 2 0 5 0
9996 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
9997 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
9998 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0
9999 0.11744 0.11785 0.114231 0.088442 0.087955 0.087543 0.044621 0.044595 0.042553 0.078942 ... 0.977735 0.977752 0.977065 0.752471 0.755746 0.759637 2 0 4 0

10000 rows × 47 columns

import numpy as np
from sklearn.decomposition import KernelPCA
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

# assume df is your DataFrame
X = X_train_living_condition.values

# Kernel PCA transformation using RBF
kpca = KernelPCA(kernel="rbf")
X_kpca = kpca.fit_transform(X)

# finding the optimal number of clusters for KMeans after transformation
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X_kpca)
    kmeanModel.fit(X_kpca)
    distortions.append(sum(np.min(cdist(X_kpca, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X_kpca.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

# Apply KMeans with the optimal number of clusters (you can find it from the above plot)
kmeans = KMeans(n_clusters=3)  # here 3 is used as an example, replace it with your optimal number
kmeans.fit(X_kpca)

# getting the cluster labels for each sample
labels = kmeans.labels_

Elbow for optimal number of clusters in living ocndition

X_train_living_condition['living_condition_cluster_label'] = labels
the_rest = []
for e in X_train.columns:
    if e not in living_condition:
        the_rest.append(e)
the_rest
X_train_the_rest = X_train[the_rest][:10000]
X_train_new = pd.concat([X_train_the_rest, X_train_living_condition], axis=1)

Now we can create more features based on this new way of clustering living condition. For example, we can take the mean, min, max and sum of the income of those living condition brackets. They are different in the sum of income but the mean is similar.

living_condition_cluster_label bincount income mean min max sum
0 7300 164439 25650 1935000 1000000000
1 454 188565 36000 810000 80000000
2 2246 172962 33300 810000 300000000

We can see that the group (2) with smallest size (454/10000) having the highest mean income (180000), even though summing up, they are the least (80000000). The group (0) with biggest size (7300) having the lowest mean income (160000), even though summing up, they are the highest. The biggest group also has the biggest range of income, ranging from 25000 to 2000000. These are the internal structure of the dataset that the kernelPCA and the Kmeans discover. Then we can merge this living situation feature into the original one.

np.bincount(labels)
X_train_living_situation = X_train_new.groupby('living_condition_cluster_label').agg({'AMT_INCOME_TOTAL': ['mean', 'min', 'max', 'sum'],
                                                        # Add more columns as needed
                                                        }).reset_index()
X_train_living_situation
X_train_living_situation.columns = ['living_condition_cluster_label', 'AMT_INCOME_TOTAL_mean','AMT_INCOME_TOTAL_min','AMT_INCOME_TOTAL_max','AMT_INCOME_TOTAL_sum' ] 
X_train_new = X_train_new.merge(X_train_living_situation, on='living_condition_cluster_label', how='left')

X_train_new
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... AMT_CREDIT_min AMT_CREDIT_sum AMT_ANNUITY_mean AMT_ANNUITY_max AMT_ANNUITY_min DAYS_DECISION_mean DAYS_DECISION_max DAYS_DECISION_min CNT_PAYMENT_mean CNT_PAYMENT_sum
0 100002 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 ... 179055.0 179055.0 9251.775000 9251.775 9251.775 -606.000000 -606.0 -606.0 24.000000 24.0
1 100003 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 ... 68053.5 1452573.0 56553.990000 98356.995 6737.310 -1305.000000 -746.0 -2341.0 10.000000 30.0
2 100004 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 ... 20106.0 20106.0 5357.250000 5357.250 5357.250 -815.000000 -815.0 -815.0 4.000000 4.0
3 100006 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 ... 0.0 2625259.5 23651.175000 39954.510 2482.920 -272.444444 -181.0 -617.0 23.000000 138.0
4 100007 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 ... 14616.0 999832.5 12278.805000 22678.785 1834.290 -1222.833333 -374.0 -2357.0 20.666667 124.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 Cash loans M N N 0 157500.0 254700.0 27558.0 225000.0 ... 40455.0 40455.0 6605.910000 6605.910 6605.910 -273.000000 -273.0 -273.0 8.000000 8.0
307507 456252 Cash loans F N Y 0 72000.0 269550.0 12001.5 225000.0 ... 56821.5 56821.5 10074.465000 10074.465 10074.465 -2497.000000 -2497.0 -2497.0 6.000000 6.0
307508 456253 Cash loans F N Y 0 153000.0 677664.0 29979.0 585000.0 ... 13945.5 41251.5 4770.405000 5567.715 3973.095 -2380.000000 -1909.0 -2851.0 5.000000 10.0
307509 456254 Cash loans F N Y 0 171000.0 370107.0 20205.0 319500.0 ... 21456.0 268879.5 10681.132500 19065.825 2296.440 -299.500000 -277.0 -322.0 15.000000 30.0
307510 456255 Cash loans F N N 0 157500.0 675000.0 49117.5 675000.0 ... 45000.0 3395448.0 20775.391875 54022.140 2250.000 -587.625000 -171.0 -991.0 21.750000 174.0

307511 rows × 133 columns

Adding financial_stress feature

There are multiple factors for the financial stress, but some main ones are: credit-income ratio, current income, current credit line, the number of children, the number of family members, spouse’s income, spouse’s credit line, bills. We can weight those factors differently, too, since they affect the client differently.

original = pd.concat([X_train[['AMT_INCOME_TOTAL','AMT_CREDIT','CNT_FAM_MEMBERS','CNT_CHILDREN']],X_train_categorical['NAME_FAMILY_STATUS']],axis=1)
original
AMT_INCOME_TOTAL AMT_CREDIT CNT_FAM_MEMBERS CNT_CHILDREN NAME_FAMILY_STATUS
0 202500.0 406597.5 1.0 0 3
1 270000.0 1293502.5 2.0 0 1
2 67500.0 135000.0 1.0 0 3
3 135000.0 312682.5 2.0 0 0
4 121500.0 513000.0 1.0 0 3
... ... ... ... ... ...
307506 157500.0 254700.0 1.0 0 2
307507 72000.0 269550.0 1.0 0 5
307508 153000.0 677664.0 1.0 0 2
307509 171000.0 370107.0 2.0 0 1
307510 157500.0 675000.0 2.0 0 1

307511 rows × 5 columns

def compute_financial_stress(row):
    w_credit_income_ratio = 2
    w_family_size = 1
    w_family_status = 1 
    w_children = 1.5
    stress_score = (row['AMT_CREDIT'] / row['AMT_INCOME_TOTAL']) * w_credit_income_ratio + row['CNT_FAM_MEMBERS'] * w_family_size + row['NAME_FAMILY_STATUS'] * w_family_status + row['CNT_CHILDREN'] * w_children
    return stress_score

original['financial_stress'] = original.apply(compute_financial_stress, axis=1)
original
AMT_INCOME_TOTAL AMT_CREDIT CNT_FAM_MEMBERS CNT_CHILDREN NAME_FAMILY_STATUS financial_stress
0 202500.0 406597.5 1.0 0 3 8.015778
1 270000.0 1293502.5 2.0 0 1 12.581500
2 67500.0 135000.0 1.0 0 3 8.000000
3 135000.0 312682.5 2.0 0 0 6.632333
4 121500.0 513000.0 1.0 0 3 12.444444
... ... ... ... ... ... ...
307506 157500.0 254700.0 1.0 0 2 6.234286
307507 72000.0 269550.0 1.0 0 5 13.487500
307508 153000.0 677664.0 1.0 0 2 11.858353
307509 171000.0 370107.0 2.0 0 1 7.328737
307510 157500.0 675000.0 2.0 0 1 11.571429

307511 rows × 6 columns

Add previous application situation

Previous application situation is an agregation of previous loans by the same person. It might rougly tell the credibility of the person, plus other important information.

previous_application = pd.read_csv('home-credit-default-risk/previous_application.csv')
previous_application.head()
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

# Aggregate based on the SK_ID_CURR
prev_app_agg = previous_application.groupby('SK_ID_CURR').agg({'AMT_CREDIT': ['mean', 'max', 'min', 'sum'],
                                                    'AMT_ANNUITY': ['mean', 'max', 'min'],
                                                    'DAYS_DECISION': ['mean', 'max', 'min'],
                                                    'CNT_PAYMENT': ['mean', 'sum']})

# Handle multi-level column names
prev_app_agg.columns = ['_'.join(col).strip() for col in prev_app_agg.columns.values]

# Reset the index
prev_app_agg.reset_index(inplace=True)
prev_app_agg
SK_ID_CURR AMT_CREDIT_mean AMT_CREDIT_max AMT_CREDIT_min AMT_CREDIT_sum AMT_ANNUITY_mean AMT_ANNUITY_max AMT_ANNUITY_min DAYS_DECISION_mean DAYS_DECISION_max DAYS_DECISION_min CNT_PAYMENT_mean CNT_PAYMENT_sum
0 100001 23787.00 23787.0 23787.0 23787.0 3951.000000 3951.000 3951.000 -1740.000 -1740 -1740 8.00 8.0
1 100002 179055.00 179055.0 179055.0 179055.0 9251.775000 9251.775 9251.775 -606.000 -606 -606 24.00 24.0
2 100003 484191.00 1035882.0 68053.5 1452573.0 56553.990000 98356.995 6737.310 -1305.000 -746 -2341 10.00 30.0
3 100004 20106.00 20106.0 20106.0 20106.0 5357.250000 5357.250 5357.250 -815.000 -815 -815 4.00 4.0
4 100005 20076.75 40153.5 0.0 40153.5 4813.200000 4813.200 4813.200 -536.000 -315 -757 12.00 12.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
338852 456251 40455.00 40455.0 40455.0 40455.0 6605.910000 6605.910 6605.910 -273.000 -273 -273 8.00 8.0
338853 456252 56821.50 56821.5 56821.5 56821.5 10074.465000 10074.465 10074.465 -2497.000 -2497 -2497 6.00 6.0
338854 456253 20625.75 27306.0 13945.5 41251.5 4770.405000 5567.715 3973.095 -2380.000 -1909 -2851 5.00 10.0
338855 456254 134439.75 247423.5 21456.0 268879.5 10681.132500 19065.825 2296.440 -299.500 -277 -322 15.00 30.0
338856 456255 424431.00 1271929.5 45000.0 3395448.0 20775.391875 54022.140 2250.000 -587.625 -171 -991 21.75 174.0

338857 rows × 13 columns

X_train_new = X_train.merge(prev_app_agg, on='SK_ID_CURR', how='left')
X_train_new
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... AMT_CREDIT_min AMT_CREDIT_sum AMT_ANNUITY_mean AMT_ANNUITY_max AMT_ANNUITY_min DAYS_DECISION_mean DAYS_DECISION_max DAYS_DECISION_min CNT_PAYMENT_mean CNT_PAYMENT_sum
0 100002 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 ... 179055.0 179055.0 9251.775000 9251.775 9251.775 -606.000000 -606.0 -606.0 24.000000 24.0
1 100003 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 ... 68053.5 1452573.0 56553.990000 98356.995 6737.310 -1305.000000 -746.0 -2341.0 10.000000 30.0
2 100004 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 ... 20106.0 20106.0 5357.250000 5357.250 5357.250 -815.000000 -815.0 -815.0 4.000000 4.0
3 100006 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 ... 0.0 2625259.5 23651.175000 39954.510 2482.920 -272.444444 -181.0 -617.0 23.000000 138.0
4 100007 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 ... 14616.0 999832.5 12278.805000 22678.785 1834.290 -1222.833333 -374.0 -2357.0 20.666667 124.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
307506 456251 Cash loans M N N 0 157500.0 254700.0 27558.0 225000.0 ... 40455.0 40455.0 6605.910000 6605.910 6605.910 -273.000000 -273.0 -273.0 8.000000 8.0
307507 456252 Cash loans F N Y 0 72000.0 269550.0 12001.5 225000.0 ... 56821.5 56821.5 10074.465000 10074.465 10074.465 -2497.000000 -2497.0 -2497.0 6.000000 6.0
307508 456253 Cash loans F N Y 0 153000.0 677664.0 29979.0 585000.0 ... 13945.5 41251.5 4770.405000 5567.715 3973.095 -2380.000000 -1909.0 -2851.0 5.000000 10.0
307509 456254 Cash loans F N Y 0 171000.0 370107.0 20205.0 319500.0 ... 21456.0 268879.5 10681.132500 19065.825 2296.440 -299.500000 -277.0 -322.0 15.000000 30.0
307510 456255 Cash loans F N N 0 157500.0 675000.0 49117.5 675000.0 ... 45000.0 3395448.0 20775.391875 54022.140 2250.000 -587.625000 -171.0 -991.0 21.750000 174.0

307511 rows × 133 columns

Misc

A feature of red-flag can be added based on credit rate from external sources. As before, the external sources can have different credibility. And then some polynomial features can be added to show non linear relationship. Here we will only demonstrate how to apply polynomial conversion to one feature: the goods price that the client applies for. Since it is not clear that the price of the goods would behave linearly with the default risk, we apply quadratic transformation.

original = X_train_numerical[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']]
original
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3
0 0.083037 0.262949 0.139376
1 0.311267 0.622246 0.510853
2 0.502130 0.555912 0.729567
3 0.502130 0.650442 0.510853
4 0.502130 0.322738 0.510853
... ... ... ...
307506 0.145570 0.681632 0.510853
307507 0.502130 0.115992 0.510853
307508 0.744026 0.535722 0.218859
307509 0.502130 0.514163 0.661024
307510 0.734460 0.708569 0.113922

307511 rows × 3 columns

def compute_red_flag(row):
    w_source_1 = 2
    w_source_2 = 1
    w_source_3 = 1 
    red_flag = row['EXT_SOURCE_1'] * w_source_1 + row['EXT_SOURCE_2'] * w_source_2 + row['EXT_SOURCE_3'] * w_source_3
    return red_flag

original['red_flag'] = original.apply(compute_red_flag, axis=1)
original
/var/folders/kf/5_ggvsz93vxdbx_h0tvy66xh0000gn/T/ipykernel_15715/1336237161.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  original['red_flag'] = original.apply(compute_red_flag, axis=1)
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 red_flag
0 0.083037 0.262949 0.139376 0.568398
1 0.311267 0.622246 0.510853 1.755633
2 0.502130 0.555912 0.729567 2.289738
3 0.502130 0.650442 0.510853 2.165554
4 0.502130 0.322738 0.510853 1.837851
... ... ... ... ...
307506 0.145570 0.681632 0.510853 1.483626
307507 0.502130 0.115992 0.510853 1.631105
307508 0.744026 0.535722 0.218859 2.242634
307509 0.502130 0.514163 0.661024 2.179446
307510 0.734460 0.708569 0.113922 2.291411

307511 rows × 4 columns

# Quadratic transformation
original = X_train_numerical['AMT_GOODS_PRICE'][:10000]

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2, interaction_only=True)
poly.fit_transform([original])
original
0        351000.0
1       1129500.0
2        135000.0
3        297000.0
4        513000.0
          ...    
9995     270000.0
9996     900000.0
9997     450000.0
9998     315000.0
9999     270000.0
Name: AMT_GOODS_PRICE, Length: 10000, dtype: float64

Conclusion

So far we have studied some forms of feature engineering in the wild. And the examples use non trivial methods to engineer the features. Those examples reflect the domain knowledge necessary in due diligence of financial institution when they consider credit lines/cash for consumers and other knowledge as a data scientist when it comes to process large dataset. To recap, we have seen that there are many input features regarding the building in which the client lives in, these are indirect reflect of their living condition, hence we create a new variable based on those input features. Then we see that there can be a financial stress evaluation based on the situation of the client. That indicator would be helpful since we are wondering whether the client can pay back the loan or not. For example, the credit-income ratio can tell how much the credit line would weigh the person’s income down, especially if they have children. We can see that the dataset lacks the information on the spouse situation (their income and credit), so one logical thing to do is that we might come back to the field and collect such information, to make our prediction more sensible. The third feature we engineer is the situation of the previous loan applications of the same client. The past can say a lot about the present and the future of this same client. So that we can aggregate those statistics into the current model’s calculation.

Together, those examples provide an overview of how to do feature engineering for a dataset, which is a crucial process and it can affect the model’s performance directly. Translating into real world business, it can help the institution makes better decision in aiding people in need.