Introduction

Feature selection is a technique in machine learning that only choose a subset of all the available features to construct a model. We know that not all features contributes to the prediction equally. For example, to predict a house price, the size of house can be much more important than the address of the house. So feature selection can help to reduce the number of variables, that would reduce computation cost and training time and make the model more parsimonious. The more important thing is that those variables are qualitatively different, so if there is an algorithm or procedure to select only the most contributing attributes, it would be better. This can combat overfitting and help with the ultimate goal of generalization, since the model sees the underlying pattern of the data. That would also possible make the model perform better.

Feature selection is a part of feature engineering (create new variables that make better sense). And it is part of the process of understanding your data qualitatively. So make sure you run multiple feature selecting algorithms then use logics and domain knowledge to evaluate each feature thoroughly before deciding to drop or to add features. Another thing to note is that some feature selection algorithms consider features individually. They cannot take into account feature interactions. That might be the broader task of feature engineering (to combine and make meaningful variables from the feature set).

Automatically, there are ways to do feature selection:

Filter method: This is to score the relevance of each feature to the target variable, making use of statistical measures such as Chi-Squared test, information gain, correlation coefficients.
Wrapper method: Why not use a model to predict the importance of the features? The search can be an exhaustive algorithm, or heuristics such as forward selection, backward elemination, recursive feature eliminiation, genetic algorithm.
Embedded method: The model to select features is embeded into the final model construction process, using models such as LASSO, elastic net, decision tree, random forest. This method can be more efficient than the above.

Filter method

Chi squared test

This statistical test is to evaluate the likelihood of correlation between variables using their distribution. Intuitively, we start with the simple assumption that all variables appear equally in frequency, i.e. they are independent. If the observed frequency is different from the expected frequency, they are actually dependent.

Let’s say we have 2 categorical variables X and Y. Attribute X has two values X1 and X2, attribute Y has two values Y1 and Y2 likewise. To see whether X and Y are dependent, we aggregate their dependent frequency:

	X1	X2	Total
Y1	120	80	200
Y2	130	60	190
Total	250	140	390

We interpret the above table as follows: there are 100 observations that has both attributes X1 and Y1. Then we define the hypothesis H0 to be no association between X and Y, H1 to be there is association between X and Y.

In step 2, we calculate the expected frequency for each cell: \(Expected Frequency = \frac{Row total * Column total}{Total}\). For example, the expected observations in the first cell is \((X1,Y1)=\frac{250*200}{390} = 128\). So we expect to see 128 observations that have the combination (X1,Y1). Here is the expected frequencies:

	X1	X2	Total
Y1	128	71	200
Y2	121	68	190
Total	250	140	390

Then we calculate the difference between the expected and the observed frequencies: \(\frac{(observed - expected)^2}{expected}\)

	X1	X2	Total
Y1	0.5	1.1	200
Y2	0.7	0.9	190
Total	250	140	390

The test statistics chi squared would be the sum of all those differences: 0.5 + 1.1 + 0.7 + 0.9 = 3.2. Now we calculate the degree of freedom of this table: \((rows - 1) * (columns - 1) = (2-1) * (2-1)=1\)

With 1 degree of freedom, choose alpha to be 0.05, check this table:

We have the critical value to be 3.841 which is bigger than our chi squared statistics for this dataset. We cannot reject the null hypothesis that there is no link between X and Y. Which we can carefully conclude that there is no link between X and Y.

Usually we use the Chi squared test for category variables. For example, movie genres can have one of these values: action, comedy, thriller, children. However, we can apply the Chi squared test for numerical values too, but we need to bin the data first. Binning a continuous variable is also called bucketting: we divide the range of the variable into smaller intervals making up that range. For example, we divide [0,10] into 3 buckets: [0,3], (3,6], and (6,10]. This gives us something like three categories to work with. The only thing is that when you do this, consider that we lose some information grouping data into buckets. Also, the Chi square test is for frequencies so they are in nature all non negative values. If you have negative values in your dataset, a way to remedy is to turn them into zero.

Code example

In this example, we would look into the home credit default risk dataset, which we can use to predict the unability to pay back the loan. In this case, we would only demonstrate the chi squared test to select the most relevant features for the target variable. There are 1166 features in which 16 are categorical variables. We fill in the missing values with most frequent for categorical variables and mean for numerical variables. We also turn negative values into 0. We would then select the top 8 categorical variables then we select the top 800 numerical variables. The binning of numerical variables would be done automatically by sklearn.

import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)

train = pd.read_csv('home-credit-manual-engineered-features/train_previous_raw.csv', nrows = 1000)
test = pd.read_csv('home-credit-manual-engineered-features/test_previous_raw.csv', nrows = 1000)

train.head()

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	...	client_installments_AMT_PAYMENT_min_sum	client_installments_AMT_PAYMENT_sum_mean	client_installments_AMT_PAYMENT_sum_max	client_installments_AMT_PAYMENT_sum_min	client_installments_AMT_PAYMENT_sum_sum	client_installments_counts_mean	client_installments_counts_max	client_installments_counts_min	client_installments_counts_sum	client_counts
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	175783.725	219625.695000	219625.695	219625.695	4.172888e+06	19.000000	19.0	19.0	361.0	19.0
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	1154108.295	453952.220400	1150977.330	80773.380	1.134881e+07	9.160000	12.0	6.0	229.0	25.0
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	16071.750	21288.465000	21288.465	21288.465	6.386539e+04	3.000000	3.0	3.0	9.0	3.0
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	994476.690	232499.719688	691786.890	25091.325	3.719996e+06	7.875000	10.0	1.0	126.0	16.0
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	483756.390	172669.901591	280199.700	18330.390	1.139621e+07	13.606061	17.0	10.0	898.0	66.0

5 rows × 1166 columns

y_train = train['TARGET']
train = train.drop(['TARGET'], axis=1)
y_train = y_train.to_frame()
y_train

	TARGET
0	1
1	0
2	0
3	0
4	0
...	...
995	0
996	0
997	0
998	0
999	0

1000 rows × 1 columns

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=categoricals)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_train = X_train.apply(LabelEncoder().fit_transform)

chi2_selector = SelectKBest(chi2, k=3) # Select top 5 features
X_train_kbest = chi2_selector.fit_transform(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, chi2_selector.get_support()) if mask]
print(selected_features)

['client_credit_SK_ID_CURR_sum_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_sum']

X_train = train.select_dtypes(include=numerics)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)
X_train[X_train < 0] = 0

chi2_selector = SelectKBest(chi2, k=800) # Select top 5 features
X_train_kbest = chi2_selector.fit_transform(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, chi2_selector.get_support()) if mask]
print(selected_features)

['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_EMPLOYED', 'OWN_CAR_AGE', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'COMMONAREA_AVG', 'LANDAREA_AVG', 'NONLIVINGAREA_AVG', 'COMMONAREA_MODE', 'LANDAREA_MODE', 'NONLIVINGAREA_MODE', 'COMMONAREA_MEDI', 'LANDAREA_MEDI', 'NONLIVINGAREA_MEDI', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_18', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'previous_loans_NAME_CONTRACT_TYPE_Cash loans_count', 'previous_loans_NAME_CONTRACT_TYPE_Cash loans_count_norm', 'previous_loans_NAME_CONTRACT_TYPE_Consumer loans_count', 'previous_loans_NAME_CONTRACT_TYPE_XNA_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_FRIDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_FRIDAY_count_norm', 'previous_loans_WEEKDAY_APPR_PROCESS_START_MONDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_SATURDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_SUNDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_SUNDAY_count_norm', 'previous_loans_WEEKDAY_APPR_PROCESS_START_THURSDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_THURSDAY_count_norm', 'previous_loans_WEEKDAY_APPR_PROCESS_START_TUESDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_TUESDAY_count_norm', 'previous_loans_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_count', 'previous_loans_WEEKDAY_APPR_PROCESS_START_WEDNESDAY_count_norm', 'previous_loans_FLAG_LAST_APPL_PER_CONTRACT_N_count', 'previous_loans_FLAG_LAST_APPL_PER_CONTRACT_N_count_norm', 'previous_loans_FLAG_LAST_APPL_PER_CONTRACT_Y_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Building a house or an annex_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a holiday home / land_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a new car_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a new car_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a used car_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Buying a used car_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Car repairs_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Car repairs_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Everyday expenses_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Everyday expenses_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Medicine_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Other_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Payments on other loans_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Payments on other loans_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase of electronic equipment_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase of electronic equipment_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Repairs_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Urgent needs_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_Wedding / gift / holiday_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_XAP_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_XAP_count_norm', 'previous_loans_NAME_CASH_LOAN_PURPOSE_XNA_count', 'previous_loans_NAME_CASH_LOAN_PURPOSE_XNA_count_norm', 'previous_loans_NAME_CONTRACT_STATUS_Approved_count', 'previous_loans_NAME_CONTRACT_STATUS_Approved_count_norm', 'previous_loans_NAME_CONTRACT_STATUS_Canceled_count', 'previous_loans_NAME_CONTRACT_STATUS_Canceled_count_norm', 'previous_loans_NAME_CONTRACT_STATUS_Refused_count', 'previous_loans_NAME_CONTRACT_STATUS_Refused_count_norm', 'previous_loans_NAME_CONTRACT_STATUS_Unused offer_count', 'previous_loans_NAME_CONTRACT_STATUS_Unused offer_count_norm', 'previous_loans_NAME_PAYMENT_TYPE_Cash through the bank_count', 'previous_loans_NAME_PAYMENT_TYPE_Cashless from the account of the employer_count', 'previous_loans_NAME_PAYMENT_TYPE_Non-cash from your account_count', 'previous_loans_NAME_PAYMENT_TYPE_Non-cash from your account_count_norm', 'previous_loans_NAME_PAYMENT_TYPE_XNA_count', 'previous_loans_CODE_REJECT_REASON_CLIENT_count', 'previous_loans_CODE_REJECT_REASON_CLIENT_count_norm', 'previous_loans_CODE_REJECT_REASON_HC_count', 'previous_loans_CODE_REJECT_REASON_HC_count_norm', 'previous_loans_CODE_REJECT_REASON_LIMIT_count_norm', 'previous_loans_CODE_REJECT_REASON_SCO_count', 'previous_loans_CODE_REJECT_REASON_SCO_count_norm', 'previous_loans_CODE_REJECT_REASON_SCOFR_count', 'previous_loans_CODE_REJECT_REASON_SCOFR_count_norm', 'previous_loans_CODE_REJECT_REASON_SYSTEM_count', 'previous_loans_CODE_REJECT_REASON_VERIF_count', 'previous_loans_CODE_REJECT_REASON_VERIF_count_norm', 'previous_loans_CODE_REJECT_REASON_XAP_count', 'previous_loans_CODE_REJECT_REASON_XAP_count_norm', 'previous_loans_CODE_REJECT_REASON_XNA_count', 'previous_loans_NAME_TYPE_SUITE_Children_count', 'previous_loans_NAME_TYPE_SUITE_Children_count_norm', 'previous_loans_NAME_TYPE_SUITE_Group of people_count', 'previous_loans_NAME_TYPE_SUITE_Group of people_count_norm', 'previous_loans_NAME_TYPE_SUITE_Other_A_count', 'previous_loans_NAME_TYPE_SUITE_Other_A_count_norm', 'previous_loans_NAME_TYPE_SUITE_Other_B_count', 'previous_loans_NAME_TYPE_SUITE_Other_B_count_norm', 'previous_loans_NAME_TYPE_SUITE_Spouse, partner_count', 'previous_loans_NAME_TYPE_SUITE_Spouse, partner_count_norm', 'previous_loans_NAME_TYPE_SUITE_Unaccompanied_count', 'previous_loans_NAME_TYPE_SUITE_Unaccompanied_count_norm', 'previous_loans_NAME_CLIENT_TYPE_New_count', 'previous_loans_NAME_CLIENT_TYPE_New_count_norm', 'previous_loans_NAME_CLIENT_TYPE_Repeater_count', 'previous_loans_NAME_CLIENT_TYPE_Repeater_count_norm', 'previous_loans_NAME_CLIENT_TYPE_XNA_count', 'previous_loans_NAME_GOODS_CATEGORY_Audio/Video_count', 'previous_loans_NAME_GOODS_CATEGORY_Auto Accessories_count', 'previous_loans_NAME_GOODS_CATEGORY_Clothing and Accessories_count', 'previous_loans_NAME_GOODS_CATEGORY_Clothing and Accessories_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Computers_count', 'previous_loans_NAME_GOODS_CATEGORY_Computers_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Construction Materials_count', 'previous_loans_NAME_GOODS_CATEGORY_Construction Materials_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Consumer Electronics_count', 'previous_loans_NAME_GOODS_CATEGORY_Consumer Electronics_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Furniture_count', 'previous_loans_NAME_GOODS_CATEGORY_Furniture_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Gardening_count', 'previous_loans_NAME_GOODS_CATEGORY_Homewares_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Jewelry_count', 'previous_loans_NAME_GOODS_CATEGORY_Jewelry_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Medical Supplies_count', 'previous_loans_NAME_GOODS_CATEGORY_Medical Supplies_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Medicine_count', 'previous_loans_NAME_GOODS_CATEGORY_Medicine_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Mobile_count', 'previous_loans_NAME_GOODS_CATEGORY_Mobile_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Office Appliances_count', 'previous_loans_NAME_GOODS_CATEGORY_Office Appliances_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Other_count', 'previous_loans_NAME_GOODS_CATEGORY_Other_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Photo / Cinema Equipment_count', 'previous_loans_NAME_GOODS_CATEGORY_Sport and Leisure_count', 'previous_loans_NAME_GOODS_CATEGORY_Tourism_count', 'previous_loans_NAME_GOODS_CATEGORY_Vehicles_count', 'previous_loans_NAME_GOODS_CATEGORY_Vehicles_count_norm', 'previous_loans_NAME_GOODS_CATEGORY_Weapon_count', 'previous_loans_NAME_PORTFOLIO_Cards_count', 'previous_loans_NAME_PORTFOLIO_Cars_count', 'previous_loans_NAME_PORTFOLIO_Cars_count_norm', 'previous_loans_NAME_PORTFOLIO_Cash_count', 'previous_loans_NAME_PORTFOLIO_POS_count', 'previous_loans_NAME_PORTFOLIO_XNA_count', 'previous_loans_NAME_PRODUCT_TYPE_XNA_count', 'previous_loans_NAME_PRODUCT_TYPE_walk-in_count', 'previous_loans_NAME_PRODUCT_TYPE_walk-in_count_norm', 'previous_loans_NAME_PRODUCT_TYPE_x-sell_count', 'previous_loans_NAME_PRODUCT_TYPE_x-sell_count_norm', 'previous_loans_CHANNEL_TYPE_AP+ (Cash loan)_count', 'previous_loans_CHANNEL_TYPE_AP+ (Cash loan)_count_norm', 'previous_loans_CHANNEL_TYPE_Car dealer_count', 'previous_loans_CHANNEL_TYPE_Car dealer_count_norm', 'previous_loans_CHANNEL_TYPE_Channel of corporate sales_count', 'previous_loans_CHANNEL_TYPE_Channel of corporate sales_count_norm', 'previous_loans_CHANNEL_TYPE_Contact center_count', 'previous_loans_CHANNEL_TYPE_Contact center_count_norm', 'previous_loans_CHANNEL_TYPE_Country-wide_count', 'previous_loans_CHANNEL_TYPE_Credit and cash offices_count', 'previous_loans_CHANNEL_TYPE_Credit and cash offices_count_norm', 'previous_loans_CHANNEL_TYPE_Regional / Local_count', 'previous_loans_CHANNEL_TYPE_Regional / Local_count_norm', 'previous_loans_CHANNEL_TYPE_Stone_count', 'previous_loans_NAME_SELLER_INDUSTRY_Auto technology_count', 'previous_loans_NAME_SELLER_INDUSTRY_Auto technology_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Clothing_count', 'previous_loans_NAME_SELLER_INDUSTRY_Clothing_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Connectivity_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Consumer electronics_count', 'previous_loans_NAME_SELLER_INDUSTRY_Consumer electronics_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Furniture_count', 'previous_loans_NAME_SELLER_INDUSTRY_Furniture_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Industry_count', 'previous_loans_NAME_SELLER_INDUSTRY_Industry_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Jewelry_count', 'previous_loans_NAME_SELLER_INDUSTRY_Jewelry_count_norm', 'previous_loans_NAME_SELLER_INDUSTRY_Tourism_count', 'previous_loans_NAME_SELLER_INDUSTRY_XNA_count', 'previous_loans_NAME_YIELD_GROUP_XNA_count', 'previous_loans_NAME_YIELD_GROUP_high_count', 'previous_loans_NAME_YIELD_GROUP_high_count_norm', 'previous_loans_NAME_YIELD_GROUP_low_action_count', 'previous_loans_NAME_YIELD_GROUP_low_action_count_norm', 'previous_loans_NAME_YIELD_GROUP_low_normal_count', 'previous_loans_NAME_YIELD_GROUP_middle_count', 'previous_loans_NAME_YIELD_GROUP_middle_count_norm', 'previous_loans_PRODUCT_COMBINATION_Card Street_count', 'previous_loans_PRODUCT_COMBINATION_Card Street_count_norm', 'previous_loans_PRODUCT_COMBINATION_Card X-Sell_count', 'previous_loans_PRODUCT_COMBINATION_Card X-Sell_count_norm', 'previous_loans_PRODUCT_COMBINATION_Cash_count', 'previous_loans_PRODUCT_COMBINATION_Cash_count_norm', 'previous_loans_PRODUCT_COMBINATION_Cash Street: high_count_norm', 'previous_loans_PRODUCT_COMBINATION_Cash Street: low_count', 'previous_loans_PRODUCT_COMBINATION_Cash Street: low_count_norm', 'previous_loans_PRODUCT_COMBINATION_Cash Street: middle_count', 'previous_loans_PRODUCT_COMBINATION_Cash X-Sell: high_count', 'previous_loans_PRODUCT_COMBINATION_Cash X-Sell: low_count', 'previous_loans_PRODUCT_COMBINATION_Cash X-Sell: low_count_norm', 'previous_loans_PRODUCT_COMBINATION_Cash X-Sell: middle_count', 'previous_loans_PRODUCT_COMBINATION_Cash X-Sell: middle_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS household with interest_count', 'previous_loans_PRODUCT_COMBINATION_POS household with interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS household without interest_count', 'previous_loans_PRODUCT_COMBINATION_POS household without interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS industry with interest_count', 'previous_loans_PRODUCT_COMBINATION_POS industry with interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS industry without interest_count', 'previous_loans_PRODUCT_COMBINATION_POS industry without interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS mobile with interest_count', 'previous_loans_PRODUCT_COMBINATION_POS mobile with interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS mobile without interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS other with interest_count', 'previous_loans_PRODUCT_COMBINATION_POS other with interest_count_norm', 'previous_loans_PRODUCT_COMBINATION_POS others without interest_count', 'previous_loans_PRODUCT_COMBINATION_POS others without interest_count_norm', 'previous_loans_AMT_ANNUITY_mean', 'previous_loans_AMT_ANNUITY_max', 'previous_loans_AMT_ANNUITY_min', 'previous_loans_AMT_ANNUITY_sum', 'previous_loans_AMT_APPLICATION_mean', 'previous_loans_AMT_APPLICATION_max', 'previous_loans_AMT_APPLICATION_min', 'previous_loans_AMT_APPLICATION_sum', 'previous_loans_AMT_CREDIT_mean', 'previous_loans_AMT_CREDIT_max', 'previous_loans_AMT_CREDIT_min', 'previous_loans_AMT_CREDIT_sum', 'previous_loans_AMT_DOWN_PAYMENT_mean', 'previous_loans_AMT_DOWN_PAYMENT_max', 'previous_loans_AMT_DOWN_PAYMENT_min', 'previous_loans_AMT_DOWN_PAYMENT_sum', 'previous_loans_AMT_GOODS_PRICE_mean', 'previous_loans_AMT_GOODS_PRICE_max', 'previous_loans_AMT_GOODS_PRICE_min', 'previous_loans_AMT_GOODS_PRICE_sum', 'previous_loans_HOUR_APPR_PROCESS_START_mean', 'previous_loans_HOUR_APPR_PROCESS_START_max', 'previous_loans_HOUR_APPR_PROCESS_START_min', 'previous_loans_HOUR_APPR_PROCESS_START_sum', 'previous_loans_NFLAG_LAST_APPL_IN_DAY_sum', 'previous_loans_RATE_DOWN_PAYMENT_mean', 'previous_loans_RATE_DOWN_PAYMENT_max', 'previous_loans_RATE_DOWN_PAYMENT_sum', 'previous_loans_SELLERPLACE_AREA_mean', 'previous_loans_SELLERPLACE_AREA_max', 'previous_loans_SELLERPLACE_AREA_min', 'previous_loans_SELLERPLACE_AREA_sum', 'previous_loans_CNT_PAYMENT_mean', 'previous_loans_CNT_PAYMENT_max', 'previous_loans_CNT_PAYMENT_min', 'previous_loans_CNT_PAYMENT_sum', 'previous_loans_DAYS_FIRST_DRAWING_mean', 'previous_loans_DAYS_FIRST_DRAWING_max', 'previous_loans_DAYS_FIRST_DRAWING_min', 'previous_loans_DAYS_FIRST_DRAWING_sum', 'previous_loans_DAYS_FIRST_DUE_mean', 'previous_loans_DAYS_FIRST_DUE_max', 'previous_loans_DAYS_FIRST_DUE_min', 'previous_loans_DAYS_FIRST_DUE_sum', 'previous_loans_DAYS_LAST_DUE_1ST_VERSION_mean', 'previous_loans_DAYS_LAST_DUE_1ST_VERSION_max', 'previous_loans_DAYS_LAST_DUE_1ST_VERSION_min', 'previous_loans_DAYS_LAST_DUE_1ST_VERSION_sum', 'previous_loans_DAYS_LAST_DUE_mean', 'previous_loans_DAYS_LAST_DUE_max', 'previous_loans_DAYS_LAST_DUE_min', 'previous_loans_DAYS_LAST_DUE_sum', 'previous_loans_DAYS_TERMINATION_mean', 'previous_loans_DAYS_TERMINATION_max', 'previous_loans_DAYS_TERMINATION_min', 'previous_loans_DAYS_TERMINATION_sum', 'previous_loans_NFLAG_INSURED_ON_APPROVAL_mean', 'previous_loans_NFLAG_INSURED_ON_APPROVAL_min', 'previous_loans_NFLAG_INSURED_ON_APPROVAL_sum', 'previous_loans_counts', 'client_cash_NAME_CONTRACT_STATUS_Active_count_mean', 'client_cash_NAME_CONTRACT_STATUS_Active_count_sum', 'client_cash_NAME_CONTRACT_STATUS_Active_count_norm_sum', 'client_cash_NAME_CONTRACT_STATUS_Approved_count_min', 'client_cash_NAME_CONTRACT_STATUS_Approved_count_sum', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_mean', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_max', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_min', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_sum', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_norm_mean', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_norm_max', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_norm_min', 'client_cash_NAME_CONTRACT_STATUS_Completed_count_norm_sum', 'client_cash_NAME_CONTRACT_STATUS_Returned to the store_count_mean', 'client_cash_NAME_CONTRACT_STATUS_Returned to the store_count_min', 'client_cash_NAME_CONTRACT_STATUS_Returned to the store_count_sum', 'client_cash_NAME_CONTRACT_STATUS_Returned to the store_count_norm_max', 'client_cash_NAME_CONTRACT_STATUS_Signed_count_mean', 'client_cash_NAME_CONTRACT_STATUS_Signed_count_max', 'client_cash_NAME_CONTRACT_STATUS_Signed_count_min', 'client_cash_NAME_CONTRACT_STATUS_Signed_count_sum', 'client_cash_NAME_CONTRACT_STATUS_Signed_count_norm_min', 'client_cash_SK_ID_CURR_mean_sum', 'client_cash_SK_ID_CURR_max_sum', 'client_cash_SK_ID_CURR_min_sum', 'client_cash_SK_ID_CURR_sum_mean', 'client_cash_SK_ID_CURR_sum_max', 'client_cash_SK_ID_CURR_sum_min', 'client_cash_SK_ID_CURR_sum_sum', 'client_cash_CNT_INSTALMENT_mean_mean', 'client_cash_CNT_INSTALMENT_mean_max', 'client_cash_CNT_INSTALMENT_mean_min', 'client_cash_CNT_INSTALMENT_mean_sum', 'client_cash_CNT_INSTALMENT_max_mean', 'client_cash_CNT_INSTALMENT_max_max', 'client_cash_CNT_INSTALMENT_max_min', 'client_cash_CNT_INSTALMENT_max_sum', 'client_cash_CNT_INSTALMENT_min_mean', 'client_cash_CNT_INSTALMENT_min_min', 'client_cash_CNT_INSTALMENT_min_sum', 'client_cash_CNT_INSTALMENT_sum_mean', 'client_cash_CNT_INSTALMENT_sum_max', 'client_cash_CNT_INSTALMENT_sum_min', 'client_cash_CNT_INSTALMENT_sum_sum', 'client_cash_CNT_INSTALMENT_FUTURE_mean_mean', 'client_cash_CNT_INSTALMENT_FUTURE_mean_max', 'client_cash_CNT_INSTALMENT_FUTURE_mean_min', 'client_cash_CNT_INSTALMENT_FUTURE_mean_sum', 'client_cash_CNT_INSTALMENT_FUTURE_max_mean', 'client_cash_CNT_INSTALMENT_FUTURE_max_max', 'client_cash_CNT_INSTALMENT_FUTURE_max_min', 'client_cash_CNT_INSTALMENT_FUTURE_max_sum', 'client_cash_CNT_INSTALMENT_FUTURE_min_max', 'client_cash_CNT_INSTALMENT_FUTURE_min_sum', 'client_cash_CNT_INSTALMENT_FUTURE_sum_mean', 'client_cash_CNT_INSTALMENT_FUTURE_sum_max', 'client_cash_CNT_INSTALMENT_FUTURE_sum_min', 'client_cash_CNT_INSTALMENT_FUTURE_sum_sum', 'client_cash_SK_DPD_mean_mean', 'client_cash_SK_DPD_mean_max', 'client_cash_SK_DPD_mean_min', 'client_cash_SK_DPD_mean_sum', 'client_cash_SK_DPD_max_mean', 'client_cash_SK_DPD_max_max', 'client_cash_SK_DPD_max_min', 'client_cash_SK_DPD_max_sum', 'client_cash_SK_DPD_min_mean', 'client_cash_SK_DPD_min_max', 'client_cash_SK_DPD_min_sum', 'client_cash_SK_DPD_sum_mean', 'client_cash_SK_DPD_sum_max', 'client_cash_SK_DPD_sum_min', 'client_cash_SK_DPD_sum_sum', 'client_cash_SK_DPD_DEF_mean_mean', 'client_cash_SK_DPD_DEF_mean_max', 'client_cash_SK_DPD_DEF_mean_min', 'client_cash_SK_DPD_DEF_mean_sum', 'client_cash_SK_DPD_DEF_max_mean', 'client_cash_SK_DPD_DEF_max_max', 'client_cash_SK_DPD_DEF_max_min', 'client_cash_SK_DPD_DEF_max_sum', 'client_cash_SK_DPD_DEF_sum_mean', 'client_cash_SK_DPD_DEF_sum_max', 'client_cash_SK_DPD_DEF_sum_min', 'client_cash_SK_DPD_DEF_sum_sum', 'client_cash_counts_mean', 'client_cash_counts_min', 'client_cash_counts_sum', 'client_counts_x', 'client_credit_NAME_CONTRACT_STATUS_Active_count_mean', 'client_credit_NAME_CONTRACT_STATUS_Active_count_max', 'client_credit_NAME_CONTRACT_STATUS_Active_count_min', 'client_credit_NAME_CONTRACT_STATUS_Active_count_sum', 'client_credit_NAME_CONTRACT_STATUS_Active_count_norm_sum', 'client_credit_NAME_CONTRACT_STATUS_Completed_count_mean', 'client_credit_NAME_CONTRACT_STATUS_Completed_count_max', 'client_credit_NAME_CONTRACT_STATUS_Completed_count_min', 'client_credit_NAME_CONTRACT_STATUS_Completed_count_sum', 'client_credit_NAME_CONTRACT_STATUS_Completed_count_norm_sum', 'client_credit_NAME_CONTRACT_STATUS_Sent proposal_count_mean', 'client_credit_NAME_CONTRACT_STATUS_Sent proposal_count_max', 'client_credit_NAME_CONTRACT_STATUS_Sent proposal_count_min', 'client_credit_NAME_CONTRACT_STATUS_Sent proposal_count_sum', 'client_credit_NAME_CONTRACT_STATUS_Sent proposal_count_norm_sum', 'client_credit_NAME_CONTRACT_STATUS_Signed_count_mean', 'client_credit_NAME_CONTRACT_STATUS_Signed_count_max', 'client_credit_NAME_CONTRACT_STATUS_Signed_count_min', 'client_credit_NAME_CONTRACT_STATUS_Signed_count_sum', 'client_credit_NAME_CONTRACT_STATUS_Signed_count_norm_sum', 'client_credit_SK_ID_CURR_mean_mean', 'client_credit_SK_ID_CURR_mean_max', 'client_credit_SK_ID_CURR_mean_min', 'client_credit_SK_ID_CURR_mean_sum', 'client_credit_SK_ID_CURR_max_mean', 'client_credit_SK_ID_CURR_max_max', 'client_credit_SK_ID_CURR_max_min', 'client_credit_SK_ID_CURR_max_sum', 'client_credit_SK_ID_CURR_min_mean', 'client_credit_SK_ID_CURR_min_max', 'client_credit_SK_ID_CURR_min_min', 'client_credit_SK_ID_CURR_min_sum', 'client_credit_SK_ID_CURR_sum_mean', 'client_credit_SK_ID_CURR_sum_max', 'client_credit_SK_ID_CURR_sum_min', 'client_credit_SK_ID_CURR_sum_sum', 'client_credit_AMT_BALANCE_mean_mean', 'client_credit_AMT_BALANCE_mean_max', 'client_credit_AMT_BALANCE_mean_min', 'client_credit_AMT_BALANCE_mean_sum', 'client_credit_AMT_BALANCE_max_mean', 'client_credit_AMT_BALANCE_max_max', 'client_credit_AMT_BALANCE_max_min', 'client_credit_AMT_BALANCE_max_sum', 'client_credit_AMT_BALANCE_min_mean', 'client_credit_AMT_BALANCE_min_max', 'client_credit_AMT_BALANCE_min_min', 'client_credit_AMT_BALANCE_min_sum', 'client_credit_AMT_BALANCE_sum_mean', 'client_credit_AMT_BALANCE_sum_max', 'client_credit_AMT_BALANCE_sum_min', 'client_credit_AMT_BALANCE_sum_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_mean_mean', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_mean_max', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_mean_min', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_mean_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_max_mean', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_max_max', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_max_min', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_max_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_min_mean', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_min_max', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_min_min', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_min_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_mean', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_max', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_min', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_mean_mean', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_mean_max', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_mean_min', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_mean_sum', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_max_mean', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_max_max', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_max_min', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_max_sum', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_min_mean', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_min_max', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_min_min', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_min_sum', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_sum_mean', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_sum_max', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_sum_min', 'client_credit_AMT_DRAWINGS_ATM_CURRENT_sum_sum', 'client_credit_AMT_DRAWINGS_CURRENT_mean_mean', 'client_credit_AMT_DRAWINGS_CURRENT_mean_max', 'client_credit_AMT_DRAWINGS_CURRENT_mean_min', 'client_credit_AMT_DRAWINGS_CURRENT_mean_sum', 'client_credit_AMT_DRAWINGS_CURRENT_max_mean', 'client_credit_AMT_DRAWINGS_CURRENT_max_max', 'client_credit_AMT_DRAWINGS_CURRENT_max_min', 'client_credit_AMT_DRAWINGS_CURRENT_max_sum', 'client_credit_AMT_DRAWINGS_CURRENT_min_mean', 'client_credit_AMT_DRAWINGS_CURRENT_min_max', 'client_credit_AMT_DRAWINGS_CURRENT_min_min', 'client_credit_AMT_DRAWINGS_CURRENT_min_sum', 'client_credit_AMT_DRAWINGS_CURRENT_sum_mean', 'client_credit_AMT_DRAWINGS_CURRENT_sum_max', 'client_credit_AMT_DRAWINGS_CURRENT_sum_min', 'client_credit_AMT_DRAWINGS_CURRENT_sum_sum', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_mean_mean', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_mean_max', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_mean_min', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_mean_sum', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_max_mean', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_max_max', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_max_min', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_max_sum', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_sum_mean', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_sum_max', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_sum_min', 'client_credit_AMT_DRAWINGS_OTHER_CURRENT_sum_sum', 'client_credit_AMT_DRAWINGS_POS_CURRENT_mean_mean', 'client_credit_AMT_DRAWINGS_POS_CURRENT_mean_max', 'client_credit_AMT_DRAWINGS_POS_CURRENT_mean_min', 'client_credit_AMT_DRAWINGS_POS_CURRENT_mean_sum', 'client_credit_AMT_DRAWINGS_POS_CURRENT_max_mean', 'client_credit_AMT_DRAWINGS_POS_CURRENT_max_max', 'client_credit_AMT_DRAWINGS_POS_CURRENT_max_min', 'client_credit_AMT_DRAWINGS_POS_CURRENT_max_sum', 'client_credit_AMT_DRAWINGS_POS_CURRENT_min_mean', 'client_credit_AMT_DRAWINGS_POS_CURRENT_min_max', 'client_credit_AMT_DRAWINGS_POS_CURRENT_min_min', 'client_credit_AMT_DRAWINGS_POS_CURRENT_min_sum', 'client_credit_AMT_DRAWINGS_POS_CURRENT_sum_mean', 'client_credit_AMT_DRAWINGS_POS_CURRENT_sum_max', 'client_credit_AMT_DRAWINGS_POS_CURRENT_sum_min', 'client_credit_AMT_DRAWINGS_POS_CURRENT_sum_sum', 'client_credit_AMT_INST_MIN_REGULARITY_mean_mean', 'client_credit_AMT_INST_MIN_REGULARITY_mean_max', 'client_credit_AMT_INST_MIN_REGULARITY_mean_min', 'client_credit_AMT_INST_MIN_REGULARITY_mean_sum', 'client_credit_AMT_INST_MIN_REGULARITY_max_mean', 'client_credit_AMT_INST_MIN_REGULARITY_max_max', 'client_credit_AMT_INST_MIN_REGULARITY_max_min', 'client_credit_AMT_INST_MIN_REGULARITY_max_sum', 'client_credit_AMT_INST_MIN_REGULARITY_min_mean', 'client_credit_AMT_INST_MIN_REGULARITY_min_max', 'client_credit_AMT_INST_MIN_REGULARITY_min_min', 'client_credit_AMT_INST_MIN_REGULARITY_min_sum', 'client_credit_AMT_INST_MIN_REGULARITY_sum_mean', 'client_credit_AMT_INST_MIN_REGULARITY_sum_max', 'client_credit_AMT_INST_MIN_REGULARITY_sum_min', 'client_credit_AMT_INST_MIN_REGULARITY_sum_sum', 'client_credit_AMT_PAYMENT_CURRENT_mean_mean', 'client_credit_AMT_PAYMENT_CURRENT_mean_max', 'client_credit_AMT_PAYMENT_CURRENT_mean_min', 'client_credit_AMT_PAYMENT_CURRENT_mean_sum', 'client_credit_AMT_PAYMENT_CURRENT_max_mean', 'client_credit_AMT_PAYMENT_CURRENT_max_max', 'client_credit_AMT_PAYMENT_CURRENT_max_min', 'client_credit_AMT_PAYMENT_CURRENT_max_sum', 'client_credit_AMT_PAYMENT_CURRENT_min_mean', 'client_credit_AMT_PAYMENT_CURRENT_min_max', 'client_credit_AMT_PAYMENT_CURRENT_min_min', 'client_credit_AMT_PAYMENT_CURRENT_min_sum', 'client_credit_AMT_PAYMENT_CURRENT_sum_mean', 'client_credit_AMT_PAYMENT_CURRENT_sum_max', 'client_credit_AMT_PAYMENT_CURRENT_sum_min', 'client_credit_AMT_PAYMENT_CURRENT_sum_sum', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_mean_mean', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_mean_max', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_mean_min', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_mean_sum', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_max_mean', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_max_max', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_max_min', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_max_sum', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_min_mean', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_min_max', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_min_min', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_min_sum', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_sum_mean', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_sum_max', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_sum_min', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_sum_sum', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_mean_mean', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_mean_max', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_mean_min', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_mean_sum', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_max_mean', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_max_max', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_max_min', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_max_sum', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_min_mean', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_min_max', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_min_min', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_min_sum', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_sum_mean', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_sum_max', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_sum_min', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_sum_sum', 'client_credit_AMT_RECIVABLE_mean_mean', 'client_credit_AMT_RECIVABLE_mean_max', 'client_credit_AMT_RECIVABLE_mean_min', 'client_credit_AMT_RECIVABLE_mean_sum', 'client_credit_AMT_RECIVABLE_max_mean', 'client_credit_AMT_RECIVABLE_max_max', 'client_credit_AMT_RECIVABLE_max_min', 'client_credit_AMT_RECIVABLE_max_sum', 'client_credit_AMT_RECIVABLE_min_mean', 'client_credit_AMT_RECIVABLE_min_max', 'client_credit_AMT_RECIVABLE_min_min', 'client_credit_AMT_RECIVABLE_min_sum', 'client_credit_AMT_RECIVABLE_sum_mean', 'client_credit_AMT_RECIVABLE_sum_max', 'client_credit_AMT_RECIVABLE_sum_min', 'client_credit_AMT_RECIVABLE_sum_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_mean_mean', 'client_credit_AMT_TOTAL_RECEIVABLE_mean_max', 'client_credit_AMT_TOTAL_RECEIVABLE_mean_min', 'client_credit_AMT_TOTAL_RECEIVABLE_mean_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_max_mean', 'client_credit_AMT_TOTAL_RECEIVABLE_max_max', 'client_credit_AMT_TOTAL_RECEIVABLE_max_min', 'client_credit_AMT_TOTAL_RECEIVABLE_max_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_min_mean', 'client_credit_AMT_TOTAL_RECEIVABLE_min_max', 'client_credit_AMT_TOTAL_RECEIVABLE_min_min', 'client_credit_AMT_TOTAL_RECEIVABLE_min_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_mean', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_max', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_min', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_sum', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_mean_mean', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_mean_max', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_mean_min', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_mean_sum', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_max_mean', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_max_max', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_max_min', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_max_sum', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_min_mean', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_min_max', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_min_min', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_min_sum', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_sum_mean', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_sum_max', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_sum_min', 'client_credit_CNT_DRAWINGS_ATM_CURRENT_sum_sum', 'client_credit_CNT_DRAWINGS_CURRENT_mean_mean', 'client_credit_CNT_DRAWINGS_CURRENT_mean_max', 'client_credit_CNT_DRAWINGS_CURRENT_mean_min', 'client_credit_CNT_DRAWINGS_CURRENT_mean_sum', 'client_credit_CNT_DRAWINGS_CURRENT_max_mean', 'client_credit_CNT_DRAWINGS_CURRENT_max_max', 'client_credit_CNT_DRAWINGS_CURRENT_max_min', 'client_credit_CNT_DRAWINGS_CURRENT_max_sum', 'client_credit_CNT_DRAWINGS_CURRENT_min_mean', 'client_credit_CNT_DRAWINGS_CURRENT_min_max', 'client_credit_CNT_DRAWINGS_CURRENT_min_min', 'client_credit_CNT_DRAWINGS_CURRENT_min_sum', 'client_credit_CNT_DRAWINGS_CURRENT_sum_mean', 'client_credit_CNT_DRAWINGS_CURRENT_sum_max', 'client_credit_CNT_DRAWINGS_CURRENT_sum_min', 'client_credit_CNT_DRAWINGS_CURRENT_sum_sum', 'client_credit_CNT_DRAWINGS_OTHER_CURRENT_mean_sum', 'client_credit_CNT_DRAWINGS_OTHER_CURRENT_max_sum', 'client_credit_CNT_DRAWINGS_OTHER_CURRENT_sum_mean', 'client_credit_CNT_DRAWINGS_OTHER_CURRENT_sum_max', 'client_credit_CNT_DRAWINGS_OTHER_CURRENT_sum_min', 'client_credit_CNT_DRAWINGS_OTHER_CURRENT_sum_sum', 'client_credit_CNT_DRAWINGS_POS_CURRENT_mean_mean', 'client_credit_CNT_DRAWINGS_POS_CURRENT_mean_max', 'client_credit_CNT_DRAWINGS_POS_CURRENT_mean_min', 'client_credit_CNT_DRAWINGS_POS_CURRENT_mean_sum', 'client_credit_CNT_DRAWINGS_POS_CURRENT_max_mean', 'client_credit_CNT_DRAWINGS_POS_CURRENT_max_max', 'client_credit_CNT_DRAWINGS_POS_CURRENT_max_min', 'client_credit_CNT_DRAWINGS_POS_CURRENT_max_sum', 'client_credit_CNT_DRAWINGS_POS_CURRENT_min_mean', 'client_credit_CNT_DRAWINGS_POS_CURRENT_min_max', 'client_credit_CNT_DRAWINGS_POS_CURRENT_min_min', 'client_credit_CNT_DRAWINGS_POS_CURRENT_min_sum', 'client_credit_CNT_DRAWINGS_POS_CURRENT_sum_mean', 'client_credit_CNT_DRAWINGS_POS_CURRENT_sum_max', 'client_credit_CNT_DRAWINGS_POS_CURRENT_sum_min', 'client_credit_CNT_DRAWINGS_POS_CURRENT_sum_sum', 'client_credit_CNT_INSTALMENT_MATURE_CUM_mean_mean', 'client_credit_CNT_INSTALMENT_MATURE_CUM_mean_max', 'client_credit_CNT_INSTALMENT_MATURE_CUM_mean_min', 'client_credit_CNT_INSTALMENT_MATURE_CUM_mean_sum', 'client_credit_CNT_INSTALMENT_MATURE_CUM_max_mean', 'client_credit_CNT_INSTALMENT_MATURE_CUM_max_max', 'client_credit_CNT_INSTALMENT_MATURE_CUM_max_min', 'client_credit_CNT_INSTALMENT_MATURE_CUM_max_sum', 'client_credit_CNT_INSTALMENT_MATURE_CUM_min_mean', 'client_credit_CNT_INSTALMENT_MATURE_CUM_min_max', 'client_credit_CNT_INSTALMENT_MATURE_CUM_min_min', 'client_credit_CNT_INSTALMENT_MATURE_CUM_min_sum', 'client_credit_CNT_INSTALMENT_MATURE_CUM_sum_mean', 'client_credit_CNT_INSTALMENT_MATURE_CUM_sum_max', 'client_credit_CNT_INSTALMENT_MATURE_CUM_sum_min', 'client_credit_CNT_INSTALMENT_MATURE_CUM_sum_sum', 'client_credit_SK_DPD_mean_mean', 'client_credit_SK_DPD_mean_max', 'client_credit_SK_DPD_mean_min', 'client_credit_SK_DPD_mean_sum', 'client_credit_SK_DPD_max_mean', 'client_credit_SK_DPD_max_max', 'client_credit_SK_DPD_max_min', 'client_credit_SK_DPD_max_sum', 'client_credit_SK_DPD_sum_mean', 'client_credit_SK_DPD_sum_max', 'client_credit_SK_DPD_sum_min', 'client_credit_SK_DPD_sum_sum', 'client_credit_SK_DPD_DEF_mean_mean', 'client_credit_SK_DPD_DEF_mean_max', 'client_credit_SK_DPD_DEF_mean_min', 'client_credit_SK_DPD_DEF_mean_sum', 'client_credit_SK_DPD_DEF_max_mean', 'client_credit_SK_DPD_DEF_max_max', 'client_credit_SK_DPD_DEF_max_min', 'client_credit_SK_DPD_DEF_max_sum', 'client_credit_SK_DPD_DEF_sum_mean', 'client_credit_SK_DPD_DEF_sum_max', 'client_credit_SK_DPD_DEF_sum_min', 'client_credit_SK_DPD_DEF_sum_sum', 'client_credit_counts_mean', 'client_credit_counts_max', 'client_credit_counts_min', 'client_credit_counts_sum', 'client_counts_y', 'client_installments_SK_ID_CURR_mean_mean', 'client_installments_SK_ID_CURR_mean_max', 'client_installments_SK_ID_CURR_mean_min', 'client_installments_SK_ID_CURR_mean_sum', 'client_installments_SK_ID_CURR_max_mean', 'client_installments_SK_ID_CURR_max_max', 'client_installments_SK_ID_CURR_max_min', 'client_installments_SK_ID_CURR_max_sum', 'client_installments_SK_ID_CURR_min_mean', 'client_installments_SK_ID_CURR_min_max', 'client_installments_SK_ID_CURR_min_min', 'client_installments_SK_ID_CURR_min_sum', 'client_installments_SK_ID_CURR_sum_mean', 'client_installments_SK_ID_CURR_sum_max', 'client_installments_SK_ID_CURR_sum_min', 'client_installments_SK_ID_CURR_sum_sum', 'client_installments_NUM_INSTALMENT_VERSION_mean_max', 'client_installments_NUM_INSTALMENT_VERSION_mean_min', 'client_installments_NUM_INSTALMENT_VERSION_mean_sum', 'client_installments_NUM_INSTALMENT_VERSION_max_max', 'client_installments_NUM_INSTALMENT_VERSION_max_min', 'client_installments_NUM_INSTALMENT_VERSION_max_sum', 'client_installments_NUM_INSTALMENT_VERSION_min_mean', 'client_installments_NUM_INSTALMENT_VERSION_min_max', 'client_installments_NUM_INSTALMENT_VERSION_min_min', 'client_installments_NUM_INSTALMENT_VERSION_min_sum', 'client_installments_NUM_INSTALMENT_VERSION_sum_mean', 'client_installments_NUM_INSTALMENT_VERSION_sum_max', 'client_installments_NUM_INSTALMENT_VERSION_sum_min', 'client_installments_NUM_INSTALMENT_VERSION_sum_sum', 'client_installments_NUM_INSTALMENT_NUMBER_mean_mean', 'client_installments_NUM_INSTALMENT_NUMBER_mean_max', 'client_installments_NUM_INSTALMENT_NUMBER_mean_min', 'client_installments_NUM_INSTALMENT_NUMBER_mean_sum', 'client_installments_NUM_INSTALMENT_NUMBER_max_mean', 'client_installments_NUM_INSTALMENT_NUMBER_max_max', 'client_installments_NUM_INSTALMENT_NUMBER_max_min', 'client_installments_NUM_INSTALMENT_NUMBER_max_sum', 'client_installments_NUM_INSTALMENT_NUMBER_min_mean', 'client_installments_NUM_INSTALMENT_NUMBER_min_max', 'client_installments_NUM_INSTALMENT_NUMBER_min_sum', 'client_installments_NUM_INSTALMENT_NUMBER_sum_mean', 'client_installments_NUM_INSTALMENT_NUMBER_sum_max', 'client_installments_NUM_INSTALMENT_NUMBER_sum_min', 'client_installments_NUM_INSTALMENT_NUMBER_sum_sum', 'client_installments_AMT_INSTALMENT_mean_mean', 'client_installments_AMT_INSTALMENT_mean_max', 'client_installments_AMT_INSTALMENT_mean_min', 'client_installments_AMT_INSTALMENT_mean_sum', 'client_installments_AMT_INSTALMENT_max_mean', 'client_installments_AMT_INSTALMENT_max_max', 'client_installments_AMT_INSTALMENT_max_min', 'client_installments_AMT_INSTALMENT_max_sum', 'client_installments_AMT_INSTALMENT_min_mean', 'client_installments_AMT_INSTALMENT_min_max', 'client_installments_AMT_INSTALMENT_min_min', 'client_installments_AMT_INSTALMENT_min_sum', 'client_installments_AMT_INSTALMENT_sum_mean', 'client_installments_AMT_INSTALMENT_sum_max', 'client_installments_AMT_INSTALMENT_sum_min', 'client_installments_AMT_INSTALMENT_sum_sum', 'client_installments_AMT_PAYMENT_mean_mean', 'client_installments_AMT_PAYMENT_mean_max', 'client_installments_AMT_PAYMENT_mean_min', 'client_installments_AMT_PAYMENT_mean_sum', 'client_installments_AMT_PAYMENT_max_mean', 'client_installments_AMT_PAYMENT_max_max', 'client_installments_AMT_PAYMENT_max_min', 'client_installments_AMT_PAYMENT_max_sum', 'client_installments_AMT_PAYMENT_min_mean', 'client_installments_AMT_PAYMENT_min_max', 'client_installments_AMT_PAYMENT_min_min', 'client_installments_AMT_PAYMENT_min_sum', 'client_installments_AMT_PAYMENT_sum_mean', 'client_installments_AMT_PAYMENT_sum_max', 'client_installments_AMT_PAYMENT_sum_min', 'client_installments_AMT_PAYMENT_sum_sum', 'client_installments_counts_mean', 'client_installments_counts_max', 'client_installments_counts_min', 'client_installments_counts_sum', 'client_counts']

chi2_selector = SelectKBest(chi2, k=10) # Select top 5 features
X_train_kbest = chi2_selector.fit_transform(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, chi2_selector.get_support()) if mask]
print(selected_features)

['client_credit_SK_ID_CURR_sum_sum', 'client_credit_AMT_BALANCE_sum_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum', 'client_credit_AMT_PAYMENT_CURRENT_sum_sum', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_sum_sum', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_sum_sum', 'client_credit_AMT_RECIVABLE_sum_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_sum', 'client_installments_SK_ID_CURR_sum_sum', 'client_installments_AMT_PAYMENT_sum_sum']

We can see that the top categorical variables are ‘client_credit_SK_ID_CURR_sum_sum’, ‘client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum’, ‘client_credit_AMT_TOTAL_RECEIVABLE_sum_sum’. Some of the top 10 of numerical variables are the balance of the client, the credit limit, the principal receivable.

Information gain (mutual information)

We can use the method SelectKBest in sklearn with several different algorithm such as mutual information, ANNOVA. These algorithms can process negative values. Mutual information is a non negative value that measures the dependency between the features and the target. It comes from information theory. It is equal to zero iff the two random variables are independent. Higher value would mean higher dependency which means that one variable becomes less uncertain when we know the value of the other one. Note that mutual information method doesn’t make assumption about the nature of the relationship, so it can capture non linear and non monotonic relation as well. Some other indicator such as correlation efficient can only capture linear relationship. Here is how to calculate the mutual information \(I(X,Y) = H(X) - H(X\mid Y)\). We can see that it is the difference betweeen the entropy for X and the conditional entropy for X given Y. The result is in bits. We only need to choose the corresponding algorithm for the SelectKBest method in sklearn. Since we have a thousand features, the dependency measurement for each variable is low, with the maximum being 0.03. Some of the top variables are about previous loan status, installment payment, actual credit limit, and cash loan purpose. It makes sense to use these features to predict the default rate of the client.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=numerics)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

# apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(X_train, y_train)   # X, y defined as before

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

# concatenate two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)

# name the dataframe columns
featureScores.columns = ['Specs','Score']  

# print 10 best features
print(featureScores.nlargest(10,'Score'))  

/Users/nguyenlinhchi/.local/lib/python3.9/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)


                                                  Specs     Score
           previous_loans_NAME_PORTFOLIO_POS_count  0.030435
         client_installments_AMT_PAYMENT_mean_min  0.024786
        client_credit_AMT_TOTAL_RECEIVABLE_sum_min  0.024512
               client_cash_CNT_INSTALMENT_min_mean  0.023053
    client_credit_AMT_CREDIT_LIMIT_ACTUAL_mean_min  0.022686
    client_credit_AMT_DRAWINGS_ATM_CURRENT_sum_max  0.022160
 previous_loans_NAME_YIELD_GROUP_low_action_cou...  0.021542
 previous_loans_NAME_CASH_LOAN_PURPOSE_Journey_...  0.021407
                                       EXT_SOURCE_3  0.021029
   client_credit_CNT_DRAWINGS_POS_CURRENT_mean_min  0.020707

ANNOVA F-test

ANNOVA, short for analysis of variance, is a statistical test that can decode the correlation among the features. It can test the equality of the means among populations too. This method can see the difference between systematic component and random factors affecting the observed variability of a dataset. This information can be used to evaluate the impact of features on the target. The equation for ANNOVA for one feature is \(\frac{distance between classes}{compactness of classes}\). This ratio shows how good the feature is in predicting the classes of a variable (how good it contributes to separate them).

For example, we have two classes for feature X separating: Class 1: [800, 300, 1300] and Class 2: [1000, 2000]. First we take the average of all points: (800 + 300 + 1300 + 1000 + 2000) / 5 = 1080. Then the mean of class 1: 800, mean of class 2: 1500. The distance between classes would be calculated as follows: number of observations in class 1 * (mean 1 - mean)^2 + number of observations in class 2 * (mean 2 - mean)^2 = 3 * (800 - 1080)^2 + 2 * (1500 - 1080)^2 = 588000. The compactness of the classes is calculated as: \(\frac{sample variance 1 + sample variance 2}{number of observations 1 - 1 + number of observations 2 - 1}\), with sample variance for each class being \(\frac{\sum_i (x_i - \bar{x})^2}{n-1}\). \(Sample variance 1 = \frac{(800-800)^2+(300-800)^2+(1300-800)^2}{3-1} = 250000\). \(sample variance 2 = 500000\). So the compactness is \(250000\). The F-score = 2.352. The degrees of freedom for the numerator is the number of groups minus one which is 1 and the second degrees of freedom is the total number of observations minus the number of groups, which is 5 - 2 = 3. Looking up the critical value for our configuration in the table of F test, the critical value is 5.53 which is larger than the calculated value. We cannot reject the null hypothesis that the feature X doesn’t have significant impact on the target variable. Since the mean of the two classes divided by the feature is equal.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=numerics)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=f_classif, k=10)
fit = bestfeatures.fit(X_train, y_train)   # X, y defined as before

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

# concatenate two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)

# name the dataframe columns
featureScores.columns = ['Specs','Score']  

# print 10 best features
print(featureScores.nlargest(10,'Score'))  

                                                 Specs      Score
                                      EXT_SOURCE_3  44.139397
                                      EXT_SOURCE_2  39.448303
previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase...  13.564337
previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase...  13.564337
previous_loans_NAME_CASH_LOAN_PURPOSE_Car repa...  12.410798
  previous_loans_CODE_REJECT_REASON_HC_count_norm  11.547177
 previous_loans_CODE_REJECT_REASON_XAP_count_norm  11.498345
previous_loans_NAME_GOODS_CATEGORY_Vehicles_co...  11.284404
previous_loans_NAME_CONTRACT_STATUS_Refused_co...  11.032108
  previous_loans_NAME_YIELD_GROUP_high_count_norm  10.263603


/Users/nguyenlinhchi/.local/lib/python3.9/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/nguyenlinhchi/.local/lib/python3.9/site-packages/sklearn/feature_selection/_univariate_selection.py:112: UserWarning: Features [  12   79   81   84   87   89   94   96   97   98  133  134  135  136
140  153  154  155  156  161  162  169  170  237  238  239  240
256  257  258  265  266  267  268  341  342  416  470  471  472
474  475  476  477  486  487  488  489  490  491  492  493  502
504  505  506  507  508  509  526  527  528  529  530  531  532
608  622  623  624  625  643  644  645  646  647  648  649  650
660  661  662  663  664  665  666  667  668  669  670  671  672
674  795  796  797  798  955  956  957  958 1003 1004 1005 1006
1020 1021 1022] are constant.
  warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/Users/nguyenlinhchi/.local/lib/python3.9/site-packages/sklearn/feature_selection/_univariate_selection.py:113: RuntimeWarning: invalid value encountered in divide
  f = msb / msw

Correlation coefficient

This is the indicator of correlation between target variable and the input variables, it selects the highly correlated features with the target. It is a simple technique.

import seaborn as sns
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = X_train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

Wrapper method

Forward selection

This starts with an empty model. In each step, it considers all the available model and chooses the one that add the most to the performance of the model. The process is repeated until adding more variables to the model doesn’t improve it anymore. This is a greedy algorithm, meaning that it chooses the immediate best at the current step without thinking about the consequences.

So basically we start with a null main model, this assumes that none of the features can explain the variance in the target. Then we fit a simple sub-model for each feature. We choose the feature that is with the sub-model producing the lowest p-value (highest statistical significance) to add to our main model. Then we fit sub-models with two predictors, one of which is the one already selected, the other is each of the rest. Then we choose the second predictor for our main model by choosing the one associated with the sub-model having the lowest p-value. Repeat the process until a stopping criterion.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=categoricals)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)
X_train = X_train.apply(LabelEncoder().fit_transform)

import statsmodels.api as sm

# List of all potential features
features = list(X_train.columns)

# The maximum p-value for features we'll accept as significant
p_max = 0.05

while len(features) > 0:
    features_with_constant = sm.add_constant(X_train[features])
    p_values = sm.OLS(y_train, features_with_constant).fit().pvalues[1:]
    
    max_p_value = p_values.max()
    
    if max_p_value >= p_max:
        excluded_feature = p_values.idxmax()
        features.remove(excluded_feature)
    else:
        break

print(f'Selected features: {features}')

Selected features: ['CODE_GENDER', 'NAME_EDUCATION_TYPE']

sm.OLS(y_train, features_with_constant).fit().summary()

OLS Regression Results
Dep. Variable:	TARGET	R-squared:	0.012
Model:	OLS	Adj. R-squared:	0.010
Method:	Least Squares	F-statistic:	6.041
Date:	Thu, 01 Jun 2023	Prob (F-statistic):	0.00247
Time:	20:05:18	Log-Likelihood:	-47.001
No. Observations:	1000	AIC:	100.0
Df Residuals:	997	BIC:	114.7
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.0231	0.016	1.414	0.158	-0.009	0.055
CODE_GENDER	0.0378	0.017	2.252	0.025	0.005	0.071
NAME_EDUCATION_TYPE	0.0155	0.006	2.551	0.011	0.004	0.027

Omnibus:	683.755	Durbin-Watson:	1.987
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5303.637
Skew:	3.310	Prob(JB):	0.00
Kurtosis:	12.136	Cond. No.	6.48

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For the categorical variables, this method chooses only gender and education type.

Backward elimination

Backward elimination method is similar to the forward elimination but in reverse. It starts with a model full of features. It removes the least significant feature one at a time (having p-value smaller than the significance level of 0.05 for example), until no further improvement is observed. Basically, it fits the full model, then it plans to remove the feature with the highest p-value exceeding the chosen significance level. If there is no such feature, the process stops. Otherwise the feature is removed and the model would be fitted again without the eliminated.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=categoricals)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)
X_train = X_train.apply(LabelEncoder().fit_transform)

import pandas as pd
import statsmodels.api as sm

# Add a column of ones as integer data type
X_train.insert(0, 'Intercept', 1)

# Set a significance level
SL = 0.05

# Convert the DataFrame into a list so that we can manipulate the features
features = list(X_train.columns)
while (len(features) > 0):
    features_with_pvalues = []
    excluded_feature = ''
    # Fit the model with the current features
    model = sm.OLS(y_train, X_train[features]).fit()
    # Get the p-values for the features
    pvalues = model.pvalues
    max_pvalue = pvalues.max()  # Get the feature with the highest p-value
    if max_pvalue > SL:
        excluded_feature = pvalues.idxmax()  # Get the feature with the highest p-value
        features.remove(excluded_feature)
    else:
        break

final_model = sm.OLS(y_train, X_train[features]).fit()
print(final_model.summary())

print(f"The final features are: {features}")

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 TARGET   R-squared (uncentered):                   0.083
Model:                            OLS   Adj. R-squared (uncentered):              0.081
Method:                 Least Squares   F-statistic:                              30.24
Date:                Thu, 01 Jun 2023   Prob (F-statistic):                    1.03e-18
Time:                        20:06:00   Log-Likelihood:                         -45.766
No. Observations:                1000   AIC:                                      97.53
Df Residuals:                     997   BIC:                                      112.3
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
CODE_GENDER                    0.0393      0.016      2.438      0.015       0.008       0.071
NAME_EDUCATION_TYPE            0.0162      0.005      3.438      0.001       0.007       0.025
WEEKDAY_APPR_PROCESS_START     0.0065      0.003      2.114      0.035       0.000       0.013
==============================================================================
Omnibus:                      679.118   Durbin-Watson:                   1.983
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5201.242
Skew:                           3.285   Prob(JB):                         0.00
Kurtosis:                      12.036   Cond. No.                         8.60
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The final features are: ['CODE_GENDER', 'NAME_EDUCATION_TYPE', 'WEEKDAY_APPR_PROCESS_START']

Interestingly, this process chooses the same two variables above: the gender and the education type, it chooses another one named the day of the week when the client starts the process.

Recursive feature elimination (RFE)

This algorithm performs a greedy search to find the best performing feature subset. In each iteration, it creates model and determine the best and worst features. Then it builds the next model with the best ones until the number of features we need is reached. The features are then ranked based on the order of their elimination. The top three categorical variables that are chosen are the gender, contract type and whether the client owns a car. The difference with backward elimination is that this method uses a metric of model performance to decide the weakest feature. For example it can use accuracy instead of p-value.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=categoricals)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)
X_train = X_train.apply(LabelEncoder().fit_transform)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# feature extraction
model = LogisticRegression()
rfe = RFE(estimator=model, n_features_to_select=3)
fit = rfe.fit(X_train, y_train) 
# get column names
col_names = X_train.columns

# get selected features
selected_features = col_names[fit.support_]

print("Selected Features: ")
for feature in selected_features:
    print(feature)

Selected Features: 
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR

Embedded method

This method chooses the features as a part of the creation of the model.

LASSO

LASSO has been introduced in a previous post. Here we will talk about how it incorporates feature selection. Remember that LASSO’s loss function has an extra term to penalize the weight of the model, apart from the usual mean squared error. The term is simply \(\lambda \mid \beta \mid\). It doesn’t just prevent the weights to become too large in absolute, it only can force some of the coefficients to be zero when \(\lambda\) is sufficiently large. In our example, the chosen variables are the ones with coefficients different from zero: the type of suite the client lives in, the education, the occupation, the day in the week that the client applies for the loan, and the organization type. One interesting variable is the week day that the client starts the loan process, it seems to be significant across feature selection algorithms.

from sklearn.linear_model import LassoCV

lasso = LassoCV(cv=5) 
lasso.fit(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, coef in zip(X_train.columns, lasso.coef_) if coef != 0]
print(selected_features)

['NAME_TYPE_SUITE', 'NAME_EDUCATION_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE']


/Users/nguyenlinhchi/.local/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:1568: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Elastic net

Since Elastic net is the mix between Lasso and Ridge, they also have that quality of selecting the relevant variables.

Decision tree and Random forest

A decision tree uses Gini or entropy impurity to estimate the quality of each split. Features at the top contribute to the final prediction more than others at the leaf. The expected fraction of the samples they contribute to thus can be used as an indicator of their importance. Random forest is also an algorithm that has the feature importance built in in sklearn library since it is an ensemble of decision trees.

from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf.fit(X_train, y_train)

# Get feature importances
importances = rf.feature_importances_

# Convert feature importances to a DataFrame
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
})

# Sort DataFrame by the importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

print(importance_df)

                       Feature  Importance
         ORGANIZATION_TYPE    0.203309
WEEKDAY_APPR_PROCESS_START    0.148354
            OCCUPATION_TYPE    0.118677
         NAME_FAMILY_STATUS    0.090337
            FLAG_OWN_REALTY    0.065496
        WALLSMATERIAL_MODE    0.061458
               FLAG_OWN_CAR    0.056228
           NAME_INCOME_TYPE    0.052931
            NAME_TYPE_SUITE    0.042914
                CODE_GENDER    0.041188
        NAME_EDUCATION_TYPE    0.038794
        FONDKAPREMONT_MODE    0.031838
          NAME_HOUSING_TYPE    0.025070
         NAME_CONTRACT_TYPE    0.017171
       EMERGENCYSTATE_MODE    0.005946
            HOUSETYPE_MODE    0.000288


/var/folders/kf/5_ggvsz93vxdbx_h0tvy66xh0000gn/T/ipykernel_7037/1195280182.py:7: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  rf.fit(X_train, y_train)

The top four features are organization type, the week day that the process start (this one again, even though it seems that it should be an irrelevant feature before we run all the algorithms), the occupation of the client and the family status.

In conclusion, we have explored multiple ways to select the most relevant features for our model. If done right, there would be many benefits to enjoy from reducing the number of variables: better run time, more efficient computation, better accuracy even. Doing together with good data collecting and preprocessing practices, it would triple our performance. However, remember that which algorithm to take varies from dataset to dataset. And it is required that the person in charge has a sufficient body of knowledge on the matter.

Feature Selection

TOC