application 데이터 세트에 다양한 feature engineering 수행.
- application_train(test) 주요 feature값들의 분포도등의 EDA 수행.
- application_train(test) 주요 feature 들의 추가적인 가공을 통한 feature engineering 수행.
라이브러리와 app 데이터 세트 로딩¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc,os,sys
import random
from sklearn.model_selection import KFold, StratifiedKFold
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_colwidth', 30)
In [3]:
app_train = pd.read_csv('데이터셋/home-credit-default-risk/application_train.csv')
app_test = pd.read_csv('데이터셋/home-credit-default-risk/application_test.csv')
In [4]:
app_train.isnull().sum()
Out[4]:
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
FLAG_OWN_REALTY 0
CNT_CHILDREN 0
AMT_INCOME_TOTAL 0
AMT_CREDIT 0
AMT_ANNUITY 12
AMT_GOODS_PRICE 278
NAME_TYPE_SUITE 1292
NAME_INCOME_TYPE 0
NAME_EDUCATION_TYPE 0
NAME_FAMILY_STATUS 0
NAME_HOUSING_TYPE 0
REGION_POPULATION_RELATIVE 0
DAYS_BIRTH 0
DAYS_EMPLOYED 0
DAYS_REGISTRATION 0
DAYS_ID_PUBLISH 0
OWN_CAR_AGE 202929
FLAG_MOBIL 0
FLAG_EMP_PHONE 0
FLAG_WORK_PHONE 0
FLAG_CONT_MOBILE 0
FLAG_PHONE 0
FLAG_EMAIL 0
OCCUPATION_TYPE 96391
CNT_FAM_MEMBERS 2
REGION_RATING_CLIENT 0
REGION_RATING_CLIENT_W_CITY 0
WEEKDAY_APPR_PROCESS_START 0
HOUR_APPR_PROCESS_START 0
REG_REGION_NOT_LIVE_REGION 0
REG_REGION_NOT_WORK_REGION 0
LIVE_REGION_NOT_WORK_REGION 0
REG_CITY_NOT_LIVE_CITY 0
REG_CITY_NOT_WORK_CITY 0
LIVE_CITY_NOT_WORK_CITY 0
ORGANIZATION_TYPE 0
EXT_SOURCE_1 173378
EXT_SOURCE_2 660
EXT_SOURCE_3 60965
APARTMENTS_AVG 156061
BASEMENTAREA_AVG 179943
YEARS_BEGINEXPLUATATION_AVG 150007
YEARS_BUILD_AVG 204488
COMMONAREA_AVG 214865
ELEVATORS_AVG 163891
ENTRANCES_AVG 154828
FLOORSMAX_AVG 153020
FLOORSMIN_AVG 208642
LANDAREA_AVG 182590
LIVINGAPARTMENTS_AVG 210199
LIVINGAREA_AVG 154350
NONLIVINGAPARTMENTS_AVG 213514
NONLIVINGAREA_AVG 169682
APARTMENTS_MODE 156061
BASEMENTAREA_MODE 179943
YEARS_BEGINEXPLUATATION_MODE 150007
YEARS_BUILD_MODE 204488
COMMONAREA_MODE 214865
ELEVATORS_MODE 163891
ENTRANCES_MODE 154828
FLOORSMAX_MODE 153020
FLOORSMIN_MODE 208642
LANDAREA_MODE 182590
LIVINGAPARTMENTS_MODE 210199
LIVINGAREA_MODE 154350
NONLIVINGAPARTMENTS_MODE 213514
NONLIVINGAREA_MODE 169682
APARTMENTS_MEDI 156061
BASEMENTAREA_MEDI 179943
YEARS_BEGINEXPLUATATION_MEDI 150007
YEARS_BUILD_MEDI 204488
COMMONAREA_MEDI 214865
ELEVATORS_MEDI 163891
ENTRANCES_MEDI 154828
FLOORSMAX_MEDI 153020
FLOORSMIN_MEDI 208642
LANDAREA_MEDI 182590
LIVINGAPARTMENTS_MEDI 210199
LIVINGAREA_MEDI 154350
NONLIVINGAPARTMENTS_MEDI 213514
NONLIVINGAREA_MEDI 169682
FONDKAPREMONT_MODE 210295
HOUSETYPE_MODE 154297
TOTALAREA_MODE 148431
WALLSMATERIAL_MODE 156341
EMERGENCYSTATE_MODE 145755
OBS_30_CNT_SOCIAL_CIRCLE 1021
DEF_30_CNT_SOCIAL_CIRCLE 1021
OBS_60_CNT_SOCIAL_CIRCLE 1021
DEF_60_CNT_SOCIAL_CIRCLE 1021
DAYS_LAST_PHONE_CHANGE 1
FLAG_DOCUMENT_2 0
FLAG_DOCUMENT_3 0
FLAG_DOCUMENT_4 0
FLAG_DOCUMENT_5 0
FLAG_DOCUMENT_6 0
FLAG_DOCUMENT_7 0
FLAG_DOCUMENT_8 0
FLAG_DOCUMENT_9 0
FLAG_DOCUMENT_10 0
FLAG_DOCUMENT_11 0
FLAG_DOCUMENT_12 0
FLAG_DOCUMENT_13 0
FLAG_DOCUMENT_14 0
FLAG_DOCUMENT_15 0
FLAG_DOCUMENT_16 0
FLAG_DOCUMENT_17 0
FLAG_DOCUMENT_18 0
FLAG_DOCUMENT_19 0
FLAG_DOCUMENT_20 0
FLAG_DOCUMENT_21 0
AMT_REQ_CREDIT_BUREAU_HOUR 41519
AMT_REQ_CREDIT_BUREAU_DAY 41519
AMT_REQ_CREDIT_BUREAU_WEEK 41519
AMT_REQ_CREDIT_BUREAU_MON 41519
AMT_REQ_CREDIT_BUREAU_QRT 41519
AMT_REQ_CREDIT_BUREAU_YEAR 41519
dtype: int64
In [5]:
app_train['TARGET'].value_counts()
Out[5]:
0 282686
1 24825
Name: TARGET, dtype: int64
In [6]:
app_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
연속형 숫자 feature들에 대해서 TARGET값이 각각 0과 1일때의 Histogram 시각화¶
- violinplot과 distplot으로 숫자형 feature들에 대해 TARGET값 별 분포도 비교 시각화
In [10]:
num_columns = app_train.dtypes[app_train.dtypes != 'object']
In [8]:
len(num_columns)
# app_train의 dtypes: float64(65), int64(41)들의 갯수가 함쳐진 것과 같다.
Out[8]:
106
In [13]:
# TARGET 값 유형에 따른 Boolean Indexing 조건
def show_hist_by_target(df, columns):
cond1 = (df['TARGET'] == 1)
cond0 = (df['TARGET'] == 0)
for column in columns:
# 2개의 subplot을 생성하고 왼쪽에는 violinplot을 오른쪽에는 distplot을 표현
fig, axs = plt.subplots(figsize=(12, 4), nrows=1, ncols=2, squeeze=False)
# violin plot을 왼쪽 subplot에 그림.
sns.violinplot(x='TARGET', y=column, data=df, ax=axs[0][0] )
# Histogram을 오른쪽 subplot에 그림.
sns.distplot(df[cond0][column], ax=axs[0][1], label='0', color='blue')
sns.distplot(df[cond1][column], ax=axs[0][1], label='1', color='red')
In [14]:
columns = ['AMT_INCOME_TOTAL','AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH',
'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'EXT_SOURCE_1',
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
show_hist_by_target(app_train, columns)
- AMT_INCOME_TOTAL, AMT_CREDIT 은 크게 차이가 없음.
- AMT_ANNUITY는 TARGET=1시 상대적으로 작은 숫자대의 값이 조금 많음.
- AMT_GOOD_PRICE는 크게 차이가 없음.
- DAYS_BIRTH는 TARGET=1시 적은 연령대의 숫자값이 상대적으로 많음.
- DAYS_EMPLOYED는 TARGET=1시 작은 값들이 조금 더 많음.
- DAYS_ID_PUBLISH, DAYS_REGISTRATION 는 TARGET=1시 최근 값들이 조금 더 많음.
- DAYS_LAST_PHONE_CHANGE는 큰 차이 없음.
- CNT_FAM_MEMBERS는 차이가 없음. outlier 때문에 histogram에 차이가 있어 보임.
- REGION_RATING_CLIENT는 큰 차이 없음
- EXT_SOURCE_1,EXT_SOURCE_2, EXT_SOURCE_3 모두 조금씩 차이가 있음.
- 나머지 컬럼모두 큰 차이가 없음.
- 전반적으로 연령대가 낮은(직장 경력이 적은), 소액 대출에서 상대적으로 연체 비중이 높음.
seaborn의 countplot() 또는 catplot()을 이용하여 category 피처(object 컬럼)을 TARGET 유형에 따라 Count 비교¶
In [15]:
object_columns = app_train.dtypes[app_train.dtypes=='object'].index.tolist()
object_columns
len(object_columns)
Out[15]:
16
In [16]:
#countplot은 계속 각 두개의 그래프의 색상이 유지가 되지않고 바뀌면서 나온다
# ㄱ그래서 countplot은 많이 쓰지않는다
# ex) 1그래프_여자(파랑)_남자(빨간)
# ex) 2그래프_여자(빨간)_남자(파랑)으로 나온다
def show_count_by_target(df, columns):
cond_1 = (df['TARGET'] == 1)
cond_0 = (df['TARGET'] == 0)
for column in columns:
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(18, 4), squeeze=False)
# countplot을 이용하여 category값의 histogram 표현
chart0 = sns.countplot(df[cond_0][column], ax=axs[0][0])
# x축의 tick label들이 값 유형이 많으므로 45도로 회전하여 표현
chart0.set_xticklabels(chart0.get_xticklabels(), rotation=45)
chart1 = sns.countplot(df[cond_1][column], ax=axs[0][1])
chart1.set_xticklabels(chart1.get_xticklabels(), rotation=45)
show_count_by_target(app_train, object_columns)
In [17]:
# catplot을 이용하여 countplot을 특정 컬럼 값 조건에 따라 여러개의 subplot을 분리하여 보여줌.
sns.catplot(x="CODE_GENDER",col="TARGET", data=app_train, kind="count")
# target값이 0인 성별별 비율
# target값이 1인 성별별 비율
#위처럼 코드는 하나지만 target의 종류별로 두개의 그래프가 나온다
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x2211fb86610>
In [19]:
# catplot을 이용하여 TARGET별로 여러컬럼의 category 값 Histogram을 표현.
def show_category_by_target(df, columns):
for column in columns:
chart = sns.catplot(x=column, col="TARGET",data=df, kind="count")
chart.set_xticklabels(rotation=65)
show_category_by_target(app_train,object_columns)
대출 횟수 대비 연체 비율이 여성이 남성보다 높음. 이를 value_counts()로 확인.¶
In [20]:
cond_1 = (app_train['TARGET'] == 1)
cond_0 = (app_train['TARGET'] == 0)
cond_f = (app_train['CODE_GENDER'] == 'F')
cond_m = (app_train['CODE_GENDER'] == 'M')
# 전체 건수 대비 남성과 여성의 비율 확인
print(app_train['CODE_GENDER'].value_counts()/app_train.shape[0])
# TARGET=1 일 경우 남성과 여성의 비율 확인
print(app_train[cond_1]['CODE_GENDER'].value_counts()/app_train[cond_1].shape[0])
# TARGET=0 일 경우 남성과 여성의 비율 확인
print(app_train[cond_0]['CODE_GENDER'].value_counts()/app_train[cond_0].shape[0])
F 0.658344
M 0.341643
XNA 0.000013
Name: CODE_GENDER, dtype: float64
F 0.570796
M 0.429204
Name: CODE_GENDER, dtype: float64
F 0.666032
M 0.333954
XNA 0.000014
Name: CODE_GENDER, dtype: float64
주요 컬럼들의 target과의 상관도 분석¶
In [23]:
corr_columns = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'DAYS_EMPLOYED','DAYS_ID_PUBLISH', 'DAYS_REGISTRATION', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL','TARGET']
corr = app_train[corr_columns].corr()
# 위컬럼들을 고른 이유는 app_baseline_01에서 Feature importance 시각화한 것에서 상위12번째(400이상)인것들만 중요컬럼으로 지정하였다.
In [24]:
app_train[corr_columns].head()
Out[24]:
EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_BIRTH | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | DAYS_EMPLOYED | DAYS_ID_PUBLISH | DAYS_REGISTRATION | DAYS_LAST_PHONE_CHANGE | AMT_INCOME_TOTAL | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.083037 | 0.262949 | 0.139376 | -9461 | 406597.5 | 24700.5 | 351000.0 | -637 | -2120 | -3648.0 | -1134.0 | 202500.0 | 1 |
1 | 0.311267 | 0.622246 | NaN | -16765 | 1293502.5 | 35698.5 | 1129500.0 | -1188 | -291 | -1186.0 | -828.0 | 270000.0 | 0 |
2 | NaN | 0.555912 | 0.729567 | -19046 | 135000.0 | 6750.0 | 135000.0 | -225 | -2531 | -4260.0 | -815.0 | 67500.0 | 0 |
3 | NaN | 0.650442 | NaN | -19005 | 312682.5 | 29686.5 | 297000.0 | -3039 | -2437 | -9833.0 | -617.0 | 135000.0 | 0 |
4 | NaN | 0.322738 | NaN | -19932 | 513000.0 | 21865.5 | 513000.0 | -3038 | -3458 | -4311.0 | -1106.0 | 121500.0 | 0 |
In [25]:
plt.figure(figsize=(9, 9))
sns.heatmap(corr, annot=True)
Out[25]:
<AxesSubplot:>
이상치 데이터 확인 및 DAYS_EMPLOYED 이상치 값 변경¶
In [26]:
### 365243이 매우 많음. 약 1000년치에 해당하는 날짜임.
app_train['DAYS_EMPLOYED'].value_counts()
# DAYS_EMPLOYED = 신청일 기준 현재 직장에서 일한 일 수
Out[26]:
365243 55374
-200 156
-224 152
-230 151
-199 151
...
-13961 1
-11827 1
-10176 1
-9459 1
-8694 1
Name: DAYS_EMPLOYED, Length: 12574, dtype: int64
In [27]:
# CODE_GENDER의 경우 XNA가 4건 정도. 많지 않으므로 그대로 유지
app_train['CODE_GENDER'].value_counts()
Out[27]:
F 202448
M 105059
XNA 4
Name: CODE_GENDER, dtype: int64
In [32]:
# LightGBM은 NULL값을 트리 모델 생성하는데 사용할 수 있으므로 일괄적으로 Null로 변환
app_train['DAYS_EMPLOYED'] = app_train['DAYS_EMPLOYED'].replace(365243, np.nan)
app_train['DAYS_EMPLOYED'].value_counts(dropna=False)
Out[32]:
NaN 55374
-200.0 156
-224.0 152
-230.0 151
-199.0 151
...
-13961.0 1
-11827.0 1
-10176.0 1
-9459.0 1
-8694.0 1
Name: DAYS_EMPLOYED, Length: 12574, dtype: int64
주요 Feature들에 대한 feature engineering 수행¶
EXT_SOURCE 계열값 확인, EXT_SOURCE_X 피처들의 평균/최대/최소/표준편차 확인¶
In [33]:
app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].isnull().sum()
Out[33]:
EXT_SOURCE_1 173378
EXT_SOURCE_2 660
EXT_SOURCE_3 60965
dtype: int64
In [34]:
app_train['EXT_SOURCE_1'].value_counts(dropna=False)
Out[34]:
NaN 173378
0.581015 5
0.546426 5
0.443982 5
0.765724 5
...
0.658473 1
0.318295 1
0.834644 1
0.485406 1
0.734460 1
Name: EXT_SOURCE_1, Length: 114585, dtype: int64
In [35]:
app_train['EXT_SOURCE_2'].value_counts(dropna=False)
Out[35]:
0.285898 721
NaN 660
0.262258 417
0.265256 343
0.159679 322
...
0.004725 1
0.257313 1
0.282030 1
0.181540 1
0.267834 1
Name: EXT_SOURCE_2, Length: 119832, dtype: int64
In [36]:
app_train['EXT_SOURCE_3'].value_counts(dropna=False)
Out[36]:
NaN 60965
0.746300 1460
0.713631 1315
0.694093 1276
0.670652 1191
...
0.028674 1
0.025272 1
0.021492 1
0.014556 1
0.043227 1
Name: EXT_SOURCE_3, Length: 815, dtype: int64
In [37]:
# EXT_SOURCE_X 피처들의 평균/최대/최소/표준편차 확인
print('### mean ###\n', app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean())
print('### max ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].max())
print('### min ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].min())
print('### std ###\n',app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std())
### mean ###
EXT_SOURCE_1 0.502130
EXT_SOURCE_2 0.514393
EXT_SOURCE_3 0.510853
dtype: float64
### max ###
EXT_SOURCE_1 0.962693
EXT_SOURCE_2 0.855000
EXT_SOURCE_3 0.896010
dtype: float64
### min ###
EXT_SOURCE_1 1.456813e-02
EXT_SOURCE_2 8.173617e-08
EXT_SOURCE_3 5.272652e-04
dtype: float64
### std ###
EXT_SOURCE_1 0.211062
EXT_SOURCE_2 0.191060
EXT_SOURCE_3 0.194844
dtype: float64
데이터 가공 전 학습과 테스트용 데이터 세트 결합¶
In [38]:
apps = pd.concat([app_train, app_test])
print(apps.shape)
(356255, 122)
EXT_SOURCE_X FEATURE 가공¶
- EXT_SOURCE_X 피처들을 결합하여 평균과 표준 편차를 신규 생성.
In [39]:
# EXT_SOURCE_X 피처들을 결합하여 평균과 표준 편차를 신규 생성.
apps['APPS_EXT_SOURCE_MEAN'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
apps['APPS_EXT_SOURCE_STD'] = apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
apps[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APPS_EXT_SOURCE_MEAN', 'APPS_EXT_SOURCE_STD']].head(10)
Out[39]:
EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APPS_EXT_SOURCE_MEAN | APPS_EXT_SOURCE_STD | |
---|---|---|---|---|---|
0 | 0.083037 | 0.262949 | 0.139376 | 0.161787 | 0.092026 |
1 | 0.311267 | 0.622246 | NaN | 0.466757 | 0.219895 |
2 | NaN | 0.555912 | 0.729567 | 0.642739 | 0.122792 |
3 | NaN | 0.650442 | NaN | 0.650442 | NaN |
4 | NaN | 0.322738 | NaN | 0.322738 | NaN |
5 | NaN | 0.354225 | 0.621226 | 0.487726 | 0.188799 |
6 | 0.774761 | 0.724000 | 0.492060 | 0.663607 | 0.150717 |
7 | NaN | 0.714279 | 0.540654 | 0.627467 | 0.122771 |
8 | 0.587334 | 0.205747 | 0.751724 | 0.514935 | 0.280096 |
9 | NaN | 0.746644 | NaN | 0.746644 | NaN |
In [40]:
apps['APPS_EXT_SOURCE_STD'].isnull().sum()
Out[40]:
40950
In [41]:
# 새로 생성한 APPS_EXT_SOURCE_STD이 NULL일 경우 APPS_EXT_SOURCE_STD의 평균으로 일괄 변경.
apps['APPS_EXT_SOURCE_STD'] = apps['APPS_EXT_SOURCE_STD'].fillna(apps['APPS_EXT_SOURCE_STD'].mean())
apps['APPS_EXT_SOURCE_STD'].isnull().sum()
Out[41]:
0
AMT_CREDIT 비율로 Feature 가공¶
In [42]:
# 매달 내야하는 돈(이자 포함) / 대출 금액
apps['APPS_ANNUITY_CREDIT_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_CREDIT']
apps['APPS_GOODS_CREDIT_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_CREDIT']
apps['APPS_CREDIT_GOODS_DIFF'] = apps['AMT_CREDIT'] - apps['AMT_GOODS_PRICE']
# 매달 내야 하는 돈 /
AMT_INCOME_TOTAL 비율로 Feature 가공¶
In [43]:
# AMT_INCOME_TOTAL 비율로 대출 금액 관련 피처 가공
apps['APPS_ANNUITY_INCOME_RATIO'] = apps['AMT_ANNUITY']/apps['AMT_INCOME_TOTAL']
apps['APPS_CREDIT_INCOME_RATIO'] = apps['AMT_CREDIT']/apps['AMT_INCOME_TOTAL']
apps['APPS_GOODS_INCOME_RATIO'] = apps['AMT_GOODS_PRICE']/apps['AMT_INCOME_TOTAL']
# 가족수를 고려한 가처분 소득 피처 가공.
apps['APPS_CNT_FAM_INCOME_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['CNT_FAM_MEMBERS']
DAYS_BIRTH, DAYS_EMPLOYED 비율로 Feature 가공.¶
In [44]:
# DAYS_BIRTH, DAYS_EMPLOYED 비율로 소득/자산 관련 Feature 가공.
apps['APPS_EMPLOYED_BIRTH_RATIO'] = apps['DAYS_EMPLOYED']/apps['DAYS_BIRTH']
apps['APPS_INCOME_EMPLOYED_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_EMPLOYED']
apps['APPS_INCOME_BIRTH_RATIO'] = apps['AMT_INCOME_TOTAL']/apps['DAYS_BIRTH']
apps['APPS_CAR_BIRTH_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_BIRTH']
apps['APPS_CAR_EMPLOYED_RATIO'] = apps['OWN_CAR_AGE'] / apps['DAYS_EMPLOYED']
In [80]:
apps.head()
Out[80]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | APPS_EXT_SOURCE_MEAN | APPS_EXT_SOURCE_STD | APPS_ANNUITY_CREDIT_RATIO | APPS_GOODS_CREDIT_RATIO | APPS_CREDIT_GOODS_DIFF | APPS_ANNUITY_INCOME_RATIO | APPS_CREDIT_INCOME_RATIO | APPS_GOODS_INCOME_RATIO | APPS_CNT_FAM_INCOME_RATIO | APPS_EMPLOYED_BIRTH_RATIO | APPS_INCOME_EMPLOYED_RATIO | APPS_INCOME_BIRTH_RATIO | APPS_CAR_BIRTH_RATIO | APPS_CAR_EMPLOYED_RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1.0 | 0 | 0 | 0 | 0 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0 | 0 | 0 | 0 | 0 | 0.018801 | -9461 | -637.0 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1.0 | 2 | 2 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | 0 | 0 | 0.0149 | 0 | 0 | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.161787 | 0.092026 | 0.060749 | 0.863262 | 55597.5 | 0.121978 | 2.007889 | 1.733333 | 202500.0 | 0.067329 | -317.896389 | -21.403657 | NaN | NaN |
1 | 100003 | 0.0 | 0 | 1 | 0 | 1 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 1 | 1 | 1 | 1 | 0 | 0.003541 | -16765 | -1188.0 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 2.0 | 1 | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | 0 | 0 | 0.0714 | 1 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.466757 | 0.219895 | 0.027598 | 0.873211 | 164002.5 | 0.132217 | 4.790750 | 4.183333 | 135000.0 | 0.070862 | -227.272727 | -16.104981 | NaN | NaN |
2 | 100004 | 0.0 | 1 | 0 | 1 | 0 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0 | 0 | 0 | 0 | 0 | 0.010032 | -19046 | -225.0 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1.0 | 2 | 2 | 1 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.642739 | 0.122792 | 0.050000 | 1.000000 | 0.0 | 0.100000 | 2.000000 | 2.000000 | 67500.0 | 0.011814 | -300.000000 | -3.544051 | -0.001365 | -0.115556 |
3 | 100006 | 0.0 | 0 | 1 | 0 | 0 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0 | 0 | 0 | 2 | 0 | 0.008019 | -19005 | -3039.0 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2.0 | 2 | 2 | 0 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.650442 | 0.151008 | 0.094941 | 0.949845 | 15682.5 | 0.219900 | 2.316167 | 2.200000 | 67500.0 | 0.159905 | -44.422507 | -7.103394 | NaN | NaN |
4 | 100007 | 0.0 | 0 | 0 | 0 | 0 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0 | 0 | 0 | 0 | 0 | 0.028663 | -19932 | -3038.0 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1.0 | 2 | 2 | 2 | 11 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.322738 | 0.151008 | 0.042623 | 1.000000 | 0.0 | 0.179963 | 4.222222 | 4.222222 | 121500.0 | 0.152418 | -39.993417 | -6.095725 | NaN | NaN |
데이터 레이블 인코딩, NULL값은 LightGBM 내부에서 처리하도록 특별한 변경하지 않음.¶
In [48]:
apps.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 356255 entries, 0 to 48743
Columns: 136 entries, SK_ID_CURR to APPS_CAR_EMPLOYED_RATIO
dtypes: float64(81), int64(55)
memory usage: 372.4 MB
In [50]:
#object타입이 없는데?? 다 인트로보냇는디??
# 위에 새로 만든 컬럼들을 모두 레이블링 하기 위한 작업
object_columns = apps.dtypes[apps.dtypes == 'object'].index.tolist()
for column in object_columns:
apps[column] = pd.factorize(apps[column])[0]
#factorize 오브젝트 타입을 레이블화 시키는 함수
In [47]:
apps.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 356255 entries, 0 to 48743
Columns: 136 entries, SK_ID_CURR to APPS_CAR_EMPLOYED_RATIO
dtypes: float64(81), int64(55)
memory usage: 372.4 MB
학습 데이터와 테스트 데이터 다시 분리¶
In [56]:
apps_train = apps[~apps['TARGET'].isnull()]
apps_test = apps[apps['TARGET'].isnull()]
apps_test = apps_test.drop('TARGET', axis=1)
In [73]:
apps_train.head()
Out[73]:
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | APPS_EXT_SOURCE_MEAN | APPS_EXT_SOURCE_STD | APPS_ANNUITY_CREDIT_RATIO | APPS_GOODS_CREDIT_RATIO | APPS_CREDIT_GOODS_DIFF | APPS_ANNUITY_INCOME_RATIO | APPS_CREDIT_INCOME_RATIO | APPS_GOODS_INCOME_RATIO | APPS_CNT_FAM_INCOME_RATIO | APPS_EMPLOYED_BIRTH_RATIO | APPS_INCOME_EMPLOYED_RATIO | APPS_INCOME_BIRTH_RATIO | APPS_CAR_BIRTH_RATIO | APPS_CAR_EMPLOYED_RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1.0 | 0 | 0 | 0 | 0 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0 | 0 | 0 | 0 | 0 | 0.018801 | -9461 | -637.0 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1.0 | 2 | 2 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | 0 | 0 | 0.0149 | 0 | 0 | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.161787 | 0.092026 | 0.060749 | 0.863262 | 55597.5 | 0.121978 | 2.007889 | 1.733333 | 202500.0 | 0.067329 | -317.896389 | -21.403657 | NaN | NaN |
1 | 100003 | 0.0 | 0 | 1 | 0 | 1 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 1 | 1 | 1 | 1 | 0 | 0.003541 | -16765 | -1188.0 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 2.0 | 1 | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | 0 | 0 | 0.0714 | 1 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.466757 | 0.219895 | 0.027598 | 0.873211 | 164002.5 | 0.132217 | 4.790750 | 4.183333 | 135000.0 | 0.070862 | -227.272727 | -16.104981 | NaN | NaN |
2 | 100004 | 0.0 | 1 | 0 | 1 | 0 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0 | 0 | 0 | 0 | 0 | 0.010032 | -19046 | -225.0 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1.0 | 2 | 2 | 1 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.642739 | 0.122792 | 0.050000 | 1.000000 | 0.0 | 0.100000 | 2.000000 | 2.000000 | 67500.0 | 0.011814 | -300.000000 | -3.544051 | -0.001365 | -0.115556 |
3 | 100006 | 0.0 | 0 | 1 | 0 | 0 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0 | 0 | 0 | 2 | 0 | 0.008019 | -19005 | -3039.0 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2.0 | 2 | 2 | 0 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.650442 | 0.151008 | 0.094941 | 0.949845 | 15682.5 | 0.219900 | 2.316167 | 2.200000 | 67500.0 | 0.159905 | -44.422507 | -7.103394 | NaN | NaN |
4 | 100007 | 0.0 | 0 | 0 | 0 | 0 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0 | 0 | 0 | 0 | 0 | 0.028663 | -19932 | -3038.0 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1.0 | 2 | 2 | 2 | 11 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.322738 | 0.151008 | 0.042623 | 1.000000 | 0.0 | 0.179963 | 4.222222 | 4.222222 | 121500.0 | 0.152418 | -39.993417 | -6.095725 | NaN | NaN |
In [59]:
apps.shape, apps_train.shape, apps_test.shape
Out[59]:
((356255, 136), (307511, 136), (48744, 135))
학습 데이터를 검증 데이터로 분리하고 LGBM Classifier로 학습 수행.¶
In [79]:
print(app_train.shape)
print(train_x.shape[0]+valid_x.shape[0])
#train_x.head()
valid_x.head()
(307511, 122)
307511
Out[79]:
NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | APPS_EXT_SOURCE_MEAN | APPS_EXT_SOURCE_STD | APPS_ANNUITY_CREDIT_RATIO | APPS_GOODS_CREDIT_RATIO | APPS_CREDIT_GOODS_DIFF | APPS_ANNUITY_INCOME_RATIO | APPS_CREDIT_INCOME_RATIO | APPS_GOODS_INCOME_RATIO | APPS_CNT_FAM_INCOME_RATIO | APPS_EMPLOYED_BIRTH_RATIO | APPS_INCOME_EMPLOYED_RATIO | APPS_INCOME_BIRTH_RATIO | APPS_CAR_BIRTH_RATIO | APPS_CAR_EMPLOYED_RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
277882 | 1 | 1 | 0 | 1 | 0 | 135000.0 | 382500.0 | 19125.0 | 382500.0 | 0 | 3 | 0 | 0 | 0 | 0.018209 | -20857 | NaN | -393.0 | -3748 | NaN | 1 | 0 | 0 | 1 | 0 | 0 | -1 | 1.0 | 3 | 3 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | NaN | 0.511305 | 0.612704 | 0.4351 | 0.3400 | 0.9816 | 0.7484 | 0.2205 | 0.48 | 0.4138 | 0.3333 | 0.375 | 0.1834 | 0.3387 | 0.4620 | 0.0347 | 0.0303 | 0.4433 | 0.3528 | 0.9816 | 0.7583 | 0.2225 | 0.4834 | 0.4138 | 0.3333 | 0.375 | 0.1876 | 0.3701 | 0.4813 | 0.0350 | 0.0321 | 0.4393 | 0.3400 | 0.9816 | 0.7518 | 0.2219 | 0.48 | 0.4138 | 0.3333 | 0.375 | 0.1866 | 0.3446 | 0.4703 | 0.0349 | 0.0310 | 0 | 0 | 0.4839 | 2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -1417.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.562005 | 0.071700 | 0.050000 | 1.000000 | 0.0 | 0.141667 | 2.833333 | 2.833333 | 135000.0 | NaN | NaN | -6.472647 | NaN | NaN |
99911 | 1 | 0 | 0 | 0 | 0 | 225000.0 | 270000.0 | 13500.0 | 270000.0 | 0 | 2 | 0 | 1 | 0 | 0.032561 | -16666 | -958.0 | -585.0 | -216 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2.0 | 1 | 1 | 6 | 19 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | NaN | 0.758884 | NaN | 0.3041 | 0.1829 | 0.9990 | 0.9864 | 0.0920 | 0.32 | 0.2759 | 0.3333 | 0.375 | 0.2138 | 0.2480 | 0.3079 | 0.0116 | 0.0407 | 0.3099 | 0.1898 | 0.9990 | 0.9869 | 0.0928 | 0.3222 | 0.2759 | 0.3333 | 0.375 | 0.2187 | 0.2709 | 0.3208 | 0.0117 | 0.0431 | 0.3071 | 0.1829 | 0.9990 | 0.9866 | 0.0926 | 0.32 | 0.2759 | 0.3333 | 0.375 | 0.2175 | 0.2522 | 0.3135 | 0.0116 | 0.0416 | 0 | 0 | 0.3538 | 6 | 0 | 1.0 | 0.0 | 1.0 | 0.0 | -185.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.758884 | 0.151008 | 0.050000 | 1.000000 | 0.0 | 0.060000 | 1.200000 | 1.200000 | 112500.0 | 0.057482 | -234.864301 | -13.500540 | NaN | NaN |
51357 | 0 | 1 | 0 | 0 | 0 | 49500.0 | 71955.0 | 7137.0 | 67500.0 | 0 | 3 | 0 | 1 | 0 | 0.010276 | -22350 | NaN | -11780.0 | -5024 | NaN | 1 | 0 | 0 | 1 | 1 | 0 | -1 | 2.0 | 2 | 2 | 6 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | NaN | 0.191319 | 0.683269 | 0.0196 | 0.0000 | 0.9627 | NaN | NaN | 0.00 | 0.0690 | 0.0417 | NaN | 0.0156 | NaN | 0.0100 | NaN | 0.0000 | 0.0200 | 0.0000 | 0.9628 | NaN | NaN | 0.0000 | 0.0690 | 0.0417 | NaN | 0.0160 | NaN | 0.0104 | NaN | 0.0000 | 0.0198 | 0.0000 | 0.9627 | NaN | NaN | 0.00 | 0.0690 | 0.0417 | NaN | 0.0159 | NaN | 0.0101 | NaN | 0.0000 | -1 | 0 | 0.0078 | 4 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | -725.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.437294 | 0.347861 | 0.099187 | 0.938086 | 4455.0 | 0.144182 | 1.453636 | 1.363636 | 24750.0 | NaN | NaN | -2.214765 | NaN | NaN |
205461 | 1 | 0 | 1 | 0 | 0 | 135000.0 | 202500.0 | 10125.0 | 202500.0 | 0 | 2 | 0 | 1 | 0 | 0.035792 | -19559 | -3631.0 | -9757.0 | -3106 | 16.0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 2.0 | 2 | 2 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.901493 | 0.493923 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 0.0 | 0.0 | 0.0 | 0.0 | -3332.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.697708 | 0.288196 | 0.050000 | 1.000000 | 0.0 | 0.075000 | 1.500000 | 1.500000 | 67500.0 | 0.185643 | -37.179840 | -6.902193 | -0.000818 | -0.004406 |
55584 | 0 | 0 | 1 | 0 | 1 | 135000.0 | 539590.5 | 19381.5 | 445500.0 | 0 | 3 | 0 | 1 | 0 | 0.046220 | -23060 | NaN | -1493.0 | -5133 | 15.0 | 1 | 0 | 0 | 1 | 0 | 0 | -1 | 3.0 | 1 | 1 | 5 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | NaN | 0.740859 | 0.259468 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1 | -1 | NaN | -1 | -1 | 0.0 | 0.0 | 0.0 | 0.0 | -1031.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.500163 | 0.340395 | 0.035919 | 0.825626 | 94090.5 | 0.143567 | 3.996967 | 3.300000 | 45000.0 | NaN | NaN | -5.854293 | -0.000650 | NaN |
In [75]:
from sklearn.model_selection import train_test_split
# ftr_app - 타겟값이 되는 컬럼들이 지워준다
ftr_app = apps_train.drop(['SK_ID_CURR', 'TARGET'], axis=1) # apps_train은 피처들을 레이블화 시킨 셋이다
# 타켓값들만 넣어준다.
# app_trin은 완전 처음 csv파일로 불러온 app_train 이다
# app_train -> apps -> apps_train _ 불러오고 합치고 다시 나누고 한것이다.
target_app = app_train['TARGET']
train_x, valid_x, train_y, valid_y = train_test_split(ftr_app, target_app, test_size=0.3, random_state=2020)
train_x.shape, valid_x.shape
Out[75]:
((215257, 134), (92254, 134))
In [87]:
from lightgbm import LGBMClassifier
clf = LGBMClassifier(
n_jobs=-1,
n_estimators=1000,
learning_rate=0.02,
num_leaves=32,
subsample=0.8,
max_depth=12,
silent=-1,
verbose=-1
)
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], eval_metric= 'auc', verbose= 100,
early_stopping_rounds= 100)
#boost_from_average= True
[100] training's auc: 0.759726 training's binary_logloss: 0.24754 valid_1's auc: 0.749339 valid_1's binary_logloss: 0.249516
[200] training's auc: 0.780471 training's binary_logloss: 0.240508 valid_1's auc: 0.759905 valid_1's binary_logloss: 0.245532
[300] training's auc: 0.794494 training's binary_logloss: 0.235945 valid_1's auc: 0.763886 valid_1's binary_logloss: 0.244172
[400] training's auc: 0.806007 training's binary_logloss: 0.232261 valid_1's auc: 0.765383 valid_1's binary_logloss: 0.243635
[500] training's auc: 0.816276 training's binary_logloss: 0.229006 valid_1's auc: 0.765464 valid_1's binary_logloss: 0.243539
[600] training's auc: 0.825884 training's binary_logloss: 0.225871 valid_1's auc: 0.765668 valid_1's binary_logloss: 0.243463
[700] training's auc: 0.834999 training's binary_logloss: 0.222851 valid_1's auc: 0.76584 valid_1's binary_logloss: 0.243374
[800] training's auc: 0.843362 training's binary_logloss: 0.21994 valid_1's auc: 0.766152 valid_1's binary_logloss: 0.243268
Out[87]:
LGBMClassifier(learning_rate=0.02, max_depth=12, n_estimators=1000,
num_leaves=32, silent=-1, subsample=0.8, verbose=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMClassifier(learning_rate=0.02, max_depth=12, n_estimators=1000,
num_leaves=32, silent=-1, subsample=0.8, verbose=-1)
In [82]:
from lightgbm import plot_importance
plot_importance(clf, figsize=(16, 32))
# 그래프를 보면 _app_baseline_01에서 LGBMClassifier로 만든 그래프랑 비교해보면 위에서 중요 피처들로 만든 피처들의 연관성이 매우 높게 나오고 있가
# 원래 ext_source_3가 최상단에 위치했지만 APPS_ANNUITY_CREDIT_RATIO가 최상단에 위치한것을 알수 있다.
# 히스토그램으로 알아본 중요 피처들을 연관있는 다른 피처들과 조합해 새 컬럼을 만들어 예측을 해보면 더좋은 예측 성능을 가져올수 있다.
Out[82]:
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>
학습된 Classifier를 이용하여 테스트 데이터 예측하고 결과를 Kaggle로 Submit 수행.¶
In [83]:
preds = clf.predict_proba(apps_test.drop(['SK_ID_CURR'], axis=1))[:, 1 ]
In [84]:
app_test['TARGET'] = preds
app_test[['SK_ID_CURR', 'TARGET']].to_csv('apps_baseline_02.csv', index=False)
In [ ]:
'사기 예측 (언발란스 데이터)' 카테고리의 다른 글
10_prev_baseline_01_exercise_pandas_대출상환 예측_피처 엔지니어링_2단계(판다스 고급기술 활용) (0) | 2022.11.03 |
---|---|
09_3_app_baseline_01_exercise_은행대출_상환_가능 불가능 예측 (0) | 2022.11.03 |