데이터 클렌징 및 가공¶

RMSLE란??¶

RMSLE 가 0이라면 (가능하냐고? ai와 pi가 모두 일치하면 이렇게 된다.), exp(RMSLE) 는 1이 되고, 예측과실제값의 평균비율을 1이다. 설명과 부합한다

MSE¶

MSE가 0에 가까울수록 추측한 값이 원본에 가까운 것이기 때문에 정확도가 높다고 할 수 있다.

평균 제곱근 오차(Root Mean Square Error; RMSE¶

rmse는 낮을 수록 좋다 -> 이뜻은 다른 데이터들과의 오차가 적다는 뜻을 의미한다.
릿지와 리쏘는 둘다 목표는 수치를 비슷하게 만들어주는 것에 있기 때문이다.

In [21]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

bike_df = pd.read_csv('./train.csv')
print(bike_df.shape)
bike_df.head(3)

(10886, 12)

Out[21]:

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32

In [22]:

bike_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB

In [23]:

bike_df.datetime.apply(pd.to_datetime)

Out[23]:

0       2011-01-01 00:00:00
1       2011-01-01 01:00:00
2       2011-01-01 02:00:00
3       2011-01-01 03:00:00
4       2011-01-01 04:00:00
                ...        
10881   2012-12-19 19:00:00
10882   2012-12-19 20:00:00
10883   2012-12-19 21:00:00
10884   2012-12-19 22:00:00
10885   2012-12-19 23:00:00
Name: datetime, Length: 10886, dtype: datetime64[ns]

In [24]:

# 문자열을 datetime 타입으로 변경. 
bike_df['datetime'] = bike_df.datetime.apply(pd.to_datetime)

# datetime 타입에서 년, 월, 일, 시간 추출
bike_df['year'] = bike_df.datetime.apply(lambda x : x.year)
bike_df['month'] = bike_df.datetime.apply(lambda x : x.month)
bike_df['day'] = bike_df.datetime.apply(lambda x : x.day)
bike_df['hour'] = bike_df.datetime.apply(lambda x: x.hour)
bike_df.head(3)

Out[24]:

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	year	month	day	hour
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	2011	1	1	0
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	2011	1	1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	2011	1	1	2

불필요한 데이터 컬럼 삭제¶

axis =1 열 삭제

In [25]:

drop_columns = ['datetime','casual','registered']
bike_df.drop(drop_columns, axis=1,inplace=True) 

In [26]:

bike_df.isnull().sum()

Out[26]:

season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
count         0
year          0
month         0
day           0
hour          0
dtype: int64

In [27]:

from sklearn.metrics import mean_squared_error, mean_absolute_error

# log 값 변환 시 NaN등의 이슈로 log() 가 아닌 log1p() 를 이용하여 RMSLE 계산
# 하지만 bike_df에는 널값이 없기 떄문에 log()사용가능
def rmsle(y, pred):
    log_y = np.log1p(y)
    log_pred = np.log1p(pred)
    squared_error = (log_y - log_pred) ** 2
    rmsle = np.sqrt(np.mean(squared_error))
    return rmsle

# 사이킷런의 mean_square_error() 를 이용하여 RMSE 계산
def rmse(y,pred):
    return np.sqrt(mean_squared_error(y,pred))

# MSE, RMSE, RMSLE 를 모두 계산 
def evaluate_regr(y,pred):
    rmsle_val = rmsle(y,pred)
    rmse_val = rmse(y,pred)
    # MAE 는 scikit learn의 mean_absolute_error() 로 계산
    mae_val = mean_absolute_error(y,pred)
    print('RMSLE: {0:.3f}, RMSE: {1:.3F}, MAE: {2:.3F}'.format(rmsle_val, rmse_val, mae_val))

로그 변환, 피처 인코딩, 모델 학습/예측/평가¶

In [28]:

from sklearn.model_selection import train_test_split , GridSearchCV
from sklearn.linear_model import LinearRegression , Ridge , Lasso

y_target = bike_df['count']
X_features = bike_df.drop(['count'],axis=1,inplace=False)

X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.3, random_state=0)

lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)
pred = lr_reg.predict(X_test)

evaluate_regr(y_test ,pred)

RMSLE: 1.165, RMSE: 140.900, MAE: 105.924

In [29]:

def get_top_error_data(y_test, pred, n_tops = 5):
    # DataFrame에 컬럼들로 실제 대여횟수(count)와 예측 값을 서로 비교 할 수 있도록 생성. 
    result_df = pd.DataFrame(y_test.values, columns=['real_count'])
    result_df['predicted_count']= np.round(pred)
    result_df['diff'] = np.abs(result_df['real_count'] - result_df['predicted_count'])
    # 예측값과 실제값이 가장 큰 데이터 순으로 출력. 
    print(result_df.sort_values('diff', ascending=False)[:n_tops])
    
get_top_error_data(y_test,pred,n_tops=5)

      real_count  predicted_count   diff
1618         890            322.0  568.0
3151         798            241.0  557.0
966          884            327.0  557.0
412          745            194.0  551.0
2817         856            310.0  546.0

In [30]:

y_target.hist()

Out[30]:

<AxesSubplot:>

In [31]:

y_log_transform = np.log1p(y_target)
y_log_transform.hist()

Out[31]:

<AxesSubplot:>

In [32]:

# 타겟 컬럼인 count 값을 log1p 로 Log 변환
y_target_log = np.log1p(y_target)

# 로그 변환된 y_target_log를 반영하여 학습/테스트 데이터 셋 분할
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target_log, test_size=0.3, random_state=0)
lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)
pred = lr_reg.predict(X_test)

# 테스트 데이터 셋의 Target 값은 Log 변환되었으므로 다시 expm1를 이용하여 원래 scale로 변환
y_test_exp = np.expm1(y_test)

# 예측 값 역시 Log 변환된 타겟 기반으로 학습되어 예측되었으므로 다시 exmpl으로 scale변환
pred_exp = np.expm1(pred)

evaluate_regr(y_test_exp ,pred_exp)

RMSLE: 1.017, RMSE: 162.594, MAE: 109.286

In [33]:

coef = pd.Series(lr_reg.coef_, index=X_features.columns)
coef_sort = coef.sort_values(ascending=False)
sns.barplot(x=coef_sort.values, y=coef_sort.index)

Out[33]:

<AxesSubplot:>

In [34]:

# 'year', month', 'day', hour'등의 피처들을 One Hot Encoding
X_features_ohe = pd.get_dummies(X_features, columns=['year', 'month','day', 'hour', 'holiday',
                                              'workingday','season','weather'])

In [38]:

X_features_ohe

Out[38]:

	temp	atemp	humidity	windspeed	year_2011	year_2012	month_1	month_2	month_3	month_4	...	workingday_0	workingday_1	season_1	season_2	season_3	season_4	weather_1	weather_2	weather_3	weather_4
0	9.84	14.395	81	0.0000	1	0	1	0	0	0	...	1	0	1	0	0	0	1	0	0	0
1	9.02	13.635	80	0.0000	1	0	1	0	0	0	...	1	0	1	0	0	0	1	0	0	0
2	9.02	13.635	80	0.0000	1	0	1	0	0	0	...	1	0	1	0	0	0	1	0	0	0
3	9.84	14.395	75	0.0000	1	0	1	0	0	0	...	1	0	1	0	0	0	1	0	0	0
4	9.84	14.395	75	0.0000	1	0	1	0	0	0	...	1	0	1	0	0	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10881	15.58	19.695	50	26.0027	0	1	0	0	0	0	...	0	1	0	0	0	1	1	0	0	0
10882	14.76	17.425	57	15.0013	0	1	0	0	0	0	...	0	1	0	0	0	1	1	0	0	0
10883	13.94	15.910	61	15.0013	0	1	0	0	0	0	...	0	1	0	0	0	1	1	0	0	0
10884	13.94	17.425	61	6.0032	0	1	0	0	0	0	...	0	1	0	0	0	1	1	0	0	0
10885	13.12	16.665	66	8.9981	0	1	0	0	0	0	...	0	1	0	0	0	1	1	0	0	0

10886 rows × 73 columns

In [35]:

# 원-핫 인코딩이 적용된 feature 데이터 세트 기반으로 학습/예측 데이터 분할. 
X_train, X_test, y_train, y_test = train_test_split(X_features_ohe, y_target_log,
                                                    test_size=0.3, random_state=0)

# 모델과 학습/테스트 데이터 셋을 입력하면 성능 평가 수치를 반환
def get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=False):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    if is_expm1 :
        y_test = np.expm1(y_test)
        pred = np.expm1(pred)
    print('###',model.__class__.__name__,'###')
    evaluate_regr(y_test, pred)
# end of function get_model_predict    

# model 별로 평가 수행
lr_reg = LinearRegression()
ridge_reg = Ridge(alpha=10)
lasso_reg = Lasso(alpha=0.01)

for model in [lr_reg, ridge_reg, lasso_reg]:
    get_model_predict(model,X_train, X_test, y_train, y_test,is_expm1=True)

### LinearRegression ###
RMSLE: 0.590, RMSE: 97.690, MAE: 63.383
### Ridge ###
RMSLE: 0.590, RMSE: 98.529, MAE: 63.893
### Lasso ###
RMSLE: 0.635, RMSE: 113.219, MAE: 72.803

In [36]:

coef = pd.Series(lr_reg.coef_ , index=X_features_ohe.columns)
coef_sort = coef.sort_values(ascending=False)[:10]
sns.barplot(x=coef_sort.values , y=coef_sort.index)

Out[36]:

<AxesSubplot:>

In [37]:

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


# 랜덤 포레스트, GBM, XGBoost, LightGBM model 별로 평가 수행
rf_reg = RandomForestRegressor(n_estimators=500)
gbm_reg = GradientBoostingRegressor(n_estimators=500)

for model in [rf_reg, gbm_reg]:
    # XGBoost의 경우 DataFrame이 입력 될 경우 버전에 따라 오류 발생 가능. ndarray로 변환.
    get_model_predict(model,X_train.values, X_test.values, y_train.values, y_test.values,is_expm1=True)

### RandomForestRegressor ###
RMSLE: 0.354, RMSE: 50.196, MAE: 31.034
### GradientBoostingRegressor ###
RMSLE: 0.330, RMSE: 53.344, MAE: 32.747

In [ ]:

'머신러닝 > 회귀' 카테고리의 다른 글

overfitting (오버피팅) 이해 (0)	2022.10.25
05-1_실습_UCIDATASET_콘크리트_회귀실습 (0)	2022.10.25
05_회귀 실습_자전거 대여 수요 예측_ 캐글 (0)	2022.10.25
04_데이터 전처리(정규화,로그변환, 스케일러, 원-핫 인코딩 ) (0)	2022.10.25
3_규제 선형 회귀__릿지, 라쏘, 엘라스틱넷 (0)	2022.10.25

with_open_형준

5.9 Regression실습-Bike Sharing Demand(수정)__UCI데이터셋

데이터 클렌징 및 가공¶

RMSLE란??¶

MSE¶

평균 제곱근 오차(Root Mean Square Error; RMSE¶

불필요한 데이터 컬럼 삭제¶

로그 변환, 피처 인코딩, 모델 학습/예측/평가¶

'머신러닝 > 회귀' 카테고리의 다른 글

티스토리툴바

5.9 Regression실습-Bike Sharing Demand(수정)__UCI데이터셋

5.9 Regression 실습 - Bike Sharing Demand¶

데이터 클렌징 및 가공¶

RMSLE란??¶

MSE¶

평균 제곱근 오차(Root Mean Square Error; RMSE¶

불필요한 데이터 컬럼 삭제¶

로그 변환, 피처 인코딩, 모델 학습/예측/평가¶

'머신러닝 > 회귀' 카테고리의 다른 글

'머신러닝/회귀' Related Articles

티스토리툴바