In [ ]:

다항 선형 회귀(polinomial linear regression) : 2차, 3차 방정식

01_다항회귀¶

In [18]:

from sklearn.preprocessing import PolynomialFeatures
import numpy as np
#
X= np.arange(4).reshape(2,2)
print('일차 다항식 계수 featuer : \n',X)

일차 다항식 계수 featuer : 
 [[0 1]
 [2 3]]

In [19]:

# 디그리가 2인 2차 다항식으로 변환하기 위해 PolynomialFeatures를 이용한 변환
poly = PolynomialFeatures(degree=2)
poly.fit(X)
poly_ftr = poly.transform(X)
print('변환된 2차 다항식 계수 featuer:\n ',poly_ftr)

변환된 2차 다항식 계수 featuer:
  [[1. 0. 1. 0. 0. 1.]
 [1. 2. 3. 4. 6. 9.]]

Linear Regression에 3차 다항식 계수 feature와 3차 다항식 결정값으로 학습 후 회귀 계수 확인

In [21]:

def polynomial_func(X):
    y = 1+2*X+X**2 + X**3
    return y
X= np.arange(4).reshape(2,2)
print('일차 다항식 계수 featuer : \n',X)
y = polynomial_func(X)
print('삼차 다항식 결정값 :\n',y)

일차 다항식 계수 featuer : 
 [[0 1]
 [2 3]]
삼차 다항식 결정값 :
 [[ 1  5]
 [17 43]]

In [23]:

from sklearn.linear_model import LinearRegression
#3차 다항식 변환
poly_ftr = PolynomialFeatures(degree=3).fit_transform(X)
print('3차 다항식 계수 featuer : \n',poly_ftr)

#
model = LinearRegression()
model.fit(poly_ftr,y)
print('Polynomial 회귀 계수\n', np.round(model.coef_,2))
print('Polynomial 회귀 계수\n',model.coef_.shape)

3차 다항식 계수 featuer : 
 [[ 1.  0.  1.  0.  0.  1.  0.  0.  0.  1.]
 [ 1.  2.  3.  4.  6.  9.  8. 12. 18. 27.]]
Polynomial 회귀 계수
 [[0.   0.02 0.02 0.05 0.07 0.1  0.1  0.14 0.22 0.31]
 [0.   0.06 0.06 0.11 0.17 0.23 0.23 0.34 0.51 0.74]]
Polynomial 회귀 계수
 (2, 10)

02_파이프라인 이용 다항 회귀¶

In [24]:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

def polynomial_func(X):
    y = 1+2*X+X**2 + X**3
    return y
model =Pipeline([('poly',PolynomialFeatures(degree=3)),
                 ('linear',LinearRegression())])
X= np.arange(4).reshape(2,2)
y=polynomial_func(X)

model = model.fit(X,y)
print('PolynomialFeatures 회귀 계수:\n',np.round(model.named_steps['linear'].coef_,2))

PolynomialFeatures 회귀 계수:
 [[0.   0.02 0.02 0.05 0.07 0.1  0.1  0.14 0.22 0.31]
 [0.   0.06 0.06 0.11 0.17 0.23 0.23 0.34 0.51 0.74]]

03_다항 회귀로 보스턴 주택 가격 예측¶

In [25]:

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt



boston =load_boston()
bostondf=pd.DataFrame(boston.data, columns=boston.feature_names)

bostondf['PRICE']= boston.target

print(bostondf.shape)
bostondf.head()

(506, 14)

C:\Users\82105\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
  warnings.warn(msg, category=FutureWarning)

Out[25]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

In [27]:

from sklearn.model_selection import train_test_split
y_target =bostondf['PRICE']
X_data=bostondf.drop(['PRICE'],axis=1,inplace=False)

X_train,X_test,y_train, y_test = train_test_split(X_data,y_target,test_size=0.3, random_state=156)

In [28]:

p_model = Pipeline([('poly',PolynomialFeatures(degree=3)),
                 ('linear',LinearRegression())])
p_model

Out[28]:

Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
                ('linear', LinearRegression())])

PolynomialFeatures

PolynomialFeatures(degree=3)

LinearRegression

LinearRegression()

다항 회귀에서 degree 수를 높일수록 오버피팅 될 수 있다는 점은 주의해야 한다.¶

다항 회귀에서 차수(degree)를 높이면 오버피팅될 우려가 있다

In [32]:

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
p_model.fit(X_train, y_train)
y_preds=p_model.predict(X_test)
mse=mean_squared_error(y_test,y_pred)
rmse =np.sqrt(mse)

print('mse: {0:.3f}, rmse: {1:.3f}'.format(mse, rmse))
print('variance score:  {0:.3f}'.format(r2_score(y_test, y_preds))) #오버피팅이 된다.

mse: 79625.594, rmse: 282.180
variance score:  -1116.598

규제 선형 회귀 개요¶

과대 적합을 방지하기 위해서 규제는 필요하다
최초 목표는 RSS(오차)를 최소화하는 것이었지만, 그러다보니 회귀 계수가 커져서 과대적합이라는 문 제를 만나게 되었다.
그래서 RSS와 더불어 회귀 계수 크기를 밸런스있게 제어하는 것이 필요하게 되었다.
-> 비용 함수의 목표가 밸런스 조절(RSS값 최소화, 회계 계수 값 제어)이 됨

평균 제곱근 오차(Root Mean Square Error; RMSE
- rmse는 낮을 수록 좋다 -> 이뜻은 다른 데이터들과의 오차가 적다는 뜻을 의미한다.
- 릿

In [11]:

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np



boston =load_boston()
bostondf=pd.DataFrame(boston.data, columns=boston.feature_names)

bostondf['PRICE']= boston.target

y_target =bostondf['PRICE']
X_data=bostondf.drop(['PRICE'],axis=1,inplace=False)

ridge= Ridge(alpha=10)
neg_mse_scores = cross_val_score(ridge, X_data,y_target,scoring='neg_mean_squared_error',cv=5)
rmse_scores = np.sqrt(-1* neg_mse_scores)
avg_rmse = np.mean(rmse_scores)
print('5 fold 의 개별 negtivate mse scores:',np.round(neg_mse_scores,3))
print('5 fold 의 개별 rmse scors: ', np.round(rmse_scores,3))
print('5 fold의 평균 rmse:{0:.3f}'.format(avg_rmse))

5 fold 의 개별 negtivate mse scores: [-11.422 -24.294 -28.144 -74.599 -28.517]
5 fold 의 개별 rmse scors:  [3.38  4.929 5.305 8.637 5.34 ]
5 fold의 평균 rmse:5.518

C:\Users\82105\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.

    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_housing
        housing = fetch_california_housing()

    for the California housing dataset and::

        from sklearn.datasets import fetch_openml
        housing = fetch_openml(name="house_prices", as_frame=True)

    for the Ames housing dataset.
  warnings.warn(msg, category=FutureWarning)

alpha 값을 0,0.1 , 1, 10 ,100으로 변경하면서 rmse 값 측정¶

In [13]:

alphas = [0 , 0.1 , 1, 10 ,100]

for alpha in alphas:
    ridge = Ridge(alpha =alpha)
    
    
    neg_mse_scores = cross_val_score(ridge, X_data, y_target, scoring='neg_mean_squared_error',cv=5)
    avg_rmse= np.mean(np.sqrt(-1 * neg_mse_scores))
    print('alpha {0}일때 5 fold 의 평균 rmse:{1:.3f}]'.format(alpha,avg_rmse))

alpha 0일때 5 fold 의 평균 rmse:5.829]
alpha 0.1일때 5 fold 의 평균 rmse:5.788]
alpha 1일때 5 fold 의 평균 rmse:5.653]
alpha 10일때 5 fold 의 평균 rmse:5.518]
alpha 100일때 5 fold 의 평균 rmse:5.330]

알파값이 증가할수록 모델 성능이 향상되고 있다

In [ ]:

'머신러닝 > 회귀' 카테고리의 다른 글

05-1_실습_UCIDATASET_콘크리트_회귀실습 (0)	2022.10.25
05_회귀 실습_자전거 대여 수요 예측_ 캐글 (0)	2022.10.25
04_데이터 전처리(정규화,로그변환, 스케일러, 원-핫 인코딩 ) (0)	2022.10.25
3_규제 선형 회귀__릿지, 라쏘, 엘라스틱넷 (0)	2022.10.25
1_다중 선형 회귀 (0)	2022.10.25

with_open_형준

2_다항 선형 회귀

01_다항회귀¶

02_파이프라인 이용 다항 회귀¶

03_다항 회귀로 보스턴 주택 가격 예측¶

다항 회귀에서 degree 수를 높일수록 오버피팅 될 수 있다는 점은 주의해야 한다.¶

규제 선형 회귀 개요¶

alpha 값을 0,0.1 , 1, 10 ,100으로 변경하면서 rmse 값 측정¶

'머신러닝 > 회귀' 카테고리의 다른 글

티스토리툴바

2_다항 선형 회귀

01_다항회귀¶

02_파이프라인 이용 다항 회귀¶

03_다항 회귀로 보스턴 주택 가격 예측¶

다항 회귀에서 degree 수를 높일수록 오버피팅 될 수 있다는 점은 주의해야 한다.¶

규제 선형 회귀 개요¶

alpha 값을 0,0.1 , 1, 10 ,100으로 변경하면서 rmse 값 측정¶

'머신러닝 > 회귀' 카테고리의 다른 글

'머신러닝/회귀' Related Articles

티스토리툴바