In [ ]:
다항 선형 회귀(polinomial linear regression) : 2차, 3차 방정식
01_다항회귀¶
In [18]:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
#
X= np.arange(4).reshape(2,2)
print('일차 다항식 계수 featuer : \n',X)
일차 다항식 계수 featuer :
[[0 1]
[2 3]]
In [19]:
# 디그리가 2인 2차 다항식으로 변환하기 위해 PolynomialFeatures를 이용한 변환
poly = PolynomialFeatures(degree=2)
poly.fit(X)
poly_ftr = poly.transform(X)
print('변환된 2차 다항식 계수 featuer:\n ',poly_ftr)
변환된 2차 다항식 계수 featuer:
[[1. 0. 1. 0. 0. 1.]
[1. 2. 3. 4. 6. 9.]]
- Linear Regression에 3차 다항식 계수 feature와 3차 다항식 결정값으로 학습 후 회귀 계수 확인
In [21]:
def polynomial_func(X):
y = 1+2*X+X**2 + X**3
return y
X= np.arange(4).reshape(2,2)
print('일차 다항식 계수 featuer : \n',X)
y = polynomial_func(X)
print('삼차 다항식 결정값 :\n',y)
일차 다항식 계수 featuer :
[[0 1]
[2 3]]
삼차 다항식 결정값 :
[[ 1 5]
[17 43]]
In [23]:
from sklearn.linear_model import LinearRegression
#3차 다항식 변환
poly_ftr = PolynomialFeatures(degree=3).fit_transform(X)
print('3차 다항식 계수 featuer : \n',poly_ftr)
#
model = LinearRegression()
model.fit(poly_ftr,y)
print('Polynomial 회귀 계수\n', np.round(model.coef_,2))
print('Polynomial 회귀 계수\n',model.coef_.shape)
3차 다항식 계수 featuer :
[[ 1. 0. 1. 0. 0. 1. 0. 0. 0. 1.]
[ 1. 2. 3. 4. 6. 9. 8. 12. 18. 27.]]
Polynomial 회귀 계수
[[0. 0.02 0.02 0.05 0.07 0.1 0.1 0.14 0.22 0.31]
[0. 0.06 0.06 0.11 0.17 0.23 0.23 0.34 0.51 0.74]]
Polynomial 회귀 계수
(2, 10)
02_파이프라인 이용 다항 회귀¶
In [24]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np
def polynomial_func(X):
y = 1+2*X+X**2 + X**3
return y
model =Pipeline([('poly',PolynomialFeatures(degree=3)),
('linear',LinearRegression())])
X= np.arange(4).reshape(2,2)
y=polynomial_func(X)
model = model.fit(X,y)
print('PolynomialFeatures 회귀 계수:\n',np.round(model.named_steps['linear'].coef_,2))
PolynomialFeatures 회귀 계수:
[[0. 0.02 0.02 0.05 0.07 0.1 0.1 0.14 0.22 0.31]
[0. 0.06 0.06 0.11 0.17 0.23 0.23 0.34 0.51 0.74]]
03_다항 회귀로 보스턴 주택 가격 예측¶
In [25]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
boston =load_boston()
bostondf=pd.DataFrame(boston.data, columns=boston.feature_names)
bostondf['PRICE']= boston.target
print(bostondf.shape)
bostondf.head()
(506, 14)
C:\Users\82105\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
warnings.warn(msg, category=FutureWarning)
Out[25]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
In [27]:
from sklearn.model_selection import train_test_split
y_target =bostondf['PRICE']
X_data=bostondf.drop(['PRICE'],axis=1,inplace=False)
X_train,X_test,y_train, y_test = train_test_split(X_data,y_target,test_size=0.3, random_state=156)
In [28]:
p_model = Pipeline([('poly',PolynomialFeatures(degree=3)),
('linear',LinearRegression())])
p_model
Out[28]:
Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression())])
PolynomialFeatures(degree=3)
LinearRegression()
다항 회귀에서 degree 수를 높일수록 오버피팅 될 수 있다는 점은 주의해야 한다.¶
- 다항 회귀에서 차수(degree)를 높이면 오버피팅될 우려가 있다
In [32]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
p_model.fit(X_train, y_train)
y_preds=p_model.predict(X_test)
mse=mean_squared_error(y_test,y_pred)
rmse =np.sqrt(mse)
print('mse: {0:.3f}, rmse: {1:.3f}'.format(mse, rmse))
print('variance score: {0:.3f}'.format(r2_score(y_test, y_preds))) #오버피팅이 된다.
mse: 79625.594, rmse: 282.180
variance score: -1116.598
규제 선형 회귀 개요¶
- 과대 적합을 방지하기 위해서 규제는 필요하다
- 최초 목표는 RSS(오차)를 최소화하는 것이었지만, 그러다보니 회귀 계수가 커져서 과대적합이라는 문 제를 만나게 되었다.
- 그래서 RSS와 더불어 회귀 계수 크기를 밸런스있게 제어하는 것이 필요하게 되었다.
- -> 비용 함수의 목표가 밸런스 조절(RSS값 최소화, 회계 계수 값 제어)이 됨
- 평균 제곱근 오차(Root Mean Square Error; RMSE
- rmse는 낮을 수록 좋다 -> 이뜻은 다른 데이터들과의 오차가 적다는 뜻을 의미한다.
- 릿
In [11]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
boston =load_boston()
bostondf=pd.DataFrame(boston.data, columns=boston.feature_names)
bostondf['PRICE']= boston.target
y_target =bostondf['PRICE']
X_data=bostondf.drop(['PRICE'],axis=1,inplace=False)
ridge= Ridge(alpha=10)
neg_mse_scores = cross_val_score(ridge, X_data,y_target,scoring='neg_mean_squared_error',cv=5)
rmse_scores = np.sqrt(-1* neg_mse_scores)
avg_rmse = np.mean(rmse_scores)
print('5 fold 의 개별 negtivate mse scores:',np.round(neg_mse_scores,3))
print('5 fold 의 개별 rmse scors: ', np.round(rmse_scores,3))
print('5 fold의 평균 rmse:{0:.3f}'.format(avg_rmse))
5 fold 의 개별 negtivate mse scores: [-11.422 -24.294 -28.144 -74.599 -28.517]
5 fold 의 개별 rmse scors: [3.38 4.929 5.305 8.637 5.34 ]
5 fold의 평균 rmse:5.518
C:\Users\82105\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; `load_boston` is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
warnings.warn(msg, category=FutureWarning)
alpha 값을 0,0.1 , 1, 10 ,100으로 변경하면서 rmse 값 측정¶
In [13]:
alphas = [0 , 0.1 , 1, 10 ,100]
for alpha in alphas:
ridge = Ridge(alpha =alpha)
neg_mse_scores = cross_val_score(ridge, X_data, y_target, scoring='neg_mean_squared_error',cv=5)
avg_rmse= np.mean(np.sqrt(-1 * neg_mse_scores))
print('alpha {0}일때 5 fold 의 평균 rmse:{1:.3f}]'.format(alpha,avg_rmse))
alpha 0일때 5 fold 의 평균 rmse:5.829]
alpha 0.1일때 5 fold 의 평균 rmse:5.788]
alpha 1일때 5 fold 의 평균 rmse:5.653]
alpha 10일때 5 fold 의 평균 rmse:5.518]
alpha 100일때 5 fold 의 평균 rmse:5.330]
- 알파값이 증가할수록 모델 성능이 향상되고 있다
In [ ]:
'머신러닝 > 회귀' 카테고리의 다른 글
05-1_실습_UCIDATASET_콘크리트_회귀실습 (0) | 2022.10.25 |
---|---|
05_회귀 실습_자전거 대여 수요 예측_ 캐글 (0) | 2022.10.25 |
04_데이터 전처리(정규화,로그변환, 스케일러, 원-핫 인코딩 ) (0) | 2022.10.25 |
3_규제 선형 회귀__릿지, 라쏘, 엘라스틱넷 (0) | 2022.10.25 |
1_다중 선형 회귀 (0) | 2022.10.25 |