Random Forest로 부동산 가격 예측하기¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
이번 실습에서 사용할 데이터는 보스턴의 집 값을 예측하는 데이터입니다.
1.1 Data Load¶
데이터는 sklearn.datasets의 load_boston를 통해 사용할 수 있습니다.
In [2]:
from sklearn.datasets import load_boston
housing = load_boston()
In [3]:
data, target = housing["data"], housing["target"]
1.2 Data EDA¶
In [4]:
pd.DataFrame(data, columns=housing["feature_names"]).describe()
Out[4]:
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 |
| std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 |
| min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 |
| 25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 |
| 50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 |
| 75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 |
| max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 |
In [5]:
pd.Series(target).describe()
Out[5]:
count 506.000000
mean 22.532806
std 9.197104
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
dtype: float64
In [6]:
fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(20, 10))
for i, feature_name in enumerate(housing["feature_names"]):
ax = axes[i // 7, i % 7]
ax.scatter(data[:, i], target)
ax.set_xlabel(feature_name)
ax.set_ylabel("price")
1.3 Data Split¶
In [7]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_target, test_target = train_test_split(
data, target, train_size=0.7, random_state=2021
)
2. Random Forest¶
In [8]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor()
2.1 학습¶
In [9]:
rf_regressor.fit(train_data, train_target)
Out[9]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
2.2 예측¶
In [10]:
train_pred = rf_regressor.predict(train_data)
test_pred = rf_regressor.predict(test_data)
In [11]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
axes[0].scatter(train_target, train_pred)
axes[0].set_xlabel("predict")
axes[0].set_ylabel("real")
axes[1].scatter(test_target, test_pred)
axes[1].set_xlabel("predict")
axes[1].set_ylabel("real")
Out[11]:
Text(0, 0.5, 'real')
2.3 평가¶
In [12]:
from sklearn.metrics import mean_squared_error
train_mse = mean_squared_error(train_target, train_pred)
test_mse = mean_squared_error(test_target, test_pred)
In [13]:
print(f"train mean squared error is {train_mse:.4f}")
print(f"test mean squared error is {test_mse:.4f}")
train mean squared error is 1.3500
test mean squared error is 12.1859
2.4 Feature Importance¶
In [14]:
feature_importance = pd.Series(rf_regressor.feature_importances_, index=housing["feature_names"])
feature_importance.sort_values(ascending=True).plot(kind="barh")
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f829f517350>
3. Best Parameter¶
In [15]:
from sklearn.model_selection import GridSearchCV
Random Forest Regressor에서 설정하는 argument들은 다음과 같습니다.
- n_estimators
- 몇 개의 의사결정나무를 생성할지 결정합니다.
- criterion
- 감소 시킬 평가지표를 설정합니다.
- "mae": Mean Absolute Error
- "mse": Mean Squared Error
- max_depth
- 의사결정나무가 가질 수 있는 최대 깊이를 결정합니다.
3.1 탐색 범위 설정¶
In [16]:
params = {
"n_estimators": [100, 200, 500, 1000],
"criterion": ["mae", "mse"],
"max_depth": [i for i in range(1, 10, 2)],
}
In [17]:
params
Out[17]:
{'criterion': ['mae', 'mse'],
'max_depth': [1, 3, 5, 7, 9],
'n_estimators': [100, 200, 500, 1000]}
In [18]:
cv_rf_regressor = RandomForestRegressor()
3.2 탐색¶
탐색을 시작합니다.cv는 k-fold의 k값입니다.
In [19]:
grid = GridSearchCV(estimator=cv_rf_regressor, param_grid=params, cv=3)
grid = grid.fit(train_data, train_target)
In [20]:
print(f"Best score of paramter search is: {grid.best_score_:.4f}")
Best score of paramter search is: 0.8692
In [21]:
grid.best_params_
Out[21]:
{'criterion': 'mse', 'max_depth': 9, 'n_estimators': 1000}
In [22]:
print("Best parameter of best score is")
for key, value in grid.best_params_.items():
print(f"\t {key}: {value}")
Best parameter of best score is
criterion: mse
max_depth: 9
n_estimators: 1000
3.3 평가¶
In [23]:
best_rf = grid.best_estimator_
In [24]:
cv_train_pred = best_rf.predict(train_data)
cv_test_pred = best_rf.predict(test_data)
In [25]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
axes[0].scatter(train_target, cv_train_pred)
axes[0].set_xlabel("predict")
axes[0].set_ylabel("real")
axes[1].scatter(test_target, cv_test_pred)
axes[1].set_xlabel("predict")
axes[1].set_ylabel("real")
Out[25]:
Text(0, 0.5, 'real')
In [26]:
cv_train_mse = mean_squared_error(train_target, cv_train_pred)
cv_test_mse = mean_squared_error(test_target, cv_test_pred)
In [27]:
print(f"Best model Train mean squared error is {cv_train_mse:.4f}")
print(f"Best model Test mean squared error is {cv_test_mse:.4f}")
Best model Train mean squared error is 1.8319
Best model Test mean squared error is 12.0698
In [28]:
cv_feature_importance = pd.Series(best_rf.feature_importances_, index=housing["feature_names"])
cv_feature_importance.sort_values(ascending=True).plot(kind="barh")
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f829f2d2610>
4. 마무리¶
In [29]:
print(f"Test mean squared error is {test_mse:.4f}")
print(f"Best model Test mean squared error is {cv_test_mse:.4f}")
Test mean squared error is 12.1859
Best model Test mean squared error is 12.0698
In [30]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
feature_importance.sort_values(ascending=True).plot(kind="barh", ax=axes[0])
cv_feature_importance.sort_values(ascending=True).plot(kind="barh", ax=axes[1])
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f829f0a7fd0>
In [ ]:
'Machine Learning > Decision Tree' 카테고리의 다른 글
| Random Forest로 손글씨 분류하기 (0) | 2024.03.12 |
|---|---|
| Decision Tree Regressor (0) | 2024.03.12 |
| Iris 꽃 종류 분류 (0) | 2024.03.12 |
| Decision Tree Classification 기초 (0) | 2024.03.12 |
