Random Forest로 부동산 가격 예측하기¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)

1. Data¶

이번 실습에서 사용할 데이터는 보스턴의 집 값을 예측하는 데이터입니다.

1.1 Data Load¶

데이터는 sklearn.datasets의 load_boston를 통해 사용할 수 있습니다.

In [2]:

from sklearn.datasets import load_boston

housing = load_boston()

In [3]:

data, target = housing["data"], housing["target"]

1.2 Data EDA¶

In [4]:

pd.DataFrame(data, columns=housing["feature_names"]).describe()

Out[4]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000

In [5]:

pd.Series(target).describe()

Out[5]:

count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
dtype: float64

In [6]:

fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(20, 10))
for i, feature_name in enumerate(housing["feature_names"]):
    ax = axes[i // 7, i % 7]
    ax.scatter(data[:, i], target)
    ax.set_xlabel(feature_name)
    ax.set_ylabel("price")

1.3 Data Split¶

In [7]:

from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(
    data, target, train_size=0.7, random_state=2021
)

2. Random Forest¶

In [8]:

from sklearn.ensemble import RandomForestRegressor

rf_regressor = RandomForestRegressor()

2.1 학습¶

In [9]:

rf_regressor.fit(train_data, train_target)

Out[9]:

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

2.2 예측¶

In [10]:

train_pred = rf_regressor.predict(train_data)
test_pred = rf_regressor.predict(test_data)

In [11]:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axes[0].scatter(train_target, train_pred)
axes[0].set_xlabel("predict")
axes[0].set_ylabel("real")

axes[1].scatter(test_target, test_pred)
axes[1].set_xlabel("predict")
axes[1].set_ylabel("real")

Out[11]:

Text(0, 0.5, 'real')

2.3 평가¶

In [12]:

from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(train_target, train_pred)
test_mse = mean_squared_error(test_target, test_pred)

In [13]:

print(f"train mean squared error is {train_mse:.4f}")
print(f"test mean squared error is {test_mse:.4f}")

train mean squared error is 1.3500
test mean squared error is 12.1859

2.4 Feature Importance¶

In [14]:

feature_importance = pd.Series(rf_regressor.feature_importances_, index=housing["feature_names"])
feature_importance.sort_values(ascending=True).plot(kind="barh")

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f829f517350>

3. Best Parameter¶

In [15]:

from sklearn.model_selection import GridSearchCV

Random Forest Regressor에서 설정하는 argument들은 다음과 같습니다.

n_estimators
- 몇 개의 의사결정나무를 생성할지 결정합니다.
criterion
- 감소 시킬 평가지표를 설정합니다.
- "mae": Mean Absolute Error
- "mse": Mean Squared Error
max_depth
- 의사결정나무가 가질 수 있는 최대 깊이를 결정합니다.

3.1 탐색 범위 설정¶

In [16]:

params = {
    "n_estimators": [100, 200, 500, 1000],
    "criterion": ["mae", "mse"],
    "max_depth": [i for i in range(1, 10, 2)],
}

In [17]:

params

Out[17]:

{'criterion': ['mae', 'mse'],
 'max_depth': [1, 3, 5, 7, 9],
 'n_estimators': [100, 200, 500, 1000]}

In [18]:

cv_rf_regressor = RandomForestRegressor()

3.2 탐색¶

탐색을 시작합니다.
cv는 k-fold의 k값입니다.

In [19]:

grid = GridSearchCV(estimator=cv_rf_regressor, param_grid=params, cv=3)
grid = grid.fit(train_data, train_target)

In [20]:

print(f"Best score of paramter search is: {grid.best_score_:.4f}")

Best score of paramter search is: 0.8692

In [21]:

grid.best_params_

Out[21]:

{'criterion': 'mse', 'max_depth': 9, 'n_estimators': 1000}

In [22]:

print("Best parameter of best score is")
for key, value in grid.best_params_.items():
    print(f"\t {key}: {value}")

Best parameter of best score is
	 criterion: mse
	 max_depth: 9
	 n_estimators: 1000

3.3 평가¶

In [23]:

best_rf = grid.best_estimator_

In [24]:

cv_train_pred = best_rf.predict(train_data)
cv_test_pred = best_rf.predict(test_data)

In [25]:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axes[0].scatter(train_target, cv_train_pred)
axes[0].set_xlabel("predict")
axes[0].set_ylabel("real")

axes[1].scatter(test_target, cv_test_pred)
axes[1].set_xlabel("predict")
axes[1].set_ylabel("real")

Out[25]:

Text(0, 0.5, 'real')

In [26]:

cv_train_mse = mean_squared_error(train_target, cv_train_pred)
cv_test_mse = mean_squared_error(test_target, cv_test_pred)

In [27]:

print(f"Best model Train mean squared error is {cv_train_mse:.4f}")
print(f"Best model Test mean squared error is {cv_test_mse:.4f}")

Best model Train mean squared error is 1.8319
Best model Test mean squared error is 12.0698

In [28]:

cv_feature_importance = pd.Series(best_rf.feature_importances_, index=housing["feature_names"])
cv_feature_importance.sort_values(ascending=True).plot(kind="barh")

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f829f2d2610>

4. 마무리¶

In [29]:

print(f"Test mean squared error is {test_mse:.4f}")
print(f"Best model Test mean squared error is {cv_test_mse:.4f}")

Test mean squared error is 12.1859
Best model Test mean squared error is 12.0698

In [30]:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
feature_importance.sort_values(ascending=True).plot(kind="barh", ax=axes[0])
cv_feature_importance.sort_values(ascending=True).plot(kind="barh", ax=axes[1])

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f829f0a7fd0>

In [ ]:

Random Forest로 손글씨 분류하기 (0)	2024.03.12
Decision Tree Regressor (0)	2024.03.12
Iris 꽃 종류 분류 (0)	2024.03.12
Decision Tree Classification 기초 (0)	2024.03.12

부동산 가격 예측