부동산 가격 예측

2024. 3. 12. 11:54·Machine Learning/Decision Tree

 

 

 
 

Random Forest로 부동산 가격 예측하기¶

 
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)
 
 

1. Data¶

 
 

이번 실습에서 사용할 데이터는 보스턴의 집 값을 예측하는 데이터입니다.

 
 

1.1 Data Load¶

 
 

데이터는 sklearn.datasets의 load_boston를 통해 사용할 수 있습니다.

 
In [2]:
from sklearn.datasets import load_boston

housing = load_boston()
 
In [3]:
data, target = housing["data"], housing["target"]
 
 

1.2 Data EDA¶

 
In [4]:
pd.DataFrame(data, columns=housing["feature_names"]).describe()
 
Out[4]:
  CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000
 
In [5]:
pd.Series(target).describe()
 
Out[5]:
count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
dtype: float64
 
In [6]:
fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(20, 10))
for i, feature_name in enumerate(housing["feature_names"]):
    ax = axes[i // 7, i % 7]
    ax.scatter(data[:, i], target)
    ax.set_xlabel(feature_name)
    ax.set_ylabel("price")
 
 
 
 

1.3 Data Split¶

 
In [7]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(
    data, target, train_size=0.7, random_state=2021
)
 
 

2. Random Forest¶

 
In [8]:
from sklearn.ensemble import RandomForestRegressor

rf_regressor = RandomForestRegressor()
 
 

2.1 학습¶

 
In [9]:
rf_regressor.fit(train_data, train_target)
 
Out[9]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
 
 

2.2 예측¶

 
In [10]:
train_pred = rf_regressor.predict(train_data)
test_pred = rf_regressor.predict(test_data)
 
In [11]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axes[0].scatter(train_target, train_pred)
axes[0].set_xlabel("predict")
axes[0].set_ylabel("real")

axes[1].scatter(test_target, test_pred)
axes[1].set_xlabel("predict")
axes[1].set_ylabel("real")
 
Out[11]:
Text(0, 0.5, 'real')
 
 
 

2.3 평가¶

 
In [12]:
from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(train_target, train_pred)
test_mse = mean_squared_error(test_target, test_pred)
 
In [13]:
print(f"train mean squared error is {train_mse:.4f}")
print(f"test mean squared error is {test_mse:.4f}")
 
 
train mean squared error is 1.3500
test mean squared error is 12.1859
 
 

2.4 Feature Importance¶

 
In [14]:
feature_importance = pd.Series(rf_regressor.feature_importances_, index=housing["feature_names"])
feature_importance.sort_values(ascending=True).plot(kind="barh")
 
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f829f517350>
 
 
 

3. Best Parameter¶

 
In [15]:
from sklearn.model_selection import GridSearchCV
 
 

Random Forest Regressor에서 설정하는 argument들은 다음과 같습니다.

  • n_estimators
    • 몇 개의 의사결정나무를 생성할지 결정합니다.
  • criterion
    • 감소 시킬 평가지표를 설정합니다.
    • "mae": Mean Absolute Error
    • "mse": Mean Squared Error
  • max_depth
    • 의사결정나무가 가질 수 있는 최대 깊이를 결정합니다.
 
 

3.1 탐색 범위 설정¶

 
In [16]:
params = {
    "n_estimators": [100, 200, 500, 1000],
    "criterion": ["mae", "mse"],
    "max_depth": [i for i in range(1, 10, 2)],
}
 
In [17]:
params
 
Out[17]:
{'criterion': ['mae', 'mse'],
 'max_depth': [1, 3, 5, 7, 9],
 'n_estimators': [100, 200, 500, 1000]}
 
In [18]:
cv_rf_regressor = RandomForestRegressor()
 
 

3.2 탐색¶

 
 

탐색을 시작합니다.
cv는 k-fold의 k값입니다.

 
In [19]:
grid = GridSearchCV(estimator=cv_rf_regressor, param_grid=params, cv=3)
grid = grid.fit(train_data, train_target)
 
In [20]:
print(f"Best score of paramter search is: {grid.best_score_:.4f}")
 
 
Best score of paramter search is: 0.8692
 
In [21]:
grid.best_params_
 
Out[21]:
{'criterion': 'mse', 'max_depth': 9, 'n_estimators': 1000}
 
In [22]:
print("Best parameter of best score is")
for key, value in grid.best_params_.items():
    print(f"\t {key}: {value}")
 
 
Best parameter of best score is
	 criterion: mse
	 max_depth: 9
	 n_estimators: 1000
 
 

3.3 평가¶

 
In [23]:
best_rf = grid.best_estimator_
 
In [24]:
cv_train_pred = best_rf.predict(train_data)
cv_test_pred = best_rf.predict(test_data)
 
In [25]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

axes[0].scatter(train_target, cv_train_pred)
axes[0].set_xlabel("predict")
axes[0].set_ylabel("real")

axes[1].scatter(test_target, cv_test_pred)
axes[1].set_xlabel("predict")
axes[1].set_ylabel("real")
 
Out[25]:
Text(0, 0.5, 'real')
 
 
In [26]:
cv_train_mse = mean_squared_error(train_target, cv_train_pred)
cv_test_mse = mean_squared_error(test_target, cv_test_pred)
 
In [27]:
print(f"Best model Train mean squared error is {cv_train_mse:.4f}")
print(f"Best model Test mean squared error is {cv_test_mse:.4f}")
 
 
Best model Train mean squared error is 1.8319
Best model Test mean squared error is 12.0698
 
In [28]:
cv_feature_importance = pd.Series(best_rf.feature_importances_, index=housing["feature_names"])
cv_feature_importance.sort_values(ascending=True).plot(kind="barh")
 
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f829f2d2610>
 
 
 

4. 마무리¶

 
In [29]:
print(f"Test mean squared error is {test_mse:.4f}")
print(f"Best model Test mean squared error is {cv_test_mse:.4f}")
 
 
Test mean squared error is 12.1859
Best model Test mean squared error is 12.0698
 
In [30]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
feature_importance.sort_values(ascending=True).plot(kind="barh", ax=axes[0])
cv_feature_importance.sort_values(ascending=True).plot(kind="barh", ax=axes[1])
 
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f829f0a7fd0>
 
 
In [ ]:
 

'Machine Learning > Decision Tree' 카테고리의 다른 글

Random Forest로 손글씨 분류하기  (0) 2024.03.12
Decision Tree Regressor  (0) 2024.03.12
Iris 꽃 종류 분류  (0) 2024.03.12
Decision Tree Classification 기초  (0) 2024.03.12
'Machine Learning/Decision Tree' 카테고리의 다른 글
  • Random Forest로 손글씨 분류하기
  • Decision Tree Regressor
  • Iris 꽃 종류 분류
  • Decision Tree Classification 기초
Juson
Juson
  • Juson
    Juson의 데이터 공부
    Juson
  • 전체
    오늘
    어제
    • 분류 전체보기 (95)
      • RAG (2)
      • AI (2)
        • NLP (0)
        • Generative Model (0)
        • Deep Reinforcement Learning (2)
        • LLM (0)
      • Logistic Optimization (0)
      • Machine Learning (37)
        • Linear Regression (2)
        • Logistic Regression (2)
        • Decision Tree (5)
        • Naive Bayes (1)
        • KNN (2)
        • SVM (2)
        • Clustering (4)
        • Dimension Reduction (3)
        • Boosting (6)
        • Abnomaly Detection (2)
        • Recommendation (4)
        • Embedding & NLP (4)
      • Reinforcement Learning (5)
      • Deep Learning (10)
        • Deep learning Bacis Mathema.. (10)
      • Optimization (2)
        • OR Optimization (0)
        • Convex Optimization (0)
        • Integer Optimization (0)
      • SNA 분석 (0)
      • 포트폴리오 최적화 공부 (0)
        • 최적화 기법 (0)
        • 금융 베이스 (0)
      • Finanancial engineering (0)
      • 프로그래머스 데브코스(Boot camp) (15)
        • SQL (9)
        • Python (5)
        • Machine Learning (1)
      • Python (22)
      • Project (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.4
Juson
부동산 가격 예측
상단으로

티스토리툴바