부동산 가격 예측하기¶
In [16]:
pip install lightgbm
Collecting lightgbm
Downloading lightgbm-4.1.0-py3-none-win_amd64.whl (1.3 MB)
---------------------------------------- 1.3/1.3 MB 3.1 MB/s eta 0:00:00
Requirement already satisfied: numpy in c:\users\sjy99\appdata\local\programs\python\python311\lib\site-packages (from lightgbm) (1.23.5)
Requirement already satisfied: scipy in c:\users\sjy99\appdata\local\programs\python\python311\lib\site-packages (from lightgbm) (1.9.3)
Installing collected packages: lightgbm
Successfully installed lightgbm-4.1.0
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip
In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
이번 실습에서 사용할 데이터는 california 집 값을 예측하는 데이터입니다.
1.1 Data Load¶
데이터는 sklearn.datasets의 fetch_california_housing를 통해 사용할 수 있습니다.
In [5]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
In [6]:
data, target = housing["data"], housing["target"]
1.2 Data EDA¶
In [7]:
pd.DataFrame(data, columns=housing["feature_names"]).describe()
Out[7]:
| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|
| count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
| mean | 3.870671 | 28.639486 | 5.429000 | 1.096675 | 1425.476744 | 3.070655 | 35.631861 | -119.569704 |
| std | 1.899822 | 12.585558 | 2.474173 | 0.473911 | 1132.462122 | 10.386050 | 2.135952 | 2.003532 |
| min | 0.499900 | 1.000000 | 0.846154 | 0.333333 | 3.000000 | 0.692308 | 32.540000 | -124.350000 |
| 25% | 2.563400 | 18.000000 | 4.440716 | 1.006079 | 787.000000 | 2.429741 | 33.930000 | -121.800000 |
| 50% | 3.534800 | 29.000000 | 5.229129 | 1.048780 | 1166.000000 | 2.818116 | 34.260000 | -118.490000 |
| 75% | 4.743250 | 37.000000 | 6.052381 | 1.099526 | 1725.000000 | 3.282261 | 37.710000 | -118.010000 |
| max | 15.000100 | 52.000000 | 141.909091 | 34.066667 | 35682.000000 | 1243.333333 | 41.950000 | -114.310000 |
In [8]:
pd.Series(target).describe()
Out[8]:
count 20640.000000
mean 2.068558
std 1.153956
min 0.149990
25% 1.196000
50% 1.797000
75% 2.647250
max 5.000010
dtype: float64
In [9]:
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20, 10))
for i, feature_name in enumerate(housing["feature_names"]):
ax = axes[i // 4, i % 4]
ax.scatter(data[:, i], target)
ax.set_xlabel(feature_name)
ax.set_ylabel("price")
1.3 Data Split¶
In [10]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_target, test_target = train_test_split(
data, target, train_size=0.7, random_state=2021
)
2. XGBoost¶
In [13]:
pip install xgboost
import xgboost as xgb
xgb_reg = xgb.XGBRegressor()
Cell In [13], line 1
pip install xgboost
^
SyntaxError: invalid syntax
2.1 학습¶
In [ ]:
xgb_reg.fit(train_data, train_target)
[08:16:58] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[ ]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
2.2 예측¶
In [ ]:
xgb_train_pred = xgb_reg.predict(train_data)
xgb_test_pred = xgb_reg.predict(test_data)
In [ ]:
plt.figure(figsize=(14, 7))
plt.subplot(121)
plt.scatter(xgb_train_pred, train_target)
plt.title("train data")
plt.xlabel("predict")
plt.ylabel("target")
plt.subplot(122)
plt.scatter(xgb_test_pred, test_target)
plt.title("test data")
plt.xlabel("predict")
plt.ylabel("target")
Out[ ]:
Text(0, 0.5, 'target')
2.3 평가¶
In [ ]:
from sklearn.metrics import mean_squared_error
xgb_train_mse = mean_squared_error(train_target, xgb_train_pred)
xgb_test_mse = mean_squared_error(test_target, xgb_test_pred)
In [ ]:
print(f"XGBoost Train MSE is {xgb_train_mse:.4f}")
print(f"XGBoost Test MSE is {xgb_test_mse:.4f}")
XGBoost Train MSE is 0.2598
XGBoost Test MSE is 0.2873
3. Light GBM¶
In [ ]:
import lightgbm as lgb
lgb_reg = lgb.LGBMRegressor()
3.1 학습¶
In [ ]:
lgb_reg.fit(train_data, train_target)
Out[ ]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
3.2 예측¶
In [ ]:
lgb_train_pred = lgb_reg.predict(train_data)
lgb_test_pred = lgb_reg.predict(test_data)
In [ ]:
plt.figure(figsize=(14, 7))
plt.subplot(121)
plt.scatter(lgb_train_pred, train_target)
plt.title("train data")
plt.xlabel("predict")
plt.ylabel("target")
plt.subplot(122)
plt.scatter(lgb_test_pred, test_target)
plt.title("test data")
plt.xlabel("predict")
plt.ylabel("target")
Out[ ]:
Text(0, 0.5, 'target')
3.3 평가¶
In [ ]:
lgb_train_mse = mean_squared_error(train_target, lgb_train_pred)
lgb_test_mse = mean_squared_error(test_target, lgb_test_pred)
In [ ]:
print(f"Light Boost Train MSE is {lgb_train_mse:.4f}")
print(f"Light Boost Test MSE is {lgb_test_mse:.4f}")
Light Boost Train MSE is 0.1543
Light Boost Test MSE is 0.2098
4. CatBoost¶
In [ ]:
import catboost as cb
cb_reg = cb.CatBoostRegressor()
4.1 학습¶
In [ ]:
cb_reg.fit(train_data, train_target, verbose=False)
Out[ ]:
<catboost.core.CatBoostRegressor at 0x7f134f92c4d0>
4.2 예측¶
In [ ]:
cb_train_pred = cb_reg.predict(train_data)
cb_test_pred = cb_reg.predict(test_data)
In [ ]:
plt.figure(figsize=(14, 7))
plt.subplot(121)
plt.scatter(cb_train_pred, train_target)
plt.title("train data")
plt.xlabel("predict")
plt.ylabel("target")
plt.subplot(122)
plt.scatter(cb_test_pred, test_target)
plt.title("test data")
plt.xlabel("predict")
plt.ylabel("target")
Out[ ]:
Text(0, 0.5, 'target')
4.3 평가¶
In [ ]:
cb_train_mse = mean_squared_error(train_target, cb_train_pred)
cb_test_mse = mean_squared_error(test_target, cb_test_pred)
In [ ]:
print(f"Cat Boost Train MSE is {cb_train_mse:.4f}")
print(f"Cat Boost Test MSE is {cb_test_mse:.4f}")
Cat Boost Train MSE is 0.1147
Cat Boost Test MSE is 0.1927
5. 마무리¶
In [ ]:
print(f"XGBoost Test MSE is {xgb_test_mse:.4f}")
print(f"Light Boost Test MSE is {lgb_test_mse:.4f}")
print(f"Cat Boost Test MSE is {cb_test_mse:.4f}")
XGBoost Test MSE is 0.2873
Light Boost Test MSE is 0.2098
Cat Boost Test MSE is 0.1927
In [ ]:
'Machine Learning > Boosting' 카테고리의 다른 글
| 샘플 데이터와 Stacking Classification (0) | 2024.03.18 |
|---|---|
| 샘플 데이터와 Stacking Regression (0) | 2024.03.18 |
| Boosting Classification 심화 실습- 뉴스 분류 (0) | 2024.03.18 |
| 샘플 데이터와 Boosting Classification (0) | 2024.03.18 |
| 샘플 데이터와 Boosting Regression (0) | 2024.03.18 |
