샘플 데이터와 Boosting Classification¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
1.1 Sample Data¶
실습에서 사용할 데이터를 생성해보겠습니다.
In [2]:
from sklearn.datasets import make_gaussian_quantiles
data_1, label_1 = make_gaussian_quantiles(
cov=2, n_samples=200, n_features=2, n_classes=2, random_state=2021
)
data_2, label_2 = make_gaussian_quantiles(
mean=(3, 3), cov=1.5, n_samples=300, n_features=2, n_classes=2, random_state=2021
)
In [3]:
data = np.concatenate((data_1, data_2))
label = np.concatenate((label_1, - label_2 + 1))
In [4]:
plt.scatter(data[:,0], data[:,1], c=label)
Out[4]:
<matplotlib.collections.PathCollection at 0x7f6195cea2d0>
1.2 Data Split¶
In [5]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_label, test_label = train_test_split(
data, label, train_size=0.7, random_state=2021
)
1.3 시각화 데이터¶
In [6]:
x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1
y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
2. Decision Tree¶
우선 기본적인 Decision Tree를 학습후 결과를 비교해 보겠습니다.
In [7]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=2)
2.1 학습¶
In [8]:
tree.fit(train_data, train_label)
Out[8]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=2, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
2.2 예측¶
In [9]:
tree_train_pred = tree.predict(train_data)
tree_test_pred = tree.predict(test_data)
2.3 평가¶
In [10]:
from sklearn.metrics import accuracy_score
tree_train_acc = accuracy_score(train_label, tree_train_pred)
tree_test_acc = accuracy_score(test_label, tree_test_pred)
In [11]:
print(f"Tree train accuray is {tree_train_acc:.4f}")
print(f"Tree test accuray is {tree_test_acc:.4f}")
Tree train accuray is 0.7286
Tree test accuray is 0.6867
2.4 시각화¶
In [12]:
tree_Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
tree_Z = tree_Z.reshape(xx.shape)
In [13]:
plt.figure(figsize=(14, 7))
plt.subplot(121)
cs = plt.contourf(xx, yy, tree_Z, cmap=plt.cm.Paired)
plt.scatter(train_data[:,0], train_data[:,1], c=train_label)
plt.title("train data")
plt.subplot(122)
cs = plt.contourf(xx, yy, tree_Z, cmap=plt.cm.Paired)
plt.scatter(test_data[:,0], test_data[:,1], c=test_label)
plt.title("test data")
Out[13]:
Text(0.5, 1.0, 'test data')
3. AdaBoost¶
다음은 AdaBoost를 학습해 보겠습니다. AdaBoost는 sklearn.ensemble의 AdaBoostClassifier로 생성할 수 있습니다.AdaBoostClassifier는 base_estimator를 선언해주어야 합니다.
가장 간단한 if else로 데이터가 분류 될 수 있도록 depth가 1인 tree로 base estimator로 만들겠습니다.
In [14]:
from sklearn.ensemble import AdaBoostClassifier
ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1))
3.1 학습¶
In [15]:
ada_boost.fit(train_data, train_label)
Out[15]:
AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=1,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
learning_rate=1.0, n_estimators=50, random_state=None)
3.2 예측¶
In [16]:
ada_boost_train_pred = ada_boost.predict(train_data)
ada_boost_test_pred = ada_boost.predict(test_data)
3.3 평가¶
In [17]:
from sklearn.metrics import accuracy_score
ada_boost_train_acc = accuracy_score(train_label, ada_boost_train_pred)
ada_boost_test_acc = accuracy_score(test_label, ada_boost_test_pred)
In [18]:
print(f"Ada Boost train accuray is {ada_boost_train_acc:.4f}")
print(f"Ada Boost test accuray is {ada_boost_test_acc:.4f}")
Ada Boost train accuray is 0.9486
Ada Boost test accuray is 0.8600
3.4 시각화¶
In [19]:
ada_boost_Z = ada_boost.predict(np.c_[xx.ravel(), yy.ravel()])
ada_boost_Z = ada_boost_Z.reshape(xx.shape)
In [20]:
plt.figure(figsize=(14, 7))
plt.subplot(121)
cs = plt.contourf(xx, yy, ada_boost_Z, cmap=plt.cm.Paired)
plt.scatter(train_data[:,0], train_data[:,1], c=train_label)
plt.title("train_data")
plt.subplot(122)
cs = plt.contourf(xx, yy, ada_boost_Z, cmap=plt.cm.Paired)
plt.scatter(test_data[:,0], test_data[:,1], c=test_label)
plt.title("test_data")
Out[20]:
Text(0.5, 1.0, 'test_data')
4. GradientBoost¶
다음은 Gradient Boost입니다.
Gradient Boost는 sklearn.ensemble 의 GradientBoostingClassifier로 생성할 수 있습니다.
Gradient Boost또한 간단한 if else로 만들 수 있도록 max_depth를 1로 주겠습니다.
In [21]:
from sklearn.ensemble import GradientBoostingClassifier
grad_boost = GradientBoostingClassifier(max_depth=1)
4.1 학습¶
In [22]:
grad_boost.fit(train_data, train_label)
Out[22]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
4.2 예측¶
In [23]:
grad_boost_train_pred = grad_boost.predict(train_data)
grad_boost_test_pred = grad_boost.predict(test_data)
4.3 평가¶
In [24]:
from sklearn.metrics import accuracy_score
grad_boost_train_acc = accuracy_score(train_label, grad_boost_train_pred)
grad_boost_test_acc = accuracy_score(test_label, grad_boost_test_pred)
In [25]:
print(f"Gradient Boost train accuray is {grad_boost_train_acc:.4f}")
print(f"Gradient Boost test accuray is {grad_boost_test_acc:.4f}")
Gradient Boost train accuray is 0.8886
Gradient Boost test accuray is 0.8200
4.4 시각화¶
In [26]:
grad_boost_Z = grad_boost.predict(np.c_[xx.ravel(), yy.ravel()])
grad_boost_Z = grad_boost_Z.reshape(xx.shape)
In [27]:
plt.figure(figsize=(14, 7))
plt.subplot(121)
cs = plt.contourf(xx, yy, grad_boost_Z, cmap=plt.cm.Paired)
plt.scatter(train_data[:,0], train_data[:,1], c=train_label)
plt.title("train_data")
plt.subplot(122)
cs = plt.contourf(xx, yy, grad_boost_Z, cmap=plt.cm.Paired)
plt.scatter(test_data[:,0], test_data[:,1], c=test_label)
plt.title("test_data")
Out[27]:
Text(0.5, 1.0, 'test_data')
5. 마무리¶
In [28]:
print(f"Tree test accuray is {tree_test_acc:.4f}")
print(f"Gradient Boost test accuray is {grad_boost_test_acc:.4f}")
print(f"Ada Boost test accuray is {ada_boost_test_acc:.4f}")
Tree test accuray is 0.6867
Gradient Boost test accuray is 0.8200
Ada Boost test accuray is 0.8600
In [29]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
Z_name = [
("tree", tree_Z),
("Ada Boost", ada_boost_Z),
("Gradient Boost", grad_boost_Z)
]
for idx, (name, Z) in enumerate(Z_name):
ax = axes[idx]
ax.contourf(xx, yy, Z, cmap=plt.cm.Paired)
ax.scatter(train_data[:,0], train_data[:,1], c=train_label)
ax.set_title(name)
In [ ]:
'Machine Learning > Boosting' 카테고리의 다른 글
| 샘플 데이터와 Stacking Classification (0) | 2024.03.18 |
|---|---|
| 샘플 데이터와 Stacking Regression (0) | 2024.03.18 |
| Boosting Classification 심화 실습- 뉴스 분류 (0) | 2024.03.18 |
| Boosting Regression 심화 실습 - 부동산 가격 예측 (0) | 2024.03.18 |
| 샘플 데이터와 Boosting Regression (0) | 2024.03.18 |
