샘플 데이터와 Decision Tree Regressor¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
1.1 Data Load¶
예시에서 사용할 샘플 데이터를 생성합니다.
In [37]:
data = np.sort(np.random.uniform(low=0, high=5, size=(80, 1)))
label = np.sin(data).ravel()
label[::5] += 3 * (0.5 - np.random.uniform(0, 1, 16))
In [38]:
data
Out[38]:
array([[4.98555781],
[4.39531746],
[4.0968716 ],
[4.46267696],
[1.62082743],
[0.89626001],
[3.23132164],
[1.39412322],
[4.60394844],
[1.44702545],
[1.94905 ],
[4.62350621],
[2.79565164],
[3.40090045],
[1.24196102],
[1.55559205],
[3.73480957],
[3.33060183],
[4.96312259],
[2.37902113],
[0.34574107],
[0.37108979],
[0.26366322],
[0.13662432],
[2.94732145],
[1.36590232],
[2.0226209 ],
[2.2298019 ],
[3.1469101 ],
[3.81155277],
[1.75182344],
[3.03209918],
[4.01242566],
[2.46340263],
[2.19040695],
[0.62292664],
[4.57442919],
[2.4853704 ],
[1.70078095],
[3.85512632],
[4.94801485],
[2.85975008],
[4.47468291],
[2.97523582],
[4.55262731],
[2.61356773],
[1.11372545],
[3.27935218],
[1.07702405],
[2.80873149],
[0.58044317],
[2.20826445],
[0.08918632],
[3.21400392],
[2.24942722],
[4.59761292],
[1.43174295],
[4.7194593 ],
[0.82566808],
[2.72170941],
[1.73767529],
[0.7477184 ],
[2.40799657],
[1.21741249],
[3.7011057 ],
[2.25019692],
[0.83846088],
[3.33454411],
[3.97575281],
[1.74655661],
[2.07496174],
[3.38162782],
[0.90069671],
[4.87971069],
[4.96135556],
[3.96334816],
[1.21015591],
[2.14437313],
[2.89742252],
[0.05791837]])
In [39]:
label
Out[39]:
array([-0.47099749, -0.95015255, -0.81647483, -0.96898363, 0.99874871,
1.66828822, -0.08960862, 0.98443386, -0.99412608, 0.99235016,
1.15402488, -0.99605253, 0.33908209, -0.25641155, 0.94641911,
-0.4537738 , -0.55903121, -0.18788581, -0.96873066, 0.6907831 ,
1.68924227, 0.36263126, 0.26061892, 0.13619967, 0.1930515 ,
0.35115093, 0.89965198, 0.79060154, -0.00531742, -0.62095472,
0.68951082, 0.10927482, -0.76486582, 0.62738461, 0.81410464,
-0.54340734, -0.99049863, 0.6101281 , 0.99156389, -0.6545095 ,
-1.55001728, 0.278126 , -0.97188069, 0.16559058, -0.98726523,
-0.2792155 , 0.89734903, -0.13732421, 0.88055125, 0.32674848,
-0.17349289, 0.80360521, 0.08906814, -0.072348 , 0.77843287,
-0.73838726, 0.99034765, -0.99997501, 0.73500095, 0.40765384,
1.21209889, 0.67996756, 0.66954502, 0.93820703, -0.53077356,
-0.03559672, 0.74361494, -0.19175641, -0.74073257, 0.98459388,
1.00511254, -0.23773678, 0.7837598 , -0.98603435, -0.96916758,
-1.6120154 , 0.93567103, 0.83996544, 0.24175116, 0.05788599])
데이터는 하나의 변수를 가지며 변수에 따른 정답은 아래처럼 생겼습니다.
In [5]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
Out[5]:
<matplotlib.collections.PathCollection at 0x17da49596d0>
1.2 Viz Data¶
시각화를 위한 데이터도 생성합니다.
In [42]:
viz_test_data = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] #[:, np.newaxis] -> 차원추가
In [43]:
viz_test_data[:5]
Out[43]:
array([[0. ],
[0.01],
[0.02],
[0.03],
[0.04]])
2. Decion Tree Regressor¶
Tree의 분할이 이루어질 때마다 어떻게 예측을 하는지 알아보겠습니다.
In [8]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
2.1 분할이 없을 경우¶
분할이 없는 경우에는 학습 데이터의 평균으로 예측을 합니다.
In [9]:
viz_test_pred = np.repeat(label.mean(), len(viz_test_data))
Plot으로 보면 강의에서 본 하나의 선이 생깁니다.
In [10]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.plot(viz_test_data, viz_test_pred, color="C2")
Out[10]:
[<matplotlib.lines.Line2D at 0x17dc53e51d0>]
분할이 없을때의 mse variance를 계산하면 다음과 같습니다.
In [11]:
train_pred = np.repeat(label.mean(), len(data))
mse_var = np.var(label - train_pred)
In [12]:
print(f"no divide mse variance: {mse_var:.3f}")
no divide mse variance: 0.580
2.2 첫 번째 분할¶
In [13]:
first_divide = DecisionTreeRegressor(max_depth=1)
In [14]:
first_divide.fit(data, label)
Out[14]:
DecisionTreeRegressor(max_depth=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=1)
In [15]:
first_divide_pred = first_divide.predict(viz_test_data)
첫번째로 분할되서 나누어진 영역을 그리면 아래와 같습니다.
In [16]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.axvline(first_divide.tree_.threshold[0], color="red")
Out[16]:
<matplotlib.lines.Line2D at 0x17dc5336f10>
분할이 이루어진 각 영역에서 다시 평균을 계산합니다.
In [17]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.axvline(first_divide.tree_.threshold[0], color="red")
plt.plot(viz_test_data, first_divide_pred, color="C2")
Out[17]:
[<matplotlib.lines.Line2D at 0x17dc738ead0>]
Treef를 시각화하기 위해서는 plot_tree 함수를 이용하면 됩니다.
Tree를 시각화하면 아래와 같습니다.
In [18]:
plot_tree(first_divide)
Out[18]:
[Text(0.5, 0.75, 'X[0] <= 3.088\nsquared_error = 0.58\nsamples = 80\nvalue = 0.062'),
Text(0.25, 0.25, 'squared_error = 0.179\nsamples = 48\nvalue = 0.565'),
Text(0.75, 0.25, 'squared_error = 0.236\nsamples = 32\nvalue = -0.691')]
2.3 두 번째 분할¶
In [19]:
second_divide = DecisionTreeRegressor(max_depth=2)
In [20]:
second_divide.fit(data, label)
Out[20]:
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
In [21]:
second_divide_pred = second_divide.predict(viz_test_data)
In [22]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.plot(viz_test_data, second_divide_pred, color="C2")
Out[22]:
[<matplotlib.lines.Line2D at 0x17dc7459550>]
In [23]:
plot_tree(second_divide)
Out[23]:
[Text(0.5, 0.8333333333333334, 'X[0] <= 3.088\nsquared_error = 0.58\nsamples = 80\nvalue = 0.062'),
Text(0.25, 0.5, 'X[0] <= 2.185\nsquared_error = 0.179\nsamples = 48\nvalue = 0.565'),
Text(0.125, 0.16666666666666666, 'squared_error = 0.119\nsamples = 31\nvalue = 0.745'),
Text(0.375, 0.16666666666666666, 'squared_error = 0.12\nsamples = 17\nvalue = 0.235'),
Text(0.75, 0.5, 'X[0] <= 3.903\nsquared_error = 0.236\nsamples = 32\nvalue = -0.691'),
Text(0.625, 0.16666666666666666, 'squared_error = 0.147\nsamples = 16\nvalue = -0.358'),
Text(0.875, 0.16666666666666666, 'squared_error = 0.104\nsamples = 16\nvalue = -1.025')]
3. Depth에 따른 변화¶
In [24]:
shallow_depth_tree = DecisionTreeRegressor(max_depth=2)
deep_depth_tree = DecisionTreeRegressor(max_depth=5)
In [25]:
shallow_depth_tree.fit(data, label)
deep_depth_tree.fit(data, label)
Out[25]:
DecisionTreeRegressor(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=5)
In [26]:
shallow_pred = shallow_depth_tree.predict(viz_test_data)
deep_pred = deep_depth_tree.predict(viz_test_data)
In [27]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.plot(viz_test_data, shallow_pred, color="C2", label="shallow")
plt.plot(viz_test_data, deep_pred, color="C3", label="deep")
plt.legend()
Out[27]:
<matplotlib.legend.Legend at 0x17dc75a4bd0>
In [ ]:
'Machine Learning > Decision Tree' 카테고리의 다른 글
| 부동산 가격 예측 (0) | 2024.03.12 |
|---|---|
| Random Forest로 손글씨 분류하기 (0) | 2024.03.12 |
| Iris 꽃 종류 분류 (0) | 2024.03.12 |
| Decision Tree Classification 기초 (0) | 2024.03.12 |
