Decision Tree Regressor

2024. 3. 12. 11:19·Machine Learning/Decision Tree

 

 

 
 

샘플 데이터와 Decision Tree Regressor¶

 
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)
 
 

1. Data¶

 
 

1.1 Data Load¶

 
 

예시에서 사용할 샘플 데이터를 생성합니다.

 
In [37]:
data = np.sort(np.random.uniform(low=0, high=5, size=(80, 1)))
label = np.sin(data).ravel()
label[::5] += 3 * (0.5 - np.random.uniform(0, 1, 16))
 
In [38]:
data
 
Out[38]:
array([[4.98555781],
       [4.39531746],
       [4.0968716 ],
       [4.46267696],
       [1.62082743],
       [0.89626001],
       [3.23132164],
       [1.39412322],
       [4.60394844],
       [1.44702545],
       [1.94905   ],
       [4.62350621],
       [2.79565164],
       [3.40090045],
       [1.24196102],
       [1.55559205],
       [3.73480957],
       [3.33060183],
       [4.96312259],
       [2.37902113],
       [0.34574107],
       [0.37108979],
       [0.26366322],
       [0.13662432],
       [2.94732145],
       [1.36590232],
       [2.0226209 ],
       [2.2298019 ],
       [3.1469101 ],
       [3.81155277],
       [1.75182344],
       [3.03209918],
       [4.01242566],
       [2.46340263],
       [2.19040695],
       [0.62292664],
       [4.57442919],
       [2.4853704 ],
       [1.70078095],
       [3.85512632],
       [4.94801485],
       [2.85975008],
       [4.47468291],
       [2.97523582],
       [4.55262731],
       [2.61356773],
       [1.11372545],
       [3.27935218],
       [1.07702405],
       [2.80873149],
       [0.58044317],
       [2.20826445],
       [0.08918632],
       [3.21400392],
       [2.24942722],
       [4.59761292],
       [1.43174295],
       [4.7194593 ],
       [0.82566808],
       [2.72170941],
       [1.73767529],
       [0.7477184 ],
       [2.40799657],
       [1.21741249],
       [3.7011057 ],
       [2.25019692],
       [0.83846088],
       [3.33454411],
       [3.97575281],
       [1.74655661],
       [2.07496174],
       [3.38162782],
       [0.90069671],
       [4.87971069],
       [4.96135556],
       [3.96334816],
       [1.21015591],
       [2.14437313],
       [2.89742252],
       [0.05791837]])
 
In [39]:
label
 
Out[39]:
array([-0.47099749, -0.95015255, -0.81647483, -0.96898363,  0.99874871,
        1.66828822, -0.08960862,  0.98443386, -0.99412608,  0.99235016,
        1.15402488, -0.99605253,  0.33908209, -0.25641155,  0.94641911,
       -0.4537738 , -0.55903121, -0.18788581, -0.96873066,  0.6907831 ,
        1.68924227,  0.36263126,  0.26061892,  0.13619967,  0.1930515 ,
        0.35115093,  0.89965198,  0.79060154, -0.00531742, -0.62095472,
        0.68951082,  0.10927482, -0.76486582,  0.62738461,  0.81410464,
       -0.54340734, -0.99049863,  0.6101281 ,  0.99156389, -0.6545095 ,
       -1.55001728,  0.278126  , -0.97188069,  0.16559058, -0.98726523,
       -0.2792155 ,  0.89734903, -0.13732421,  0.88055125,  0.32674848,
       -0.17349289,  0.80360521,  0.08906814, -0.072348  ,  0.77843287,
       -0.73838726,  0.99034765, -0.99997501,  0.73500095,  0.40765384,
        1.21209889,  0.67996756,  0.66954502,  0.93820703, -0.53077356,
       -0.03559672,  0.74361494, -0.19175641, -0.74073257,  0.98459388,
        1.00511254, -0.23773678,  0.7837598 , -0.98603435, -0.96916758,
       -1.6120154 ,  0.93567103,  0.83996544,  0.24175116,  0.05788599])
 
 

데이터는 하나의 변수를 가지며 변수에 따른 정답은 아래처럼 생겼습니다.

 
In [5]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
 
Out[5]:
<matplotlib.collections.PathCollection at 0x17da49596d0>
 
 
 

1.2 Viz Data¶

 
 

시각화를 위한 데이터도 생성합니다.

 
In [42]:
viz_test_data = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] #[:, np.newaxis] -> 차원추가
 
In [43]:
viz_test_data[:5]
 
Out[43]:
array([[0.  ],
       [0.01],
       [0.02],
       [0.03],
       [0.04]])
 
 

2. Decion Tree Regressor¶

 
 

Tree의 분할이 이루어질 때마다 어떻게 예측을 하는지 알아보겠습니다.

 
In [8]:
from sklearn.tree import DecisionTreeRegressor, plot_tree
 
 

2.1 분할이 없을 경우¶

 
 

분할이 없는 경우에는 학습 데이터의 평균으로 예측을 합니다.

 
In [9]:
viz_test_pred = np.repeat(label.mean(), len(viz_test_data))
 
 

Plot으로 보면 강의에서 본 하나의 선이 생깁니다.

 
In [10]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.plot(viz_test_data, viz_test_pred, color="C2")
 
Out[10]:
[<matplotlib.lines.Line2D at 0x17dc53e51d0>]
 
 
 

분할이 없을때의 mse variance를 계산하면 다음과 같습니다.

 
In [11]:
train_pred = np.repeat(label.mean(), len(data))
mse_var = np.var(label - train_pred)
 
In [12]:
print(f"no divide mse variance: {mse_var:.3f}")
 
 
no divide mse variance: 0.580
 
 

2.2 첫 번째 분할¶

 
In [13]:
first_divide = DecisionTreeRegressor(max_depth=1)
 
In [14]:
first_divide.fit(data, label)
 
Out[14]:
DecisionTreeRegressor(max_depth=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=1)
 
In [15]:
first_divide_pred = first_divide.predict(viz_test_data)
 
 

첫번째로 분할되서 나누어진 영역을 그리면 아래와 같습니다.

 
In [16]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.axvline(first_divide.tree_.threshold[0], color="red")
 
Out[16]:
<matplotlib.lines.Line2D at 0x17dc5336f10>
 
 
 

분할이 이루어진 각 영역에서 다시 평균을 계산합니다.

 
In [17]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.axvline(first_divide.tree_.threshold[0], color="red")
plt.plot(viz_test_data, first_divide_pred, color="C2")
 
Out[17]:
[<matplotlib.lines.Line2D at 0x17dc738ead0>]
 
 
 

Treef를 시각화하기 위해서는 plot_tree 함수를 이용하면 됩니다.
Tree를 시각화하면 아래와 같습니다.

 
In [18]:
plot_tree(first_divide)
 
Out[18]:
[Text(0.5, 0.75, 'X[0] <= 3.088\nsquared_error = 0.58\nsamples = 80\nvalue = 0.062'),
 Text(0.25, 0.25, 'squared_error = 0.179\nsamples = 48\nvalue = 0.565'),
 Text(0.75, 0.25, 'squared_error = 0.236\nsamples = 32\nvalue = -0.691')]
 
 
 

2.3 두 번째 분할¶

 
In [19]:
second_divide = DecisionTreeRegressor(max_depth=2)
 
In [20]:
second_divide.fit(data, label)
 
Out[20]:
DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=2)
 
In [21]:
second_divide_pred = second_divide.predict(viz_test_data)
 
In [22]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.plot(viz_test_data, second_divide_pred, color="C2")
 
Out[22]:
[<matplotlib.lines.Line2D at 0x17dc7459550>]
 
 
In [23]:
plot_tree(second_divide)
 
Out[23]:
[Text(0.5, 0.8333333333333334, 'X[0] <= 3.088\nsquared_error = 0.58\nsamples = 80\nvalue = 0.062'),
 Text(0.25, 0.5, 'X[0] <= 2.185\nsquared_error = 0.179\nsamples = 48\nvalue = 0.565'),
 Text(0.125, 0.16666666666666666, 'squared_error = 0.119\nsamples = 31\nvalue = 0.745'),
 Text(0.375, 0.16666666666666666, 'squared_error = 0.12\nsamples = 17\nvalue = 0.235'),
 Text(0.75, 0.5, 'X[0] <= 3.903\nsquared_error = 0.236\nsamples = 32\nvalue = -0.691'),
 Text(0.625, 0.16666666666666666, 'squared_error = 0.147\nsamples = 16\nvalue = -0.358'),
 Text(0.875, 0.16666666666666666, 'squared_error = 0.104\nsamples = 16\nvalue = -1.025')]
 
 
 

3. Depth에 따른 변화¶

 
In [24]:
shallow_depth_tree = DecisionTreeRegressor(max_depth=2)
deep_depth_tree = DecisionTreeRegressor(max_depth=5)
 
In [25]:
shallow_depth_tree.fit(data, label)
deep_depth_tree.fit(data, label)
 
Out[25]:
DecisionTreeRegressor(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=5)
 
In [26]:
shallow_pred = shallow_depth_tree.predict(viz_test_data)
deep_pred = deep_depth_tree.predict(viz_test_data)
 
In [27]:
plt.figure(figsize=(8, 8))
plt.scatter(data, label, edgecolor="black", c="darkorange")
plt.plot(viz_test_data, shallow_pred, color="C2", label="shallow")
plt.plot(viz_test_data, deep_pred, color="C3", label="deep")
plt.legend()
 
Out[27]:
<matplotlib.legend.Legend at 0x17dc75a4bd0>
 
 
In [ ]:
 

'Machine Learning > Decision Tree' 카테고리의 다른 글

부동산 가격 예측  (0) 2024.03.12
Random Forest로 손글씨 분류하기  (0) 2024.03.12
Iris 꽃 종류 분류  (0) 2024.03.12
Decision Tree Classification 기초  (0) 2024.03.12
'Machine Learning/Decision Tree' 카테고리의 다른 글
  • 부동산 가격 예측
  • Random Forest로 손글씨 분류하기
  • Iris 꽃 종류 분류
  • Decision Tree Classification 기초
Juson
Juson
  • Juson
    Juson의 데이터 공부
    Juson
  • 전체
    오늘
    어제
    • 분류 전체보기 (95)
      • RAG (2)
      • AI (2)
        • NLP (0)
        • Generative Model (0)
        • Deep Reinforcement Learning (2)
        • LLM (0)
      • Logistic Optimization (0)
      • Machine Learning (37)
        • Linear Regression (2)
        • Logistic Regression (2)
        • Decision Tree (5)
        • Naive Bayes (1)
        • KNN (2)
        • SVM (2)
        • Clustering (4)
        • Dimension Reduction (3)
        • Boosting (6)
        • Abnomaly Detection (2)
        • Recommendation (4)
        • Embedding & NLP (4)
      • Reinforcement Learning (5)
      • Deep Learning (10)
        • Deep learning Bacis Mathema.. (10)
      • Optimization (2)
        • OR Optimization (0)
        • Convex Optimization (0)
        • Integer Optimization (0)
      • SNA 분석 (0)
      • 포트폴리오 최적화 공부 (0)
        • 최적화 기법 (0)
        • 금융 베이스 (0)
      • Finanancial engineering (0)
      • 프로그래머스 데브코스(Boot camp) (15)
        • SQL (9)
        • Python (5)
        • Machine Learning (1)
      • Python (22)
      • Project (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.4
Juson
Decision Tree Regressor
상단으로

티스토리툴바