Non-Hierarchical Clustering

2024. 3. 15. 14:10·Machine Learning/Clustering

 

 

 
 

샘플 데이터와 Non-Hierarchical Clustering¶

 
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)
 
 

1. Data¶

 
 

1.1 Sample Data¶

 
In [2]:
from sklearn.datasets import make_blobs


data, label = make_blobs(n_samples=1500, random_state=170)
 
In [3]:
plt.scatter(data[:, 0], data[:, 1], c=label)
 
Out[3]:
<matplotlib.collections.PathCollection at 0x16fc6415810>
 
 
In [52]:
pd.DataFrame(label).value_counts
 
Out[52]:
<bound method DataFrame.value_counts of       0
0     1
1     1
2     0
3     1
4     1
...  ..
1495  0
1496  1
1497  2
1498  2
1499  2

[1500 rows x 1 columns]>
 
In [54]:
data
 
Out[54]:
array([[-5.19811282e+00,  6.41869316e-01],
       [-5.75229538e+00,  4.18627111e-01],
       [-1.08448984e+01, -7.55352273e+00],
       ...,
       [ 1.36105255e+00, -9.07491863e-01],
       [-3.54141108e-01,  7.12241630e-01],
       [ 1.88577252e+00,  1.41185693e-03]])
 
 

2. K Means¶

 
 

2.1 정확한 군집의 갯수를 맞춘 경우¶

 
In [4]:
from sklearn.cluster import KMeans

correct_kmeans = KMeans(n_clusters=3)
 
In [5]:
correct_kmeans.fit(data)
 
Out[5]:
KMeans(n_clusters=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3)
 
In [6]:
correct_pred = correct_kmeans.predict(data)
 
In [55]:
correct_pred
 
Out[55]:
array([2, 2, 1, ..., 0, 0, 0])
 
In [7]:
correct_kmeans.cluster_centers_
 
Out[7]:
array([[ 1.91176144,  0.40634045],
       [-8.94137566, -5.48137132],
       [-4.55490993,  0.02920864]])
 
In [8]:
correct_center = correct_kmeans.cluster_centers_
 
In [9]:
plt.scatter(data[:, 0], data[:, 1], c=correct_pred)
plt.scatter(correct_center[:, 0], correct_center[:, 1], marker="*", s=100, color="red")
 
Out[9]:
<matplotlib.collections.PathCollection at 0x16fc89caa50>
 
 
 

2.2 군집의 갯수를 틀린 경우¶

 
 

2.2.1 적은 경우¶

 
In [10]:
small_kmeans = KMeans(n_clusters=2)
 
In [11]:
small_kmeans.fit(data)
 
Out[11]:
KMeans(n_clusters=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2)
 
In [12]:
small_pred = small_kmeans.predict(data)
 
In [13]:
small_center = small_kmeans.cluster_centers_
 
In [14]:
plt.scatter(data[:, 0], data[:, 1], c=small_pred)
plt.scatter(small_center[:, 0], small_center[:, 1], marker="*", s=100, color="red")
 
Out[14]:
<matplotlib.collections.PathCollection at 0x16fd4e16c50>
 
 
 

2.2.1 큰 경우¶

 
In [15]:
large_kmeans = KMeans(n_clusters=4)
 
In [16]:
large_kmeans.fit(data)
 
Out[16]:
KMeans(n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4)
 
In [17]:
large_pred = large_kmeans.predict(data)
 
In [18]:
large_center = large_kmeans.cluster_centers_
 
In [19]:
plt.scatter(data[:, 0], data[:, 1], c=large_pred)
plt.scatter(large_center[:, 0], large_center[:, 1], marker="*", s=100, color="red")
 
Out[19]:
<matplotlib.collections.PathCollection at 0x16fc8957190>
 
 
 

2.3 적절한 K를 찾기¶

 
In [63]:
sse_per_n = []

for n in range(1, 12, 2):
    kmeans = KMeans(n_clusters=n)
    kmeans.fit(data)
    sse = kmeans.inertia_
    sse_per_n += [sse]
 
In [64]:
plt.plot(range(1, 12, 2), sse_per_n)
plt.title("Sum of Sqaured Error")
 
Out[64]:
Text(0.5, 1.0, 'Sum of Sqaured Error')
 
 
In [65]:
sse_per_n
 
Out[65]:
[43533.29436635672,
 2862.731914078957,
 2212.3564930014495,
 1695.4863403202762,
 1390.4019613869775,
 1118.5211919485205]
 
 

3. K Means의 한계¶

 
 

3.1 서로 다른 크기의 군집¶

 
In [68]:
size_data, size_label = make_blobs(
    n_samples=1500,
    cluster_std=[1.0, 2.5, 0.5],
    random_state=170
)
size_data
 
Out[68]:
array([[ -6.11119721,   1.47153062],
       [ -7.49665361,   0.9134251 ],
       [-10.84489837,  -7.55352273],
       ...,
       [  1.64990343,  -0.20117787],
       [  0.79230661,   0.60868888],
       [  1.91226342,   0.25327399]])
 
In [69]:
size_label
 
Out[69]:
array([1, 1, 0, ..., 2, 2, 2])
 
In [23]:
size_data = np.vstack(
    (size_data[size_label == 0][:500],
     size_data[size_label == 1][:100],
     size_data[size_label == 2][:10])
)
size_label = [0] * 500 + [1] * 100 + [2] * 10
 
In [70]:
size_data
 
Out[70]:
array([[ -6.11119721,   1.47153062],
       [ -7.49665361,   0.9134251 ],
       [-10.84489837,  -7.55352273],
       ...,
       [  1.64990343,  -0.20117787],
       [  0.79230661,   0.60868888],
       [  1.91226342,   0.25327399]])
 
In [24]:
plt.scatter(size_data[:, 0], size_data[:, 1], c=size_label)
 
Out[24]:
<matplotlib.collections.PathCollection at 0x16fd4ee3190>
 
 
In [25]:
size_kmeans = KMeans(n_clusters=3, random_state=2021)
 
In [26]:
size_pred = size_kmeans.fit_predict(size_data)
 
In [27]:
size_center = size_kmeans.cluster_centers_
 
In [28]:
plt.scatter(size_data[:, 0], size_data[:, 1], c=size_pred)
plt.scatter(size_center[:, 0], size_center[:, 1], marker="*", s=100, color="red")
 
Out[28]:
<matplotlib.collections.PathCollection at 0x16fd4f99650>
 
 
 

3.2 서로 다른 밀도의 군집¶

 
In [29]:
density_data, density_label = make_blobs(
    n_samples=1500,
    cluster_std=[1.0, 2.5, 0.5],
    random_state=170
)
 
In [30]:
plt.scatter(density_data[:, 0], density_data[:, 1], c=density_label)
 
Out[30]:
<matplotlib.collections.PathCollection at 0x16fd4e646d0>
 
 
In [31]:
density_kmeans = KMeans(n_clusters=3, random_state=2021)
 
In [32]:
density_pred = density_kmeans.fit_predict(density_data)
 
In [33]:
density_center = density_kmeans.cluster_centers_
 
In [34]:
plt.scatter(density_data[:, 0], density_data[:, 1], c=density_pred)
plt.scatter(density_center[:, 0], density_center[:, 1], marker="*", s=100, color="red")
 
Out[34]:
<matplotlib.collections.PathCollection at 0x16fd60e7550>
 
 
 

3.3 지역적 패턴이 있는 군집¶

 
In [35]:
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
pattern_data = np.dot(data, transformation)
 
In [36]:
plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=label)
 
Out[36]:
<matplotlib.collections.PathCollection at 0x16fd60cc1d0>
 
 
In [37]:
pattern_kmeans = KMeans(n_clusters=3, random_state=2021)
 
In [38]:
pattern_pred = pattern_kmeans.fit_predict(pattern_data)
 
In [39]:
pattern_center = pattern_kmeans.cluster_centers_
 
In [40]:
plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=pattern_pred)
plt.scatter(pattern_center[:, 0], pattern_center[:, 1], marker="*", s=100, color="red")
 
Out[40]:
<matplotlib.collections.PathCollection at 0x16fd5f9ee10>
 
 
 

4. DBSCAN¶

 
 

이번에는 DBSCAN을 이용해 K Means의 한계가 있던 데이터에 적용해 보겠습니다.

 
In [41]:
from sklearn.cluster import DBSCAN
 
 

4.1 서로 다른 크기의 군집¶

 
In [42]:
size_dbscan = DBSCAN(eps=1.0)
 
In [43]:
size_db_pred = size_dbscan.fit_predict(size_data)
 
In [44]:
plt.scatter(size_data[:, 0], size_data[:, 1], c=size_db_pred) #보라색은 outlier
 
Out[44]:
<matplotlib.collections.PathCollection at 0x16fd6232b90>
 
 
 

4.2 서로 다른 밀도의 군집¶

 
In [45]:
density_dbscan = DBSCAN()
 
In [46]:
density_db_pred = density_dbscan.fit_predict(density_data)
 
In [47]:
plt.scatter(density_data[:, 0], density_data[:, 1], c=density_db_pred)
 
Out[47]:
<matplotlib.collections.PathCollection at 0x16fd62a4c50>
 
 
 

4.3 지역적 패턴이 있는 군집¶

 
In [48]:
pattern_db = DBSCAN(eps=.3, min_samples=20)
 
In [49]:
pattern_db_pred = pattern_db.fit_predict(pattern_data)
 
In [50]:
plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=pattern_db_pred)
 
Out[50]:
<matplotlib.collections.PathCollection at 0x16fd63314d0>
 
 
In [ ]:
 

'Machine Learning > Clustering' 카테고리의 다른 글

이미지 압축  (0) 2024.03.15
Clustering으로 빈 데이터 채우기  (0) 2024.03.15
Hierarchical Clustering  (0) 2024.03.15
'Machine Learning/Clustering' 카테고리의 다른 글
  • 이미지 압축
  • Clustering으로 빈 데이터 채우기
  • Hierarchical Clustering
Juson
Juson
  • Juson
    Juson의 데이터 공부
    Juson
  • 전체
    오늘
    어제
    • 분류 전체보기 (95)
      • RAG (2)
      • AI (2)
        • NLP (0)
        • Generative Model (0)
        • Deep Reinforcement Learning (2)
        • LLM (0)
      • Logistic Optimization (0)
      • Machine Learning (37)
        • Linear Regression (2)
        • Logistic Regression (2)
        • Decision Tree (5)
        • Naive Bayes (1)
        • KNN (2)
        • SVM (2)
        • Clustering (4)
        • Dimension Reduction (3)
        • Boosting (6)
        • Abnomaly Detection (2)
        • Recommendation (4)
        • Embedding & NLP (4)
      • Reinforcement Learning (5)
      • Deep Learning (10)
        • Deep learning Bacis Mathema.. (10)
      • Optimization (2)
        • OR Optimization (0)
        • Convex Optimization (0)
        • Integer Optimization (0)
      • SNA 분석 (0)
      • 포트폴리오 최적화 공부 (0)
        • 최적화 기법 (0)
        • 금융 베이스 (0)
      • Finanancial engineering (0)
      • 프로그래머스 데브코스(Boot camp) (15)
        • SQL (9)
        • Python (5)
        • Machine Learning (1)
      • Python (22)
      • Project (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.4
Juson
Non-Hierarchical Clustering
상단으로

티스토리툴바