샘플 데이터와 Non-Hierarchical Clustering¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
1.1 Sample Data¶
In [2]:
from sklearn.datasets import make_blobs
data, label = make_blobs(n_samples=1500, random_state=170)
In [3]:
plt.scatter(data[:, 0], data[:, 1], c=label)
Out[3]:
<matplotlib.collections.PathCollection at 0x16fc6415810>
In [52]:
pd.DataFrame(label).value_counts
Out[52]:
<bound method DataFrame.value_counts of 0
0 1
1 1
2 0
3 1
4 1
... ..
1495 0
1496 1
1497 2
1498 2
1499 2
[1500 rows x 1 columns]>
In [54]:
data
Out[54]:
array([[-5.19811282e+00, 6.41869316e-01],
[-5.75229538e+00, 4.18627111e-01],
[-1.08448984e+01, -7.55352273e+00],
...,
[ 1.36105255e+00, -9.07491863e-01],
[-3.54141108e-01, 7.12241630e-01],
[ 1.88577252e+00, 1.41185693e-03]])
2. K Means¶
2.1 정확한 군집의 갯수를 맞춘 경우¶
In [4]:
from sklearn.cluster import KMeans
correct_kmeans = KMeans(n_clusters=3)
In [5]:
correct_kmeans.fit(data)
Out[5]:
KMeans(n_clusters=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3)
In [6]:
correct_pred = correct_kmeans.predict(data)
In [55]:
correct_pred
Out[55]:
array([2, 2, 1, ..., 0, 0, 0])
In [7]:
correct_kmeans.cluster_centers_
Out[7]:
array([[ 1.91176144, 0.40634045],
[-8.94137566, -5.48137132],
[-4.55490993, 0.02920864]])
In [8]:
correct_center = correct_kmeans.cluster_centers_
In [9]:
plt.scatter(data[:, 0], data[:, 1], c=correct_pred)
plt.scatter(correct_center[:, 0], correct_center[:, 1], marker="*", s=100, color="red")
Out[9]:
<matplotlib.collections.PathCollection at 0x16fc89caa50>
2.2 군집의 갯수를 틀린 경우¶
2.2.1 적은 경우¶
In [10]:
small_kmeans = KMeans(n_clusters=2)
In [11]:
small_kmeans.fit(data)
Out[11]:
KMeans(n_clusters=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2)
In [12]:
small_pred = small_kmeans.predict(data)
In [13]:
small_center = small_kmeans.cluster_centers_
In [14]:
plt.scatter(data[:, 0], data[:, 1], c=small_pred)
plt.scatter(small_center[:, 0], small_center[:, 1], marker="*", s=100, color="red")
Out[14]:
<matplotlib.collections.PathCollection at 0x16fd4e16c50>
2.2.1 큰 경우¶
In [15]:
large_kmeans = KMeans(n_clusters=4)
In [16]:
large_kmeans.fit(data)
Out[16]:
KMeans(n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4)
In [17]:
large_pred = large_kmeans.predict(data)
In [18]:
large_center = large_kmeans.cluster_centers_
In [19]:
plt.scatter(data[:, 0], data[:, 1], c=large_pred)
plt.scatter(large_center[:, 0], large_center[:, 1], marker="*", s=100, color="red")
Out[19]:
<matplotlib.collections.PathCollection at 0x16fc8957190>
2.3 적절한 K를 찾기¶
In [63]:
sse_per_n = []
for n in range(1, 12, 2):
kmeans = KMeans(n_clusters=n)
kmeans.fit(data)
sse = kmeans.inertia_
sse_per_n += [sse]
In [64]:
plt.plot(range(1, 12, 2), sse_per_n)
plt.title("Sum of Sqaured Error")
Out[64]:
Text(0.5, 1.0, 'Sum of Sqaured Error')
In [65]:
sse_per_n
Out[65]:
[43533.29436635672,
2862.731914078957,
2212.3564930014495,
1695.4863403202762,
1390.4019613869775,
1118.5211919485205]
3. K Means의 한계¶
3.1 서로 다른 크기의 군집¶
In [68]:
size_data, size_label = make_blobs(
n_samples=1500,
cluster_std=[1.0, 2.5, 0.5],
random_state=170
)
size_data
Out[68]:
array([[ -6.11119721, 1.47153062],
[ -7.49665361, 0.9134251 ],
[-10.84489837, -7.55352273],
...,
[ 1.64990343, -0.20117787],
[ 0.79230661, 0.60868888],
[ 1.91226342, 0.25327399]])
In [69]:
size_label
Out[69]:
array([1, 1, 0, ..., 2, 2, 2])
In [23]:
size_data = np.vstack(
(size_data[size_label == 0][:500],
size_data[size_label == 1][:100],
size_data[size_label == 2][:10])
)
size_label = [0] * 500 + [1] * 100 + [2] * 10
In [70]:
size_data
Out[70]:
array([[ -6.11119721, 1.47153062],
[ -7.49665361, 0.9134251 ],
[-10.84489837, -7.55352273],
...,
[ 1.64990343, -0.20117787],
[ 0.79230661, 0.60868888],
[ 1.91226342, 0.25327399]])
In [24]:
plt.scatter(size_data[:, 0], size_data[:, 1], c=size_label)
Out[24]:
<matplotlib.collections.PathCollection at 0x16fd4ee3190>
In [25]:
size_kmeans = KMeans(n_clusters=3, random_state=2021)
In [26]:
size_pred = size_kmeans.fit_predict(size_data)
In [27]:
size_center = size_kmeans.cluster_centers_
In [28]:
plt.scatter(size_data[:, 0], size_data[:, 1], c=size_pred)
plt.scatter(size_center[:, 0], size_center[:, 1], marker="*", s=100, color="red")
Out[28]:
<matplotlib.collections.PathCollection at 0x16fd4f99650>
3.2 서로 다른 밀도의 군집¶
In [29]:
density_data, density_label = make_blobs(
n_samples=1500,
cluster_std=[1.0, 2.5, 0.5],
random_state=170
)
In [30]:
plt.scatter(density_data[:, 0], density_data[:, 1], c=density_label)
Out[30]:
<matplotlib.collections.PathCollection at 0x16fd4e646d0>
In [31]:
density_kmeans = KMeans(n_clusters=3, random_state=2021)
In [32]:
density_pred = density_kmeans.fit_predict(density_data)
In [33]:
density_center = density_kmeans.cluster_centers_
In [34]:
plt.scatter(density_data[:, 0], density_data[:, 1], c=density_pred)
plt.scatter(density_center[:, 0], density_center[:, 1], marker="*", s=100, color="red")
Out[34]:
<matplotlib.collections.PathCollection at 0x16fd60e7550>
3.3 지역적 패턴이 있는 군집¶
In [35]:
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
pattern_data = np.dot(data, transformation)
In [36]:
plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=label)
Out[36]:
<matplotlib.collections.PathCollection at 0x16fd60cc1d0>
In [37]:
pattern_kmeans = KMeans(n_clusters=3, random_state=2021)
In [38]:
pattern_pred = pattern_kmeans.fit_predict(pattern_data)
In [39]:
pattern_center = pattern_kmeans.cluster_centers_
In [40]:
plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=pattern_pred)
plt.scatter(pattern_center[:, 0], pattern_center[:, 1], marker="*", s=100, color="red")
Out[40]:
<matplotlib.collections.PathCollection at 0x16fd5f9ee10>
4. DBSCAN¶
이번에는 DBSCAN을 이용해 K Means의 한계가 있던 데이터에 적용해 보겠습니다.
In [41]:
from sklearn.cluster import DBSCAN
4.1 서로 다른 크기의 군집¶
In [42]:
size_dbscan = DBSCAN(eps=1.0)
In [43]:
size_db_pred = size_dbscan.fit_predict(size_data)
In [44]:
plt.scatter(size_data[:, 0], size_data[:, 1], c=size_db_pred) #보라색은 outlier
Out[44]:
<matplotlib.collections.PathCollection at 0x16fd6232b90>
4.2 서로 다른 밀도의 군집¶
In [45]:
density_dbscan = DBSCAN()
In [46]:
density_db_pred = density_dbscan.fit_predict(density_data)
In [47]:
plt.scatter(density_data[:, 0], density_data[:, 1], c=density_db_pred)
Out[47]:
<matplotlib.collections.PathCollection at 0x16fd62a4c50>
4.3 지역적 패턴이 있는 군집¶
In [48]:
pattern_db = DBSCAN(eps=.3, min_samples=20)
In [49]:
pattern_db_pred = pattern_db.fit_predict(pattern_data)
In [50]:
plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=pattern_db_pred)
Out[50]:
<matplotlib.collections.PathCollection at 0x16fd63314d0>
In [ ]:
'Machine Learning > Clustering' 카테고리의 다른 글
| 이미지 압축 (0) | 2024.03.15 |
|---|---|
| Clustering으로 빈 데이터 채우기 (0) | 2024.03.15 |
| Hierarchical Clustering (0) | 2024.03.15 |
