샘플 데이터와 Non-Hierarchical Clustering¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)

1. Data¶

1.1 Sample Data¶

In [2]:

from sklearn.datasets import make_blobs

data, label = make_blobs(n_samples=1500, random_state=170)

In [3]:

plt.scatter(data[:, 0], data[:, 1], c=label)

Out[3]:

<matplotlib.collections.PathCollection at 0x16fc6415810>

In [52]:

pd.DataFrame(label).value_counts

Out[52]:

<bound method DataFrame.value_counts of       0
0     1
1     1
2     0
3     1
4     1
...  ..
1495  0
1496  1
1497  2
1498  2
1499  2

[1500 rows x 1 columns]>

In [54]:

data

Out[54]:

array([[-5.19811282e+00,  6.41869316e-01],
       [-5.75229538e+00,  4.18627111e-01],
       [-1.08448984e+01, -7.55352273e+00],
       ...,
       [ 1.36105255e+00, -9.07491863e-01],
       [-3.54141108e-01,  7.12241630e-01],
       [ 1.88577252e+00,  1.41185693e-03]])

2. K Means¶

2.1 정확한 군집의 갯수를 맞춘 경우¶

In [4]:

from sklearn.cluster import KMeans

correct_kmeans = KMeans(n_clusters=3)

In [5]:

correct_kmeans.fit(data)

Out[5]:

KMeans(n_clusters=3)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [6]:

correct_pred = correct_kmeans.predict(data)

In [55]:

correct_pred

Out[55]:

array([2, 2, 1, ..., 0, 0, 0])

In [7]:

correct_kmeans.cluster_centers_

Out[7]:

array([[ 1.91176144,  0.40634045],
       [-8.94137566, -5.48137132],
       [-4.55490993,  0.02920864]])

In [8]:

correct_center = correct_kmeans.cluster_centers_

In [9]:

plt.scatter(data[:, 0], data[:, 1], c=correct_pred)
plt.scatter(correct_center[:, 0], correct_center[:, 1], marker="*", s=100, color="red")

Out[9]:

<matplotlib.collections.PathCollection at 0x16fc89caa50>

2.2 군집의 갯수를 틀린 경우¶

2.2.1 적은 경우¶

In [10]:

small_kmeans = KMeans(n_clusters=2)

In [11]:

small_kmeans.fit(data)

Out[11]:

KMeans(n_clusters=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [12]:

small_pred = small_kmeans.predict(data)

In [13]:

small_center = small_kmeans.cluster_centers_

In [14]:

plt.scatter(data[:, 0], data[:, 1], c=small_pred)
plt.scatter(small_center[:, 0], small_center[:, 1], marker="*", s=100, color="red")

Out[14]:

<matplotlib.collections.PathCollection at 0x16fd4e16c50>

2.2.1 큰 경우¶

In [15]:

large_kmeans = KMeans(n_clusters=4)

In [16]:

large_kmeans.fit(data)

Out[16]:

KMeans(n_clusters=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [17]:

large_pred = large_kmeans.predict(data)

In [18]:

large_center = large_kmeans.cluster_centers_

In [19]:

plt.scatter(data[:, 0], data[:, 1], c=large_pred)
plt.scatter(large_center[:, 0], large_center[:, 1], marker="*", s=100, color="red")

Out[19]:

<matplotlib.collections.PathCollection at 0x16fc8957190>

2.3 적절한 K를 찾기¶

In [63]:

sse_per_n = []

for n in range(1, 12, 2):
    kmeans = KMeans(n_clusters=n)
    kmeans.fit(data)
    sse = kmeans.inertia_
    sse_per_n += [sse]

In [64]:

plt.plot(range(1, 12, 2), sse_per_n)
plt.title("Sum of Sqaured Error")

Out[64]:

Text(0.5, 1.0, 'Sum of Sqaured Error')

In [65]:

sse_per_n

Out[65]:

[43533.29436635672,
 2862.731914078957,
 2212.3564930014495,
 1695.4863403202762,
 1390.4019613869775,
 1118.5211919485205]

3. K Means의 한계¶

3.1 서로 다른 크기의 군집¶

In [68]:

size_data, size_label = make_blobs(
    n_samples=1500,
    cluster_std=[1.0, 2.5, 0.5],
    random_state=170
)
size_data

Out[68]:

array([[ -6.11119721,   1.47153062],
       [ -7.49665361,   0.9134251 ],
       [-10.84489837,  -7.55352273],
       ...,
       [  1.64990343,  -0.20117787],
       [  0.79230661,   0.60868888],
       [  1.91226342,   0.25327399]])

In [69]:

size_label

Out[69]:

array([1, 1, 0, ..., 2, 2, 2])

In [23]:

size_data = np.vstack(
    (size_data[size_label == 0][:500],
     size_data[size_label == 1][:100],
     size_data[size_label == 2][:10])
)
size_label = [0] * 500 + [1] * 100 + [2] * 10

In [70]:

size_data

Out[70]:

array([[ -6.11119721,   1.47153062],
       [ -7.49665361,   0.9134251 ],
       [-10.84489837,  -7.55352273],
       ...,
       [  1.64990343,  -0.20117787],
       [  0.79230661,   0.60868888],
       [  1.91226342,   0.25327399]])

In [24]:

plt.scatter(size_data[:, 0], size_data[:, 1], c=size_label)

Out[24]:

<matplotlib.collections.PathCollection at 0x16fd4ee3190>

In [25]:

size_kmeans = KMeans(n_clusters=3, random_state=2021)

In [26]:

size_pred = size_kmeans.fit_predict(size_data)

In [27]:

size_center = size_kmeans.cluster_centers_

In [28]:

plt.scatter(size_data[:, 0], size_data[:, 1], c=size_pred)
plt.scatter(size_center[:, 0], size_center[:, 1], marker="*", s=100, color="red")

Out[28]:

<matplotlib.collections.PathCollection at 0x16fd4f99650>

3.2 서로 다른 밀도의 군집¶

In [29]:

density_data, density_label = make_blobs(
    n_samples=1500,
    cluster_std=[1.0, 2.5, 0.5],
    random_state=170
)

In [30]:

plt.scatter(density_data[:, 0], density_data[:, 1], c=density_label)

Out[30]:

<matplotlib.collections.PathCollection at 0x16fd4e646d0>

In [31]:

density_kmeans = KMeans(n_clusters=3, random_state=2021)

In [32]:

density_pred = density_kmeans.fit_predict(density_data)

In [33]:

density_center = density_kmeans.cluster_centers_

In [34]:

plt.scatter(density_data[:, 0], density_data[:, 1], c=density_pred)
plt.scatter(density_center[:, 0], density_center[:, 1], marker="*", s=100, color="red")

Out[34]:

<matplotlib.collections.PathCollection at 0x16fd60e7550>

3.3 지역적 패턴이 있는 군집¶

In [35]:

transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
pattern_data = np.dot(data, transformation)

In [36]:

plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=label)

Out[36]:

<matplotlib.collections.PathCollection at 0x16fd60cc1d0>

In [37]:

pattern_kmeans = KMeans(n_clusters=3, random_state=2021)

In [38]:

pattern_pred = pattern_kmeans.fit_predict(pattern_data)

In [39]:

pattern_center = pattern_kmeans.cluster_centers_

In [40]:

plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=pattern_pred)
plt.scatter(pattern_center[:, 0], pattern_center[:, 1], marker="*", s=100, color="red")

Out[40]:

<matplotlib.collections.PathCollection at 0x16fd5f9ee10>

4. DBSCAN¶

이번에는 DBSCAN을 이용해 K Means의 한계가 있던 데이터에 적용해 보겠습니다.

In [41]:

from sklearn.cluster import DBSCAN

4.1 서로 다른 크기의 군집¶

In [42]:

size_dbscan = DBSCAN(eps=1.0)

In [43]:

size_db_pred = size_dbscan.fit_predict(size_data)

In [44]:

plt.scatter(size_data[:, 0], size_data[:, 1], c=size_db_pred) #보라색은 outlier

Out[44]:

<matplotlib.collections.PathCollection at 0x16fd6232b90>

4.2 서로 다른 밀도의 군집¶

In [45]:

density_dbscan = DBSCAN()

In [46]:

density_db_pred = density_dbscan.fit_predict(density_data)

In [47]:

plt.scatter(density_data[:, 0], density_data[:, 1], c=density_db_pred)

Out[47]:

<matplotlib.collections.PathCollection at 0x16fd62a4c50>

4.3 지역적 패턴이 있는 군집¶

In [48]:

pattern_db = DBSCAN(eps=.3, min_samples=20)

In [49]:

pattern_db_pred = pattern_db.fit_predict(pattern_data)

In [50]:

plt.scatter(pattern_data[:, 0], pattern_data[:, 1], c=pattern_db_pred)

Out[50]:

<matplotlib.collections.PathCollection at 0x16fd63314d0>

In [ ]:

이미지 압축 (0)	2024.03.15
Clustering으로 빈 데이터 채우기 (0)	2024.03.15
Hierarchical Clustering (0)	2024.03.15

Non-Hierarchical Clustering

샘플 데이터와 Non-Hierarchical Clustering¶

1. Data¶

1.1 Sample Data¶

2. K Means¶

2.1 정확한 군집의 갯수를 맞춘 경우¶

2.2 군집의 갯수를 틀린 경우¶

2.2.1 적은 경우¶

2.2.1 큰 경우¶

2.3 적절한 K를 찾기¶

3. K Means의 한계¶

3.1 서로 다른 크기의 군집¶

3.2 서로 다른 밀도의 군집¶

3.3 지역적 패턴이 있는 군집¶

4. DBSCAN¶

4.1 서로 다른 크기의 군집¶

4.2 서로 다른 밀도의 군집¶

4.3 지역적 패턴이 있는 군집¶

'Machine Learning > Clustering' 카테고리의 다른 글

티스토리툴바