KNN으로 음수 가능 여부를 판단하기¶

In [44]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)

1. Data¶

이번 실습에서 사용할 데이터는 음수가 가능한지를 판단하는 데이터 입니다.

1.1 Data Load¶

In [45]:

water = pd.read_csv("water_potability.csv")

In [46]:

data = water.drop(["Potability"], axis=1)
label = water["Potability"]

1.2 Data EDA¶

데이터의 변수들을 확인하겠습니다. count를 확인하면 count들이 다른 것을 확인할 수 있습니다.

In [47]:

data.describe()

Out[47]:

	ph	Hardness	Solids	Chloramines	Sulfate	Conductivity	Organic_carbon	Trihalomethanes	Turbidity
count	2785.000000	3276.000000	3276.000000	3276.000000	2495.000000	3276.000000	3276.000000	3114.000000	3276.000000
mean	7.080795	196.369496	22014.092526	7.122277	333.775777	426.205111	14.284970	66.396293	3.966786
std	1.594320	32.879761	8768.570828	1.583085	41.416840	80.824064	3.308162	16.175008	0.780382
min	0.000000	47.432000	320.942611	0.352000	129.000000	181.483754	2.200000	0.738000	1.450000
25%	6.093092	176.850538	15666.690297	6.127421	307.699498	365.734414	12.065801	55.844536	3.439711
50%	7.036752	196.967627	20927.833607	7.130299	333.073546	421.884968	14.218338	66.622485	3.955028
75%	8.062066	216.667456	27332.762127	8.114887	359.950170	481.792304	16.557652	77.337473	4.500320
max	14.000000	323.124000	61227.196008	13.127000	481.030642	753.342620	28.300000	124.000000	6.739000

값이 비어있는 데이터의 개수를 확인해 보겠습니다.

In [48]:

data.isna()

Out[48]:

	ph	Hardness	Solids	Chloramines	Sulfate	Conductivity	Organic_carbon	Trihalomethanes	Turbidity
0	True	False	False	False	False	False	False	False	False
1	False	False	False	False	True	False	False	False	False
2	False	False	False	False	True	False	False	False	False
3	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...
3271	False	False	False	False	False	False	False	False	False
3272	False	False	False	False	True	False	False	True	False
3273	False	False	False	False	True	False	False	False	False
3274	False	False	False	False	True	False	False	False	False
3275	False	False	False	False	True	False	False	False	False

3276 rows × 9 columns

In [49]:

data.isna().sum()

Out[49]:

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
dtype: int64

1.3 Data Preprocess¶

빈 데이터를 제거하는 전처리를 수행하려 합니다. 빈 데이터를 처리하는 방법은 row를 제거하는 법과 column을 제거하는 방법이 있습니다.

1.3.1 row를 제거하는 방법¶

In [50]:

data.isna().sum(axis=1)

Out[50]:

0       1
1       1
2       1
3       0
4       0
       ..
3271    0
3272    2
3273    1
3274    1
3275    1
Length: 3276, dtype: int64

In [51]:

na_cnt = data.isna().sum(axis=1)
na_cnt

Out[51]:

0       1
1       1
2       1
3       0
4       0
       ..
3271    0
3272    2
3273    1
3274    1
3275    1
Length: 3276, dtype: int64

In [52]:

drop_idx = na_cnt.loc[na_cnt > 0].index

In [54]:

drop_idx

Out[54]:

Int64Index([   0,    1,    2,    8,   11,   13,   14,   16,   18,   20,
            ...
            3247, 3252, 3258, 3259, 3260, 3266, 3272, 3273, 3274, 3275],
           dtype='int64', length=1265)

In [55]:

drop_row = data.drop(drop_idx, axis=0)

In [56]:

drop_row.shape

Out[56]:

(2011, 9)

In [57]:

data.shape

Out[57]:

(3276, 9)

1.3.2 column을 제거하는 방법¶

In [58]:

na_cnt = data.isna().sum()
drop_cols = na_cnt.loc[na_cnt > 0].index

In [59]:

drop_cols

Out[59]:

Index(['ph', 'Sulfate', 'Trihalomethanes'], dtype='object')

In [60]:

data = data.drop(drop_cols, axis=1)

1.4 Data Split¶

데이터를 Train, Test로 나누겠습니다.

In [61]:

from sklearn.model_selection import train_test_split

train_data, test_data, train_label, test_label = train_test_split(
    data, label, train_size=0.7, random_state=2021
)

In [62]:

print(f"train_data size: {len(train_label)}, {len(train_label)/len(data):.2f}")
print(f"test_data size: {len(test_label)}, {len(test_label)/len(data):.2f}")

train_data size: 2293, 0.70
test_data size: 983, 0.30

2. KNN¶

In [63]:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

2.1 Best Hyper Parameter¶

KNeighborsClassifier에서 탐색해야 할 argument들은 다음과 같습니다.

n_neighbors
- 몇 개의 이웃으로 예측할 것 인지 정합니다.
p
- 거리를 어떤 방식으로 계산할지 정합니다.
- 1: manhattan distance
- 2: euclidean distance

In [64]:

from sklearn.model_selection import GridSearchCV

2.1.1 탐색 범위 선정¶

In [65]:

params = {
    "n_neighbors": [i for i in range(1, 12, 2)],
    "p": [1, 2]
}

In [66]:

params

Out[66]:

{'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]}

2.1.2 탐색¶

In [67]:

grid_cv = GridSearchCV(knn, param_grid=params, cv=3, n_jobs=-1) # job = -1 로 인해 모든 리소스 활

In [68]:

grid_cv.fit(train_data, train_label)

Out[68]:

GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})

estimator: KNeighborsClassifier

KNeighborsClassifier()

KNeighborsClassifier

KNeighborsClassifier()

2.1.3 결과¶

In [69]:

print(f"Best score of paramter search is: {grid_cv.best_score_:.4f}")

Best score of paramter search is: 0.5652

In [70]:

grid_cv.best_params_

Out[70]:

{'n_neighbors': 11, 'p': 1}

In [71]:

print("Best parameter of best score is")
print(f"\t n_neighbors: {grid_cv.best_params_['n_neighbors']}")
print(f"\t p: {grid_cv.best_params_['p']}")

Best parameter of best score is
	 n_neighbors: 11
	 p: 1

2.1.4 예측¶

In [72]:

train_pred = grid_cv.best_estimator_.predict(train_data)
test_pred = grid_cv.best_estimator_.predict(test_data)

2.1.5 평가¶

In [73]:

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(train_label, train_pred)
test_acc = accuracy_score(test_label, test_pred)

In [74]:

print(f"train accuracy is {train_acc:.4f}")
print(f"test accuracy is {test_acc:.4f}")

train accuracy is 0.6520
test accuracy is 0.5595

3. Scaling을 할 경우¶

3.1 Data Scaling¶

KNN은 거리를 기반으로 하는 알고리즘이기 때문에 데이터의 크기에 영향을 받습니다.
Scaling을 진행해 크기를 맞춰줍니다.

In [75]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [76]:

scaler.fit(train_data)

Out[76]:

StandardScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [77]:

scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)

3.2 탐색¶

In [78]:

scaling_knn = KNeighborsClassifier()
scaling_grid_cv = GridSearchCV(scaling_knn, param_grid=params, n_jobs=-1)

In [79]:

scaling_grid_cv.fit(scaled_train_data, train_label)

Out[79]:

GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})

estimator: KNeighborsClassifier

KNeighborsClassifier()

KNeighborsClassifier

KNeighborsClassifier()

In [80]:

scaling_grid_cv.best_score_

Out[80]:

0.587011825593896

In [81]:

scaling_grid_cv.best_params_

Out[81]:

{'n_neighbors': 9, 'p': 1}

3.3 평가¶

In [82]:

scaling_train_pred = scaling_grid_cv.best_estimator_.predict(scaled_train_data)
scaling_test_pred = scaling_grid_cv.best_estimator_.predict(scaled_test_data)

In [83]:

scaling_train_acc = accuracy_score(train_label, scaling_train_pred)
scaling_test_acc = accuracy_score(test_label, scaling_test_pred)

In [84]:

print(f"Scaled data train accuracy is {scaling_train_acc:.4f}")
print(f"Scaled data test accuracy is {scaling_test_acc:.4f}")

Scaled data train accuracy is 0.6829
Scaled data test accuracy is 0.5799

4. 마무리¶

In [85]:

print(f"test accuracy is {test_acc:.4f}")
print(f"Scaled data test accuracy is {scaling_test_acc:.4f}")

test accuracy is 0.5595
Scaled data test accuracy is 0.5799

In [ ]:

KNN으로 음수 가능 여부를 판단하기

KNN으로 음수 가능 여부를 판단하기¶

1. Data¶

1.1 Data Load¶

1.2 Data EDA¶

1.3 Data Preprocess¶

1.3.1 row를 제거하는 방법¶

1.3.2 column을 제거하는 방법¶

1.4 Data Split¶

2. KNN¶

2.1 Best Hyper Parameter¶

2.1.1 탐색 범위 선정¶

2.1.2 탐색¶

2.1.3 결과¶

2.1.4 예측¶

2.1.5 평가¶

3. Scaling을 할 경우¶

3.1 Data Scaling¶

3.2 탐색¶

3.3 평가¶

4. 마무리¶

'Machine Learning > KNN' 카테고리의 다른 글

티스토리툴바