KNN으로 음수 가능 여부를 판단하기¶
In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
이번 실습에서 사용할 데이터는 음수가 가능한지를 판단하는 데이터 입니다.
1.1 Data Load¶
In [45]:
water = pd.read_csv("water_potability.csv")
In [46]:
data = water.drop(["Potability"], axis=1)
label = water["Potability"]
1.2 Data EDA¶
데이터의 변수들을 확인하겠습니다. count를 확인하면 count들이 다른 것을 확인할 수 있습니다.
In [47]:
data.describe()
Out[47]:
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
|---|---|---|---|---|---|---|---|---|---|
| count | 2785.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 2495.000000 | 3276.000000 | 3276.000000 | 3114.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 |
| std | 1.594320 | 32.879761 | 8768.570828 | 1.583085 | 41.416840 | 80.824064 | 3.308162 | 16.175008 | 0.780382 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 |
| 25% | 6.093092 | 176.850538 | 15666.690297 | 6.127421 | 307.699498 | 365.734414 | 12.065801 | 55.844536 | 3.439711 |
| 50% | 7.036752 | 196.967627 | 20927.833607 | 7.130299 | 333.073546 | 421.884968 | 14.218338 | 66.622485 | 3.955028 |
| 75% | 8.062066 | 216.667456 | 27332.762127 | 8.114887 | 359.950170 | 481.792304 | 16.557652 | 77.337473 | 4.500320 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 |
값이 비어있는 데이터의 개수를 확인해 보겠습니다.
In [48]:
data.isna()
Out[48]:
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | True | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | True | False | False | False | False |
| 2 | False | False | False | False | True | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | False | False | False | False | False | False | False | False | False |
| 3272 | False | False | False | False | True | False | False | True | False |
| 3273 | False | False | False | False | True | False | False | False | False |
| 3274 | False | False | False | False | True | False | False | False | False |
| 3275 | False | False | False | False | True | False | False | False | False |
3276 rows × 9 columns
In [49]:
data.isna().sum()
Out[49]:
ph 491
Hardness 0
Solids 0
Chloramines 0
Sulfate 781
Conductivity 0
Organic_carbon 0
Trihalomethanes 162
Turbidity 0
dtype: int64
1.3 Data Preprocess¶
빈 데이터를 제거하는 전처리를 수행하려 합니다. 빈 데이터를 처리하는 방법은 row를 제거하는 법과 column을 제거하는 방법이 있습니다.
1.3.1 row를 제거하는 방법¶
In [50]:
data.isna().sum(axis=1)
Out[50]:
0 1
1 1
2 1
3 0
4 0
..
3271 0
3272 2
3273 1
3274 1
3275 1
Length: 3276, dtype: int64
In [51]:
na_cnt = data.isna().sum(axis=1)
na_cnt
Out[51]:
0 1
1 1
2 1
3 0
4 0
..
3271 0
3272 2
3273 1
3274 1
3275 1
Length: 3276, dtype: int64
In [52]:
drop_idx = na_cnt.loc[na_cnt > 0].index
In [54]:
drop_idx
Out[54]:
Int64Index([ 0, 1, 2, 8, 11, 13, 14, 16, 18, 20,
...
3247, 3252, 3258, 3259, 3260, 3266, 3272, 3273, 3274, 3275],
dtype='int64', length=1265)
In [55]:
drop_row = data.drop(drop_idx, axis=0)
In [56]:
drop_row.shape
Out[56]:
(2011, 9)
In [57]:
data.shape
Out[57]:
(3276, 9)
1.3.2 column을 제거하는 방법¶
In [58]:
na_cnt = data.isna().sum()
drop_cols = na_cnt.loc[na_cnt > 0].index
In [59]:
drop_cols
Out[59]:
Index(['ph', 'Sulfate', 'Trihalomethanes'], dtype='object')
In [60]:
data = data.drop(drop_cols, axis=1)
1.4 Data Split¶
데이터를 Train, Test로 나누겠습니다.
In [61]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_label, test_label = train_test_split(
data, label, train_size=0.7, random_state=2021
)
In [62]:
print(f"train_data size: {len(train_label)}, {len(train_label)/len(data):.2f}")
print(f"test_data size: {len(test_label)}, {len(test_label)/len(data):.2f}")
train_data size: 2293, 0.70
test_data size: 983, 0.30
2. KNN¶
In [63]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
2.1 Best Hyper Parameter¶
KNeighborsClassifier에서 탐색해야 할 argument들은 다음과 같습니다.
- n_neighbors
- 몇 개의 이웃으로 예측할 것 인지 정합니다.
- p
- 거리를 어떤 방식으로 계산할지 정합니다.
- 1: manhattan distance
- 2: euclidean distance
In [64]:
from sklearn.model_selection import GridSearchCV
2.1.1 탐색 범위 선정¶
In [65]:
params = {
"n_neighbors": [i for i in range(1, 12, 2)],
"p": [1, 2]
}
In [66]:
params
Out[66]:
{'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]}
2.1.2 탐색¶
In [67]:
grid_cv = GridSearchCV(knn, param_grid=params, cv=3, n_jobs=-1) # job = -1 로 인해 모든 리소스 활
In [68]:
grid_cv.fit(train_data, train_label)
Out[68]:
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
KNeighborsClassifier()
KNeighborsClassifier()
2.1.3 결과¶
In [69]:
print(f"Best score of paramter search is: {grid_cv.best_score_:.4f}")
Best score of paramter search is: 0.5652
In [70]:
grid_cv.best_params_
Out[70]:
{'n_neighbors': 11, 'p': 1}
In [71]:
print("Best parameter of best score is")
print(f"\t n_neighbors: {grid_cv.best_params_['n_neighbors']}")
print(f"\t p: {grid_cv.best_params_['p']}")
Best parameter of best score is
n_neighbors: 11
p: 1
2.1.4 예측¶
In [72]:
train_pred = grid_cv.best_estimator_.predict(train_data)
test_pred = grid_cv.best_estimator_.predict(test_data)
2.1.5 평가¶
In [73]:
from sklearn.metrics import accuracy_score
train_acc = accuracy_score(train_label, train_pred)
test_acc = accuracy_score(test_label, test_pred)
In [74]:
print(f"train accuracy is {train_acc:.4f}")
print(f"test accuracy is {test_acc:.4f}")
train accuracy is 0.6520
test accuracy is 0.5595
3. Scaling을 할 경우¶
3.1 Data Scaling¶
KNN은 거리를 기반으로 하는 알고리즘이기 때문에 데이터의 크기에 영향을 받습니다.
Scaling을 진행해 크기를 맞춰줍니다.
In [75]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
In [76]:
scaler.fit(train_data)
Out[76]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
In [77]:
scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)
3.2 탐색¶
In [78]:
scaling_knn = KNeighborsClassifier()
scaling_grid_cv = GridSearchCV(scaling_knn, param_grid=params, n_jobs=-1)
In [79]:
scaling_grid_cv.fit(scaled_train_data, train_label)
Out[79]:
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
KNeighborsClassifier()
KNeighborsClassifier()
In [80]:
scaling_grid_cv.best_score_
Out[80]:
0.587011825593896
In [81]:
scaling_grid_cv.best_params_
Out[81]:
{'n_neighbors': 9, 'p': 1}
3.3 평가¶
In [82]:
scaling_train_pred = scaling_grid_cv.best_estimator_.predict(scaled_train_data)
scaling_test_pred = scaling_grid_cv.best_estimator_.predict(scaled_test_data)
In [83]:
scaling_train_acc = accuracy_score(train_label, scaling_train_pred)
scaling_test_acc = accuracy_score(test_label, scaling_test_pred)
In [84]:
print(f"Scaled data train accuracy is {scaling_train_acc:.4f}")
print(f"Scaled data test accuracy is {scaling_test_acc:.4f}")
Scaled data train accuracy is 0.6829
Scaled data test accuracy is 0.5799
4. 마무리¶
In [85]:
print(f"test accuracy is {test_acc:.4f}")
print(f"Scaled data test accuracy is {scaling_test_acc:.4f}")
test accuracy is 0.5595
Scaled data test accuracy is 0.5799
In [ ]:
'Machine Learning > KNN' 카테고리의 다른 글
| KNN 기초 실습 (0) | 2024.03.12 |
|---|
