KNN으로 음수 가능 여부를 판단하기

2024. 3. 12. 12:11·Machine Learning/KNN

 

 

 
 

KNN으로 음수 가능 여부를 판단하기¶

 
In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)
 
 

1. Data¶

 
 

이번 실습에서 사용할 데이터는 음수가 가능한지를 판단하는 데이터 입니다.

 
 

1.1 Data Load¶

 
In [45]:
water = pd.read_csv("water_potability.csv")
 
In [46]:
data = water.drop(["Potability"], axis=1)
label = water["Potability"]
 
 

1.2 Data EDA¶

 
 

데이터의 변수들을 확인하겠습니다. count를 확인하면 count들이 다른 것을 확인할 수 있습니다.

 
In [47]:
data.describe()
 
Out[47]:
  ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity
count 2785.000000 3276.000000 3276.000000 3276.000000 2495.000000 3276.000000 3276.000000 3114.000000 3276.000000
mean 7.080795 196.369496 22014.092526 7.122277 333.775777 426.205111 14.284970 66.396293 3.966786
std 1.594320 32.879761 8768.570828 1.583085 41.416840 80.824064 3.308162 16.175008 0.780382
min 0.000000 47.432000 320.942611 0.352000 129.000000 181.483754 2.200000 0.738000 1.450000
25% 6.093092 176.850538 15666.690297 6.127421 307.699498 365.734414 12.065801 55.844536 3.439711
50% 7.036752 196.967627 20927.833607 7.130299 333.073546 421.884968 14.218338 66.622485 3.955028
75% 8.062066 216.667456 27332.762127 8.114887 359.950170 481.792304 16.557652 77.337473 4.500320
max 14.000000 323.124000 61227.196008 13.127000 481.030642 753.342620 28.300000 124.000000 6.739000
 
 

값이 비어있는 데이터의 개수를 확인해 보겠습니다.

 
In [48]:
data.isna()
 
Out[48]:
  ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity
0 True False False False False False False False False
1 False False False False True False False False False
2 False False False False True False False False False
3 False False False False False False False False False
4 False False False False False False False False False
... ... ... ... ... ... ... ... ... ...
3271 False False False False False False False False False
3272 False False False False True False False True False
3273 False False False False True False False False False
3274 False False False False True False False False False
3275 False False False False True False False False False

3276 rows × 9 columns

 
In [49]:
data.isna().sum()
 
Out[49]:
ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
dtype: int64
 
 

1.3 Data Preprocess¶

 
 

빈 데이터를 제거하는 전처리를 수행하려 합니다. 빈 데이터를 처리하는 방법은 row를 제거하는 법과 column을 제거하는 방법이 있습니다.

 
 

1.3.1 row를 제거하는 방법¶

 
In [50]:
data.isna().sum(axis=1)
 
Out[50]:
0       1
1       1
2       1
3       0
4       0
       ..
3271    0
3272    2
3273    1
3274    1
3275    1
Length: 3276, dtype: int64
 
In [51]:
na_cnt = data.isna().sum(axis=1)
na_cnt
 
Out[51]:
0       1
1       1
2       1
3       0
4       0
       ..
3271    0
3272    2
3273    1
3274    1
3275    1
Length: 3276, dtype: int64
 
In [52]:
drop_idx = na_cnt.loc[na_cnt > 0].index
 
In [54]:
drop_idx
 
Out[54]:
Int64Index([   0,    1,    2,    8,   11,   13,   14,   16,   18,   20,
            ...
            3247, 3252, 3258, 3259, 3260, 3266, 3272, 3273, 3274, 3275],
           dtype='int64', length=1265)
 
In [55]:
drop_row = data.drop(drop_idx, axis=0)
 
In [56]:
drop_row.shape
 
Out[56]:
(2011, 9)
 
In [57]:
data.shape
 
Out[57]:
(3276, 9)
 
 

1.3.2 column을 제거하는 방법¶

 
In [58]:
na_cnt = data.isna().sum()
drop_cols = na_cnt.loc[na_cnt > 0].index
 
In [59]:
drop_cols
 
Out[59]:
Index(['ph', 'Sulfate', 'Trihalomethanes'], dtype='object')
 
In [60]:
data = data.drop(drop_cols, axis=1)
 
 

1.4 Data Split¶

 
 

데이터를 Train, Test로 나누겠습니다.

 
In [61]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_label, test_label = train_test_split(
    data, label, train_size=0.7, random_state=2021
)
 
In [62]:
print(f"train_data size: {len(train_label)}, {len(train_label)/len(data):.2f}")
print(f"test_data size: {len(test_label)}, {len(test_label)/len(data):.2f}")
 
 
train_data size: 2293, 0.70
test_data size: 983, 0.30
 
 

2. KNN¶

 
In [63]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
 
 

2.1 Best Hyper Parameter¶

 
 

KNeighborsClassifier에서 탐색해야 할 argument들은 다음과 같습니다.

  • n_neighbors
    • 몇 개의 이웃으로 예측할 것 인지 정합니다.
  • p
    • 거리를 어떤 방식으로 계산할지 정합니다.
    • 1: manhattan distance
    • 2: euclidean distance
 
In [64]:
from sklearn.model_selection import GridSearchCV
 
 

2.1.1 탐색 범위 선정¶

 
In [65]:
params = {
    "n_neighbors": [i for i in range(1, 12, 2)],
    "p": [1, 2]
}
 
In [66]:
params
 
Out[66]:
{'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]}
 
 

2.1.2 탐색¶

 
In [67]:
grid_cv = GridSearchCV(knn, param_grid=params, cv=3, n_jobs=-1) # job = -1 로 인해 모든 리소스 활
 
In [68]:
grid_cv.fit(train_data, train_label)
 
Out[68]:
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
KNeighborsClassifier()
KNeighborsClassifier()
 
 

2.1.3 결과¶

 
In [69]:
print(f"Best score of paramter search is: {grid_cv.best_score_:.4f}")
 
 
Best score of paramter search is: 0.5652
 
In [70]:
grid_cv.best_params_
 
Out[70]:
{'n_neighbors': 11, 'p': 1}
 
In [71]:
print("Best parameter of best score is")
print(f"\t n_neighbors: {grid_cv.best_params_['n_neighbors']}")
print(f"\t p: {grid_cv.best_params_['p']}")
 
 
Best parameter of best score is
	 n_neighbors: 11
	 p: 1
 
 

2.1.4 예측¶

 
In [72]:
train_pred = grid_cv.best_estimator_.predict(train_data)
test_pred = grid_cv.best_estimator_.predict(test_data)
 
 

2.1.5 평가¶

 
In [73]:
from sklearn.metrics import accuracy_score

train_acc = accuracy_score(train_label, train_pred)
test_acc = accuracy_score(test_label, test_pred)
 
In [74]:
print(f"train accuracy is {train_acc:.4f}")
print(f"test accuracy is {test_acc:.4f}")
 
 
train accuracy is 0.6520
test accuracy is 0.5595
 
 

3. Scaling을 할 경우¶

 
 

3.1 Data Scaling¶

 
 

KNN은 거리를 기반으로 하는 알고리즘이기 때문에 데이터의 크기에 영향을 받습니다.
Scaling을 진행해 크기를 맞춰줍니다.

 
In [75]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
 
In [76]:
scaler.fit(train_data)
 
Out[76]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
 
In [77]:
scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)
 
 

3.2 탐색¶

 
In [78]:
scaling_knn = KNeighborsClassifier()
scaling_grid_cv = GridSearchCV(scaling_knn, param_grid=params, n_jobs=-1)
 
In [79]:
scaling_grid_cv.fit(scaled_train_data, train_label)
 
Out[79]:
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11], 'p': [1, 2]})
KNeighborsClassifier()
KNeighborsClassifier()
 
In [80]:
scaling_grid_cv.best_score_
 
Out[80]:
0.587011825593896
 
In [81]:
scaling_grid_cv.best_params_
 
Out[81]:
{'n_neighbors': 9, 'p': 1}
 
 

3.3 평가¶

 
In [82]:
scaling_train_pred = scaling_grid_cv.best_estimator_.predict(scaled_train_data)
scaling_test_pred = scaling_grid_cv.best_estimator_.predict(scaled_test_data)
 
In [83]:
scaling_train_acc = accuracy_score(train_label, scaling_train_pred)
scaling_test_acc = accuracy_score(test_label, scaling_test_pred)
 
In [84]:
print(f"Scaled data train accuracy is {scaling_train_acc:.4f}")
print(f"Scaled data test accuracy is {scaling_test_acc:.4f}")
 
 
Scaled data train accuracy is 0.6829
Scaled data test accuracy is 0.5799
 
 

4. 마무리¶

 
In [85]:
print(f"test accuracy is {test_acc:.4f}")
print(f"Scaled data test accuracy is {scaling_test_acc:.4f}")
 
 
test accuracy is 0.5595
Scaled data test accuracy is 0.5799
 
In [ ]:
 

'Machine Learning > KNN' 카테고리의 다른 글

KNN 기초 실습  (0) 2024.03.12
'Machine Learning/KNN' 카테고리의 다른 글
  • KNN 기초 실습
Juson
Juson
  • Juson
    Juson의 데이터 공부
    Juson
  • 전체
    오늘
    어제
    • 분류 전체보기 (95)
      • RAG (2)
      • AI (2)
        • NLP (0)
        • Generative Model (0)
        • Deep Reinforcement Learning (2)
        • LLM (0)
      • Logistic Optimization (0)
      • Machine Learning (37)
        • Linear Regression (2)
        • Logistic Regression (2)
        • Decision Tree (5)
        • Naive Bayes (1)
        • KNN (2)
        • SVM (2)
        • Clustering (4)
        • Dimension Reduction (3)
        • Boosting (6)
        • Abnomaly Detection (2)
        • Recommendation (4)
        • Embedding & NLP (4)
      • Reinforcement Learning (5)
      • Deep Learning (10)
        • Deep learning Bacis Mathema.. (10)
      • Optimization (2)
        • OR Optimization (0)
        • Convex Optimization (0)
        • Integer Optimization (0)
      • SNA 분석 (0)
      • 포트폴리오 최적화 공부 (0)
        • 최적화 기법 (0)
        • 금융 베이스 (0)
      • Finanancial engineering (0)
      • 프로그래머스 데브코스(Boot camp) (15)
        • SQL (9)
        • Python (5)
        • Machine Learning (1)
      • Python (22)
      • Project (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.4
Juson
KNN으로 음수 가능 여부를 판단하기
상단으로

티스토리툴바