Boosting Classification 심화 실습- 뉴스 분류

2024. 3. 18. 19:01·Machine Learning/Boosting

 

 

 
 

뉴스 분류하기¶

 
In [1]:
!pip install catboost
 
 
Requirement already satisfied: catboost in /usr/local/lib/python3.7/dist-packages (1.0.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.3.2)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (1.3.3)
 
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)
 
 

1. Data¶

 
 

이번 실습에서 사용할 데이터는 뉴스를 분류하는 데이터입니다.

 
 

1.1 Data Load¶

 
In [3]:
from sklearn.datasets import fetch_20newsgroups


newsgroup = fetch_20newsgroups()
 
In [4]:
data, target = newsgroup["data"], newsgroup["target"]
 
In [5]:
print(data[0])
 
 
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





 
In [6]:
target[0]
 
Out[6]:
7
 
In [8]:
newsgroup["target_names"]
 
Out[8]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
 
 

1.2 Data Split¶

 
 

아래의 뉴스 그룹만 사용

  • 'talk.politics.guns'
  • 'talk.politics.mideast'
  • 'talk.politics.misc'
  • 'talk.religion.misc
 
In [9]:
len(newsgroup["target_names"])
 
Out[9]:
20
 
In [10]:
text = pd.Series(data, name="text")
target = pd.Series(target, name="target")
 
In [11]:
df = pd.concat([text, target], 1)
 
In [12]:
df
 
Out[12]:
  text target
0 From: lerxst@wam.umd.edu (where's my thing)\nS... 7
1 From: guykuo@carson.u.washington.edu (Guy Kuo)... 4
2 From: twillis@ec.ecn.purdue.edu (Thomas E Will... 4
3 From: jgreen@amber (Joe Green)\nSubject: Re: W... 1
4 From: jcm@head-cfa.harvard.edu (Jonathan McDow... 14
... ... ...
11309 From: jim.zisfein@factory.com (Jim Zisfein) \n... 13
11310 From: ebodin@pearl.tufts.edu\nSubject: Screen ... 4
11311 From: westes@netcom.com (Will Estes)\nSubject:... 3
11312 From: steve@hcrlgw (Steven Collins)\nSubject: ... 1
11313 From: gunning@cco.caltech.edu (Kevin J. Gunnin... 8

11314 rows × 2 columns

 
In [13]:
df.target.value_counts().sort_index()
 
Out[13]:
0     480
1     584
2     591
3     590
4     578
5     593
6     585
7     594
8     598
9     597
10    600
11    595
12    591
13    594
14    593
15    599
16    546
17    564
18    465
19    377
Name: target, dtype: int64
 
In [14]:
df.query("16 <= target <= 19")
 
Out[14]:
  text target
5 From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\... 16
11 From: david@terminus.ericsson.se (David Bold)\... 19
33 From: ayr1@cunixa.cc.columbia.edu (Amir Y Rose... 17
34 From: joec@hilbert.cyprs.rain.com ( Joe Cipale... 18
39 From: bressler@iftccu.ca.boeing.com (Rick Bres... 16
... ... ...
11277 From: bob1@cos.com (Bob Blackshaw)\nSubject: R... 17
11280 From: jake@bony1.bony.com (Jake Livni)\nSubjec... 17
11299 From: 2120788@hydra.maths.unsw.EDU.AU ()\nSubj... 17
11304 From: Pegasus@aaa.uoregon.edu (Pegasus)\nSubje... 19
11305 From: shaig@composer.think.com (Shai Guday)\nS... 17

1952 rows × 2 columns

 
In [15]:
df_sample = df.query("16 <= target <= 19")
 
In [16]:
data = df_sample.text
target = df_sample.target
 
In [17]:
np.array(data).shape
 
Out[17]:
(1952,)
 
In [18]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(
    data, target, train_size=0.7, random_state=2021
)
 
 

1.2 Count Vectorize¶

 
In [19]:
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('punkt')
 
 
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[19]:
True
 
 

뉴스에 모두 등장한 단어를 사용

 
In [20]:
cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_data)
 
 
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:507: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"
Out[20]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function word_tokenize at 0x7f0d2604fe60>,
                vocabulary=None)
 
In [21]:
len(cnt_vectorizer.vocabulary_)
 
Out[21]:
32480
 
 

최소 10개의 뉴스에서 등장한 단어를 사용

 
In [22]:
cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize, min_df=10)
cnt_vectorizer.fit(train_data)
 
 
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:507: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"
Out[22]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=10,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function word_tokenize at 0x7f0d2604fe60>,
                vocabulary=None)
 
In [23]:
len(cnt_vectorizer.vocabulary_)
 
Out[23]:
4237
 
In [24]:
train_matrix = cnt_vectorizer.transform(train_data)
test_matrix = cnt_vectorizer.transform(test_data)
 
 

2. XGBoost¶

 
In [25]:
import xgboost as xgb


xgb_clf = xgb.XGBClassifier()
 
 

2.1 학습¶

 
In [26]:
xgb_clf.fit(train_matrix, train_target)
 
Out[26]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
 
 

2.2 예측¶

 
In [27]:
xgb_train_pred = xgb_clf.predict(train_matrix)
xgb_test_pred = xgb_clf.predict(test_matrix)
 
 

2.3 평가¶

 
In [28]:
from sklearn.metrics import accuracy_score

xgb_train_acc = accuracy_score(train_target, xgb_train_pred)
xgb_test_acc = accuracy_score(test_target, xgb_test_pred)
 
In [29]:
print(f"XGBoost Train accuracy is {xgb_train_acc:.4f}")
print(f"XGBoost Test accuracy is {xgb_test_acc:.4f}")
 
 
XGBoost Train accuracy is 0.9663
XGBoost Test accuracy is 0.9078
 
 

3. Light GBM¶

 
In [30]:
import lightgbm as lgb

lgb_clf = lgb.LGBMClassifier()
 
 

3.1 학습¶

 
In [31]:
train_matrix
 
Out[31]:
<1366x4237 sparse matrix of type '<class 'numpy.int64'>'
	with 236573 stored elements in Compressed Sparse Row format>
 
In [32]:
train_matrix.toarray()
 
Out[32]:
array([[1, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 6, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
 
In [33]:
lgb_clf.fit(train_matrix.toarray(), train_target)
 
Out[33]:
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
 
 

3.2 예측¶

 
In [34]:
lgb_train_pred = lgb_clf.predict(train_matrix.toarray())
lgb_test_pred = lgb_clf.predict(test_matrix.toarray())
 
 

3.3 평가¶

 
In [35]:
lgb_train_acc = accuracy_score(train_target, lgb_train_pred)
lgb_test_acc = accuracy_score(test_target, lgb_test_pred)
 
In [37]:
print(f"Light Boost train accuracy is {lgb_train_acc:.4f}")
print(f"Light Boost test accuracy is {lgb_test_acc:.4f}")
 
 
Light Boost train accuracy is 1.0000
Light Boost test accuracy is 0.9130
 
 

4. CatBoost¶

 
In [38]:
import catboost as cb


cb_clf = cb.CatBoostClassifier()
 
 

4.1 학습¶

 
In [39]:
cb_clf.fit(train_matrix, train_target, verbose=False)
 
Out[39]:
<catboost.core.CatBoostClassifier at 0x7f0d24bd1b50>
 
 

4.2 예측¶

 
In [40]:
cb_train_pred = cb_clf.predict(train_matrix)
cb_test_pred = cb_clf.predict(test_matrix)
 
 

4.3 평가¶

 
In [41]:
cb_train_acc = accuracy_score(train_target, cb_train_pred)
cb_test_acc = accuracy_score(test_target, cb_test_pred)
 
In [42]:
print(f"Cat Boost train accuracy is {cb_train_acc:.4f}")
print(f"Cat Boost test accuracy is {cb_test_acc:.4f}")
 
 
Cat Boost train accuracy is 1.0000
Cat Boost test accuracy is 0.9386
 
 

5. 마무리¶

 
In [43]:
print(f"XGBoost test accuray is {xgb_test_acc:.4f}")
print(f"Light Boost test accuray is {lgb_test_acc:.4f}")
print(f"Cat Boost test accuray is {cb_test_acc:.4f}")
 
 
XGBoost test accuray is 0.9078
Light Boost test accuray is 0.9130
Cat Boost test accuray is 0.9386
 
In [ ]:
 

'Machine Learning > Boosting' 카테고리의 다른 글

샘플 데이터와 Stacking Classification  (0) 2024.03.18
샘플 데이터와 Stacking Regression  (0) 2024.03.18
Boosting Regression 심화 실습 - 부동산 가격 예측  (0) 2024.03.18
샘플 데이터와 Boosting Classification  (0) 2024.03.18
샘플 데이터와 Boosting Regression  (0) 2024.03.18
'Machine Learning/Boosting' 카테고리의 다른 글
  • 샘플 데이터와 Stacking Classification
  • 샘플 데이터와 Stacking Regression
  • Boosting Regression 심화 실습 - 부동산 가격 예측
  • 샘플 데이터와 Boosting Classification
Juson
Juson
  • Juson
    Juson의 데이터 공부
    Juson
  • 전체
    오늘
    어제
    • 분류 전체보기 (95)
      • RAG (2)
      • AI (2)
        • NLP (0)
        • Generative Model (0)
        • Deep Reinforcement Learning (2)
        • LLM (0)
      • Logistic Optimization (0)
      • Machine Learning (37)
        • Linear Regression (2)
        • Logistic Regression (2)
        • Decision Tree (5)
        • Naive Bayes (1)
        • KNN (2)
        • SVM (2)
        • Clustering (4)
        • Dimension Reduction (3)
        • Boosting (6)
        • Abnomaly Detection (2)
        • Recommendation (4)
        • Embedding & NLP (4)
      • Reinforcement Learning (5)
      • Deep Learning (10)
        • Deep learning Bacis Mathema.. (10)
      • Optimization (2)
        • OR Optimization (0)
        • Convex Optimization (0)
        • Integer Optimization (0)
      • SNA 분석 (0)
      • 포트폴리오 최적화 공부 (0)
        • 최적화 기법 (0)
        • 금융 베이스 (0)
      • Finanancial engineering (0)
      • 프로그래머스 데브코스(Boot camp) (15)
        • SQL (9)
        • Python (5)
        • Machine Learning (1)
      • Python (22)
      • Project (0)
  • 블로그 메뉴

    • 홈
    • 태그
    • 방명록
  • 링크

  • 공지사항

  • 인기 글

  • 태그

  • 최근 댓글

  • 최근 글

  • hELLO· Designed By정상우.v4.10.4
Juson
Boosting Classification 심화 실습- 뉴스 분류
상단으로

티스토리툴바