뉴스 분류하기¶

In [1]:

!pip install catboost

Requirement already satisfied: catboost in /usr/local/lib/python3.7/dist-packages (1.0.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.3.2)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (1.3.3)

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(2021)

1. Data¶

이번 실습에서 사용할 데이터는 뉴스를 분류하는 데이터입니다.

1.1 Data Load¶

In [3]:

from sklearn.datasets import fetch_20newsgroups

newsgroup = fetch_20newsgroups()

In [4]:

data, target = newsgroup["data"], newsgroup["target"]

In [5]:

print(data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

In [6]:

target[0]

Out[6]:

In [8]:

newsgroup["target_names"]

Out[8]:

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

1.2 Data Split¶

아래의 뉴스 그룹만 사용

'talk.politics.guns'
'talk.politics.mideast'
'talk.politics.misc'
'talk.religion.misc

In [9]:

len(newsgroup["target_names"])

Out[9]:

In [10]:

text = pd.Series(data, name="text")
target = pd.Series(target, name="target")

In [11]:

df = pd.concat([text, target], 1)

In [12]:

df

Out[12]:

	text	target
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14
...	...	...
11309	From: jim.zisfein@factory.com (Jim Zisfein) \n...	13
11310	From: ebodin@pearl.tufts.edu\nSubject: Screen ...	4
11311	From: westes@netcom.com (Will Estes)\nSubject:...	3
11312	From: steve@hcrlgw (Steven Collins)\nSubject: ...	1
11313	From: gunning@cco.caltech.edu (Kevin J. Gunnin...	8

11314 rows × 2 columns

In [13]:

df.target.value_counts().sort_index()

Out[13]:

0     480
1     584
2     591
3     590
4     578
5     593
6     585
7     594
8     598
9     597
10    600
11    595
12    591
13    594
14    593
15    599
16    546
17    564
18    465
19    377
Name: target, dtype: int64

In [14]:

df.query("16 <= target <= 19")

Out[14]:

	text	target
5	From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...	16
11	From: david@terminus.ericsson.se (David Bold)\...	19
33	From: ayr1@cunixa.cc.columbia.edu (Amir Y Rose...	17
34	From: joec@hilbert.cyprs.rain.com ( Joe Cipale...	18
39	From: bressler@iftccu.ca.boeing.com (Rick Bres...	16
...	...	...
11277	From: bob1@cos.com (Bob Blackshaw)\nSubject: R...	17
11280	From: jake@bony1.bony.com (Jake Livni)\nSubjec...	17
11299	From: 2120788@hydra.maths.unsw.EDU.AU ()\nSubj...	17
11304	From: Pegasus@aaa.uoregon.edu (Pegasus)\nSubje...	19
11305	From: shaig@composer.think.com (Shai Guday)\nS...	17

1952 rows × 2 columns

In [15]:

df_sample = df.query("16 <= target <= 19")

In [16]:

data = df_sample.text
target = df_sample.target

In [17]:

np.array(data).shape

Out[17]:

(1952,)

In [18]:

from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(
    data, target, train_size=0.7, random_state=2021
)

1.2 Count Vectorize¶

In [19]:

import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Out[19]:

True

뉴스에 모두 등장한 단어를 사용

In [20]:

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_data)

/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:507: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"

Out[20]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function word_tokenize at 0x7f0d2604fe60>,
                vocabulary=None)

In [21]:

len(cnt_vectorizer.vocabulary_)

Out[21]:

최소 10개의 뉴스에서 등장한 단어를 사용

In [22]:

cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize, min_df=10)
cnt_vectorizer.fit(train_data)

/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:507: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"

Out[22]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=10,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function word_tokenize at 0x7f0d2604fe60>,
                vocabulary=None)

In [23]:

len(cnt_vectorizer.vocabulary_)

Out[23]:

In [24]:

train_matrix = cnt_vectorizer.transform(train_data)
test_matrix = cnt_vectorizer.transform(test_data)

2. XGBoost¶

In [25]:

import xgboost as xgb

xgb_clf = xgb.XGBClassifier()

2.1 학습¶

In [26]:

xgb_clf.fit(train_matrix, train_target)

Out[26]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

2.2 예측¶

In [27]:

xgb_train_pred = xgb_clf.predict(train_matrix)
xgb_test_pred = xgb_clf.predict(test_matrix)

2.3 평가¶

In [28]:

from sklearn.metrics import accuracy_score

xgb_train_acc = accuracy_score(train_target, xgb_train_pred)
xgb_test_acc = accuracy_score(test_target, xgb_test_pred)

In [29]:

print(f"XGBoost Train accuracy is {xgb_train_acc:.4f}")
print(f"XGBoost Test accuracy is {xgb_test_acc:.4f}")

XGBoost Train accuracy is 0.9663
XGBoost Test accuracy is 0.9078

3. Light GBM¶

In [30]:

import lightgbm as lgb

lgb_clf = lgb.LGBMClassifier()

3.1 학습¶

In [31]:

train_matrix

Out[31]:

<1366x4237 sparse matrix of type '<class 'numpy.int64'>'
	with 236573 stored elements in Compressed Sparse Row format>

In [32]:

train_matrix.toarray()

Out[32]:

array([[1, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 6, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [33]:

lgb_clf.fit(train_matrix.toarray(), train_target)

Out[33]:

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

3.2 예측¶

In [34]:

lgb_train_pred = lgb_clf.predict(train_matrix.toarray())
lgb_test_pred = lgb_clf.predict(test_matrix.toarray())

3.3 평가¶

In [35]:

lgb_train_acc = accuracy_score(train_target, lgb_train_pred)
lgb_test_acc = accuracy_score(test_target, lgb_test_pred)

In [37]:

print(f"Light Boost train accuracy is {lgb_train_acc:.4f}")
print(f"Light Boost test accuracy is {lgb_test_acc:.4f}")

Light Boost train accuracy is 1.0000
Light Boost test accuracy is 0.9130

4. CatBoost¶

In [38]:

import catboost as cb

cb_clf = cb.CatBoostClassifier()

4.1 학습¶

In [39]:

cb_clf.fit(train_matrix, train_target, verbose=False)

Out[39]:

<catboost.core.CatBoostClassifier at 0x7f0d24bd1b50>

4.2 예측¶

In [40]:

cb_train_pred = cb_clf.predict(train_matrix)
cb_test_pred = cb_clf.predict(test_matrix)

4.3 평가¶

In [41]:

cb_train_acc = accuracy_score(train_target, cb_train_pred)
cb_test_acc = accuracy_score(test_target, cb_test_pred)

In [42]:

print(f"Cat Boost train accuracy is {cb_train_acc:.4f}")
print(f"Cat Boost test accuracy is {cb_test_acc:.4f}")

Cat Boost train accuracy is 1.0000
Cat Boost test accuracy is 0.9386

5. 마무리¶

In [43]:

print(f"XGBoost test accuray is {xgb_test_acc:.4f}")
print(f"Light Boost test accuray is {lgb_test_acc:.4f}")
print(f"Cat Boost test accuray is {cb_test_acc:.4f}")

XGBoost test accuray is 0.9078
Light Boost test accuray is 0.9130
Cat Boost test accuray is 0.9386

In [ ]:

샘플 데이터와 Stacking Classification (0)	2024.03.18
샘플 데이터와 Stacking Regression (0)	2024.03.18
Boosting Regression 심화 실습 - 부동산 가격 예측 (0)	2024.03.18
샘플 데이터와 Boosting Classification (0)	2024.03.18
샘플 데이터와 Boosting Regression (0)	2024.03.18

Boosting Classification 심화 실습- 뉴스 분류