뉴스 분류하기¶
In [1]:
!pip install catboost
Requirement already satisfied: catboost in /usr/local/lib/python3.7/dist-packages (1.0.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.3.2)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (1.3.3)
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(2021)
1. Data¶
이번 실습에서 사용할 데이터는 뉴스를 분류하는 데이터입니다.
1.1 Data Load¶
In [3]:
from sklearn.datasets import fetch_20newsgroups
newsgroup = fetch_20newsgroups()
In [4]:
data, target = newsgroup["data"], newsgroup["target"]
In [5]:
print(data[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----
In [6]:
target[0]
Out[6]:
7
In [8]:
newsgroup["target_names"]
Out[8]:
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
1.2 Data Split¶
아래의 뉴스 그룹만 사용
- 'talk.politics.guns'
- 'talk.politics.mideast'
- 'talk.politics.misc'
- 'talk.religion.misc
In [9]:
len(newsgroup["target_names"])
Out[9]:
20
In [10]:
text = pd.Series(data, name="text")
target = pd.Series(target, name="target")
In [11]:
df = pd.concat([text, target], 1)
In [12]:
df
Out[12]:
| text | target | |
|---|---|---|
| 0 | From: lerxst@wam.umd.edu (where's my thing)\nS... | 7 |
| 1 | From: guykuo@carson.u.washington.edu (Guy Kuo)... | 4 |
| 2 | From: twillis@ec.ecn.purdue.edu (Thomas E Will... | 4 |
| 3 | From: jgreen@amber (Joe Green)\nSubject: Re: W... | 1 |
| 4 | From: jcm@head-cfa.harvard.edu (Jonathan McDow... | 14 |
| ... | ... | ... |
| 11309 | From: jim.zisfein@factory.com (Jim Zisfein) \n... | 13 |
| 11310 | From: ebodin@pearl.tufts.edu\nSubject: Screen ... | 4 |
| 11311 | From: westes@netcom.com (Will Estes)\nSubject:... | 3 |
| 11312 | From: steve@hcrlgw (Steven Collins)\nSubject: ... | 1 |
| 11313 | From: gunning@cco.caltech.edu (Kevin J. Gunnin... | 8 |
11314 rows × 2 columns
In [13]:
df.target.value_counts().sort_index()
Out[13]:
0 480
1 584
2 591
3 590
4 578
5 593
6 585
7 594
8 598
9 597
10 600
11 595
12 591
13 594
14 593
15 599
16 546
17 564
18 465
19 377
Name: target, dtype: int64
In [14]:
df.query("16 <= target <= 19")
Out[14]:
| text | target | |
|---|---|---|
| 5 | From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\... | 16 |
| 11 | From: david@terminus.ericsson.se (David Bold)\... | 19 |
| 33 | From: ayr1@cunixa.cc.columbia.edu (Amir Y Rose... | 17 |
| 34 | From: joec@hilbert.cyprs.rain.com ( Joe Cipale... | 18 |
| 39 | From: bressler@iftccu.ca.boeing.com (Rick Bres... | 16 |
| ... | ... | ... |
| 11277 | From: bob1@cos.com (Bob Blackshaw)\nSubject: R... | 17 |
| 11280 | From: jake@bony1.bony.com (Jake Livni)\nSubjec... | 17 |
| 11299 | From: 2120788@hydra.maths.unsw.EDU.AU ()\nSubj... | 17 |
| 11304 | From: Pegasus@aaa.uoregon.edu (Pegasus)\nSubje... | 19 |
| 11305 | From: shaig@composer.think.com (Shai Guday)\nS... | 17 |
1952 rows × 2 columns
In [15]:
df_sample = df.query("16 <= target <= 19")
In [16]:
data = df_sample.text
target = df_sample.target
In [17]:
np.array(data).shape
Out[17]:
(1952,)
In [18]:
from sklearn.model_selection import train_test_split
train_data, test_data, train_target, test_target = train_test_split(
data, target, train_size=0.7, random_state=2021
)
1.2 Count Vectorize¶
In [19]:
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Out[19]:
True
뉴스에 모두 등장한 단어를 사용
In [20]:
cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize)
cnt_vectorizer.fit(train_data)
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:507: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn("The parameter 'token_pattern' will not be used"
Out[20]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=<function word_tokenize at 0x7f0d2604fe60>,
vocabulary=None)
In [21]:
len(cnt_vectorizer.vocabulary_)
Out[21]:
32480
최소 10개의 뉴스에서 등장한 단어를 사용
In [22]:
cnt_vectorizer = CountVectorizer(tokenizer=word_tokenize, min_df=10)
cnt_vectorizer.fit(train_data)
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py:507: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn("The parameter 'token_pattern' will not be used"
Out[22]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=10,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=<function word_tokenize at 0x7f0d2604fe60>,
vocabulary=None)
In [23]:
len(cnt_vectorizer.vocabulary_)
Out[23]:
4237
In [24]:
train_matrix = cnt_vectorizer.transform(train_data)
test_matrix = cnt_vectorizer.transform(test_data)
2. XGBoost¶
In [25]:
import xgboost as xgb
xgb_clf = xgb.XGBClassifier()
2.1 학습¶
In [26]:
xgb_clf.fit(train_matrix, train_target)
Out[26]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective='multi:softprob', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
2.2 예측¶
In [27]:
xgb_train_pred = xgb_clf.predict(train_matrix)
xgb_test_pred = xgb_clf.predict(test_matrix)
2.3 평가¶
In [28]:
from sklearn.metrics import accuracy_score
xgb_train_acc = accuracy_score(train_target, xgb_train_pred)
xgb_test_acc = accuracy_score(test_target, xgb_test_pred)
In [29]:
print(f"XGBoost Train accuracy is {xgb_train_acc:.4f}")
print(f"XGBoost Test accuracy is {xgb_test_acc:.4f}")
XGBoost Train accuracy is 0.9663
XGBoost Test accuracy is 0.9078
3. Light GBM¶
In [30]:
import lightgbm as lgb
lgb_clf = lgb.LGBMClassifier()
3.1 학습¶
In [31]:
train_matrix
Out[31]:
<1366x4237 sparse matrix of type '<class 'numpy.int64'>'
with 236573 stored elements in Compressed Sparse Row format>
In [32]:
train_matrix.toarray()
Out[32]:
array([[1, 0, 0, ..., 0, 0, 0],
[2, 0, 0, ..., 0, 6, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 2, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
In [33]:
lgb_clf.fit(train_matrix.toarray(), train_target)
Out[33]:
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
3.2 예측¶
In [34]:
lgb_train_pred = lgb_clf.predict(train_matrix.toarray())
lgb_test_pred = lgb_clf.predict(test_matrix.toarray())
3.3 평가¶
In [35]:
lgb_train_acc = accuracy_score(train_target, lgb_train_pred)
lgb_test_acc = accuracy_score(test_target, lgb_test_pred)
In [37]:
print(f"Light Boost train accuracy is {lgb_train_acc:.4f}")
print(f"Light Boost test accuracy is {lgb_test_acc:.4f}")
Light Boost train accuracy is 1.0000
Light Boost test accuracy is 0.9130
4. CatBoost¶
In [38]:
import catboost as cb
cb_clf = cb.CatBoostClassifier()
4.1 학습¶
In [39]:
cb_clf.fit(train_matrix, train_target, verbose=False)
Out[39]:
<catboost.core.CatBoostClassifier at 0x7f0d24bd1b50>
4.2 예측¶
In [40]:
cb_train_pred = cb_clf.predict(train_matrix)
cb_test_pred = cb_clf.predict(test_matrix)
4.3 평가¶
In [41]:
cb_train_acc = accuracy_score(train_target, cb_train_pred)
cb_test_acc = accuracy_score(test_target, cb_test_pred)
In [42]:
print(f"Cat Boost train accuracy is {cb_train_acc:.4f}")
print(f"Cat Boost test accuracy is {cb_test_acc:.4f}")
Cat Boost train accuracy is 1.0000
Cat Boost test accuracy is 0.9386
5. 마무리¶
In [43]:
print(f"XGBoost test accuray is {xgb_test_acc:.4f}")
print(f"Light Boost test accuray is {lgb_test_acc:.4f}")
print(f"Cat Boost test accuray is {cb_test_acc:.4f}")
XGBoost test accuray is 0.9078
Light Boost test accuray is 0.9130
Cat Boost test accuray is 0.9386
In [ ]:
'Machine Learning > Boosting' 카테고리의 다른 글
| 샘플 데이터와 Stacking Classification (0) | 2024.03.18 |
|---|---|
| 샘플 데이터와 Stacking Regression (0) | 2024.03.18 |
| Boosting Regression 심화 실습 - 부동산 가격 예측 (0) | 2024.03.18 |
| 샘플 데이터와 Boosting Classification (0) | 2024.03.18 |
| 샘플 데이터와 Boosting Regression (0) | 2024.03.18 |
