发帖

楼主: ReneeBK

870 0

Command Line Functions for Text Mining in WEKA [推广有奖]

1关注
62粉丝

VIP

已卖：4901份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

0%

威望: 1 级
论坛币: 49675 个
通用积分: 56.2487
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2017-3-26 12:46:38 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Command Line Functions for Text Mining in WEKA

In previous posts I have explained how to chain filters and classifiers in WEKA, in order to avoid incorrect results when evaluating text classifiers by using cross-fold validation, and how to integrate feature selection in the text classification process. For this purpose, I have used the FilteredClassifier and theMultiFilter in the Explorer GUI provided by WEKA. Now it is time to do so in the command line.
WEKA essentially provides three usage modes:

Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
Using the classes programmatically, that is, in your own programs in Java.

One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like -Xms, -Xms and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.
I will deal with the usage of WEKA in your programs in the future, in this post I focus on the command line. Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner likePART (which I used in my previous posts), you get the following options:

[ DISCUZ_CODE_4 ]gt;java weka.classifiers.rules.PART

Weka exception: No training file and no object input file given.

General options:

-h or -help

Output help information.

-synopsis or -info

Output synopsis for classifier (use in conjunction with -h)

-t <name of training file>

Sets training file.

-T <name of test file>

Sets test file. If missing, a cross-validation will be performed

on the training data.

...

Options specific to weka.classifiers.rules.PART:

-C <pruning confidence>

Set confidence threshold for pruning.

(default 0.25)

...
复制代码

I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:

Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
Training a classifier and storing the model in a file for further use.
Training a classifier and getting its output (classification of instances) on a test collection.

However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<), or to use the option -h to get the options of the filter.
In my previous post on chaining filters and classifiers, I performed an experiment running a PART classifier on an ARFF-formatted subset of the SMS Spam Collection, namely the smsspam.small.arfffile. As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation ===
Correctly Classified Instances 173 86.5 %
Incorrectly Classified Instances 27 13.5 %
Kappa statistic 0.4181
Mean absolute error 0.1625
Root mean squared error 0.3523
Relative absolute error 58.2872 %
Root relative squared error 94.9031 %
Total Number of Instances 200

=== Confusion Matrix ===

a b <-- classified as
13 20 | a = spam
7 160 | b = ham

Which is exactly the one I showed in my previous post. I have used the following general options:

-t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
-c 1 to specify the first attribute as the class.
-x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
-v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.

Plus the specific options of the FilteredClassifier -F to define the filter, and -W to define the classifier.
In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation ===
Correctly Classified Instances 167 83.5 %
Incorrectly Classified Instances 33 16.5 %
Kappa statistic 0.1959
Mean absolute error 0.1967
Root mean squared error 0.38
Relative absolute error 70.53 %
Root relative squared error 102.3794 %
Total Number of Instances 200

=== Confusion Matrix ===

a b <-- classified as
6 27 | a = spam
6 161 | b = ham

Which in turn, it is the same I got in that post. If we replace PART by the SMO implementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART toweka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post.
While most of the options are the same as in the previous command, two things deserve special attention in this one:

We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of the InfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0.
As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:
- The whole MultiFilter filter, enclosed by the isolated quotation marks (").
- The AttributeSelection filter, enclosed by the escaped quotation mark (\").
- The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.
So many escaping symbols make it a bit dirty, but still functional.

Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Text Mining Functions function Command tions explained Explorer previous provided provides

Command Line Functions for Text Mining in WEKA [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

Command Line Functions for Text Mining in WEKA [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群