[休闲其它] 【独家发布】Command Line Functions for Text Mining in WEKA [推广有奖]

0关注
10粉丝

教授

还不是VIP/贵宾

TA的文库 其他...

Must-Read Book

Winrats NewOccidental

Matlab NewOccidental

威望: 1 级
论坛币: 31169 个
通用积分: 3.5919
学术水平: 96 点
热心指数: 43 点
信用等级: 79 点
经验: 9658 点
帖子: 287
精华: 10
在线时间: 40 小时
注册时间: 2013-12-14
最后登录: 2024-4-12

楼主

农村固定观察点 发表于 2014-12-10 07:40:23 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

WEKA essentially provides three usage modes:

Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
Using the classes programmatically, that is, in your own programs in Java.

One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like -Xms, -Xms and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner like PART, you get the following options:

[ DISCUZ_CODE_112 ]gt;java weka.classifiers.rules.PART
Weka exception: No training file and no object input file given.
General options:
-h or -help
Output help information.
-synopsis or -info
Output synopsis for classifier (use in conjunction with -h)
-t <name of training file>
Sets training file.
-T <name of test file>
Sets test file. If missing, a cross-validation will be performed
on the training data.
...
Options specific to weka.classifiers.rules.PART:
-C <pruning confidence>
Set confidence threshold for pruning.
(default 0.25)
...

复制代码

I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
Training a classifier and storing the model in a file for further use.
Training a classifier and getting its output (classification of instances) on a test collection.

However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<), or to use the option -h to get the options of the filter.

As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:

[ DISCUZ_CODE_113 ]gt;java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART
复制代码

To get the following output:

=== Stratified cross-validation === Correctly Classified Instances 173 86.5 % Incorrectly Classified Instances 27 13.5 % Kappa statistic 0.4181 Mean absolute error 0.1625 Root mean squared error 0.3523 Relative absolute error 58.2872 % Root relative squared error 94.9031 % Total Number of Instances 200
=== Confusion Matrix ===
a b <-- classified as 13 20 | a = spam 7 160 | b = ham

Which is exactly the one I showed in my previous post. I have used the following general options:-t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
-c 1 to specify the first attribute as the class.
-x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
-v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.

Plus the specific options of the FilteredClassifier -F to define the filter, and -W to define the classifier.In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:

[ DISCUZ_CODE_114 ]gt;java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART
复制代码

To get the following output:=== Stratified cross-validation === Correctly Classified Instances 167 83.5 % Incorrectly Classified Instances 33 16.5 % Kappa statistic 0.1959 Mean absolute error 0.1967 Root mean squared error 0.38 Relative absolute error 70.53 % Root relative squared error 102.3794 % Total Number of Instances 200
=== Confusion Matrix ===
a b <-- classified as 6 27 | a = spam 6 161 | b = ham

Which in turn, it is the same I got in that post. If we replace PART by the SMOimplementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART to weka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post. While most of the options are the same as in the previous command, two things deserve special attention in this one:

We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of theInfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0. As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:The whole MultiFilter filter, enclosed by the isolated quotation marks (").
The AttributeSelection filter, enclosed by the escaped quotation mark (\").
The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.

Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.