1228 1

[休闲其它] 【独家发布】Command Line Functions for Text Mining in WEKA [推广有奖]

  • 0关注
  • 10粉丝

教授

8%

还不是VIP/贵宾

-

TA的文库  其他...

Must-Read Book

Winrats NewOccidental

Matlab NewOccidental

威望
1
论坛币
31169 个
通用积分
3.5919
学术水平
96 点
热心指数
43 点
信用等级
79 点
经验
9658 点
帖子
287
精华
10
在线时间
40 小时
注册时间
2013-12-14
最后登录
2024-4-12

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
WEKA essentially provides three usage modes:

  • Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
  • Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
  • Using the classes programmatically, that is, in your own programs in Java.


One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like -Xms, -Xms and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner like PART, you get the following options:
  1. [        DISCUZ_CODE_112        ]gt;java weka.classifiers.rules.PART
  2. Weka exception: No training file and no object input file given.
  3. General options:
  4. -h or -help
  5. Output help information.
  6. -synopsis or -info
  7. Output synopsis for classifier (use in conjunction with -h)
  8. -t <name of training file>
  9. Sets training file.
  10. -T <name of test file>
  11. Sets test file. If missing, a cross-validation will be performed
  12. on the training data.
  13. ...
  14. Options specific to weka.classifiers.rules.PART:
  15. -C <pruning confidence>
  16. Set confidence threshold for pruning.
  17. (default 0.25)
  18. ...
复制代码

I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:
  • Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
  • Training a classifier and storing the model in a file for further use.
  • Training a classifier and getting its output (classification of instances) on a test collection.

However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<), or to use the option -h to get the options of the filter.

As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:

  1. [        DISCUZ_CODE_113        ]gt;java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART
复制代码

To get the following output:

=== Stratified cross-validation === Correctly Classified Instances 173 86.5 % Incorrectly Classified Instances 27 13.5 % Kappa statistic 0.4181 Mean absolute error 0.1625 Root mean squared error 0.3523 Relative absolute error 58.2872 % Root relative squared error 94.9031 % Total Number of Instances 200
=== Confusion Matrix ===
a b <-- classified as 13 20 | a = spam 7 160 | b = ham

Which is exactly the one I showed in my previous post. I have used the following general options:
  • -t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
  • -c 1 to specify the first attribute as the class.
  • -x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
  • -v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.


Plus the specific options of the FilteredClassifier -F to define the filter, and -W to define the classifier.In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:

  1. [        DISCUZ_CODE_114        ]gt;java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART
复制代码


To get the following output:=== Stratified cross-validation === Correctly Classified Instances 167 83.5 % Incorrectly Classified Instances 33 16.5 % Kappa statistic 0.1959 Mean absolute error 0.1967 Root mean squared error 0.38 Relative absolute error 70.53 % Root relative squared error 102.3794 % Total Number of Instances 200
=== Confusion Matrix ===
a b <-- classified as 6 27 | a = spam 6 161 | b = ham

Which in turn, it is the same I got in that post. If we replace PART by the SMOimplementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART to weka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post. While most of the options are the same as in the previous command, two things deserve special attention in this one:

We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of theInfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0. As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:
  • The whole MultiFilter filter, enclosed by the isolated quotation marks (").
  • The AttributeSelection filter, enclosed by the escaped quotation mark (\").
  • The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.
Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.
二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Text Mining Functions function Command tions previous provided provides command process

本帖被以下文库推荐

沙发
oliyiyi 发表于 2014-12-10 07:54:14 |只看作者 |坛友微信交流群
谢谢分享

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加JingGuanBbs
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-10-6 04:05