In previous posts I have explained how to chain filters and classifiers in WEKA, in order to avoid incorrect results when evaluating text classifiers by using cross-fold validation, and how to integrate feature selection in the text classification process. For this purpose, I have used the FilteredClassifier and theMultiFilter in the Explorer GUI provided by WEKA. Now it is time to do so in the command line.
WEKA essentially provides three usage modes:
- Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
- Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
- Using the classes programmatically, that is, in your own programs in Java.
I will deal with the usage of WEKA in your programs in the future, in this post I focus on the command line. Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner likePART (which I used in my previous posts), you get the following options:
I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:复制代码
- [ DISCUZ_CODE_4 ]gt;java weka.classifiers.rules.PART
- Weka exception: No training file and no object input file given.
- General options:
- -h or -help
- Output help information.
- -synopsis or -info
- Output synopsis for classifier (use in conjunction with -h)
- -t <name of training file>
- Sets training file.
- -T <name of test file>
- Sets test file. If missing, a cross-validation will be performed
- on the training data.
- ...
- Options specific to weka.classifiers.rules.PART:
- -C <pruning confidence>
- Set confidence threshold for pruning.
- (default 0.25)
- ...
- Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
- Training a classifier and storing the model in a file for further use.
- Training a classifier and getting its output (classification of instances) on a test collection.
In my previous post on chaining filters and classifiers, I performed an experiment running a PART classifier on an ARFF-formatted subset of the SMS Spam Collection, namely the smsspam.small.arfffile. As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:
$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PARTTo get the following output:
=== Stratified cross-validation ===Which is exactly the one I showed in my previous post. I have used the following general options:
Correctly Classified Instances 173 86.5 %
Incorrectly Classified Instances 27 13.5 %
Kappa statistic 0.4181
Mean absolute error 0.1625
Root mean squared error 0.3523
Relative absolute error 58.2872 %
Root relative squared error 94.9031 %
Total Number of Instances 200
=== Confusion Matrix ===
a b <-- classified as
13 20 | a = spam
7 160 | b = ham
- -t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
- -c 1 to specify the first attribute as the class.
- -x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
- -v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.
In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:
$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PARTTo get the following output:
=== Stratified cross-validation ===Which in turn, it is the same I got in that post. If we replace PART by the SMO implementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART toweka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post.
Correctly Classified Instances 167 83.5 %
Incorrectly Classified Instances 33 16.5 %
Kappa statistic 0.1959
Mean absolute error 0.1967
Root mean squared error 0.38
Relative absolute error 70.53 %
Root relative squared error 102.3794 %
Total Number of Instances 200
=== Confusion Matrix ===
a b <-- classified as
6 27 | a = spam
6 161 | b = ham
While most of the options are the same as in the previous command, two things deserve special attention in this one:
- We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of the InfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0.
- As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:
- The whole MultiFilter filter, enclosed by the isolated quotation marks (").
- The AttributeSelection filter, enclosed by the escaped quotation mark (\").
- The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.
- So many escaping symbols make it a bit dirty, but still functional.


雷达卡



京公网安备 11010802022788号







