发帖

楼主: 农村固定观察点

1409 1

[休闲其它] 【独家发布】A Simple Text Classifier in Java with WEKA [推广有奖]

0关注
10粉丝

已卖：1629份资源

教授

8%

还不是VIP/贵宾

-

TA的文库 其他...

Must-Read Book

Winrats NewOccidental

Matlab NewOccidental

0%

威望: 1 级
论坛币: 31829 个
通用积分: 4.8961
学术水平: 96 点
热心指数: 43 点
信用等级: 79 点
经验: 9658 点
帖子: 287
精华: 10
在线时间: 40 小时
注册时间: 2013-12-14
最后登录: 2024-4-12

楼主

农村固定观察点 发表于 2014-12-10 07:42:07 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

A Simple Text Classifier in Java with WEKA

In previous posts [1, 2, 3], I have shown how to make use of the WEKA classesFilteredClassifier and MultiFilter in order to properly build and evaluate a text classifier using WEKA. For this purpose, I have made use of the Explorer GUI provided by WEKA, and its command-line interface.
In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.
Following Salton, and Belkin and Croft, the process of text classification involves two main steps:

Representing your text database in order to enable learning, and to train a classifier on it.
Using the classifier to predict text labels of new, unseen documents.

The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.
In consequence, what we need to demonstrate the text classification process is two programs: one to learn from the text dataset, and another to use the learnt model toclassify new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named MyFilteredLearner.java, and its main() method demonstrates its usage, which involves:

Loading the text dataset.
Evaluating the classifier.
Training the classifier.
Storing the classifier.

The most interesting parts of the process are:

We read the dataset by simply using the method getData() of anArffReader object that wraps a BufferedReader.
We programmatically create the classifier by combining aStringToWordVector filter (in order to represent the texts as feature vectors) and a NaiveBayes classifier (for learning), using the FilteredClassifierclass discussed in previous posts.

The process of creating the classifier is demonstrated in the next code snippet:

trainData.setClassIndex(0);
filter = new StringToWordVector();
filter.setAttributeIndices("last");
classifier = new FilteredClassifier();
classifier.setFilter(filter);
classifier.setClassifier(new NaiveBayes());

So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the FilteredClassifier object and add the previous filter and a newNaiveBayes classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (smsspam.small.arff).
You can execute this class with the following commands to get the following output:

$>javac MyFilteredLearner.java
$>java MyFilteredLearner smsspam.small.arff myClassifier.dat
===== Loaded dataset: smsspam.small.arff =====

Correctly Classified Instances 187 93.5 %
Incorrectly Classified Instances 13 6.5 %
Kappa statistic 0.7277
Mean absolute error 0.0721
Root mean squared error 0.2568
Relative absolute error 25.8792 %
Root relative squared error 69.1763 %
Coverage of cases (0.95 level) 94 %
Mean rel. region size (0.95 level) 51.75 %
Total Number of Instances 200

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,636 0,006 0,955 0,636 0,764 0,748 0,943 0,858 spam
0,994 0,364 0,933 0,994 0,962 0,748 0,943 0,986 ham
Weighted Avg. 0,935 0,305 0,936 0,935 0,930 0,748 0,943 0,965
===== Evaluating on filtered (training) dataset done =====
===== Training on filtered (training) dataset done =====
===== Saved model: myClassifier.dat =====

The evaluation has been performed with default values except for the number of folds, that has been set to 4 as shown in the next code snippet:

Evaluation eval = new Evaluation(trainData);
eval.crossValidateModel(classifier, trainData, 4, new Random(1));
System.out.println(eval.toSummaryString());

For the case you don want to evaluate the classifier on the training data, you can omit the call to the evaluate() method.
Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named MyFilteredClassifier.java, and its main() method demonstrates its usage, which involves:

Reading the text to be classified from a file.
Reading the model or classifier from a file.
Creating the instance.
Classifying it.

Creating the instance is performed in the makeInstance() method, and its code is the following one:

// Create the attributes, class and text
FastVector fvNominalVal = new FastVector(2);
fvNominalVal.addElement("spam");
fvNominalVal.addElement("ham");
Attribute attribute1 = new Attribute("class", fvNominalVal);
Attribute attribute2 = new Attribute("text",(FastVector) null);
// Create list of instances with one element
FastVector fvWekaAttributes = new FastVector(2);
fvWekaAttributes.addElement(attribute1);
fvWekaAttributes.addElement(attribute2);
instances = new Instances("Test relation", fvWekaAttributes, 1);
// Set class index
instances.setClassIndex(0);
// Create and add the instance
DenseInstance instance = new DenseInstance(2);
instance.setValue(attribute2, text);
// instance.setValue((Attribute)fvWekaAttributes.elementAt(1), text);
instances.add(instance);

The classifier learnt with MyFilteredLearner.java expects that an instance has two attributes: the first one is the class, it is a nominal one with values "spam" or "ham"; the second one is a String, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the Instances object (and not in each instance).
So first we create the attributes by using the FastVector class provided by WEKA. The case of the nominal attribute ("class") is relatively simple, but the case of the String one is a bit more complex because it requires the second argument of the constructor to benull, but casted to FastVector. Then we create an Instances object by using aFastVector to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the FastVector class is deprecated in the WEKA development version.
The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a DenseInstance object. However, if you make use of the stable version, then you can use Instance (link to the stable version doc), and must change this code to:

Instance instance = new Instance(2);

As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.
The rest of the methods are (more or less) straightforward if you follow the documentation (weka - Programmatic Use, and weka - Use WEKA in your Java code). You get the class prediction on your text with the following lines:

double pred = classifier.classifyInstance(instances.instance(0));
System.out.println("Class predicted: " + instances.classAttribute().value((int) pred));

And if you feed this classifier with a file (smstest.txt) that stores the text "this is spam or not, who knows?", and the model learnt with MyFilteredLearner.java(that is stored in myClassifier.dat), then you get the following result:

$>javac MyFilteredClassifier.java
$>java MyFilteredClassifier smstest.txt myClassifier.dat
===== Loaded text data: smstest.txt =====
this is spam or not, who knows?
===== Loaded model: myClassifier.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'

@attribute class {spam,ham}
@attribute text string

@data
?,' this is spam or not, who knows?'
===== Classified instance =====
Class predicted: ham

It is interesting to see that the class assigned to the instance before classifying it is "?", which means undefined or unknown.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：classifier simple Text With WEKA Java

本帖被以下文库推荐

· Must-Read Book|主题: 72, 订阅: 9

沙发

oliyiyi 发表于 2014-12-10 07:53:55

谢谢分享

返回列表

发帖

[休闲其它] 【独家发布】A Simple Text Classifier in Java with WEKA [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

[休闲其它] 【独家发布】A Simple Text Classifier in Java with WEKA [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

初级学术勋章

初级热心勋章

初级信用勋章

中级信用勋章

中级学术勋章

中级热心勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群