人大经济论坛 › 论坛 › 休闲区十二区 › 休闲灌水 › 【独家发布】Train a Weka Classifier in Java

发帖

楼主: 农村固定观察点

956 0

[休闲其它] 【独家发布】Train a Weka Classifier in Java [推广有奖]

0关注
10粉丝

已卖：1628份资源

教授

还不是VIP/贵宾

TA的文库 其他...

Must-Read Book

Winrats NewOccidental

Matlab NewOccidental

威望: 1 级
论坛币: 31729 个
通用积分: 4.7461
学术水平: 96 点
热心指数: 43 点
信用等级: 79 点
经验: 9658 点
帖子: 287
精华: 10
在线时间: 40 小时
注册时间: 2013-12-14
最后登录: 2024-4-12

楼主

农村固定观察点 发表于 2014-12-10 11:13:34 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

This tutorial presents the steps for training one of the classifiers available in WEKA using java. Weka provides implementation of wide range of machine learning based classifiers. A trained classifier can be used for the classification of data in a particular domain which depends on the training set.To train a classifier we need a training set.Once the classifier is trained it can be stored in a file and can be loaded for later use (classifier serialization and deserialization).

In this article, we will provide you a weka tutorial on classsifing a tweet as either positive tweet or negative tweet(Sentiment analysis).The following points explain the steps involved in performing sentiment analysis using weka.

Download weka 3.7.x and install it.
In the weka installation folder you will find weka.jar. Add it to the java class path.
The following section consists of code snippets for training and testing the Naive bayes classifier.

3.1 Reading Input Dataset from CSV File

Tweets for training, annotated with their sentiment values are in CSV(Comma Separated Values) file format. Following code reads the input data from the CSV file

private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
//reading the training dataset from CSV file
CSVLoader trainingLoader =new CSVLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
}

复制代码

3.2 Tweet Preprocessing and Feature extraction

Once input is read from the file, the data (tweets in this case) needs to be preprocessed and then feature extraction is performed. For tweet preprocessing, a modified CMU POS Tagger is used.featureWords is an arrayList that consists of the feature words. If the feature word is present in the tweet, the index corresponding to that feature word is given a value of 1. All other feature words that are not in the tweet have a value 0. There are totally 6800 featute words. This means that the feature vector is a point in which each axis represents the presence or the absence of the feature word corresponding to that axis. Hence Feature vector takes a sparse form (more zeroes than non zero entries). Thus a SparseInstance is used to represent a feature vector.

private Instance extractFeature(Instance inputInstance)
{
Map<Integer,Double> featureMap = new TreeMap<>();
//after tweet preprocessing, tweet is tokenized to individual words along with their parts of speech.
List<Token> tokens = posTagger.runPOSTagger(inputInstance.stringValue(0));
for(Token token : tokens)
{
switch(token.getPOS())
{
case "A":
case "V":
case "R":
case "#":
String word = token.getWord().replaceAll("#","");
if(featureWords.contains(word))
{
//adding 1.0 to the featureMap represents that the feature word is present in the input data
featureMap.put(featureWords.indexOf(word),1.0);
}
}
}
int indices[] = new int[featureMap.size()+1];
double values[] = new double[featureMap.size()+1];
int i=0;
for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
{
indices[i] = entry.getKey();
values[i] = entry.getValue();
i++;
}
indices[i] = featureWords.size();
values[i] = (double)sentimentClassList.indexOf(inputInstance.stringValue(1));
return new SparseInstance(1.0,values,indices,featureWords.size());
}

复制代码

Once feature extraction is done, training and testing can be performed.

3.3 Training the Classifier

In this tutorial, NaiveBayes classifier is used. The following code snippet consists of steps involved in training in the NaiveBayes Classifier.

public void trainClassifier(final String INPUT_FILENAME)
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances trainingInstances = createInstances("TRAINING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(trainingInstances);
trainingInstances.add(currentFeatureVector);
}
//You can create the classifier that you want.
//For instance classifier = new SMO;
//In this tutorial we use NaiveBayes Classifier.
classifier = new NaiveBayes();
try {
//classifier training code
classifier.buildClassifier(trainingInstances);
//storing the trained classifier to a file for future use
weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
} catch (Exception ex) {
System.out.println("Exception in training the classifier.");
}
}

复制代码

3.4 Testing the Classifier

The following code snippet consists of steps involved in testing the classifier. In testing, with the training knowledge, the classifier tries to predict the class (sentiment) of the tweet.

public void testClassifier(final String INPUT_FILENAME)
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances testingInstances = createInstances("TESTING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(testingInstances);
testingInstances.add(currentFeatureVector);
}
try {
//Classifier deserialization
classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
//classifier testing code
for(Instance testInstance : testingInstances)
{
double score = classifier.classifyInstance(testInstance);
System.out.println(testingInstances.attribute("Sentiment").value((int)score));
}
} catch (Exception ex) {
System.out.println("Exception in testing the classifier.");
}
}
private Instances createInstances(final String INSTANCES_NAME)
{
//create an Instances object with initial capacity as zero
Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
//sets the class index as the last attribute (positive or negative)
instances.setClassIndex(instances.numAttributes()-1);
return instances;
}

复制代码

Download the project files here.

The downloaded archive is a NetBeans project. It can be opened with NetBeans IDE. If you are not using NetBeans IDE, go to the src directory for the java code. FeatureWordsList.dat is an arraylist of feature words(ArrayList serialized to file). training.csv and testing.csv contains the datasets for training and testing.