926 0

[休闲其它] 【独家发布】Train a Weka Classifier in Java [推广有奖]

  • 0关注
  • 10粉丝

已卖:1624份资源

教授

8%

还不是VIP/贵宾

-

TA的文库  其他...

Must-Read Book

Winrats NewOccidental

Matlab NewOccidental

威望
1
论坛币
31404 个
通用积分
4.4011
学术水平
96 点
热心指数
43 点
信用等级
79 点
经验
9658 点
帖子
287
精华
10
在线时间
40 小时
注册时间
2013-12-14
最后登录
2024-4-12

楼主
农村固定观察点 发表于 2014-12-10 11:13:34 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

This tutorial presents the steps for training one of the classifiers available in WEKA using java. Weka provides implementation of wide range of machine learning based classifiers. A trained classifier can be used for the classification of data in a particular domain which depends on the training set.To train a classifier we need a training set.Once the classifier is trained it can be stored in a file and can be loaded for later use (classifier serialization and deserialization).


In this article, we will provide you a weka tutorial on classsifing a tweet as either positive tweet or negative tweet(Sentiment analysis).The following points explain the steps involved in performing sentiment analysis using weka.

  • Download weka 3.7.x and install it.
  • In the weka installation folder you will find weka.jar. Add it to the java class path.
  • The following section consists of code snippets for training and testing the Naive bayes classifier.
3.1 Reading Input Dataset from CSV File

Tweets for training, annotated with their sentiment values are in CSV(Comma Separated Values) file format. Following code reads the input data from the CSV file

  1. private void getTrainingDataset(final String INPUT_FILENAME)
  2. {
  3.     try{
  4.         //reading the training dataset from CSV file
  5.         CSVLoader trainingLoader =new CSVLoader();
  6.         trainingLoader.setSource(new File(INPUT_FILENAME));
  7.         inputDataset = trainingLoader.getDataSet();
  8.     }catch(IOException ex)
  9.     {
  10.         System.out.println("Exception in getTrainingDataset Method");
  11.     }
  12. }
复制代码



3.2 Tweet Preprocessing and Feature extraction

Once input is read from the file, the data (tweets in this case) needs to be preprocessed and then feature extraction is performed. For tweet preprocessing, a modified CMU POS Tagger is used.featureWords is an arrayList that consists of the feature words. If the feature word is present in the tweet, the index corresponding to that feature word is given a value of 1. All other feature words that are not in the tweet have a value 0. There are totally 6800 featute words. This means that the feature vector is a point in which each axis represents the presence or the absence of the feature word corresponding to that axis. Hence Feature vector takes a sparse form (more zeroes than non zero entries). Thus a SparseInstance is used to represent a feature vector.

  1. private Instance extractFeature(Instance inputInstance)
  2. {
  3.     Map<Integer,Double> featureMap = new TreeMap<>();
  4.      
  5.     //after tweet preprocessing, tweet is tokenized to individual words along with their parts of speech.
  6.     List<Token> tokens = posTagger.runPOSTagger(inputInstance.stringValue(0));

  7.     for(Token token : tokens)
  8.     {
  9.         switch(token.getPOS())
  10.         {
  11.             case "A":
  12.             case "V":
  13.             case "R":   
  14.             case "#":   
  15.                 String word = token.getWord().replaceAll("#","");
  16.                 if(featureWords.contains(word))
  17.                 {
  18.                     //adding 1.0 to the featureMap represents that the feature word is present in the input data
  19.                     featureMap.put(featureWords.indexOf(word),1.0);
  20.                 }
  21.         }
  22.     }
  23.     int indices[] = new int[featureMap.size()+1];
  24.     double values[] = new double[featureMap.size()+1];
  25.     int i=0;
  26.     for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
  27.     {
  28.         indices[i] = entry.getKey();
  29.         values[i] = entry.getValue();
  30.         i++;
  31.     }
  32.     indices[i] = featureWords.size();
  33.     values[i] = (double)sentimentClassList.indexOf(inputInstance.stringValue(1));
  34.     return new SparseInstance(1.0,values,indices,featureWords.size());
  35. }
复制代码

Once feature extraction is done, training and testing can be performed.


3.3 Training the Classifier

In this tutorial, NaiveBayes classifier is used. The following code snippet consists of steps involved in training in the NaiveBayes Classifier.

  1. public void trainClassifier(final String INPUT_FILENAME)
  2. {
  3.     getTrainingDataset(INPUT_FILENAME);
  4.      
  5.     //trainingInstances consists of feature vector of every input
  6.     Instances trainingInstances = createInstances("TRAINING_INSTANCES");
  7.      
  8.     for(Instance currentInstance : inputDataset)
  9.     {
  10.         //extractFeature method returns the feature vector for the current input
  11.         Instance currentFeatureVector = extractFeature(currentInstance);
  12.          
  13.         //Make the currentFeatureVector to be added to the trainingInstances
  14.         currentFeatureVector.setDataset(trainingInstances);
  15.         trainingInstances.add(currentFeatureVector);
  16.     }
  17.          
  18.     //You can create the classifier that you want.
  19.     //For instance classifier = new SMO;
  20.     //In this tutorial we use NaiveBayes Classifier.
  21.     classifier = new NaiveBayes();
  22.          
  23.     try {
  24.         //classifier training code
  25.         classifier.buildClassifier(trainingInstances);
  26.          
  27.         //storing the trained classifier to a file for future use
  28.         weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
  29.     } catch (Exception ex) {
  30.         System.out.println("Exception in training the classifier.");
  31.     }
  32. }
复制代码



3.4 Testing the Classifier

The following code snippet consists of steps involved in testing the classifier. In testing, with the training knowledge, the classifier tries to predict the class (sentiment) of the tweet.

  1. public void testClassifier(final String INPUT_FILENAME)
  2. {
  3.     getTrainingDataset(INPUT_FILENAME);
  4.          
  5.     //trainingInstances consists of feature vector of every input
  6.     Instances testingInstances = createInstances("TESTING_INSTANCES");

  7.     for(Instance currentInstance : inputDataset)
  8.     {
  9.         //extractFeature method returns the feature vector for the current input
  10.         Instance currentFeatureVector = extractFeature(currentInstance);

  11.         //Make the currentFeatureVector to be added to the trainingInstances
  12.         currentFeatureVector.setDataset(testingInstances);
  13.         testingInstances.add(currentFeatureVector);
  14.     }
  15.          
  16.          
  17.     try {
  18.         //Classifier deserialization
  19.         classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
  20.          
  21.         //classifier testing code
  22.         for(Instance testInstance : testingInstances)
  23.         {
  24.             double score = classifier.classifyInstance(testInstance);
  25.             System.out.println(testingInstances.attribute("Sentiment").value((int)score));
  26.         }
  27.     } catch (Exception ex) {
  28.         System.out.println("Exception in testing the classifier.");
  29.     }
  30. }


  31. private Instances createInstances(final String INSTANCES_NAME)
  32. {
  33.      
  34.     //create an Instances object with initial capacity as zero
  35.     Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
  36.      
  37.     //sets the class index as the last attribute (positive or negative)
  38.     instances.setClassIndex(instances.numAttributes()-1);
  39.          
  40.     return instances;
  41. }
复制代码



Download the project files here.

The downloaded archive is a NetBeans project. It can be opened with NetBeans IDE. If you are not using NetBeans IDE, go to the src directory for the java code. FeatureWordsList.dat is an arraylist of feature words(ArrayList serialized to file). training.csv and testing.csv contains the datasets for training and testing.


二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:classifier Train rain WEKA Java particular available learning provides training

已有 1 人评分经验 收起 理由
苹果六人行 + 80 精彩帖子

总评分: 经验 + 80   查看全部评分

本帖被以下文库推荐

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
jg-xs1
拉您进交流群
GMT+8, 2025-12-6 04:40