This tutorial presents the steps for training one of the classifiers available in WEKA using java. Weka provides implementation of wide range of machine learning based classifiers. A trained classifier can be used for the classification of data in a particular domain which depends on the training set.To train a classifier we need a training set.Once the classifier is trained it can be stored in a file and can be loaded for later use (classifier serialization and deserialization).
In this article, we will provide you a weka tutorial on classsifing a tweet as either positive tweet or negative tweet(Sentiment analysis).The following points explain the steps involved in performing sentiment analysis using weka.
- Download weka 3.7.x and install it.
- In the weka installation folder you will find weka.jar. Add it to the java class path.
- The following section consists of code snippets for training and testing the Naive bayes classifier.
Tweets for training, annotated with their sentiment values are in CSV(Comma Separated Values) file format. Following code reads the input data from the CSV file
- private void getTrainingDataset(final String INPUT_FILENAME)
- {
- try{
- //reading the training dataset from CSV file
- CSVLoader trainingLoader =new CSVLoader();
- trainingLoader.setSource(new File(INPUT_FILENAME));
- inputDataset = trainingLoader.getDataSet();
- }catch(IOException ex)
- {
- System.out.println("Exception in getTrainingDataset Method");
- }
- }
3.2 Tweet Preprocessing and Feature extraction
Once input is read from the file, the data (tweets in this case) needs to be preprocessed and then feature extraction is performed. For tweet preprocessing, a modified CMU POS Tagger is used.featureWords is an arrayList that consists of the feature words. If the feature word is present in the tweet, the index corresponding to that feature word is given a value of 1. All other feature words that are not in the tweet have a value 0. There are totally 6800 featute words. This means that the feature vector is a point in which each axis represents the presence or the absence of the feature word corresponding to that axis. Hence Feature vector takes a sparse form (more zeroes than non zero entries). Thus a SparseInstance is used to represent a feature vector.
- private Instance extractFeature(Instance inputInstance)
- {
- Map<Integer,Double> featureMap = new TreeMap<>();
-
- //after tweet preprocessing, tweet is tokenized to individual words along with their parts of speech.
- List<Token> tokens = posTagger.runPOSTagger(inputInstance.stringValue(0));
-
- for(Token token : tokens)
- {
- switch(token.getPOS())
- {
- case "A":
- case "V":
- case "R":
- case "#":
- String word = token.getWord().replaceAll("#","");
- if(featureWords.contains(word))
- {
- //adding 1.0 to the featureMap represents that the feature word is present in the input data
- featureMap.put(featureWords.indexOf(word),1.0);
- }
- }
- }
- int indices[] = new int[featureMap.size()+1];
- double values[] = new double[featureMap.size()+1];
- int i=0;
- for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
- {
- indices[i] = entry.getKey();
- values[i] = entry.getValue();
- i++;
- }
- indices[i] = featureWords.size();
- values[i] = (double)sentimentClassList.indexOf(inputInstance.stringValue(1));
- return new SparseInstance(1.0,values,indices,featureWords.size());
- }
Once feature extraction is done, training and testing can be performed.
3.3 Training the Classifier
In this tutorial, NaiveBayes classifier is used. The following code snippet consists of steps involved in training in the NaiveBayes Classifier.
- public void trainClassifier(final String INPUT_FILENAME)
- {
- getTrainingDataset(INPUT_FILENAME);
-
- //trainingInstances consists of feature vector of every input
- Instances trainingInstances = createInstances("TRAINING_INSTANCES");
-
- for(Instance currentInstance : inputDataset)
- {
- //extractFeature method returns the feature vector for the current input
- Instance currentFeatureVector = extractFeature(currentInstance);
-
- //Make the currentFeatureVector to be added to the trainingInstances
- currentFeatureVector.setDataset(trainingInstances);
- trainingInstances.add(currentFeatureVector);
- }
-
- //You can create the classifier that you want.
- //For instance classifier = new SMO;
- //In this tutorial we use NaiveBayes Classifier.
- classifier = new NaiveBayes();
-
- try {
- //classifier training code
- classifier.buildClassifier(trainingInstances);
-
- //storing the trained classifier to a file for future use
- weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
- } catch (Exception ex) {
- System.out.println("Exception in training the classifier.");
- }
- }
3.4 Testing the Classifier
The following code snippet consists of steps involved in testing the classifier. In testing, with the training knowledge, the classifier tries to predict the class (sentiment) of the tweet.
- public void testClassifier(final String INPUT_FILENAME)
- {
- getTrainingDataset(INPUT_FILENAME);
-
- //trainingInstances consists of feature vector of every input
- Instances testingInstances = createInstances("TESTING_INSTANCES");
-
- for(Instance currentInstance : inputDataset)
- {
- //extractFeature method returns the feature vector for the current input
- Instance currentFeatureVector = extractFeature(currentInstance);
-
- //Make the currentFeatureVector to be added to the trainingInstances
- currentFeatureVector.setDataset(testingInstances);
- testingInstances.add(currentFeatureVector);
- }
-
-
- try {
- //Classifier deserialization
- classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
-
- //classifier testing code
- for(Instance testInstance : testingInstances)
- {
- double score = classifier.classifyInstance(testInstance);
- System.out.println(testingInstances.attribute("Sentiment").value((int)score));
- }
- } catch (Exception ex) {
- System.out.println("Exception in testing the classifier.");
- }
- }
-
-
- private Instances createInstances(final String INSTANCES_NAME)
- {
-
- //create an Instances object with initial capacity as zero
- Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
-
- //sets the class index as the last attribute (positive or negative)
- instances.setClassIndex(instances.numAttributes()-1);
-
- return instances;
- }
Download the project files here.
The downloaded archive is a NetBeans project. It can be opened with NetBeans IDE. If you are not using NetBeans IDE, go to the src directory for the java code. FeatureWordsList.dat is an arraylist of feature words(ArrayList serialized to file). training.csv and testing.csv contains the datasets for training and testing.


雷达卡


京公网安备 11010802022788号







