In this page I intend to provide useful hints and tips (recipes) for working with text data in WEKA. The information is organized as a list of blog posts and references, plus additional material like code and text collections.
I suggest to read my following posts on text classification with WEKA in the publication order:
Text Mining in WEKA: Chaining Filters and Classifiers explains how and why you should so when evaluating your text classifiers using cross-fold validation. The explanation is done using the Explorer tools, and it helps as a quick introduction to the process of building a text classifier in WEKA, along to the FilteredClassifier class.
Command Line Functions for Text Mining in WEKA presents how perform previous experiments with the FilteredClassifier and MultiFilter classes but now in the command line interface instead on WEKA's Explorer.
URL Text Classification with WEKA, Part 1: Data Analysis shows an application of text classification to processing URLs text as a complement to URL database-based filtering in Web Filters. This first post just explains how I have built the dataset, while an upcoming post will explain my ongoing experiments.
Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers discusses three ways of mapping the set of terms used in the representations of the training and test sets of a text dataset for enabling learning, namely using batch filters, the FilteredClassifier class and the InputMappedClassifier class.
Language Identification as Text Classification with WEKA explains how to build an automated language guesser for texts as a complete example of a Text Mining process with WEKA, and in order to demonstrate a more advanced usage of theStringToWordVector class.
Baseline Sentiment Analysis with WEKA shows how to configure and run an experiment on sentyment analysis and opinion mining using WEKA, and specially the TextDirectoryLoader and the NGramTokenizer classes.
Performance Analysis of N-Gram Tokenizer in WEKA, which analyzes the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step.
Do you want me to deal with some specific topic? Just let me know.
I have some other posts on WEKA, like the following ones:
All my posts related to WEKA can be found using the label WEKA.
Interesting references for working with WEKA include:
Use WEKA in your Java code provides an excelent introduction to how to use the classes Instances, Filter, Classifier, Clusteres, Evaluation and AttributeSelection, in your own code.
WEKA programmatic use describes the learning process life-cycle and, more importantly, it explains how to deal with attributes in your Java code.
The train/test split from the previous mini spam SMS collection provided for the post in which I explain how to map the vocabular from the training to the test collection in WEKA. It features a training subset and a test subset.
The URL list dataset for my posts on analyzing URL content for Web Filtering (part 1, part 2). This dataset is over 40Mb. This list is based on an SquidBlackList.org list licensed under Commons Attribution 3.0 Unported License: Blacklists(Squidblacklist.org) / CC BY 3.0.