发帖

楼主: 农村固定观察点

2375 0

[数据挖掘理论与案例] Text Mining in WEKA Cookbook [推广有奖]

0关注
10粉丝

已卖：1628份资源

教授

8%

还不是VIP/贵宾

-

TA的文库 其他...

Must-Read Book

Winrats NewOccidental

Matlab NewOccidental

0%

威望: 1 级
论坛币: 31729 个
通用积分: 4.7461
学术水平: 96 点
热心指数: 43 点
信用等级: 79 点
经验: 9658 点
帖子: 287
精华: 10
在线时间: 40 小时
注册时间: 2013-12-14
最后登录: 2024-4-12

楼主

农村固定观察点 发表于 2014-12-10 06:15:17 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Text Mining in WEKA Cookbook

In this page I intend to provide useful hints and tips (recipes) for working with text data in WEKA. The information is organized as a list of blog posts and references, plus additional material like code and text collections.
I suggest to read my following posts on text classification with WEKA in the publication order:

Text Mining in WEKA: Chaining Filters and Classifiers explains how and why you should so when evaluating your text classifiers using cross-fold validation. The explanation is done using the Explorer tools, and it helps as a quick introduction to the process of building a text classifier in WEKA, along to the FilteredClassifier class.
Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters describes how to complete the life-cycle of the learning process by adding feature selection to it, by using the MultiFilter class.
Command Line Functions for Text Mining in WEKA presents how perform previous experiments with the FilteredClassifier and MultiFilter classes but now in the command line interface instead on WEKA's Explorer.
A Simple Text Classifier in Java with WEKA presents and discuses two little programs as examples of how to integrate WEKA into your Java code for text mining.
URL Text Classification with WEKA, Part 1: Data Analysis shows an application of text classification to processing URLs text as a complement to URL database-based filtering in Web Filters. This first post just explains how I have built the dataset, while an upcoming post will explain my ongoing experiments.
Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers discusses three ways of mapping the set of terms used in the representations of the training and test sets of a text dataset for enabling learning, namely using batch filters, the FilteredClassifier class and the InputMappedClassifier class.
Language Identification as Text Classification with WEKA explains how to build an automated language guesser for texts as a complete example of a Text Mining process with WEKA, and in order to demonstrate a more advanced usage of theStringToWordVector class.
Baseline Sentiment Analysis with WEKA shows how to configure and run an experiment on sentyment analysis and opinion mining using WEKA, and specially the TextDirectoryLoader and the NGramTokenizer classes.
Comparing baselines of keyword and learning based sentiment analysis provides a basic example of using SentiWordNet for a keyword-based approach to sentiment classification, and compares it with a learning-based approach based in WEKA.
Sample Code for Text Indexing with WEKA shows how to index a text dataset using your own Java code and the StringToWordVector filter in WEKA.
Performance Analysis of N-Gram Tokenizer in WEKA, which analyzes the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step.
Do you want me to deal with some specific topic? Just let me know.

I have some other posts on WEKA, like the following ones:

A note on WEKA limitations and big data, which discusses how to go big data when staring with WEKA.
Class imbalanced distribution and WEKA cost sensitive learning, which explains how to deal with the class imbalance problem by using cost-sensitive clasifiers in WEKA.

All my posts related to WEKA can be found using the label WEKA.
Interesting references for working with WEKA include:

Use WEKA in your Java code provides an excelent introduction to how to use the classes Instances, Filter, Classifier, Clusteres, Evaluation and AttributeSelection, in your own code.
WEKA programmatic use describes the learning process life-cycle and, more importantly, it explains how to deal with attributes in your Java code.
Text Categorization with WEKA deals with transforming a directory structure of classes (directories) and documents (inside those directories) into ARFF format for further processing. The code is available at ARFF files from Text Collections.

For testing your classifiers and integrating WEKA in your own code, I provide the following stuff:

The full spam SMS text collection.
An ARFF-formatted mini spam SMS collection used in most of my posts.
The test example used in my post about integrating WEKA in your Java code for text mining.
The programs MyFilteredLearner.java and MyFilteredClassifier.java, which demonstrate how to integrate WEKA in your Java code for text mining.
The train/test split from the previous mini spam SMS collection provided for the post in which I explain how to map the vocabular from the training to the test collection in WEKA. It features a training subset and a test subset.
The URL list dataset for my posts on analyzing URL content for Web Filtering (part 1, part 2). This dataset is over 40Mb. This list is based on an SquidBlackList.org list licensed under Commons Attribution 3.0 Unported License: Blacklists(Squidblacklist.org) / CC BY 3.0.
The training and test datasets for the Language Identification demonstration, along with the sample scripts (*.sh) and the program LanguageIdentifier.java that makes use of the learned model.
The SFU Review Corpus formatted for WEKA experiments (English, Spanish) as used in the Baseline Sentiment Analysis with WEKA post.
The SentimentClassifier.java program for learning-based sentiment classification with WEKA.
The SentiWordNetDemo.java program for keyword-based sentiment classification with SentiWordNet (see Comparing baselines of keyword and learning based sentiment analysis).
The IndexTest.java class, which demonstrates the process of indexing a text collection in Java.

You will find most of this stuff at my tmweka Github repository.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏1 回帖

关键词：Text Mining Cookbook Cook ning Book additional references following material provide

本帖被以下文库推荐

· 小袁的Text Mining|主题: 12, 订阅: 3
· Text Mining NewOccidental|主题: 213, 订阅: 43
· Must-Read Book|主题: 72, 订阅: 9

返回列表

发帖

[数据挖掘理论与案例] Text Mining in WEKA Cookbook [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

[数据挖掘理论与案例] Text Mining in WEKA Cookbook [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

浏览过的帖子

浏览过的版块

本版微信群

扫码加我拉你入群