Text Analysis with LingPipe 4 [推广有奖]

4关注
0粉丝

高中生

92%

还不是VIP/贵宾

威望: 0 级
论坛币: 515 个
通用积分: 0.1000
学术水平: 0 点
热心指数: 0 点
信用等级: 0 点
经验: 101 点
帖子: 10
精华: 0
在线时间: 66 小时
注册时间: 2018-7-23
最后登录: 2023-4-29

楼主

910200822 发表于 2018-12-6 00:50:14 |AI写论文

1论坛币

Text Analysiswith LingPipe 4

Bob Carpenter

Breck Baldwin

Contents
1 Getting Started 1
1.1 Tools of the Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hello World Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Introduction to Ant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Handlers, Parsers, and Corpora 19
2.1 Handlers and Object Handlers . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Tokenization 33
3.1 Tokenizers and Tokenizer Factories . . . . . . . . . . . . . . . . . . . . 33
3.2 LingPipe’s Base Tokenizer Factories . . . . . . . . . . . . . . . . . . . . 37
3.3 LingPipe’s Filtered Tokenizers . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Morphology, Stemming, and Lemmatization . . . . . . . . . . . . . . . 46
3.5 Soundex: Pronunciation-Based Tokens . . . . . . . . . . . . . . . . . . 53
3.6 Character Normalizing Tokenizer Filters . . . . . . . . . . . . . . . . . 56
3.7 Penn Treebank Tokenization . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8 Adapting to and From Lucene Analyzers . . . . . . . . . . . . . . . . . 64
3.9 Tokenizations as Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Suffix Arrays 75
4.1 What is a Suffix Array? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Character Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Token Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Document Collections as Suffix Arrays . . . . . . . . . . . . . . . . . . 81
4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 Symbol Tables 85
5.1 The SymbolTable Interface . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 The MapSymbolTable Class . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 The SymbolTableCompiler Class . . . . . . . . . . . . . . . . . . . . . 89
6 Character Language Models 93
6.1 Applications of Language Models . . . . . . . . . . . . . . . . . . . . . . 93
6.2 The Basics of N-Gram Language Models . . . . . . . . . . . . . . . . . 94
6.3 Character-Level Language Models and Unicode . . . . . . . . . . . . . 95
v
6.4 Language Model Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Process Character Language Models . . . . . . . . . . . . . . . . . . . . 98
6.6 Sequence Character Language Models . . . . . . . . . . . . . . . . . . . 101
6.7 Tuning Language Model Smoothing . . . . . . . . . . . . . . . . . . . . 104
6.8 Underlying Sequence Counter . . . . . . . . . . . . . . . . . . . . . . . . 107
6.9 Learning Curve Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.10 Pruning Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.11 Compling and Serializing Character LMs . . . . . . . . . . . . . . . . . 112
6.12 Thread Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.13 The Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Tokenized Language Models 119
7.1 Applications of Tokenized Language Models . . . . . . . . . . . . . . . 119
7.2 Token Language Model Interface . . . . . . . . . . . . . . . . . . . . . . 119
8 Spelling Correction 121
9 Classifiers and Evaluation 123
9.1 What is a Classifier? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.2 Kinds of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.3 Gold Standards, Annotation, and Reference Data . . . . . . . . . . . . 129
9.4 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
9.5 Precision-Recall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.6 Micro- and Macro-Averaged Statistics . . . . . . . . . . . . . . . . . . . 144
9.7 Scored Precision-Recall Evaluations . . . . . . . . . . . . . . . . . . . . 147
9.8 Contingency Tables and Derived Statistics . . . . . . . . . . . . . . . . 155
9.9 Bias Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.10 Post-Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10 Naive Bayes Classifiers 169
10.1 Introduction to Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.2 Getting Started with Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 173
10.3 Independence, Overdispersion and Probability Attenuation . . . . . 175
10.4 Tokens, Counts and Sufficient Statistics . . . . . . . . . . . . . . . . . 177
10.5 Unbalanced Category Probabilities . . . . . . . . . . . . . . . . . . . . . 177
10.6 Maximum Likelihood Estimation and Smoothing . . . . . . . . . . . . 178
10.7 Item-Weighted Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.8 Document Length Normalization . . . . . . . . . . . . . . . . . . . . . . 183
10.9 Serialization and Compilation . . . . . . . . . . . . . . . . . . . . . . . . 185
10.10 Training and Testing with a Corpus . . . . . . . . . . . . . . . . . . . . 187
10.11 Cross-Validating a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.12 Formalizing Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11 Tagging 205
11.1 Taggings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.2 Tag Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.3 Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.4 Tagger Evaluators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
12 Tagging with Hidden Markov Models 215
13 Conditional Random Fields 217
14 Latent Dirichlet Allocation 219
14.1 Corpora, Documents, and Tokens . . . . . . . . . . . . . . . . . . . . . 219
14.2 LDA Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 220
14.3 Interpreting LDA Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
14.4 LDA’s Gibbs Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
14.5 Handling Gibbs Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
14.6 Scalability of LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
14.7 Understanding the LDA Model Parameters . . . . . . . . . . . . . . . . 238
14.8 LDA Instances for Multi-Topic Classification . . . . . . . . . . . . . . . 239
14.9 Comparing Documents with LDA . . . . . . . . . . . . . . . . . . . . . . 244
14.10 Stability of Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.11 The LDA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
15 Singular Value Decomposition 251
16 Sentence Boundary Detection 253
A Mathematics 255
A.1 Basic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.2 Useful Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
B Statistics 259
B.1 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . 259
B.2 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . 261
B.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 261
B.4 Maximum a Posterior Estimation . . . . . . . . . . . . . . . . . . . . . . 261
B.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
C Java Basics 267
C.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . 267
D Corpora 271
D.1 Canterbury Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
D.2 20 Newsgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
D.3 MedTag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
D.4 WormBase MEDLINE Citations . . . . . . . . . . . . . . . . . . . . . . . . 273
E Further Reading 275
E.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
E.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
E.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
E.4 Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
E.5 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 277
F Licenses 279
F.1 LingPipe License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
F.2 Java Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
F.3 Apache License 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
F.4 Common Public License 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . 288
F.5 X License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
F.6 Creative Commons Attribution-Sharealike 3.0 Unported License . . 290

复制代码

最佳答案

Nicolle 查看完整内容

**** 本内容被作者隐藏 ****

分享0 收藏1 回帖

关键词：Analysis Analysi alysis Analys Analy

本帖被以下文库推荐

· Text Mining NewOccidental|主题: 213, 订阅: 43

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2732 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	沙发 Nicolle 发表于 2018-12-6 00:50:15 提示: 作者被禁止或删除内容自动屏蔽

	回复举报