楼主: 飞天玄舞6
1339 1

[数据挖掘理论与案例] Tika in Action [推广有奖]

  • 3关注
  • 31粉丝

VIP1

学科带头人

12%

(VIP/贵宾)九级

73%

TA的文库  其他...

综合文库

威望
0
论坛币
154041 个
通用积分
4221.6395
学术水平
128 点
热心指数
148 点
信用等级
102 点
经验
76624 点
帖子
1503
精华
0
在线时间
1509 小时
注册时间
2013-12-2
最后登录
2021-10-20

相似文件 换一批

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
Tika in Action
PART 1 GETTING STARTED............................................ 1
1 The case for the digital Babel fish 3
1.1 Understanding digital documents 4
A taxonomy of file formats 5 ■ Parser libraries 6
Structured text as the universal language 9 ■ Universal
metadata 10 ■ The program that understands everything 13
1.2 What is Apache Tika? 15
A bit of history 15 ■ Key design goals 17 ■ When and where to
use Tika 21
1.3 Summary 22
2 Getting started with Tika 24
2.1 Working with Tika source code 25
Getting the source code 25 ■ The Maven build 26
Including Tika in Ant projects 26
www.allitebooks.com
x CONTENTS
2.2 The Tika application 27
Drag-and-drop text extraction: the Tika GUI 29 ■ Tika on the
command line 30
2.3 Tika as an embedded library 32
Using the Tika facade 32 ■ Managing dependencies 34
2.4 Summary 36
3 The information landscape 38
3.1 Measuring information overload 40
Scale and growth 40 ■ Complexity 42
3.2 I’m feeling lucky—searching the information
landscape 44
Just click it: the modern search engine 44 ■ Tika’s role in
search 46
3.3 Beyond lucky: machine learning 47
Your likes and dislikes 48 ■ Real-world machine learning 50
3.4 Summary 52
PART 2 TIKA IN DETAIL................................................53
4 Document type detection 55
4.1 Internet media types 56
The parlance of media type names 58 ■ Categories of media
types 58 ■ IANA and other type registries 60
4.2 Media types in Tika 60
The shared MIME-info database 61 ■ The MediaType class 62
The MediaTypeRegistry class 63 ■ Type hierarchies 64
4.3 File format diagnostics 65
Filename globs 66 ■ Content type hints 68 ■ Magic bytes 68
Character encodings 69 ■ Other mechanisms 70
4.4 Tika, the type inspector 71
4.5 Summary 72
5 Content extraction 73
5.1 Full-text extraction 74
Abstracting the parsing process 74 ■ Full-text indexing 75
Incremental parsing 77
CONTENTS xi
5.2 The Parser interface 78
Who knew parsing could be so easy? 78 ■ The parse() method 79
Parser implementations 80 ■ Parser selection 82
5.3 Document input stream 84
Standardizing input to Tika 84 ■ The TikaInputStream
class 85
5.4 Structured XHTML output 87
Semantic structure of text 87 ■ Structured output via SAX
events 88 ■ Marking up structure with XHTML 89
5.5 Context-sensitive parsing 91
Environment settings 91 ■ Custom document handling 92
5.6 Summary 93
6 Understanding metadata 94
6.1 The standards of metadata 96
Metadata models 96 ■ General metadata standards 99
Content-specific metadata standards 99
6.2 Metadata quality 101
Challenges/Problems 101 ■ Unifying heterogeneous
standards 103
6.3 Metadata in Tika 104
Keys and multiple values 105 ■ Transformations and
views 106
6.4 Practical uses of metadata 107
Common metadata for the Lucene indexer 108 ■ Give me my
metadata in my schema! 109
6.5 Summary 111
7 Language detection 113
7.1 The most translated document in the world 114
7.2 Sounds Greek to me—theory of language detection 115
Language profiles 116 ■ Profiling algorithms 117
The N-gram algorithm 118 ■ Advanced profiling
algorithms 119
7.3 Language detection in Tika 119
Incremental language detection 120 ■ Putting it all together 121
7.4 Summary 122
xii CONTENTS
8 What’s in a file? 123
8.1 Types of content 124
HDF: a format for scientific data 125 ■ Really Simple Syndication:
a format for rapidly changing content 126
8.2 How Tika extracts content 127
Organization of content 128 ■ File header and naming
conventions 133 ■ Storage affects extraction 139
8.3 Summary 141
PART 3 INTEGRATION AND ADVANCED USE .................143
9 The big picture 145
9.1 Tika in search engines 146
The search use case 146 ■ The anatomy of a search
index 146
9.2 Managing and mining information 147
Document management systems 148 ■ Text
mining 149
9.3 Buzzword compliance 150
Modularity, Spring, and OSGi 150 ■ Large-scale
computing 151
9.4 Summary 153
10 Tika and the Lucene search stack 154
10.1 Load-bearing walls 155
ManifoldCF 156 ■ Open Relevance 157
10.2 The steel frame 159
Lucene Core 159 ■ Solr 161
10.3 The finishing touches 162
Nutch 162 ■ Droids 164 ■ Mahout 165
10.4 Summary 166
11 Extending Tika 167
11.1 Adding type information 168
Custom media type configuration 169
CONTENTS xiii
11.2 Custom type detection 169
The Detector interface 170 ■ Building a custom type
detector 170 ■ Plugging in new detectors 172
11.3 Customized parsing 172
Customizing existing parsers 173 ■ Writing a new
parser 174 ■ Plugging in new parsers 175
Overriding existing parsers 176
11.4 Summary 176
PART 4 CASE STUDIES................................................179
12 Powering NASA science data systems 181
12.1 NASA’s Planetary Data System 182
PDS data model 182 ■ The PDS search
redesign 184
12.2 NASA’s Earth Science Enterprise 186
Leveraging Tika in NASA Earth Science SIPS 187
Using Tika within the ground data systems 188
12.3 Summary 190
13 Content management with Apache Jackrabbit 191
13.1 Introducing Apache Jackrabbit 192
13.2 The text extraction pool 192
13.3 Content-aware WebDAV 194
13.4 Summary 195
14 Curating cancer research data with Tika 196
14.1 The NCI Early Detection Research Network 197
The EDRN data model 197 ■ Scientific data
curation 198
14.2 Integrating Tika 198
Metadata extraction 199 ■ MIME type identification
and classification 201
14.3 Summary 203
xiv CONTENTS
15 The classic search engine example 204
15.1 The Public Terabyte Dataset Project 205
15.2 The Bixo web crawler 206
Parsing fetched documents 207 ■ Validating Tika’s charset
detection 209
15.3 Summary 210
appendix A Tika quick reference 211
appendix B Supported metadata keys 214
index 219

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Action ACT CTI TIO Organization everything documents history digital formats

Tika in Action.pdf

15.52 MB

需要: 5 个论坛币  [购买]

strive for the best, prepare for the worst.
沙发
franky_sas 发表于 2017-1-13 00:12:43 |只看作者 |坛友微信交流群
Thanks.

使用道具

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 论坛法律顾问:王进律师 知识产权保护声明   免责及隐私声明

GMT+8, 2024-4-27 04:43