Python Text Processing with Nltk 2.0 Cookbook

0关注
62粉丝

VIP

已卖：4196份资源

院士

67%

还不是VIP/贵宾

-

TA的文库 其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

0%

威望: 0 级
论坛币: 50294 个
通用积分: 83.8106
学术水平: 253 点
热心指数: 300 点
信用等级: 208 点
经验: 41518 点
帖子: 3256
精华: 14
在线时间: 766 小时
注册时间: 2006-5-4
最后登录: 2022-11-6

楼主

Lisrelchen 发表于 2014-10-10 01:13:58 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

Python Text Processing with Nltk 2.0 Cookbook

Jacob Perkins (Author)

本帖隐藏的内容

Python Text Processing with NLTK 2.0 Cookbook.pdf (1.7 MB, 需要: 20 个论坛币)

Product Details

Paperback: 1 pages
Publisher: Packt Publishing (Nov. 12 2010)
Language: English
ISBN-10: 1849513600
ISBN-13: 978-1849513609

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏3 回帖

关键词：Processing Cookbook processI Process python

本帖被以下文库推荐

· 编程语言(Coding Languages)|主题: 3936, 订阅: 126
· 東西方精品圖書|主题: 896, 订阅: 110
· Text Mining NewOccidental|主题: 213, 订阅: 43
· Python(Must-Read Books)|主题: 1687, 订阅: 407
· 2万+全球顶级名校/投行英文文献 |主题: 21710, 订阅: 2698

沙发

crystal8832(未真实交易用户)

发表于 2014-10-10 01:18:23

Tokenizing text into sentences
Tokenization is the process of splitting a string into a list of pieces, or tokens. We'll start by splitting a paragraph into a list of sentences.
Getting ready
Installation instructions for NLTK are available at http://www.nltk.org/download and the latest version as of this writing is 2.0b9. NLTK requires Python 2.4 or higher, but is not compatible with Python 3.0. The recommended Python version is 2.6.
Once you've installed NLTK, you'll also need to install the data by following the instructions at http://www.nltk.org/data. We recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/punkt.zip is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/english.pickle.
Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://www.nltk.org/getting-started. For Mac with Linux/Unix users, you can open a terminal and type python.
How to do it...
Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:
>>> para = "Hello World. It's good to see you. Thanks for buying this book."
Now we want to split para into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument.
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']
So now we have a list of sentences that we can use for further processing.

复制代码

已有 1 人评分	论坛币	收起理由
Nicolle	+ 20	鼓励积极发帖讨论

总评分: 论坛币 + 20 查看全部评分

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	藤椅 Nicolle(未真实交易用户) 发表于 2015-9-5 21:45:19 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	板凳 Nicolle(未真实交易用户) 发表于 2015-9-5 21:46:13 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	报纸 Nicolle(未真实交易用户) 发表于 2015-9-5 21:46:56 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	地板 Nicolle(未真实交易用户) 发表于 2015-9-5 21:47:53 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	7楼 Nicolle(未真实交易用户) 发表于 2015-9-5 21:48:29 Looking up lemmas and synonyms in WordNet 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	8楼 Nicolle(未真实交易用户) 发表于 2015-9-5 21:49:24 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

加关注串个门加好友发消息 0关注 463 粉丝巨擘 Nicolle 当前离线阅读权限 255 威望 16 级论坛币 12403159 个通用积分 1639.2132 学术水平 3305 点热心指数 3329 点信用等级 3095 点经验 476993 点帖子 23839 精华 91 在线时间 9878 小时注册时间 2005-4-23 最后登录 2022-3-6 雷达卡	9楼 Nicolle(未真实交易用户) 发表于 2015-9-5 21:52:55 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

10楼

Lisrelchen(未真实交易用户) 发表于 2015-9-5 21:55:36

Discovering word collocations
Collocations are two or more words that tend to appear frequently together, such as "United States". Of course, there are many other words that can come after "United", for example "United Kingdom", "United Airlines", and so on. As with many aspects of natural language processing, context is very important, and for collocations, context is everything!
In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means that we'll find common phrases that occur frequently throughout the text. For fun, we'll start with the script for Monty Python and the Holy Grail.
Getting ready
The script for Monty Python and the Holy Grail is found in the webtext corpus, so be sure that it's unzipped in nltk_data/corpora/webtext/.
How to do it...
We're going to create a list of all lowercased words in the text, and then produce a BigramCollocationFinder, which we can use to find bigrams, which are pairs of words. These bigrams are found using association measurement functions found in the nltk.metrics package.
>>> from nltk.corpus import webtext
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.metrics import BigramAssocMeasures
>>> words = [w.lower() for w in webtext.words('grail.txt')]
>>> bcf = BigramCollocationFinder.from_words(words)
>>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]
Well that's not very useful! Let's refine it a bit by adding a word filter to remove punctuation and stopwords.
>>> from nltk.corpus import stopwords
>>> stopset = set(stopwords.words('english'))
>>> filter_stops = lambda w: len(w) < 3 or w in stopset
>>> bcf.apply_word_filter(filter_stops)
>>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
[('black', 'knight'), ('clop', 'clop'), ('head', 'knight'), ('mumble', 'mumble')]
Much better—we can clearly see four of the most common bigrams in Monty Python and the Holy Grail. If you'd like to see more than four, simply increase the number to whatever you want, and the collocation finder will do its best.

复制代码

Python Text Processing with Nltk 2.0 Cookbook [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

本帖被以下文库推荐

Looking up lemmas and synonyms in WordNet

浏览过的帖子

浏览过的版块

二级伯乐勋章

一级伯乐勋章

初级热心勋章

初级学术勋章

中级热心勋章

初级信用勋章

中级学术勋章

中级信用勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

Python Text Processing with Nltk 2.0 Cookbook [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

本帖被以下文库推荐

Looking up lemmas and synonyms in WordNet

浏览过的帖子

浏览过的版块

二级伯乐勋章

一级伯乐勋章

初级热心勋章

初级学术勋章

中级热心勋章

初级信用勋章

中级学术勋章

中级信用勋章

高级热心勋章

高级学术勋章

高级信用勋章

特级热心勋章

特级学术勋章

特级信用勋章

本版微信群

扫码加我拉你入群