楼主: Lisrelchen
3060 12

Python Text Processing with Nltk 2.0 Cookbook [推广有奖]

  • 0关注
  • 62粉丝

VIP

已卖:4194份资源

院士

67%

还不是VIP/贵宾

-

TA的文库  其他...

Bayesian NewOccidental

Spatial Data Analysis

东西方数据挖掘

威望
0
论坛币
50288 个
通用积分
83.6906
学术水平
253 点
热心指数
300 点
信用等级
208 点
经验
41518 点
帖子
3256
精华
14
在线时间
766 小时
注册时间
2006-5-4
最后登录
2022-11-6

楼主
Lisrelchen 发表于 2014-10-10 01:13:58 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币


Python Text Processing with Nltk 2.0 Cookbook


Jacob Perkins (Author)


本帖隐藏的内容

Python Text Processing with NLTK 2.0 Cookbook.pdf (1.7 MB, 需要: 20 个论坛币)



Product Details
  • Paperback: 1 pages
  • Publisher: Packt Publishing (Nov. 12 2010)
  • Language: English
  • ISBN-10: 1849513600
  • ISBN-13: 978-1849513609






二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Processing Cookbook processI Process python

已有 1 人评分论坛币 学术水平 热心指数 信用等级 收起 理由
crystal8832 + 36 + 3 + 3 + 3 辛苦了!

总评分: 论坛币 + 36  学术水平 + 3  热心指数 + 3  信用等级 + 3   查看全部评分

本帖被以下文库推荐

沙发
crystal8832(未真实交易用户) 学生认证  发表于 2014-10-10 01:18:23
  1. Tokenizing text into sentences

  2. Tokenization is the process of splitting a string into a list of pieces, or tokens. We'll start by splitting a paragraph into a list of sentences.

  3. Getting ready

  4. Installation instructions for NLTK are available at http://www.nltk.org/download and the latest version as of this writing is 2.0b9. NLTK requires Python 2.4 or higher, but is not compatible with Python 3.0. The recommended Python version is 2.6.

  5. Once you've installed NLTK, you'll also need to install the data by following the instructions at http://www.nltk.org/data. We recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/punkt.zip is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/english.pickle.

  6. Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://www.nltk.org/getting-started. For Mac with Linux/Unix users, you can open a terminal and type python.

  7. How to do it...

  8. Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:

  9. >>> para = "Hello World. It's good to see you. Thanks for buying this book."
  10. Now we want to split para into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument.

  11. >>> from nltk.tokenize import sent_tokenize
  12. >>> sent_tokenize(para)
  13. ['Hello World.', "It's good to see you.", 'Thanks for buying this book.']
  14. So now we have a list of sentences that we can use for further processing.
复制代码

已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

藤椅
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:45:19
提示: 作者被禁止或删除 内容自动屏蔽

板凳
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:46:13
提示: 作者被禁止或删除 内容自动屏蔽

报纸
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:46:56
提示: 作者被禁止或删除 内容自动屏蔽

地板
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:47:53
提示: 作者被禁止或删除 内容自动屏蔽

7
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:48:29

Looking up lemmas and synonyms in WordNet

提示: 作者被禁止或删除 内容自动屏蔽

8
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:49:24
提示: 作者被禁止或删除 内容自动屏蔽

9
Nicolle(未真实交易用户) 学生认证  发表于 2015-9-5 21:52:55
提示: 作者被禁止或删除 内容自动屏蔽

10
Lisrelchen(未真实交易用户) 发表于 2015-9-5 21:55:36
  1. Discovering word collocations

  2. Collocations are two or more words that tend to appear frequently together, such as "United States". Of course, there are many other words that can come after "United", for example "United Kingdom", "United Airlines", and so on. As with many aspects of natural language processing, context is very important, and for collocations, context is everything!

  3. In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means that we'll find common phrases that occur frequently throughout the text. For fun, we'll start with the script for Monty Python and the Holy Grail.

  4. Getting ready

  5. The script for Monty Python and the Holy Grail is found in the webtext corpus, so be sure that it's unzipped in nltk_data/corpora/webtext/.

  6. How to do it...

  7. We're going to create a list of all lowercased words in the text, and then produce a BigramCollocationFinder, which we can use to find bigrams, which are pairs of words. These bigrams are found using association measurement functions found in the nltk.metrics package.

  8. >>> from nltk.corpus import webtext
  9. >>> from nltk.collocations import BigramCollocationFinder
  10. >>> from nltk.metrics import BigramAssocMeasures
  11. >>> words = [w.lower() for w in webtext.words('grail.txt')]
  12. >>> bcf = BigramCollocationFinder.from_words(words)
  13. >>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
  14. [("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't')]
  15. Well that's not very useful! Let's refine it a bit by adding a word filter to remove punctuation and stopwords.

  16. >>> from nltk.corpus import stopwords
  17. >>> stopset = set(stopwords.words('english'))
  18. >>> filter_stops = lambda w: len(w) < 3 or w in stopset
  19. >>> bcf.apply_word_filter(filter_stops)
  20. >>> bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)
  21. [('black', 'knight'), ('clop', 'clop'), ('head', 'knight'), ('mumble', 'mumble')]
  22. Much better—we can clearly see four of the most common bigrams in Monty Python and the Holy Grail. If you'd like to see more than four, simply increase the number to whatever you want, and the collocation finder will do its best.
复制代码

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-21 14:38