楼主: ReneeBK
2669 15

[GitHub]Python 3 Text-Processing with NLTK 3 Cookbook. [推广有奖]

  • 1关注
  • 62粉丝

VIP

已卖:4900份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49655 个
通用积分
55.9937
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57805 点
帖子
4005
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

楼主
ReneeBK 发表于 2017-7-7 00:22:47 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币

本帖隐藏的内容

nltk3-cookbook-master.zip (37.06 KB)

nltk3-cookbook

本帖隐藏的内容

https://github.com/karanmilan/Automatic-Answer-Evaluation/blob/master/Python%203%20Text%20Processing%20with%20NLTK%203%20Cookbook.pdf

Python 3 code and corpus examples for the Python 3 Text-Processing with NLTK 3 Cookbook.



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Processing Cookbook processI Process python

本帖被以下文库推荐

沙发
ReneeBK 发表于 2017-7-7 00:23:57
  1. ==========================
  2. Setting up a Custom Corpus
  3. ==========================

  4. >>> import os, os.path
  5. >>> path = os.path.expanduser('~/nltk_data')
  6. >>> if not os.path.exists(path):
  7. ...                os.mkdir(path)
  8. >>> os.path.exists(path)
  9. True
  10. >>> import nltk.data
  11. >>> path in nltk.data.path
  12. True

  13. >>> nltk.data.load('corpora/cookbook/mywords.txt', format='raw')
  14. b'nltk\\n'

  15. >>> nltk.data.load('corpora/cookbook/synonyms.yaml')
  16. {'bday': 'birthday'}
复制代码

藤椅
ReneeBK 发表于 2017-7-7 00:24:52
  1. ===========================
  2. Creating a Word List Corpus
  3. ===========================

  4. >>> from nltk.corpus.reader import WordListCorpusReader
  5. >>> reader = WordListCorpusReader('.', ['wordlist'])
  6. >>> reader.words()
  7. ['nltk', 'corpus', 'corpora', 'wordnet']
  8. >>> reader.fileids()
  9. ['wordlist']

  10. >>> reader.raw()
  11. 'nltk\\ncorpus\\ncorpora\\nwordnet\\n'
  12. >>> from nltk.tokenize import line_tokenize
  13. >>> line_tokenize(reader.raw())
  14. ['nltk', 'corpus', 'corpora', 'wordnet']

  15. >>> from nltk.corpus import names
  16. >>> names.fileids()
  17. ['female.txt', 'male.txt']
  18. >>> len(names.words('female.txt'))
  19. 5001
  20. >>> len(names.words('male.txt'))
  21. 2943

  22. >>> from nltk.corpus import words
  23. >>> words.fileids()
  24. ['en', 'en-basic']
  25. >>> len(words.words('en-basic'))
  26. 850
  27. >>> len(words.words('en'))
  28. 234936
复制代码

板凳
ReneeBK 发表于 2017-7-7 00:25:45
  1. Creating a Part-of-Speech Tagged Word Corpus
  2. ============================================

  3. >>> from nltk.corpus.reader import TaggedCorpusReader
  4. >>> reader = TaggedCorpusReader('.', r'.*\.pos')
  5. >>> reader.words()
  6. ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
  7. >>> reader.tagged_words()
  8. [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
  9. >>> reader.sents()
  10. [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
  11. >>> reader.tagged_sents()
  12. [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
  13. >>> reader.paras()
  14. [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
  15. >>> reader.tagged_paras()
  16. [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]

  17. >>> from nltk.tokenize import SpaceTokenizer
  18. >>> reader = TaggedCorpusReader('.', r'.*\.pos', word_tokenizer=SpaceTokenizer())
  19. >>> reader.words()
  20. ['The', 'expense', 'and', 'time', 'involved', 'are', ...]

  21. >>> from nltk.tokenize import LineTokenizer
  22. >>> reader = TaggedCorpusReader('.', r'.*\.pos', sent_tokenizer=LineTokenizer())
  23. >>> reader.sents()
  24. [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

  25. >>> reader = TaggedCorpusReader('.', r'.*\.pos', tagset='en-brown')
  26. >>> reader.tagged_words(tagset='universal')
  27. [('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...]

  28. >>> from nltk.corpus import treebank
  29. >>> treebank.tagged_words()
  30. [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
  31. >>> treebank.tagged_words(tagset='universal')
  32. [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ...]
  33. >>> treebank.tagged_words(tagset='brown')
  34. [('Pierre', 'UNK'), ('Vinken', 'UNK'), (',', 'UNK'), ...]
复制代码

报纸
ReneeBK 发表于 2017-7-7 00:26:47
  1. ================================
  2. Creating a Chunked Phrase Corpus
  3. ================================

  4. >>> from nltk.corpus.reader import ChunkedCorpusReader
  5. >>> reader = ChunkedCorpusReader('.', r'.*\.chunk')
  6. >>> reader.chunked_words()
  7. [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ...]
  8. >>> reader.chunked_sents()
  9. [Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (',', ','), Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]
  10. >>> reader.chunked_paras()
  11. [[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (',', ','), Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]]

  12. >>> from nltk.corpus.reader import ConllChunkCorpusReader
  13. >>> conllreader = ConllChunkCorpusReader('.', r'.*\.iob', ('NP', 'VP', 'PP'))
  14. >>> conllreader.chunked_words()
  15. [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), ...]
  16. >>> conllreader.chunked_sents()
  17. [Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]), Tree('PP', [('of', 'IN')]), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]
  18. >>> conllreader.iob_words()
  19. [('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ...]
  20. >>> conllreader.iob_sents()
  21. [[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ('had', 'VBD', 'B-VP'), ('been', 'VBN', 'I-VP'), ('executive', 'JJ', 'B-NP'), ('vice', 'NN', 'I-NP'), ('president', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('Balcor', 'NNP', 'B-NP'), ('.', '.', 'O')]]

  22. >>> reader.chunked_words()[0].leaves()
  23. [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]
  24. >>> reader.chunked_sents()[0].leaves()
  25. [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'), ('jobs', 'NNS'), (',', ','), ('the', 'DT'), ('spokesman', 'NN'), ('said', 'VBD'), ('.', '.')]
  26. >>> reader.chunked_paras()[0][0].leaves()
  27. [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'), ('jobs', 'NNS'), (',', ','), ('the', 'DT'), ('spokesman', 'NN'), ('said', 'VBD'), ('.', '.')]
复制代码

地板
ReneeBK 发表于 2017-7-7 00:27:29
  1. ==================================
  2. Creating a Categorized Text Corpus
  3. ==================================

  4. >>> from nltk.corpus import brown
  5. >>> brown.categories()
  6. ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

  7. >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
  8. >>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*\.txt', cat_pattern=r'movie_(\w+)\.txt')
  9. >>> reader.categories()
  10. ['neg', 'pos']
  11. >>> reader.fileids(categories=['neg'])
  12. ['movie_neg.txt']
  13. >>> reader.fileids(categories=['pos'])
  14. ['movie_pos.txt']

  15. >>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*\.txt', cat_map={'movie_pos.txt': ['pos'], 'movie_neg.txt': ['neg']})
  16. >>> reader.categories()
  17. ['neg', 'pos']
复制代码

7
ReneeBK 发表于 2017-7-7 00:27:51
  1. ===================================
  2. Creating a Categorized Chunk Corpus
  3. ===================================

  4. >>> import nltk.data
  5. >>> from catchunked import CategorizedChunkedCorpusReader
  6. >>> path = nltk.data.find('corpora/treebank/tagged')
  7. >>> reader = CategorizedChunkedCorpusReader(path, r'wsj_.*\.pos', cat_pattern=r'wsj_(.*)\.pos')
  8. >>> len(reader.categories()) == len(reader.fileids())
  9. True
  10. >>> len(reader.chunked_sents(categories=['0001']))
  11. 16

  12. >>> import nltk.data
  13. >>> from catchunked import CategorizedConllChunkCorpusReader
  14. >>> path = nltk.data.find('corpora/conll2000')
  15. >>> reader = CategorizedConllChunkCorpusReader(path, r'.*\.txt', ('NP','VP','PP'), cat_pattern=r'(.*)\.txt')
  16. >>> reader.categories()
  17. ['test', 'train']
  18. >>> reader.fileids()
  19. ['test.txt', 'train.txt']
  20. >>> len(reader.chunked_sents(categories=['test']))
  21. 2012
复制代码

8
auirzxp 学生认证  发表于 2017-7-7 00:29:27
提示: 作者被禁止或删除 内容自动屏蔽

9
钱学森64 发表于 2017-7-7 01:08:02
谢谢分享

10
cszcszcsz 发表于 2017-7-7 06:05:44
谢谢 分享!

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-17 12:45