Tokenization is the process of splitting a string into a list of pieces, or tokens. We'll start by splitting a paragraph into a list of sentences.
Getting ready
Installation instructions for NLTK are available at http://www.nltk.org/download and the latest version as of this writing is 2.0b9. NLTK requires Python 2.4 or higher, but is not compatible with Python 3.0. The recommended Python version is 2.6.
Once you've installed NLTK, you'll also need to install the data by following the instructions at http://www.nltk.org/data. We recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/punkt.zip is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/english.pickle.
Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://www.nltk.org/getting-started. For Mac with Linux/Unix users, you can open a terminal and type python.
How to do it...
Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:
>>> para = "Hello World. It's good to see you. Thanks for buying this book."
Now we want to split para into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument.
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']
So now we have a list of sentences that we can use for further processing.
Collocations are two or more words that tend to appear frequently together, such as "United States". Of course, there are many other words that can come after "United", for example "United Kingdom", "United Airlines", and so on. As with many aspects of natural language processing, context is very important, and for collocations, context is everything!
In the case of collocations, the context will be a document in the form of a list of words. Discovering collocations in this list of words means that we'll find common phrases that occur frequently throughout the text. For fun, we'll start with the script for Monty Python and the Holy Grail.
Getting ready
The script for Monty Python and the Holy Grail is found in the webtext corpus, so be sure that it's unzipped in nltk_data/corpora/webtext/.
How to do it...
We're going to create a list of all lowercased words in the text, and then produce a BigramCollocationFinder, which we can use to find bigrams, which are pairs of words. These bigrams are found using association measurement functions found in the nltk.metrics package.
>>> from nltk.corpus import webtext
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.metrics import BigramAssocMeasures
>>> words = [w.lower() for w in webtext.words('grail.txt')]
Much better—we can clearly see four of the most common bigrams in Monty Python and the Holy Grail. If you'd like to see more than four, simply increase the number to whatever you want, and the collocation finder will do its best.