Natural Language Tool Kit – Tutorial 11

Text Classification

The step on from looking at words and how words relate to each other, is to broaden out to classifying sections of text. Classification can be focused on identifying what a piece of text is about (e.g. politics, the military etc) or as simple as identifying if some text is spam or not spam – for comment/email filters.

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower()) # normalise everything to lower case and append

all_words = nltk.FreqDist(all_words) # converts to a nltk frequency distribution
print(all_words.most_common(15)) # top 15 most common
print(all_words["stupid"]) # shows the frequency of a specific word

The first key piece of code is defining ‘documents’.

documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

This uses the list() constructor to iterate through the movie_reviews and create a tuple. 

However, this is easier to see when broken down into multiple lines:-

documents = [] # declare the list
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append(list(movie_reviews.words(fileid)), category)

Next we shuffle the documents as we don’t want to train and test on the same data.

So the principal being used is to take all of the words from all of the reviews and compile them to find out the most popular words and whether they appear in positive or negative reviews. Then when we look at a new review we can test what words appear and determine if they are more positive or negative….onto the next tutorial.

Useful links:-

Lists and Tuples – https://realpython.com/python-lists-tuples/

Leave a Reply