Text Classification
The step on from looking at words and how words relate to each other, is to broaden out to classifying sections of text. Classification can be focused on identifying what a piece of text is about (e.g. politics, the military etc) or as simple as identifying if some text is spam or not spam – for comment/email filters.
import nltk import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) print(documents[1]) all_words = [] for w in movie_reviews.words(): all_words.append(w.lower()) # normalise everything to lower case and append all_words = nltk.FreqDist(all_words) # converts to a nltk frequency distribution print(all_words.most_common(15)) # top 15 most common print(all_words["stupid"]) # shows the frequency of a specific word
The first key piece of code is defining ‘documents’.
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
This uses the list() constructor to iterate through the movie_reviews and create a tuple.
However, this is easier to see when broken down into multiple lines:-
documents = [] # declare the list for category in movie_reviews.categories(): for fileid in movie_reviews.fileids(category): documents.append(list(movie_reviews.words(fileid)), category)
Next we shuffle the documents as we don’t want to train and test on the same data.
So the principal being used is to take all of the words from all of the reviews and compile them to find out the most popular words and whether they appear in positive or negative reviews. Then when we look at a new review we can test what words appear and determine if they are more positive or negative….onto the next tutorial.
Useful links:-
Lists and Tuples – https://realpython.com/python-lists-tuples/