Natural Language Tool Kit – Tutorial 13

Naive Bayes

This section builds on the last 2 tutorials to choose an algorithm, separate the data into training and testing sets – and set it running.

The algorithm in this example is the Naive Bayes classifier.

But first the data needs to be split into training and test sets for some supervised machine learning. In essence we show the machine data, and telling it “this data is positive,” or “this data is negative.” Then, after the training is done, we show the machine some new data and ask the computer what the computer thinks the category of the new data is.

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower()) # normalise everything to lower case and append

all_words = nltk.FreqDist(all_words) # converts to a nltk frequency distribution

word_features = list (all_words.keys())[:3000] # from the frequency list we're taking just the words(keys) and only the top 3000

def find_fetures(document):
    words = set(document) # this gives a list of the unique words - removes duplicates
    features = {} # declare an empty dictionary
    for w in word_features:
        features[w] = (w in words) # this checks each word in the top 3000 to see if it is present in the passed text 'document' so gives a true/false against the 3000
    return features

# print((find_fetures(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_fetures(rev), category) for (rev, category) in documents]

training_set = featuresets[:1900] # splits the featuresets into two seperate groups 1 to train and the other to test
testing_set = featuresets[1900:]

## Naive Bayse Algorythm

classifier = nltk.NaiveBayesClassifier.train(training_set) # training the NaiveBayesClassifier on training data
print("Naive Bayes Algo accuracy:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15) # tells most popular words on both sides and if +ve or -ve

Output:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial13.py
Naive Bayes Algo accuracy: 80.0
Most Informative Features
annual = True pos : neg = 9.6 : 1.0
sucks = True neg : pos = 9.1 : 1.0
bothered = True neg : pos = 9.1 : 1.0
frances = True pos : neg = 8.9 : 1.0
idiotic = True neg : pos = 8.8 : 1.0
unimaginative = True neg : pos = 8.4 : 1.0
silverstone = True neg : pos = 7.7 : 1.0
shoddy = True neg : pos = 7.1 : 1.0
suvari = True neg : pos = 7.1 : 1.0
mena = True neg : pos = 7.1 : 1.0
sexist = True neg : pos = 7.1 : 1.0
regard = True pos : neg = 6.9 : 1.0
schumacher = True neg : pos = 6.7 : 1.0
uninspired = True neg : pos = 6.6 : 1.0
kidding = True neg : pos = 6.4 : 1.0

Leave a Reply