Natural Language Tool Kit – Tutorial 4

Part of Speech Tagging

Part of Speech Tagging is the process of adding a label to every single word in a text, identifying the type of word it is.

Below is a list of the POS tags and their meaning…

This time round a different Tokenizer is used – PunktSentenceTokenizer.
This is a machine learning tokenizer which can be trained first on some training data and then applied.

In the example two of GW Bush’s speeches are used: 2005 to train and then that tokenizer applied to the 2006’s speech.

The output of the tokenizer is then passed to POS tag function to create tuples of the words in the speech and the corresponding POS tag.

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for x in tokenized[:5]:
            words = nltk.word_tokenize(x)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

process_content()

Note: tokenized[:5] limits the example to only a few lines of the speech

Giving the output….

galiquis@raspberrypi: $ python3 ./nltk_tutorial4.py
[(‘PRESIDENT’, ‘NNP’), (‘GEORGE’, ‘NNP’), (‘W.’, ‘NNP’), (‘BUSH’, ‘NNP’), (“‘S”, ‘POS’), (‘ADDRESS’, ‘NNP’), (‘BEFORE’, ‘IN’), (‘A’, ‘NNP’), (‘JOINT’, ‘NNP’), (‘SESSION’, ‘NNP’),
(‘OF’, ‘IN’), (‘THE’, ‘NNP’), (‘CONGRESS’, ‘NNP’), (‘ON’, ‘NNP’), (‘THE’, ‘NNP’), (‘STATE’, ‘NNP’), (‘OF’, ‘IN’), (‘THE’, ‘NNP’), (‘UNION’, ‘NNP’), (‘January’, ‘NNP’), (’31’, ‘CD’), (‘,’, ‘,’), (‘2006’, ‘CD’), (‘THE’, ‘NNP’), (‘PRESIDENT’, ‘NNP’), (‘:’, ‘:’), (‘Thank’, ‘NNP’), (‘you’, ‘PRP’), (‘all’, ‘DT’), (‘.’, ‘.’)][(‘Mr.’, ‘NNP’), (‘Speaker’, ‘NNP’), (‘,’, ‘,’), (‘Vice’, ‘NNP’), (‘President’, ‘NNP’), (‘Cheney’, ‘NNP’), (‘,’, ‘,’), (‘members’, ‘NNS’), (‘of’, ‘IN’), (‘Congress’, ‘NNP’), (‘,’, ‘,’), (‘members’, ‘NNS’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘Supreme’, ‘NNP’), (‘Court’, ‘NNP’), (‘and’, ‘CC’), (‘diplomatic’, ‘JJ’), (‘corps’, ‘NN’), (‘,’, ‘,’), (‘distinguished’, ‘JJ’), (‘guests’, ‘NNS’), (‘,’, ‘,’), (‘and’, ‘CC’), (‘fellow’, ‘JJ’), (‘citizens’, ‘NNS’), (‘:’, ‘:’), (‘Today’, ‘VB’), (‘our’, ‘PRP$’), (‘nation’, ‘NN’), (‘lost’, ‘VBD’), (‘a’, ‘DT’), (‘beloved’, ‘VBN’), (‘,’, ‘,’), (‘graceful’, ‘JJ’), (‘,’, ‘,’), (‘courageous’, ‘JJ’), (‘woman’, ‘NN’), (‘who’, ‘WP’), (‘called’, ‘VBD’), (‘America’, ‘NNP’), (‘to’, ‘TO’), (‘its’, ‘PRP$’), (‘founding’, ‘NN’), (‘ideals’, ‘NNS’), (‘and’, ‘CC’), (‘carried’, ‘VBD’), (‘on’, ‘IN’), (‘a’, ‘DT’), (‘noble’, ‘JJ’), (‘dream’, ‘NN’), (‘.’, ‘.’)][(‘Tonight’, ‘NN’), (‘we’, ‘PRP’), (‘are’, ‘VBP’), (‘comforted’, ‘VBN’), (‘by’, ‘IN’), (‘the’, ‘DT’), (‘hope’, ‘NN’), (‘of’, ‘IN’), (‘a’, ‘DT’), (‘glad’, ‘JJ’), (‘reunion’, ‘NN’), (‘with’, ‘IN’), (‘the’, ‘DT’), (‘husband’, ‘NN’), (‘who’, ‘WP’), (‘was’, ‘VBD’), (‘taken’, ‘VBN’), (‘so’, ‘RB’), (‘long’, ‘RB’), (‘ago’, ‘RB’), (‘,’, ‘,’), (‘and’, ‘CC’), (‘we’,
‘PRP’), (‘are’, ‘VBP’), (‘grateful’, ‘JJ’), (‘for’, ‘IN’), (‘the’, ‘DT’), (‘good’, ‘JJ’), (‘life’, ‘NN’), (‘of’, ‘IN’), (‘Coretta’, ‘NNP’), (‘Scott’, ‘NNP’), (‘King’, ‘NNP’), (‘.’, ‘.’)][(‘(‘, ‘(‘), (‘Applause’, ‘NNP’), (‘.’, ‘.’), (‘)’, ‘)’)][(‘President’, ‘NNP’), (‘George’, ‘NNP’), (‘W.’, ‘NNP’), (‘Bush’, ‘NNP’), (‘reacts’, ‘VBZ’), (‘to’, ‘TO’), (‘applause’, ‘VB’), (‘during’, ‘IN’), (‘his’, ‘PRP$’), (‘State’, ‘NNP’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘Union’, ‘NNP’), (‘Address’, ‘NNP’), (‘at’, ‘IN’), (‘the’, ‘DT’), (‘Capitol’, ‘NNP’), (‘,’, ‘,’), (‘Tuesday’, ‘NNP’), (‘,’, ‘,’), (‘Jan’, ‘NNP’), (‘.’, ‘.’)]

Leave a Reply