Natural Language Tool Kit – Tutorial 8

Lemmatizing

Lemmatizing is very similar to stemming with the key difference being that lemmatizing ends up at a real word.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Gives the output:-

galiquis@raspberrypi:$ python3 ./nltk_tutorial8.py
cat
cactus
goose
rock
python
good
best
run
run

Some points to note:-

  • Lemmatize takes part of the POS parameter/tag so:-
    • pos=”a”or ‘a’ will find the closest adjective
    • pos=”v” or ‘v’ will find the closest verb
    • the default (no option) finds the closest noun
  • More powerful than stemming

Natural Language Tool Kit – Tutorial 7

Named Entity Recognition

A key method for chunking in natural language processing is called “Named Entity Recognition.” The concept being to have the code identify and pull out “entities” like people, places, things, locations, monetary figures, and more.

NE Type and Examples
ORGANIZATION – Georgia-Pacific Corp., WHO
PERSON – Eddy Bonte, President Obama
LOCATION – Murray River, Mount Everest
DATE – June, 2008-06-29
TIME – two fifty a m, 1:30 p.m.
MONEY – 175 million Canadian Dollars, GBP 10.40
PERCENT – twenty pct, 18.75 %
FACILITY – Washington Monument, Stonehenge
GPE – South East Asia, Midlothian

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            namedEnt.draw()
    except Exception as e:
        print(str(e))

process_content()

The key line that applies the entity recognition is:-

namedEnt = nltk.ne_chunk(tagged, binary=True)

Which has two options…

  • binary=True – this means either something is a named entity, or not. There will be no further detail and hits are titled ‘NE’
  • binary=False – this provides detail on the named entity, e.g Facility, Organisation, Person etc 

In the example when Binary is False, it picked up the same tagged items, but split up terms like White House into “White” and “House” as if they were different, whereas in the binary = True option, the named entity recognition was correct to say White House was part of the same named entity.

NLTK’s Named Entity Recognition can be a bit hit and miss leading to a lot of false positives.

Natural Language Tool Kit – Tutorial 6

Chinking

Chinking is the process of excluding/removing things from Chunks. So there might be items in the Chunk that need to be removed.

This is done by expanding the Chunk code to include }{ brackets that contain the items to be excluded.

Example:-

chunkGram = r"""Chunk: {<.*>+}
                        }<VB.?|IN|DT|TO>+{"""

Natural Language Tool Kit – Tutorial 5

Chunking

The next step on from knowing the parts speech (with tags) is to group them into meaningful chunks. The chunks are based around a subject (usually a noun – naming word) with the relevant verbs and adverbs associated with it.

To enable chunking regular expressions are employed, mainly:-

  • + = match 1 or more
  • ? = match 0 or 1 repetition
  • * = match 0 or more repetitions
  • . = any character except a new line

Regex cheat-sheet (full) – http://www.rexegg.com/regex-quickstart.html#ref

The main focus of Chunking is developing the right regular expression to pull together each proper noun (tag: NNP) along with the various types of verb and adverb.

The final expression was:-

r”””Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}”””

So breaking this down:-

  • r at the start tells regex to consider the test as raw
  • “”” “”” triple quotes allows the expression to be split over multiple lines for ease of reading
  • Chunk: is the lable that regex will apply to each identified chunk
  • {} contains the detail of needs to be found – the qualifier
  • <> contains the items being searched for
  • <RB.?>* any number of iterations of RB combined with one other character (.?)
  • <VB.?>* any number of iterations of VB combined with one other character (.?)
  • <NNP>+ one or more itterations of a proper noun NNP
  • <NN>? zero or one iteration of singular noun NN 
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for x in tokenized[:5]:
            words = nltk.word_tokenize(x)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            print(chunked)

    except Exception as e:
        print(str(e))

process_content()

Output:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial5.py
(S
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
‘S/POS
(Chunk ADDRESS/NNP)
BEFORE/IN
(Chunk A/NNP JOINT/NNP SESSION/NNP)
OF/IN
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
OF/IN
(Chunk THE/NNP UNION/NNP January/NNP)
31/CD
,/,
2006/CD
(Chunk THE/NNP PRESIDENT/NNP)
:/:
(Chunk Thank/NNP)
you/PRP
all/DT
./.)
(S
(Chunk Mr./NNP Speaker/NNP)
,/,
(Chunk Vice/NNP President/NNP Cheney/NNP)
,/,
members/NNS
of/IN
(Chunk Congress/NNP)
,/,
members/NNS
of/IN
the/DT
(Chunk Supreme/NNP Court/NNP)
and/CC
diplomatic/JJ
corps/NN
,/,
distinguished/JJ
guests/NNS
,/,
and/CC
fellow/JJ
citizens/NNS
:/:
Today/VB
our/PRP$
nation/NN
lost/VBD
a/DT
beloved/VBN
,/,
graceful/JJ
,/,
courageous/JJ
woman/NN
who/WP
(Chunk called/VBD America/NNP)
to/TO
its/PRP$
founding/NN
ideals/NNS
and/CC
carried/VBD
on/IN
a/DT
noble/JJ
dream/NN
./.)
(S
Tonight/NN
we/PRP
are/VBP
comforted/VBN
by/IN
the/DT
hope/NN
of/IN
a/DT
glad/JJ
reunion/NN
with/IN
the/DT
husband/NN
who/WP
was/VBD
taken/VBN
so/RB
long/RB
ago/RB
,/,
and/CC
we/PRP
are/VBP
grateful/JJ
for/IN
the/DT
good/JJ
life/NN
of/IN
(Chunk Coretta/NNP Scott/NNP King/NNP)
./.)
(S (/( (Chunk Applause/NNP) ./. )/))
(S
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
reacts/VBZ
to/TO
applause/VB
during/IN
his/PRP$
(Chunk State/NNP)
of/IN
the/DT
(Chunk Union/NNP Address/NNP)
at/IN
the/DT
(Chunk Capitol/NNP)
,/,
(Chunk Tuesday/NNP)
,/,
(Chunk Jan/NNP)
./.)

Natural Language Tool Kit – Tutorial 4

Part of Speech Tagging

Part of Speech Tagging is the process of adding a label to every single word in a text, identifying the type of word it is.

Below is a list of the POS tags and their meaning…

This time round a different Tokenizer is used – PunktSentenceTokenizer.
This is a machine learning tokenizer which can be trained first on some training data and then applied.

In the example two of GW Bush’s speeches are used: 2005 to train and then that tokenizer applied to the 2006’s speech.

The output of the tokenizer is then passed to POS tag function to create tuples of the words in the speech and the corresponding POS tag.

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for x in tokenized[:5]:
            words = nltk.word_tokenize(x)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

process_content()

Note: tokenized[:5] limits the example to only a few lines of the speech

Giving the output….

galiquis@raspberrypi: $ python3 ./nltk_tutorial4.py
[(‘PRESIDENT’, ‘NNP’), (‘GEORGE’, ‘NNP’), (‘W.’, ‘NNP’), (‘BUSH’, ‘NNP’), (“‘S”, ‘POS’), (‘ADDRESS’, ‘NNP’), (‘BEFORE’, ‘IN’), (‘A’, ‘NNP’), (‘JOINT’, ‘NNP’), (‘SESSION’, ‘NNP’),
(‘OF’, ‘IN’), (‘THE’, ‘NNP’), (‘CONGRESS’, ‘NNP’), (‘ON’, ‘NNP’), (‘THE’, ‘NNP’), (‘STATE’, ‘NNP’), (‘OF’, ‘IN’), (‘THE’, ‘NNP’), (‘UNION’, ‘NNP’), (‘January’, ‘NNP’), (’31’, ‘CD’), (‘,’, ‘,’), (‘2006’, ‘CD’), (‘THE’, ‘NNP’), (‘PRESIDENT’, ‘NNP’), (‘:’, ‘:’), (‘Thank’, ‘NNP’), (‘you’, ‘PRP’), (‘all’, ‘DT’), (‘.’, ‘.’)][(‘Mr.’, ‘NNP’), (‘Speaker’, ‘NNP’), (‘,’, ‘,’), (‘Vice’, ‘NNP’), (‘President’, ‘NNP’), (‘Cheney’, ‘NNP’), (‘,’, ‘,’), (‘members’, ‘NNS’), (‘of’, ‘IN’), (‘Congress’, ‘NNP’), (‘,’, ‘,’), (‘members’, ‘NNS’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘Supreme’, ‘NNP’), (‘Court’, ‘NNP’), (‘and’, ‘CC’), (‘diplomatic’, ‘JJ’), (‘corps’, ‘NN’), (‘,’, ‘,’), (‘distinguished’, ‘JJ’), (‘guests’, ‘NNS’), (‘,’, ‘,’), (‘and’, ‘CC’), (‘fellow’, ‘JJ’), (‘citizens’, ‘NNS’), (‘:’, ‘:’), (‘Today’, ‘VB’), (‘our’, ‘PRP$’), (‘nation’, ‘NN’), (‘lost’, ‘VBD’), (‘a’, ‘DT’), (‘beloved’, ‘VBN’), (‘,’, ‘,’), (‘graceful’, ‘JJ’), (‘,’, ‘,’), (‘courageous’, ‘JJ’), (‘woman’, ‘NN’), (‘who’, ‘WP’), (‘called’, ‘VBD’), (‘America’, ‘NNP’), (‘to’, ‘TO’), (‘its’, ‘PRP$’), (‘founding’, ‘NN’), (‘ideals’, ‘NNS’), (‘and’, ‘CC’), (‘carried’, ‘VBD’), (‘on’, ‘IN’), (‘a’, ‘DT’), (‘noble’, ‘JJ’), (‘dream’, ‘NN’), (‘.’, ‘.’)][(‘Tonight’, ‘NN’), (‘we’, ‘PRP’), (‘are’, ‘VBP’), (‘comforted’, ‘VBN’), (‘by’, ‘IN’), (‘the’, ‘DT’), (‘hope’, ‘NN’), (‘of’, ‘IN’), (‘a’, ‘DT’), (‘glad’, ‘JJ’), (‘reunion’, ‘NN’), (‘with’, ‘IN’), (‘the’, ‘DT’), (‘husband’, ‘NN’), (‘who’, ‘WP’), (‘was’, ‘VBD’), (‘taken’, ‘VBN’), (‘so’, ‘RB’), (‘long’, ‘RB’), (‘ago’, ‘RB’), (‘,’, ‘,’), (‘and’, ‘CC’), (‘we’,
‘PRP’), (‘are’, ‘VBP’), (‘grateful’, ‘JJ’), (‘for’, ‘IN’), (‘the’, ‘DT’), (‘good’, ‘JJ’), (‘life’, ‘NN’), (‘of’, ‘IN’), (‘Coretta’, ‘NNP’), (‘Scott’, ‘NNP’), (‘King’, ‘NNP’), (‘.’, ‘.’)][(‘(‘, ‘(‘), (‘Applause’, ‘NNP’), (‘.’, ‘.’), (‘)’, ‘)’)][(‘President’, ‘NNP’), (‘George’, ‘NNP’), (‘W.’, ‘NNP’), (‘Bush’, ‘NNP’), (‘reacts’, ‘VBZ’), (‘to’, ‘TO’), (‘applause’, ‘VB’), (‘during’, ‘IN’), (‘his’, ‘PRP$’), (‘State’, ‘NNP’), (‘of’, ‘IN’), (‘the’, ‘DT’), (‘Union’, ‘NNP’), (‘Address’, ‘NNP’), (‘at’, ‘IN’), (‘the’, ‘DT’), (‘Capitol’, ‘NNP’), (‘,’, ‘,’), (‘Tuesday’, ‘NNP’), (‘,’, ‘,’), (‘Jan’, ‘NNP’), (‘.’, ‘.’)]

Natural Language Tool Kit – Tutorial 3

Stemming

Stemming is the process of reducing words to their root forms, mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Example:-

  • riding -> rid
  • riden -> rid
  • ride -> rid
  • rides -> rid

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
Stemmers remove these morphological affixes from words, leaving only the word stem – which may result in words that are not actual words. 

In NLTK there are multiple stemmers, examples being Porter, Porter2, Paice-Husk, and Lovins.

In the tutorial Porter is used, first to demonstrate the basic function on a list of words.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

example_words = ["pyhton","pythoner","pyhtoning","pythoned","pythonly"]

for x in example_words:
    print (ps.stem(x))

Giving an output of:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial3.py
pyhton
python
pyhton
python
pythonli

It can also be used to stem words in a sentence, using tokenize to pull the sentence into individual words.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once"
words = word_tokenize(new_text)
for x in words:
    print(ps.stem(x))

Giving an output of:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial3.py
It
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc

Natural Language Tool Kit – Tutorial 2

Stop words

Part 2 focuses on Stop words, those little structural words that humans rely on to make sense of a sentence but which just get in the way of algorithmic analysis. Words such as: a, is, the, it….etc

First we import from Corpus a list of predefined stop words, which using print(stopwords) shows:-

{‘whom’, ‘through’, ‘y’, “hadn’t”, ‘while’, ‘they’, ‘some’, ‘into’, ‘you’, ‘how’, ‘too’, ‘until’, ‘ourselves’, “should’ve”, ‘me’, ‘a’, ‘wouldn’, ‘or’, ‘yours’, ‘ve’, ‘themselves’, “you’ve”, ‘nor’, ‘so’, ‘not’, ‘haven’, ‘those’, ‘needn’, ‘didn’, ‘was’, ‘she’, ‘is’, ‘because’, ‘once’, ‘did’, ‘from’, ‘don’, ‘mustn’, ‘own’, ‘myself’, ‘doing’, ‘have’, “won’t”,
‘wasn’, ‘few’, ‘during’, ‘aren’, ‘out’, ‘having’, ‘both’, ‘who’, ‘all’, ‘d’, ‘which’, ‘for’, ‘if’, ‘her’, ‘any’, “don’t”, ‘won’, ‘between’, ‘your’, ‘ain’, ‘mightn’, “mustn’t”, “you’ll”, ‘hers’, ‘am’, ‘this’, ‘does’, ‘are’, ‘before’, ‘most’, ‘what’, ‘after’, “wouldn’t”, ‘we’, ‘re’, ‘isn’, ‘yourselves’, ‘down’, ‘it’, ‘our’, ‘he’, “shouldn’t”, ‘o’, ‘were’, ‘been’, ‘there’, “isn’t”, ‘but’, ‘yourself’, ‘other’, “couldn’t”, ‘again’, ‘herself’, “mightn’t”, ‘to’, ‘their’, ‘i’, ‘when’, ‘hasn’, “doesn’t”, “needn’t”, ‘same’, ‘m’, ‘its’, “haven’t”, “weren’t”, ‘an’, ‘had’, ‘weren’, ‘shan’, ‘against’, “aren’t”, ‘will’, “you’re”, ‘the’, ‘my’, ‘him’, ‘himself’, ‘s’, ‘ll’, ‘of’, ‘ours’, ‘in’, ‘itself’, ‘about’, ‘as’, ‘than’, ‘couldn’, “shan’t”, “hasn’t”, ‘theirs’, ‘just’, ‘where’, ‘be’, ‘with’, ‘why’, ‘below’, ‘now’, ‘off’, ‘up’, ‘each’, ‘only’, ‘here’, ‘further’, ‘shouldn’, “wasn’t”, ‘on’, “didn’t”, “you’d”, ‘do’, ‘no’, ‘more’, ‘over’, ‘can’, ‘that’, ‘being’, ‘such’, ‘by’, ‘at’, “that’ll”, ‘above’, ‘ma’, “it’s”, ‘should’, ‘these’, ‘has’, “she’s”, ‘very’, ‘t’, ‘under’, ‘them’, ‘doesn’, ‘then’, ‘his’, ‘and’, ‘hadn’}

Then using word_tokenise and a for loop we remove the stop words.

Example code:-

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sentence = "This is an example showing off stop word filtration"
stop_words = set(stopwords.words("english"))

words = word_tokenize(example_sentence)
filtered_sentence = []

for x in words:
    if x not in stop_words:
        filtered_sentence.append(x)

print(filtered_sentence)

The for loop can be combined into one line of code but it’s not as easy to follow:-

filtered_sentence = [w for w in words if not w in stop_words]

Straight forward way of removing noise from the word lists.

Natural Language Tool Kit – Tutorial 1

Installation & Tokenizing

So to learn about sentiment analysis I’m initially going to be working through a series of tutorials by Sentdex on YouTube.

The main focus on this was installing the Natural Language Tool Kit, which unlike a lot of python libraries also requires you to download extensions. (…if that’s the right term?)

In the vid it recommends running:- Continue reading %s

Numpy install woes

Recently I’ve started a project to look at tracking stocks combined with  company sentiment – utilising Python running on a little Raspberry Pi. I’ll link to the detailed project posts at a later point.

Anyway the first hurdle I’ve hit was getting NumPy to install correctly with the usual pip install giving the following error when NumPy is called…

ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you’re working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.

Original error was: libf77blas.so.3: cannot open shared object file: No such file or directory

—————————————-
ERROR: Command errored out with exit status 1: /usr/bin/python3 /usr/local/lib/python3.7/dist-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpumibpogm Check the logs for full command output.

I tried a couple of things to fix this…

  1. Repeating the install and specifying the upgrade flag.
    sudo pip3 install -U numpy
    Note -U = –upgrade
  2. Uninstalling and re-installing NumPy and Setuptools.
    sudo pip3 uninstall -y numpy
    sudo pip3 uninstall -y setuptools
    sudo pip3 install -U setuptools
    sudo pip3 install -U numpy
    Note -y = –yes Don’t ask for confirmation of uninstall deletions.
  3. The uninstall of NumPy didn’t work so I moved to apt-get:-
    sudo apt-get remove python-numpy
    Which did the trick…

But to know avail.
So after a little light browsing I came up with an answer – not all of the NumPy dependencies had been installed. Ineed to run 

sudo apt-get install python-dev libatlas-base-dev

Which then allowed NumPy to be reinstalled and worked a treat.