Natural Language Tool Kit – Tutorial 5

Chunking

The next step on from knowing the parts speech (with tags) is to group them into meaningful chunks. The chunks are based around a subject (usually a noun – naming word) with the relevant verbs and adverbs associated with it.

To enable chunking regular expressions are employed, mainly:-

  • + = match 1 or more
  • ? = match 0 or 1 repetition
  • * = match 0 or more repetitions
  • . = any character except a new line

Regex cheat-sheet (full) – http://www.rexegg.com/regex-quickstart.html#ref

The main focus of Chunking is developing the right regular expression to pull together each proper noun (tag: NNP) along with the various types of verb and adverb.

The final expression was:-

r”””Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}”””

So breaking this down:-

  • r at the start tells regex to consider the test as raw
  • “”” “”” triple quotes allows the expression to be split over multiple lines for ease of reading
  • Chunk: is the lable that regex will apply to each identified chunk
  • {} contains the detail of needs to be found – the qualifier
  • <> contains the items being searched for
  • <RB.?>* any number of iterations of RB combined with one other character (.?)
  • <VB.?>* any number of iterations of VB combined with one other character (.?)
  • <NNP>+ one or more itterations of a proper noun NNP
  • <NN>? zero or one iteration of singular noun NN 
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for x in tokenized[:5]:
            words = nltk.word_tokenize(x)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            print(chunked)

    except Exception as e:
        print(str(e))

process_content()

Output:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial5.py
(S
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
‘S/POS
(Chunk ADDRESS/NNP)
BEFORE/IN
(Chunk A/NNP JOINT/NNP SESSION/NNP)
OF/IN
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
OF/IN
(Chunk THE/NNP UNION/NNP January/NNP)
31/CD
,/,
2006/CD
(Chunk THE/NNP PRESIDENT/NNP)
:/:
(Chunk Thank/NNP)
you/PRP
all/DT
./.)
(S
(Chunk Mr./NNP Speaker/NNP)
,/,
(Chunk Vice/NNP President/NNP Cheney/NNP)
,/,
members/NNS
of/IN
(Chunk Congress/NNP)
,/,
members/NNS
of/IN
the/DT
(Chunk Supreme/NNP Court/NNP)
and/CC
diplomatic/JJ
corps/NN
,/,
distinguished/JJ
guests/NNS
,/,
and/CC
fellow/JJ
citizens/NNS
:/:
Today/VB
our/PRP$
nation/NN
lost/VBD
a/DT
beloved/VBN
,/,
graceful/JJ
,/,
courageous/JJ
woman/NN
who/WP
(Chunk called/VBD America/NNP)
to/TO
its/PRP$
founding/NN
ideals/NNS
and/CC
carried/VBD
on/IN
a/DT
noble/JJ
dream/NN
./.)
(S
Tonight/NN
we/PRP
are/VBP
comforted/VBN
by/IN
the/DT
hope/NN
of/IN
a/DT
glad/JJ
reunion/NN
with/IN
the/DT
husband/NN
who/WP
was/VBD
taken/VBN
so/RB
long/RB
ago/RB
,/,
and/CC
we/PRP
are/VBP
grateful/JJ
for/IN
the/DT
good/JJ
life/NN
of/IN
(Chunk Coretta/NNP Scott/NNP King/NNP)
./.)
(S (/( (Chunk Applause/NNP) ./. )/))
(S
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
reacts/VBZ
to/TO
applause/VB
during/IN
his/PRP$
(Chunk State/NNP)
of/IN
the/DT
(Chunk Union/NNP Address/NNP)
at/IN
the/DT
(Chunk Capitol/NNP)
,/,
(Chunk Tuesday/NNP)
,/,
(Chunk Jan/NNP)
./.)

Leave a Reply