Chunking
The next step on from knowing the parts speech (with tags) is to group them into meaningful chunks. The chunks are based around a subject (usually a noun – naming word) with the relevant verbs and adverbs associated with it.
To enable chunking regular expressions are employed, mainly:-
- + = match 1 or more
- ? = match 0 or 1 repetition
- * = match 0 or more repetitions
- . = any character except a new line
Regex cheat-sheet (full) – http://www.rexegg.com/regex-quickstart.html#ref
The main focus of Chunking is developing the right regular expression to pull together each proper noun (tag: NNP) along with the various types of verb and adverb.
The final expression was:-
r”””Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}”””
So breaking this down:-
- r at the start tells regex to consider the test as raw
- “”” “”” triple quotes allows the expression to be split over multiple lines for ease of reading
- Chunk: is the lable that regex will apply to each identified chunk
- {} contains the detail of needs to be found – the qualifier
- <> contains the items being searched for
- <RB.?>* any number of iterations of RB combined with one other character (.?)
- <VB.?>* any number of iterations of VB combined with one other character (.?)
- <NNP>+ one or more itterations of a proper noun NNP
- <NN>? zero or one iteration of singular noun NN
import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for x in tokenized[:5]: words = nltk.word_tokenize(x) tagged = nltk.pos_tag(words) chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) print(chunked) except Exception as e: print(str(e)) process_content()
Output:-
galiquis@raspberrypi: $ python3 ./nltk_tutorial5.py
(S
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
‘S/POS
(Chunk ADDRESS/NNP)
BEFORE/IN
(Chunk A/NNP JOINT/NNP SESSION/NNP)
OF/IN
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
OF/IN
(Chunk THE/NNP UNION/NNP January/NNP)
31/CD
,/,
2006/CD
(Chunk THE/NNP PRESIDENT/NNP)
:/:
(Chunk Thank/NNP)
you/PRP
all/DT
./.)
(S
(Chunk Mr./NNP Speaker/NNP)
,/,
(Chunk Vice/NNP President/NNP Cheney/NNP)
,/,
members/NNS
of/IN
(Chunk Congress/NNP)
,/,
members/NNS
of/IN
the/DT
(Chunk Supreme/NNP Court/NNP)
and/CC
diplomatic/JJ
corps/NN
,/,
distinguished/JJ
guests/NNS
,/,
and/CC
fellow/JJ
citizens/NNS
:/:
Today/VB
our/PRP$
nation/NN
lost/VBD
a/DT
beloved/VBN
,/,
graceful/JJ
,/,
courageous/JJ
woman/NN
who/WP
(Chunk called/VBD America/NNP)
to/TO
its/PRP$
founding/NN
ideals/NNS
and/CC
carried/VBD
on/IN
a/DT
noble/JJ
dream/NN
./.)
(S
Tonight/NN
we/PRP
are/VBP
comforted/VBN
by/IN
the/DT
hope/NN
of/IN
a/DT
glad/JJ
reunion/NN
with/IN
the/DT
husband/NN
who/WP
was/VBD
taken/VBN
so/RB
long/RB
ago/RB
,/,
and/CC
we/PRP
are/VBP
grateful/JJ
for/IN
the/DT
good/JJ
life/NN
of/IN
(Chunk Coretta/NNP Scott/NNP King/NNP)
./.)
(S (/( (Chunk Applause/NNP) ./. )/))
(S
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
reacts/VBZ
to/TO
applause/VB
during/IN
his/PRP$
(Chunk State/NNP)
of/IN
the/DT
(Chunk Union/NNP Address/NNP)
at/IN
the/DT
(Chunk Capitol/NNP)
,/,
(Chunk Tuesday/NNP)
,/,
(Chunk Jan/NNP)
./.)