Natural Language Tool Kit – Tutorial 3

Stemming

Stemming is the process of reducing words to their root forms, mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Example:-

  • riding -> rid
  • riden -> rid
  • ride -> rid
  • rides -> rid

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
Stemmers remove these morphological affixes from words, leaving only the word stem – which may result in words that are not actual words. 

In NLTK there are multiple stemmers, examples being Porter, Porter2, Paice-Husk, and Lovins.

In the tutorial Porter is used, first to demonstrate the basic function on a list of words.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

example_words = ["pyhton","pythoner","pyhtoning","pythoned","pythonly"]

for x in example_words:
    print (ps.stem(x))

Giving an output of:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial3.py
pyhton
python
pyhton
python
pythonli

It can also be used to stem words in a sentence, using tokenize to pull the sentence into individual words.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once"
words = word_tokenize(new_text)
for x in words:
    print(ps.stem(x))

Giving an output of:-

galiquis@raspberrypi: $ python3 ./nltk_tutorial3.py
It
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc

Leave a Reply