Stemming
Stemming is the process of reducing words to their root forms, mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Example:-
- riding -> rid
- riden -> rid
- ride -> rid
- rides -> rid
Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
Stemmers remove these morphological affixes from words, leaving only the word stem – which may result in words that are not actual words.
In NLTK there are multiple stemmers, examples being Porter, Porter2, Paice-Husk, and Lovins.
In the tutorial Porter is used, first to demonstrate the basic function on a list of words.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
example_words = ["pyhton","pythoner","pyhtoning","pythoned","pythonly"]
for x in example_words:
print (ps.stem(x))
Giving an output of:-
galiquis@raspberrypi: $ python3 ./nltk_tutorial3.py
pyhton
python
pyhton
python
pythonli
It can also be used to stem words in a sentence, using tokenize to pull the sentence into individual words.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once"
words = word_tokenize(new_text)
for x in words:
print(ps.stem(x))
Giving an output of:-
galiquis@raspberrypi: $ python3 ./nltk_tutorial3.py
It
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc