Natural Language Tool Kit – Tutorial 1

Installation & Tokenizing

So to learn about sentiment analysis I’m initially going to be working through a series of tutorials by Sentdex on YouTube.

The main focus on this was installing the Natural Language Tool Kit, which unlike a lot of python libraries also requires you to download extensions. (…if that’s the right term?)

In the vid it recommends running:-

import nltk
nltk.download()

Once this code is run it opens a command line dialogue where you can search for the packages you want – ‘all’ being the recommendation.

So for a slight shortcut I just ran:- 

import nltk
nltk.download('all')

Simple enough…

The next section of the tutorial covered some terminology that’s worth summarising:-

  • tokenizing – the process by which a big quantity of text is divided into smaller parts called tokens. Two main forms in this tutorial…
    • word tokenizers
    • sentence tokenisers
  • Corpora – a body of text – eg medical journals, speeches
  • Lexicon – dictionary of words and meanings

To illustrate this the final bit of the tutorial demo’d how sentence and work tonkenisation works in nltk.

from nltk.tokenize import word_tokenize, sent_tokenize

example_text = "Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language."

print(sent_tokenize(example_text))
for x in word_tokenize(example_text):
    print(x)

Which when run gives the output:-

[‘Natural Language Processing is the task we give computers to read and understand (process) written text (natural language).’, ‘By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.’]Natural
Language
Processing
is
the
task
we
give
computers
to
read
and
understand
(
process
)
….etc etc…

Leave a Reply