jmhobbs

Looking up words in a Dictionary using Python

First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.

I came across three useful Python solutions, and I'm going to detail usage of two of them in this post.

Option 1: NLTK + Wordnet

First up is accessing Wordnet.

"Wordnet is a large lexical database of English..."
The only Python way of accessing this (that I came across) is NLTK, a set of
"Open source Python modules, linguistic data and documentation for research and development in natural language processing..."

Getting NLTK Installed

For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:

>>> from nltk.corpus import wordnet
>>> wordnet.synsets( 'cake' )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 68, in __getattr__
    self.__load()
  File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 56, in __load
    except LookupError: raise e
LookupError:
**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource: >>> nltk.download().
  Searched in:
    - '/home/jmhobbs/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

So what we need to do is run the NLTK installer, as shown here:

>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> wordnet
    Downloading package 'wordnet' to /home/jmhobbs/nltk_data...
      Unzipping corpora/wordnet.zip.

---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>> 

Using NLTK + Wordnet

Now that we have everything installed, using wordnet from Python is straight forward.

# Load the wordnet corpus
from nltk.corpus import wordnet

# Get a collection of synsets (synonym sets) for a word
synsets = wordnet.synsets( 'cake' )

# Print the information
for synset in synsets:
  print "-" * 10
  print "Name:", synset.name
  print "Lexical Type:", synset.lexname
  print "Lemmas:", synset.lemma_names
  print "Definition:", synset.definition
  for example in synset.examples:
    print "Example:", example

The output of that is:

----------
Name: cake.n.01
Lexical Type: noun.artifact
Lemmas: ['cake', 'bar']
Definition: a block of solid substance (such as soap or wax)
Example: a bar of chocolate
----------
Name: patty.n.01
Lexical Type: noun.food
Lemmas: ['patty', 'cake']
Definition: small flat mass of chopped food
----------
Name: cake.n.03
Lexical Type: noun.food
Lemmas: ['cake']
Definition: baked goods made from or based on a mixture of flour, sugar, eggs, and fat
----------
Name: coat.v.03
Lexical Type: verb.contact
Lemmas: ['coat', 'cake']
Definition: form a coat over
Example: Dirt had coated her face

Perfect!

Caveats

There are some caveats to using WordNet with NLTK. First is that the definitions aren't always ordered in the way you would expect. For instance, look at the "cake" results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.

Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.

Last, you are constrained to the English language, as analyzed by Pinceton. I'll address this issue in the next section.

Option 2: SDict Viewer

As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.

SDict Viewer is an application

SDict Viewer is an application, so it's not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my "library" version from http://github.com/jmhobbs/sdictviewer-lib.

Here is an example when it's all finished:

import sys

import sdictviewer.formats.dct.sdict as sdict
import sdictviewer.dictutil

dictionary = sdict.SDictionary( 'webster_1913.dct' )
dictionary.load()

start_word = sys.argv[1]

found = False

for item in dictionary.get_word_list_iter( start_word ):
  try:
    if start_word == str( item ):
      instance, definition = item.read_articles()[0]
      print "%s: %s" % ( item, definition )
      found = True
      break
  except:
    continue

if not found:
  print "No definition for '%s'." % start_word

dictionary.close()

Here is a sample run:

jmhobbs@katya:~$ python okay.py Cat
Cat: (n.) An animal of various species of the genera Felis and Lynx. The domestic cat is Felis domestica. The European wild cat (Felis catus) is much larger than the domestic cat. In the United States the name wild cat is commonly applied to the bay lynx (Lynx rufus) See Wild cat, and Tiger cat.
wrote /home/jmhobbs/.sdictviewer/index_cache/webster_1913.dct-1.0.index

As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.

Option 3: Aard Format

In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.

All Done

So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!