Looking up words in a Dictionary using Python

March 1, 2010

First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.

I came across three useful Python solutions, and I’m going to detail usage of two of them in this post.

Option 1: NLTK + Wordnet

First up is accessing Wordnet.

“Wordnet is a large lexical database of English…”

The only Python way of accessing this (that I came across) is NLTK, a set of

“Open source Python modules, linguistic data and documentation for research and development in natural language processing…”

Getting NLTK Installed

For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:

>>> from nltk.corpus import wordnet
>>> wordnet.synsets( 'cake' )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 68, in __getattr__
    self.__load()
  File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 56, in __load
    except LookupError: raise e
LookupError:
**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource: >>> nltk.download().
  Searched in:
    - '/home/jmhobbs/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

So what we need to do is run the NLTK installer, as shown here:

>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> wordnet
    Downloading package 'wordnet' to /home/jmhobbs/nltk_data...
      Unzipping corpora/wordnet.zip.

---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>> 

Using NLTK + Wordnet

Now that we have everything installed, using wordnet from Python is straight forward.

# Load the wordnet corpus
from nltk.corpus import wordnet

# Get a collection of synsets (synonym sets) for a word
synsets = wordnet.synsets( 'cake' )

# Print the information
for synset in synsets:
  print "-" * 10
  print "Name:", synset.name
  print "Lexical Type:", synset.lexname
  print "Lemmas:", synset.lemma_names
  print "Definition:", synset.definition
  for example in synset.examples:
    print "Example:", example

The output of that is:

----------
Name: cake.n.01
Lexical Type: noun.artifact
Lemmas: ['cake', 'bar']
Definition: a block of solid substance (such as soap or wax)
Example: a bar of chocolate
----------
Name: patty.n.01
Lexical Type: noun.food
Lemmas: ['patty', 'cake']
Definition: small flat mass of chopped food
----------
Name: cake.n.03
Lexical Type: noun.food
Lemmas: ['cake']
Definition: baked goods made from or based on a mixture of flour, sugar, eggs, and fat
----------
Name: coat.v.03
Lexical Type: verb.contact
Lemmas: ['coat', 'cake']
Definition: form a coat over
Example: Dirt had coated her face

Perfect!

Caveats

There are some caveats to using WordNet with NLTK. First is that the definitions aren’t always ordered in the way you would expect. For instance, look at the “cake” results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.

Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.

Last, you are constrained to the English language, as analyzed by Pinceton. I’ll address this issue in the next section.

Option 2: SDict Viewer

As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.

SDict Viewer is an application

SDict Viewer is an application, so it’s not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my “library” version from http://github.com/jmhobbs/sdictviewer-lib.

Here is an example when it’s all finished:

import sys

import sdictviewer.formats.dct.sdict as sdict
import sdictviewer.dictutil

dictionary = sdict.SDictionary( 'webster_1913.dct' )
dictionary.load()

start_word = sys.argv[1]

found = False

for item in dictionary.get_word_list_iter( start_word ):
  try:
    if start_word == str( item ):
      instance, definition = item.read_articles()[0]
      print "%s: %s" % ( item, definition )
      found = True
      break
  except:
    continue

if not found:
  print "No definition for '%s'." % start_word

dictionary.close()

Here is a sample run:

[email protected]:~$ python okay.py Cat
Cat: (n.) An animal of various species of the genera Felis and Lynx. The domestic cat is Felis domestica. The European wild cat (Felis catus) is much larger than the domestic cat. In the United States the name wild cat is commonly applied to the bay lynx (Lynx rufus) See Wild cat, and Tiger cat.
wrote /home/jmhobbs/.sdictviewer/index_cache/webster_1913.dct-1.0.index

As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.

Option 3: Aard Format

In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.

All Done

So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!

Categories: Consume, Geek
Tags: , ,

Comments

  1. Randy Ford says:

    Dear Mr. Hobbs,

    I would love to use your sdictviewer-lib but I have got the slightest idea about how to install it. I think you are assuming a shared level of expertise with your users that we fall far short of.

    Sorry for my inadequacies,

    Randy

  2. john says:

    No problem! Here’s the quick version. I’ll try to get an install script into the repo soon, but this should work for now.

    1. Download the library as a zip file: https://github.com/jmhobbs/sdictviewer-lib/zipball/master

    2. Unzip that and rename the resulting directory as sdictviewer

    3. Now, in the folder that you put that sdictviewer directory in, write your app. You can use the code above as a starting point.

    4. Download a dictionary you want to use (like this one)

    5. Run the app!

  3. praveenkumar says:

    sir, i have used the above sdictviewer ,it has got a small problem that it showing no definition for the word ‘cat’ and showing wrote to c:/……/.sdictviewrt/index_Cache/webster_1913

    what to do sir to eliminate the problem…waiting for your reply

  4. john says:

    SDict files are case sensitive, so the Webster 1913 dictionary has “Cat” but not “cat”.

    As for the cache message, you can comment out line 175 of sdictviewer/formats/dct/sdict.py to make that go away.

  5. praveenkumar says:

    yaa…working now…fine…thanks a lot…

  6. praveenkumar says:

    i want to know technically…hw this sdict viewer working and what’s the role of these database files you have provided?? can explain me please

  7. john says:

    Sure. The “database” files are the dictionary files, the list of words and definitions.

    SDict is a packed plain text format: http://sdict.com/en/format.php

    You can get a free “compiler “from here: http://swaj.net/sdict/index.html#download-ptksdict

    The SDict Viewer code is simply accessing the index in the file (hence the cache), and then doing lookup’s on the file accordingly.

    I did not write the core code (as I state in the README). This is pulled from the http://sdictviewer.sourceforge.net/ project, so my understanding of the format is not very deep.

  8. Martini says:

    Thanks for the awesome tutorial.
    I’m getting this error. In which directory should the
    ‘webster_1913.dct’ file be in?

    IOError: [Errno 2] No such file or directory: ‘webster_1913.dct’

    Thanks

  9. john says:

    @Martini – in the code I provided above, it is expecting webster_1913.dct to be in the same directory as the example script.

  10. rohrabacher3 says:

    I came across three useful Python solutions, and I

  11. emay says:

    i want to know that if i want to access wordnet through file handling ? is this possible and how ?

  12. john says:

    Hey emay,

    I do not know if direct access is possible, but I would encourage you to try the various libraries that handle that access for you.

    You can download wordnet directly here: http://wordnet.princeton.edu/wordnet/download/

    Good luck!

  13. emay says:

    thank you @john
    i just want to access the wordnet in python through filehandling or using array
    thats what i want to discuss
    and thanks alot for link and intrest

  14. om says:

    can u plz tell me how to rank the synsets and about the object of synset

  15. John Hobbs says:

    @om I’m not an expert with WordNet, but in simple terms a synset is all the synonyms/meanings for a word. As for ranking them, that’s beyond my expertise. You can learn more about WordNet at http://wordnet.princeton.edu/

  16. Goom says:

    Hi Mr. Hobbs,
    I was wondering if there is something out there that will provide pronunciation information.

  17. John Hobbs says:

    Sorry, I don’t know of a resource for that.

  18. Cesar says:

    Hey John, Can you please tell me how can I use your code snippet to print all definitions of a word rather than just one. I actually need to count how many definitions any given word has in the dictionary.

    Thanks a lot in advance,
    Cesar

  19. John Hobbs says:

    Hi Cesar,

    I’ve not touched this code in a while, so I could be off, but it seems you should be able to just not break in the loop, something like this;

    found = []
    for item in dictionary.get_word_list_iter( start_word ):
      try:
        if start_word == str( item ):
          found.append(item.read_articles()[0])
      except:
        continue
    
    for item in found:
        print "%s: %s" % item
    
  20. Cesar says:

    Thanks a lot for getting back to me. I had already tried doing that, and the code still returned only the first definition. For instance, the word ‘Record’ has 3 entries in the 1913 dict (2 verbs and 1 noun) http://machaut.uchicago.edu/?resource=Webster%27s&word=record&use1913=on&use1828=on

    But, the code only returns the definition from the first verb entry for record.

    Thanks again!

  21. AJ says:

    Is there a way to know if a word is inflectional or derivational through any module of nltk? Basically i want to find the root form of all the inflectional words of brown corpus. I have thought of giving word by word input to the stemmer. Please suggest any better way to do the same. Thanks in advance!

  22. John Hobbs says:

    Not sure! I think you would need a special corpus for that. Maybe PropBank?

  23. huzaira says:

    HI John, would you know how to expand a query using wordnet in python?
    say we have query q: “I am having issues with my phone reception”

    how can i get the synonyms for reception and add them to the original query before searching for an answer in a db?

  24. John Hobbs says:

    @Huzaira I honestly don’t know. I’m not a Wordnet expert, I just used it because it was available :)

  25. Kirti says:

    Works fine thanks

  26. alonelion says:

    hello, is this software can work on either dictionary, like cambridge or oxford dictionary,
    and I could pick meaning from either of them. thank you.

  27. John Hobbs says:

    Hi alonelion,

    It can work on any dictionary that you have a compatible dictionary file for. If you wanted to choose between two, you would initialize it for both, have it lookup on both, then present the results together.

  28. alonelion says:

    thank you join, best wish to you.

  29. […] should refer to this article if you have trouble installing wordnet or want to try other […]

  30. MQ says:

    Any idea how I would be able to define words listen in a .txt file line by line using your second option? What I’ve done is open the file and then save each word using readline. I was thinking a for loop to put each word in a list and then put it in for start_word, delete the word from the list (start_word = list[:-1], del list[:-1]), and then break when done with len(list). Do I have the right idea and how would I go implementing that?

  31. John Hobbs says:

    @MQ

    That’s the right idea! If you aren’t needing to store it, just print it, you can just loop off the open file handle;

    import sdictviewer.formats.dct.sdict as sdict
    
    dictionary = sdict.SDictionary('webster_1913.dct')
    dictionary.load()
    
    found = False
    
    with open("word_list.txt", "r") as handle:
        for word in handle:
            start_word = word.rstrip()
            for item in dictionary.get_word_list_iter(start_word):
                try:
                    if start_word == str(item):
                        instance, definition = item.read_articles()[0]
                        print "%s: %s" % (item, definition)
                        found = True
                        break
                except:
                    continue
    
                if not found:
                    print "No definition for '%s'." % start_word
    
    dictionary.close()
  32. E says:

    Where is the webster_1913.dct file available? Following the links above, it looks like the site is no longer available.
    Thanks

  33. John Hobbs says:

    @E Unsure. I can no longer find any good sdict files anymore, looks like the project broke apart. I’ll dig around and see if I can find a copy in my files somewhere.

  34. lion says:

    Hello, right now more useful dictionary I think is mobi dictionary, because you could get easily from website, so do you have some method to read from mobi dictionary pick up words.
    Before I try use calibre to get htmlz and then unzip to get html, and then use beautifulsoup to get some part of the translation, but hard is using calibre to get the result some code is not block, so not so easily to get what you want.
    So, do you have some good advise thank you.

  35. John Hobbs says:

    Sorry, I don’t know much about the mobi format, and cursory Google searching didn’t find anything promising.

    Good luck!

  36. ajith says:

    Hi, Im getting error like this. bound method. Wht to do . please do suggest.

    C:\Users\User 1\Desktop>python meaning.py
    ———-
    Name:
    Lexical Type:
    Lemmas:
    Definition:
    Traceback (most recent call last):
    File “meaning.py”, line 20, in
    for example in synset.examples:
    TypeError: ‘method’ object is not iterable

    C:\Users\User 1\Desktop>

  37. ash says:

    how to create a new corpus??

Leave A Comment

Your email will not be published.