First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.
I came across three useful Python solutions, and I’m going to detail usage of two of them in this post.
First up is accessing Wordnet.
“Wordnet is a large lexical database of English…”
The only Python way of accessing this (that I came across) is NLTK, a set of
“Open source Python modules, linguistic data and documentation for research and development in natural language processing…”
For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:
>>> from nltk.corpus import wordnet >>> wordnet.synsets( 'cake' ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 68, in __getattr__ self.__load() File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 56, in __load except LookupError: raise e LookupError: ********************************************************************** Resource 'corpora/wordnet' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download(). Searched in: - '/home/jmhobbs/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **********************************************************************
So what we need to do is run the NLTK installer, as shown here:
>>> import nltk >>> nltk.download() NLTK Downloader --------------------------------------------------------------------------- d) Download l) List c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> d Download which package (l=list; x=cancel)? Identifier> wordnet Downloading package 'wordnet' to /home/jmhobbs/nltk_data... Unzipping corpora/wordnet.zip. --------------------------------------------------------------------------- d) Download l) List c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> q True >>>
Now that we have everything installed, using wordnet from Python is straight forward.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # Load the wordnet corpus from nltk.corpus import wordnet # Get a collection of synsets (synonym sets) for a word synsets = wordnet.synsets( 'cake' ) # Print the information for synset in synsets: print "-" * 10 print "Name:", synset.name print "Lexical Type:", synset.lexname print "Lemmas:", synset.lemma_names print "Definition:", synset.definition for example in synset.examples: print "Example:", example |
The output of that is:
---------- Name: cake.n.01 Lexical Type: noun.artifact Lemmas: ['cake', 'bar'] Definition: a block of solid substance (such as soap or wax) Example: a bar of chocolate ---------- Name: patty.n.01 Lexical Type: noun.food Lemmas: ['patty', 'cake'] Definition: small flat mass of chopped food ---------- Name: cake.n.03 Lexical Type: noun.food Lemmas: ['cake'] Definition: baked goods made from or based on a mixture of flour, sugar, eggs, and fat ---------- Name: coat.v.03 Lexical Type: verb.contact Lemmas: ['coat', 'cake'] Definition: form a coat over Example: Dirt had coated her face
Perfect!
There are some caveats to using WordNet with NLTK. First is that the definitions aren’t always ordered in the way you would expect. For instance, look at the “cake” results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.
Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.
Last, you are constrained to the English language, as analyzed by Pinceton. I’ll address this issue in the next section.
As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.
SDict Viewer is an application, so it’s not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my “library” version from http://github.com/jmhobbs/sdictviewer-lib.
Here is an example when it’s all finished:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import sys import sdictviewer.formats.dct.sdict as sdict import sdictviewer.dictutil dictionary = sdict.SDictionary( 'webster_1913.dct' ) dictionary.load() start_word = sys.argv[1] found = False for item in dictionary.get_word_list_iter( start_word ): try: if start_word == str( item ): instance, definition = item.read_articles()[0] print "%s: %s" % ( item, definition ) found = True break except: continue if not found: print "No definition for '%s'." % start_word dictionary.close() |
Here is a sample run:
jmhobbs@katya:~$ python okay.py Cat Cat: (n.) An animal of various species of the genera Felis and Lynx. The domestic cat is Felis domestica. The European wild cat (Felis catus) is much larger than the domestic cat. In the United States the name wild cat is commonly applied to the bay lynx (Lynx rufus) See Wild cat, and Tiger cat. wrote /home/jmhobbs/.sdictviewer/index_cache/webster_1913.dct-1.0.index
As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.
In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.
So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!
Posted March 1st, 2010 - Permalink