First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.
I came across three useful Python solutions, and I’m going to detail usage of two of them in this post.
Option 1: NLTK + Wordnet
First up is accessing Wordnet.
“Wordnet is a large lexical database of English…”
The only Python way of accessing this (that I came across) is NLTK, a set of
“Open source Python modules, linguistic data and documentation for research and development in natural language processing…”
Getting NLTK Installed
For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:
>>> from nltk.corpus import wordnet
>>> wordnet.synsets( 'cake' )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 68, in __getattr__
self.__load()
File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 56, in __load
except LookupError: raise e
LookupError:
**********************************************************************
Resource 'corpora/wordnet' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download().
Searched in:
- '/home/jmhobbs/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
So what we need to do is run the NLTK installer, as shown here:
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> wordnet
Downloading package 'wordnet' to /home/jmhobbs/nltk_data...
Unzipping corpora/wordnet.zip.
---------------------------------------------------------------------------
d) Download l) List c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>>
Using NLTK + Wordnet
Now that we have everything installed, using wordnet from Python is straight forward.
# Load the wordnet corpus
from nltk.corpus import wordnet
# Get a collection of synsets (synonym sets) for a word
synsets = wordnet.synsets( 'cake' )
# Print the information
for synset in synsets:
print "-" * 10
print "Name:", synset.name
print "Lexical Type:", synset.lexname
print "Lemmas:", synset.lemma_names
print "Definition:", synset.definition
for example in synset.examples:
print "Example:", example
The output of that is:
----------
Name: cake.n.01
Lexical Type: noun.artifact
Lemmas: ['cake', 'bar']
Definition: a block of solid substance (such as soap or wax)
Example: a bar of chocolate
----------
Name: patty.n.01
Lexical Type: noun.food
Lemmas: ['patty', 'cake']
Definition: small flat mass of chopped food
----------
Name: cake.n.03
Lexical Type: noun.food
Lemmas: ['cake']
Definition: baked goods made from or based on a mixture of flour, sugar, eggs, and fat
----------
Name: coat.v.03
Lexical Type: verb.contact
Lemmas: ['coat', 'cake']
Definition: form a coat over
Example: Dirt had coated her face
Perfect!
Caveats
There are some caveats to using WordNet with NLTK. First is that the definitions aren’t always ordered in the way you would expect. For instance, look at the “cake” results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.
Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.
Last, you are constrained to the English language, as analyzed by Pinceton. I’ll address this issue in the next section.
Option 2: SDict Viewer
As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.
SDict Viewer is an application
SDict Viewer is an application, so it’s not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my “library” version from http://github.com/jmhobbs/sdictviewer-lib.
Here is an example when it’s all finished:
import sys
import sdictviewer.formats.dct.sdict as sdict
import sdictviewer.dictutil
dictionary = sdict.SDictionary( 'webster_1913.dct' )
dictionary.load()
start_word = sys.argv[1]
found = False
for item in dictionary.get_word_list_iter( start_word ):
try:
if start_word == str( item ):
instance, definition = item.read_articles()[0]
print "%s: %s" % ( item, definition )
found = True
break
except:
continue
if not found:
print "No definition for '%s'." % start_word
dictionary.close()
Here is a sample run:
[email protected]:~$ python okay.py Cat
Cat: (n.) An animal of various species of the genera Felis and Lynx. The domestic cat is Felis domestica. The European wild cat (Felis catus) is much larger than the domestic cat. In the United States the name wild cat is commonly applied to the bay lynx (Lynx rufus) See Wild cat, and Tiger cat.
wrote /home/jmhobbs/.sdictviewer/index_cache/webster_1913.dct-1.0.index
As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.
Option 3: Aard Format
In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.
All Done
So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!
Comments
Dear Mr. Hobbs,
I would love to use your sdictviewer-lib but I have got the slightest idea about how to install it. I think you are assuming a shared level of expertise with your users that we fall far short of.
Sorry for my inadequacies,
Randy
No problem! Here’s the quick version. I’ll try to get an install script into the repo soon, but this should work for now.
1. Download the library as a zip file: https://github.com/jmhobbs/sdictviewer-lib/zipball/master
2. Unzip that and rename the resulting directory as sdictviewer
3. Now, in the folder that you put that sdictviewer directory in, write your app. You can use the code above as a starting point.
4. Download a dictionary you want to use (like this one)
5. Run the app!
sir, i have used the above sdictviewer ,it has got a small problem that it showing no definition for the word ‘cat’ and showing wrote to c:/……/.sdictviewrt/index_Cache/webster_1913
what to do sir to eliminate the problem…waiting for your reply
SDict files are case sensitive, so the Webster 1913 dictionary has “Cat” but not “cat”.
As for the cache message, you can comment out line 175 of sdictviewer/formats/dct/sdict.py to make that go away.
yaa…working now…fine…thanks a lot…
i want to know technically…hw this sdict viewer working and what’s the role of these database files you have provided?? can explain me please
Sure. The “database” files are the dictionary files, the list of words and definitions.
SDict is a packed plain text format: http://sdict.com/en/format.php
You can get a free “compiler “from here: http://swaj.net/sdict/index.html#download-ptksdict
The SDict Viewer code is simply accessing the index in the file (hence the cache), and then doing lookup’s on the file accordingly.
I did not write the core code (as I state in the README). This is pulled from the http://sdictviewer.sourceforge.net/ project, so my understanding of the format is not very deep.
Thanks for the awesome tutorial.
I’m getting this error. In which directory should the
‘webster_1913.dct’ file be in?
IOError: [Errno 2] No such file or directory: ‘webster_1913.dct’
Thanks
@Martini – in the code I provided above, it is expecting webster_1913.dct to be in the same directory as the example script.
I came across three useful Python solutions, and I
i want to know that if i want to access wordnet through file handling ? is this possible and how ?
Hey emay,
I do not know if direct access is possible, but I would encourage you to try the various libraries that handle that access for you.
You can download wordnet directly here: http://wordnet.princeton.edu/wordnet/download/
Good luck!
thank you @john
i just want to access the wordnet in python through filehandling or using array
thats what i want to discuss
and thanks alot for link and intrest
can u plz tell me how to rank the synsets and about the object of synset
@om I’m not an expert with WordNet, but in simple terms a synset is all the synonyms/meanings for a word. As for ranking them, that’s beyond my expertise. You can learn more about WordNet at http://wordnet.princeton.edu/
Hi Mr. Hobbs,
I was wondering if there is something out there that will provide pronunciation information.
Sorry, I don’t know of a resource for that.
Hey John, Can you please tell me how can I use your code snippet to print all definitions of a word rather than just one. I actually need to count how many definitions any given word has in the dictionary.
Thanks a lot in advance,
Cesar
Hi Cesar,
I’ve not touched this code in a while, so I could be off, but it seems you should be able to just not break in the loop, something like this;
Thanks a lot for getting back to me. I had already tried doing that, and the code still returned only the first definition. For instance, the word ‘Record’ has 3 entries in the 1913 dict (2 verbs and 1 noun) http://machaut.uchicago.edu/?resource=Webster%27s&word=record&use1913=on&use1828=on
But, the code only returns the definition from the first verb entry for record.
Thanks again!
Is there a way to know if a word is inflectional or derivational through any module of nltk? Basically i want to find the root form of all the inflectional words of brown corpus. I have thought of giving word by word input to the stemmer. Please suggest any better way to do the same. Thanks in advance!
Not sure! I think you would need a special corpus for that. Maybe PropBank?
HI John, would you know how to expand a query using wordnet in python?
say we have query q: “I am having issues with my phone reception”
how can i get the synonyms for reception and add them to the original query before searching for an answer in a db?
@Huzaira I honestly don’t know. I’m not a Wordnet expert, I just used it because it was available :)
Works fine thanks
hello, is this software can work on either dictionary, like cambridge or oxford dictionary,
and I could pick meaning from either of them. thank you.
Hi alonelion,
It can work on any dictionary that you have a compatible dictionary file for. If you wanted to choose between two, you would initialize it for both, have it lookup on both, then present the results together.
thank you join, best wish to you.
[…] should refer to this article if you have trouble installing wordnet or want to try other […]
Any idea how I would be able to define words listen in a .txt file line by line using your second option? What I’ve done is open the file and then save each word using readline. I was thinking a for loop to put each word in a list and then put it in for start_word, delete the word from the list (start_word = list[:-1], del list[:-1]), and then break when done with len(list). Do I have the right idea and how would I go implementing that?
@MQ
That’s the right idea! If you aren’t needing to store it, just print it, you can just loop off the open file handle;
Where is the webster_1913.dct file available? Following the links above, it looks like the site is no longer available.
Thanks
@E Unsure. I can no longer find any good sdict files anymore, looks like the project broke apart. I’ll dig around and see if I can find a copy in my files somewhere.
Hello, right now more useful dictionary I think is mobi dictionary, because you could get easily from website, so do you have some method to read from mobi dictionary pick up words.
Before I try use calibre to get htmlz and then unzip to get html, and then use beautifulsoup to get some part of the translation, but hard is using calibre to get the result some code is not block, so not so easily to get what you want.
So, do you have some good advise thank you.
Sorry, I don’t know much about the mobi format, and cursory Google searching didn’t find anything promising.
Good luck!
Hi, Im getting error like this. bound method. Wht to do . please do suggest.
C:\Users\User 1\Desktop>python meaning.py
———-
Name:
Lexical Type:
Lemmas:
Definition:
Traceback (most recent call last):
File “meaning.py”, line 20, in
for example in synset.examples:
TypeError: ‘method’ object is not iterable
C:\Users\User 1\Desktop>
how to create a new corpus??