“Schrödinger’s cat walks into a bar. And doesn’t.”
- Brian Malow
Science Comedian
First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.
I came across three useful Python solutions, and I’m going to detail usage of two of them in this post.
First up is accessing Wordnet.
“Wordnet is a large lexical database of English…”
The only Python way of accessing this (that I came across) is NLTK, a set of
“Open source Python modules, linguistic data and documentation for research and development in natural language processing…”
For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:
>>> from nltk.corpus import wordnet >>> wordnet.synsets( 'cake' ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 68, in __getattr__ self.__load() File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 56, in __load except LookupError: raise e LookupError: ********************************************************************** Resource 'corpora/wordnet' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download(). Searched in: - '/home/jmhobbs/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **********************************************************************
So what we need to do is run the NLTK installer, as shown here:
>>> import nltk >>> nltk.download() NLTK Downloader --------------------------------------------------------------------------- d) Download l) List c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> d Download which package (l=list; x=cancel)? Identifier> wordnet Downloading package 'wordnet' to /home/jmhobbs/nltk_data... Unzipping corpora/wordnet.zip. --------------------------------------------------------------------------- d) Download l) List c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> q True >>>
Now that we have everything installed, using wordnet from Python is straight forward.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # Load the wordnet corpus from nltk.corpus import wordnet # Get a collection of synsets (synonym sets) for a word synsets = wordnet.synsets( 'cake' ) # Print the information for synset in synsets: print "-" * 10 print "Name:", synset.name print "Lexical Type:", synset.lexname print "Lemmas:", synset.lemma_names print "Definition:", synset.definition for example in synset.examples: print "Example:", example |
The output of that is:
---------- Name: cake.n.01 Lexical Type: noun.artifact Lemmas: ['cake', 'bar'] Definition: a block of solid substance (such as soap or wax) Example: a bar of chocolate ---------- Name: patty.n.01 Lexical Type: noun.food Lemmas: ['patty', 'cake'] Definition: small flat mass of chopped food ---------- Name: cake.n.03 Lexical Type: noun.food Lemmas: ['cake'] Definition: baked goods made from or based on a mixture of flour, sugar, eggs, and fat ---------- Name: coat.v.03 Lexical Type: verb.contact Lemmas: ['coat', 'cake'] Definition: form a coat over Example: Dirt had coated her face
Perfect!
There are some caveats to using WordNet with NLTK. First is that the definitions aren’t always ordered in the way you would expect. For instance, look at the “cake” results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.
Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.
Last, you are constrained to the English language, as analyzed by Pinceton. I’ll address this issue in the next section.
As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.
SDict Viewer is an application, so it’s not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my “library” version from http://github.com/jmhobbs/sdictviewer-lib.
Here is an example when it’s all finished:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import sys import sdictviewer.formats.dct.sdict as sdict import sdictviewer.dictutil dictionary = sdict.SDictionary( 'webster_1913.dct' ) dictionary.load() start_word = sys.argv[1] found = False for item in dictionary.get_word_list_iter( start_word ): try: if start_word == str( item ): instance, definition = item.read_articles()[0] print "%s: %s" % ( item, definition ) found = True break except: continue if not found: print "No definition for '%s'." % start_word dictionary.close() |
Here is a sample run:
jmhobbs@katya:~$ python okay.py Cat Cat: (n.) An animal of various species of the genera Felis and Lynx. The domestic cat is Felis domestica. The European wild cat (Felis catus) is much larger than the domestic cat. In the United States the name wild cat is commonly applied to the bay lynx (Lynx rufus) See Wild cat, and Tiger cat. wrote /home/jmhobbs/.sdictviewer/index_cache/webster_1913.dct-1.0.index
As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.
In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.
So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!
Posted March 1st, 2010 - Permalink
“May your organs fail before your dreams fail you.”
- The Matches
Little Maggots
on Decomposer
I’ve been learning the Kohana framework for a project at work, and I have to say I really like it. It has a lot of the things I liked about rails, and it stays out of my way, unlike CakePHP.
I thought I’d highlight my authentication solution that uses the built in Auth module and a base controller that I call Site_Controller. Keep in mind that all of my controllers derive from this one.
So, what’s it boil down to? Essentially you set up Auth and my base controller, then in your children controllers you can set $access_control to an array of methods you want protected. It works with key == method and value == access level. For values you can have “*” which means anyone logged in can use the method, or a string providing a specific role. Take a look at the controller then I’ll show you an example usage.
<?php class Site_Controller extends Template_Controller { public $template = 'layout'; protected $access_control = array(); protected $access_denied = "/user/login"; //public $auto_render = false; function __construct () { parent::__construct(); $this->session = Session::instance(); // Check permissions if( array_key_exists( router::$method, $this->access_control ) ) { if( '*' == $this->access_control[router::$method] ) { if( ! Auth::instance()->logged_in() ) url::redirect( $this->access_denied ); } else if( is_array( $this->access_control[router::$method] ) ) { $can_proceed = false; foreach( $this->access_control[router::$method] as $role ) if( Auth::instance()->logged_in( $role ) ) $can_proceed = true; if( ! $can_proceed ) url::redirect( $this->access_denied ); } else { if( ! Auth::instance()->logged_in( $this->access_control[router::$method] ) ) url::redirect( $this->access_denied ); } } } public function __call( $method, $arguments ) { $this->template->title = "404"; $this->template->content = new View( 'errors/404'); } }
Here’s an example controller. In this case anyone can access login, anyone logged in can access index and only logged in admins can access adminsonly.
application/controllers/user.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | <?php class User_Controller extends Site_Controller { protected $access_control = array( "index" => "*", "adminsonly" => "admin" ); function index () { $this->template->content = "index"; } function login () { $this->template->content = "login"; } function adminsonly () { $this->template->content = "admins only"; } } |
I haven’t done a ton of testing and it’s not the most robust solution, but I like it and it was easy to write.
Posted February 24th, 2010 - Permalink
“May your coming year be filled with magic and dreams and good madness. I hope you read some fine books and kiss someone who thinks you’re wonderful, and don’t forget to make some art — write or draw or build or sing or live as only you can. And I hope, somewhere in the next year, you surprise yourself.”
- Neil Gaiman
Posted February 18th, 2010 - Permalink