Tag: Python

Premature optimization is the root of all evil. But don’t be stupid.

August 2, 2010 » Geek

There is a relatively prevalent quote in the programming world, bandied about by programmers of all creeds.

“Premature optimization is the root of all evil.”

– Donald Knuth

I agree. What we intuit about optimization is usually wrong. And what’s more, until you hit a bottleneck, it’s often a waste of time. Moore has been good to us, and CPU bound problems aren’t as common as they once were.

That said, I think people should not cling to this. It’s dumb.

Today I was looking for a better way to compute sha1 file hashes in PHP. I know about sha1() obviously, but that’s not how it should be done for files.

Python provides the excellent hashlib that lets you update a hash with blocks of data. It’s excellent for file hashing, because you can read in data in chunks and update the hash as you go, thus avoiding reading the whole file into memory at once. Here, have a sample program that reads in 512 byte chunks.

So I quickly found the sha1_file() function. Perfect, I’m sure this reads in chunks, otherwise they would not have bothered to make the function.

Then I scroll down to check the user contributed notes for anything interesting. The top two notes are examples of using this snippet to get a sha1 hash of a file:

I was stunned. The definition of file_get_contents() is as follows:

“Reads entire file into a string.”

Surely they are not suggesting that I read the entire file into memory, as a PHP variable, and then pass it on to the hashing function? Who would think that is a good solution, when there is a perfectly good built in function to do this for you, in chunks, in a neater fashion?

My only explanation is ignorance, or utter lack of consideration. I gave it a test with these two PHP scripts.

The first version weighed in as using slightly more memory than second one. About 89k actually. And test.jpg is 88k.

Imagine if that picture was a few megs bigger. Imagine if I could somehow trigger this process on your website, over and over again with ab or something. It’s a DOS in the making.

Develop however you want to, just please, please don’t be stupid.

Streaming Tweets With Tweepy

July 5, 2010 » Consume, Geek

I’ve been meaning to check out the Tweepy for a while and got around to it today. It’s a Python library for interacting with Twitter. The feature I’m most interested in is the streaming API support, which isn’t advertised much by Tweepy but seems pretty solid.

Tweepy has pretty good documentation, and the code is terse and readable, but what I found most useful was the examples repository, which had the only example of streaming with Tweepy that I could find in the official documentation.

It’s really straightforward. Implement a tweepy.streaming.StreamListener to consume data, set up a tweepy.streaming.Stream with that listener, then pull the trigger on the streaming function you want to use.

Here’s a quick example I set up to track the filter keyword “omaha”.

Auto-Generated Github User Page With py-github

June 29, 2010 » Geek

Update (2010-06-30)

So I got antsy about this and I upgraded to using pystache instead of my homebrew templating system. This was my first run in with mustache, and I have to say I like it, even though I used the bare minimum feature set.

New code is at http://github.com/jmhobbs/jmhobbs.github.com

Github has a cool feature called “Github Pages” that let you host static content on a subdomain of github, e.g. http://jmhobbs.github.com/.

They also provide an auto-generator for project pages that has a nice clean format which I really like. So I decided to make my user page match the look and feel of the project pages. And to boot I wanted to be able have it auto-generate since I want it to be “hands free”, otherwise I’ll forget to update it.

To make this happen I whipped up my template and then grabbed the excellent py-github from Dustin Sallings, which I have used before.

Without furthur ado I’ll just show you the source. It’s not complicated, just some API calls then search replace on a template file. If you want to use it, be sure to get the most recent version from http://github.com/jmhobbs/jmhobbs.github.com.

Throw in a cron job and you are set. Beware of lot’s of “page build” notices from Github though.


except AttributeError:
repo_string = repo_string + ‘


pass

repo_string = repo_string + “

\n”

template = template.replace( ‘<% repos %>‘, repo_string )

ga = “””

“””

if False != settings[‘google_analytics’]:
template = template.replace( ‘<% google_analytics %>‘, ga )
template = template.replace( ‘<% ga_code %>‘, settings[‘google_analytics’] )
else:
template = template.replace( ‘<% google_analytics %>‘, ” )

print “Writing file…”
f = open( ‘index.html’, ‘w’ )
f.write( template )
f.close()

print “Done!”

if __name__ == “__main__”:
main()

Wow. You actually scrolled through all of that. Amazing.

Python UNIX Sockets

June 14, 2010 » Geek

I’ve been tinkering with using UNIX sockets for IPC from Python and I thought I would share my most basic experiment.

This is a super simple example of client/server usage of a socket. Essentially the server is a blocking command socket that echo’s whatever is passed through it.

Listing: server.py

Listing: client.py

Here is the transcript of me running the client.

And here is the server transcript from that session.

Now all you need is a protocol and you’ll be set for basic IPC.

Looking up words in a Dictionary using Python

March 1, 2010 » Consume, Geek

First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.

I came across three useful Python solutions, and I’m going to detail usage of two of them in this post.

Option 1: NLTK + Wordnet

First up is accessing Wordnet.

“Wordnet is a large lexical database of English…”

The only Python way of accessing this (that I came across) is NLTK, a set of

“Open source Python modules, linguistic data and documentation for research and development in natural language processing…”

Getting NLTK Installed

For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:

So what we need to do is run the NLTK installer, as shown here:

Using NLTK + Wordnet

Now that we have everything installed, using wordnet from Python is straight forward.

The output of that is:

Perfect!

Caveats

There are some caveats to using WordNet with NLTK. First is that the definitions aren’t always ordered in the way you would expect. For instance, look at the “cake” results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.

Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.

Last, you are constrained to the English language, as analyzed by Pinceton. I’ll address this issue in the next section.

Option 2: SDict Viewer

As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.

SDict Viewer is an application

SDict Viewer is an application, so it’s not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my “library” version from http://github.com/jmhobbs/sdictviewer-lib.

Here is an example when it’s all finished:

Here is a sample run:

As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.

Option 3: Aard Format

In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.

All Done

So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!