jmhobbs

Premature optimization is the root of all evil. But don't be stupid.

There is a relatively prevalent quote in the programming world, bandied about by programmers of all creeds.

"Premature optimization is the root of all evil."

- Donald Knuth

I agree. What we intuit about optimization is usually wrong. And what's more, until you hit a bottleneck, it's often a waste of time. Moore has been good to us, and CPU bound problems aren't as common as they once were.

That said, I think people should not cling to this. It's dumb.

Today I was looking for a better way to compute sha1 file hashes in PHP. I know about sha1() obviously, but that's not how it should be done for files.

Python provides the excellent hashlib that lets you update a hash with blocks of data. It's excellent for file hashing, because you can read in data in chunks and update the hash as you go, thus avoiding reading the whole file into memory at once. Here, have a sample program that reads in 512 byte chunks.

>>> import hashlib
>>> hash = hashlib.sha1()
>>> with open( 'test.jpg', 'r' ) as handle:
...     while True:
...             data = handle.read(512)
...             if not data:
...                     break
...             hash.update( data )
... 
>>> hash.hexdigest()
'9e8d5ee361c6988baf7f75999f2c854a765f3eca'

So I quickly found the sha1_file() function. Perfect, I'm sure this reads in chunks, otherwise they would not have bothered to make the function.

Then I scroll down to check the user contributed notes for anything interesting. The top two notes are examples of using this snippet to get a sha1 hash of a file:

$hash = sha1( file_get_contents( $file ) );

I was stunned. The definition of file_get_contents() is as follows:

"Reads entire file into a string."

Surely they are not suggesting that I read the entire file into memory, as a PHP variable, and then pass it on to the hashing function? Who would think that is a good solution, when there is a perfectly good built in function to do this for you, in chunks, in a neater fashion?

My only explanation is ignorance, or utter lack of consideration. I gave it a test with these two PHP scripts.

The first version weighed in as using slightly more memory than second one. About 89k actually. And test.jpg is 88k.

Imagine if that picture was a few megs bigger. Imagine if I could somehow trigger this process on your website, over and over again with ab or something. It's a DOS in the making.

Develop however you want to, just please, please don't be stupid.