Fundamentals: Hashing

Nov 20, 2011

What the hash?

This post is about a fundamental programming concept, hashing.

"A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys."
Hash function
Wikipedia - 2011/11/15

So, a hashing function can turn this:

Hello, my name is John Hobbs. I like hashes.

into this:

8ced81aa42cd91930024054f26ed92ea

Doesn't seem super useful, does it? The content looses its semantic value. 8ced81aa42cd91930024054f26ed92ea is not the same as a personal introduction.

That's okay though, we can find a use for a hash that won't be seen by a human.

But first, let's try our hand at writing a hash function.

Mapping to keys

So, the principal function of a hashing algorithm is to reduce a large block of data into a smaller block of data.

Let's set up some arbitrary limits for developing our function.

For simplicity, let's work at hashing values into a one byte integer, so eight bits.

Let's also focus on only hashing ASCII strings, since ASCII characters are conveniently one byte long.

Now that that is established, let me propose a hashing function in pseudo-code:

let hash = 0
for each character in string:
  let hash = hash XOR character

For those of you not familiar with XOR, I mean exclusive or. This means comparing the bytes bit by bit. If the bits are the same, the result is zero. If they are different, the result is one. The common notation for XOR is ^. Here's an example of 5 ^ 117:

  00000101 =   5
^ 01110101 = 117
----------------
  01110000 = 112

Working at a byte by byte level like this ensures that our hash will not exceed 8 bits in length, or an integer value of 128.

Buckets

With our shiny little hash function in mind, let's use find a use.

One great use is lookup tables. These are sometimes called dictionaries, associated arrays, hash tables, etc.

Basically, they map a key to a value. So you tell a table that "my name" is "john", and then later, if you ask it for "my name" is tells you "john". Great tool for amnesiacs.

Hashing is perfect here for faster lookups, and for giving bound memory performance.

Think of it as a finite line of buckets. When you want to store something you hand it to the walrus, and say "Please keep this fish in a bucket called 'Salmon'."

Now, the walrus only has so many buckets, and none of them are labeled "Salmon".

So, he uses a special walrus method (a hashing function) that deterministically says "Salmon == 5", and puts the fish in the fifth bucket.

Later, when you come back and ask for the fish from the "Salmon" bucket, he knows to get the fish out of the fifth bucket because he can derive that "Salmon == 5" using the same method as before.

Unfortunately, he ate the fish already, so there is nothing to give you

Walrus are selfish that way.

Implementation

Now that we have a use case, let's implement our hashing function.

I chose JavaScript here.

function hash ( string ) {
  var c = 0;
  for( var i = 0; i < string.length; i++ ) {
    c = c ^ string[i].charCodeAt();
  }
  return c;
};

Let's give it a try:

console.log( hash( "hello" ) );
// prints "98"
console.log( hash( "goodbye" ) );
// prints "125"

Cool! Now there is just a little more glue to make this complete. We have to have a get and a set for values, and an array to store them in.

var SimpleHashTable = function () {

  // these are the buckets
  this.table = new Array(128);

  // hand it to the walrus
  this.set = function ( key, value ) {
    this.table[this.hash( key )] = value;
  };

  // get it from the walrus
  this.get = function ( key ) {
    return this.table[this.hash( key )];
  };

  // walrus math
  this.hash = function ( string ) {
    var c = 0;
    for( var i = 0; i < string.length; i++ ) {
      c = c ^ string[i].charCodeAt();
    }
    return c;
  };

};

The get and set functions are pretty similar. You hash the key, and use that integer as the index to either set, or retrieve a value in the table array.

Let's give it a try!

var ht = new SimpleHashTable();
ht.set( "salutations", "Hello, good sir." );
console.log( ht.get( "salutations" ) );
// prints "Hello, good sir."

Perfect!

Collision course

Now let's use our SimpleHashTable to store some stuff!

ht.set( "salutations", "Hello, good sir." );
console.log( ht.get( "salutations" ) );
// prints "Hello, good sir."
ht.set( "q", "BARF!" )
console.log( ht.get( "salutations" ) );
// prints "BARF!"

Wait, what the heck just happened?!

It shouldn't have printed the value for q, we asked for salutations!

This is known as a collision, and it happens.

What happened is that the hash value we compute for q is the same value as the hash for salutations. It's bound to happen, we only have 128 possible buckets and an infinite amount of possible keys.

So we have several ways to improve this situation.

One option is more buckets.

We can also improve the hashing function, but this only buys us fewer, more predictable collision rates.

Another common fix is a second level to each bucket keeping the exact key, value tuple, and doing a scan on collisions.

That would be like the walrus putting a sticky note that says "Salmon" on the fish before tossing it into the bucket. Then, when you ask for the "Salmon", he rifles through the bucket, checking each fish until he finds one labeled "Salmon".

I'll leave these as an exercise for the reader, because it's time we got moving on.

Real hashes

Okay, so now you have an example of a simple hash you could use for table lookups. That's cool and everything, but in general we should leave details like that to the neckbeards. It's good to know roughly how it works, but we don't need to do it ourselves.

What else are hashes good for? Well, there is a specie of hash called a cryptographic hash. These hashes are cryptographic primitives, with strong promises of distribution and collision avoidance.

Examples you may have come across are MD5, or the SHA family of algorithms.

Implementation

These algorithms are considerably more complex than the hash function we created earlier, so I would urge you not to write your own except as an exercise.

Implementations are available in a variety of languages, so Google wisely.

Uses

This class of hash has several every day uses that apply to developer-types.

Data Integrity

Since hashes are deterministic functions, the output can be used as a fingerprint for the input. Passing data and the hash of that data allows for others to quickly check and make sure that the data has not become corrupted or tampered with in transit.

This assumes a secure channel for the hash, so use it wisely.

If you are on a Linux machine, you can quickly get the fingerprint for a file. In this example I'll use the sha1 algorithm to verify the Redis 2.4.2 source code.

jmhobbs@Cordelia:~/Downloads$ sha1sum redis-2.4.2.tar.gz 
d2c9288dcfe16b4718e39a0f8ad7b21f4ffc6de0  redis-2.4.2.tar.gz
jmhobbs@Cordelia:~/Downloads$

Comparing that hash to the Redis website, it looks like my download completed without errors. Huzzah!

Secret Verification

Another common use of hashing is to verify secrets, without storing the secret in the clear.

Say I was writing a system login, and I needed to use usernames and passwords.

I know it's bad to store passwords in plaintext on my server. What if my server gets hacked and the user table is stolen?

I can't store my passwords in the clear, but I need to reliably verify them.

Hashes make for a great solution to this problem.

You hash the password and store it in your database. Then, when a user wants to login, you hash the password they provide and compare it to the one in the database.

If they match, you log them in.

Doing this properly is a bit more complicated than that, so if you need to work with passwords like this, I recommend you read up.

Data Authenticity

We can also use hashes to confirm the authenticity of a message.

Say we both know a secret phrase, and no one else does.

If you want to send a message to me, and want it to be provably from you, you can send me the message along with a hash of the message concatenated to the secret.

Then, when I get the message, I also take the hash of the message plus the secret and compare it to your provided hash.

If they match, then I know it's from you.

This is encapsulated in something called a Hashed Message Authentication Code, or HMAC.

File De-duplication

One final case that I regularly use is file de-duplication.

If you host uploads for a website, it's a nice trick to store upload meta data (user, original name, upload time, etc.) separate from the file itself.

By de-coupling these you can hash files on upload, and if you already have a copy of that content, discard it and point the meta data at the existing file.

Here is an example in pseudo-code:

let hash = sha1_file( uploaded_file.tmp_path )
file = uploads.find_by_hash( hash )
if not file.loaded():
  file.hash = hash
  file.save()
return file.id

Conclusion

Hashes are useful tools for programmers, so learn the concepts and the applications.

I hope you learned something from this post! How do you use hashes?