Homoglyph Substitution for URL's

Aug 29, 2014

At Pack we use ascii-based unique identifiers in URL's a lot. We call them slugs. Dogs have them, users have them, breeds have them, etc.

I made the decision early on to keep the slugs plain old ascii. No unicode. These are primarily for URL's, and I wanted them easy to type. Most slugs in the system are automatically generated. These slugs are derived from names when a dog or user is created in the system. This is a problem, because there are a lot of people in the world who use characters outside of the ascii set.

Usually, the solution is just to drop non-ascii characters. This is the simplest option, and it works. For example, Designer News uses this technique. In the case of John Henry MÃ¼ller, they simply drop the Ã¼ because of the umlaut, giving him the user URL of https://news.layervault.com/u/11655/john-henry-mller/. MÃ¼ller becomes mller. I find this less than optimal.

A second technique is to use homoglyph substitution. A homoglyph is a character which is visually similar to another, to the point that they are difficult to quickly distinguish with just a glance. I'm familiar with them from the world of phishing, where people register domains that look very similar to other domains by using homoglyphs.

Once you build a list of homoglyphs, it's easy to create slugs that are ascii only through substitution. We expanded the definition of homoglyph for our list to include anything you could squint at funny and think they were similar. The method is a bit brute force, but it only ever runs once per string, and I think the outcome is worth it.

# -*- coding: utf-8 -*-

UNICODE_ASCII_HOMOGLYPHS = (
    ('a', u'AaÃ€ÃÃ‚ÃƒÃ„Ã…Ã Ã¡Ã¢Ã£Ã¤Ã¥É‘Î‘Î±Ð°áŽªï¼¡ï½Ä„Ä€ÄÄ‚ÄƒÄ…Ã€ÃÃ‚ÃƒÃ„Ã…Ã Ã¡Ã¢Ã£Ã¤Ã¥'),
    ....
    ('z', u'ZzÎ–áƒï¼ºï½šÅ¹ÅºÅ»Å¼Å½Å¾'),
)


def replace_homoglyphs(string):
    '''If a string is unicode, replace all of the unicode homoglyphs with ASCII equivalents.'''
    if unicode == type(string):
        for homoglyph_set in UNICODE_ASCII_HOMOGLYPHS:
            for homoglyph in homoglyph_set[1]:
                string = string.replace(homoglyph, homoglyph_set[0])
    return string

This works well for us, we get reasonable URL's for dogs like "HÃ³lmfrÃÃ°ur frÃ¡ Ã“lafsfjordur". holmfriour-fra-olafsfjordur is not the same, but it's close enough for a URL that you don't mind, and it's better than using hlmfrur-fr-lafsfjordur.

HÃ³lmfrÃÃ°ur frÃ¡ Ã“lafsfjordur

Unfortunately, this doesn't work well for un-romanized languages, notably asian languages, such as "ã‚¯ãƒƒã‚ãƒ¼". In this case, the system breaks down and we end up with no usable slug, so we build from a default. I'm still seeking a solution for that. Maybe I should use automatic translation on it.