“Extracting Data From Nasty HTML” or “How To Be Frickin Awesome”

January 23, 2007

Whoa! Post 100 for this little blog. Although at times it’s been a little weak on content, I think there have been enough good ones to outweigh them. Besides, this site is more for me than for anyone else.

As this is post 100, I’m required by a dubious interpretation of a little known Norwegian law to list my favorite posts so far. So here they are, slightly categorized.

Mildly Useful

Wordy And Thoughtful

Now to the meat of this post!
When I was first hired at UNO I was given the transfer articulation site as a project. What they basically do is keep track from year to year what each class at a number of schools is equivalent to here at UNO. I wrote it pretty quick, and they’ve been slowly adding data by hand for a few months now.

It’s a lot of data to enter too. So far they only have one year of one school done. The old system was a series of static HTML pages, so they didn’t think they could load it into the new system. I didn’t agree fully, because although the pages were poorly written and differed from year to year, they had a standard table layout on each one. I got to work on the idea of extracting the old data and loading it into the new database.

The first thing to do was create a syntactically correct file, here’s a sample of part of one of the files:

Nasty, all-caps and they didn’t even close the tags. Ugly, ugly.

Luckily I knew of a secret weapon, HTML Tidy!
When run through with the appropriate flags I got this lovely version of the code:

Okay, so running the tidy command on every file one at a time would be crazy, so I wrote up a short batch file to hit every single .html file with the tidy love. Please excuse the nasty one-lined-ness of it.

Okay, so now that we’ve got that beautified file, we need to parse it. To make life easier I stripped all the other tags out except for the table tags with PHP’s strip_tags(). I then regexed out anything that wasn’t inside the <table> tags and created my SimpleXML object with what was left over.

A few loops and a whole lot of boring processing later and it’s all in the DB. But that’s the gist of the system and it’s frickin tight.

Categories: Geek
Tags: , ,

Leave A Comment

Your email will not be published.