jmhobbs

"Extracting Data From Nasty HTML" or "How To Be Frickin Awesome"

Whoa! Post 100 for this little blog. Although at times it's been a little weak on content, I think there have been enough good ones to outweigh them. Besides, this site is more for me than for anyone else.

As this is post 100, I'm required by a dubious interpretation of a little known Norwegian law to list my favorite posts so far. So here they are, slightly categorized.

Mildly Useful http://www.velvetcache.org/2006/09/28/usb-apps/ http://www.velvetcache.org/2006/09/17/firefox-plugins/ http://www.velvetcache.org/2006/10/02/simple-php-caching/ http://www.velvetcache.org/2007/01/19/cleaning-up-e-books/ http://www.velvetcache.org/2007/01/08/renaming-your-the-folders/ http://www.velvetcache.org/2007/01/08/pop-can-bookmarkers/

Wordy And Thoughtful http://www.velvetcache.org/2006/09/15/i-cant-be-your-john-cusack/ http://www.velvetcache.org/2006/11/01/rediscovery/ http://www.velvetcache.org/2006/09/19/complexity-vs-redundancy/ http://www.velvetcache.org/2006/09/20/linux-on-the-desktop/ http://www.velvetcache.org/2006/09/07/facebook-apis/ http://www.velvetcache.org/2006/10/03/moniker-junkie/

Now to the meat of this post! When I was first hired at UNO I was given the transfer articulation site as a project. What they basically do is keep track from year to year what each class at a number of schools is equivalent to here at UNO. I wrote it pretty quick, and they've been slowly adding data by hand for a few months now.

It's a lot of data to enter too. So far they only have one year of one school done. The old system was a series of static HTML pages, so they didn't think they could load it into the new system. I didn't agree fully, because although the pages were poorly written and differed from year to year, they had a standard table layout on each one. I got to work on the idea of extracting the old data and loading it into the new database.

The first thing to do was create a syntactically correct file, here's a sample of part of one of the files:

...
       
        
Max.
Transfer Hours
Allowed

Comments

ARCH1300

Architectural Desktop I ...

Nasty, all-caps and they didn't even close the tags. Ugly, ugly.

Luckily I knew of a secret weapon, HTML Tidy! When run through with the appropriate flags I got this lovely version of the code:

...

Comments

ARCH1300

Architectural Desktop I

....

Okay, so running the tidy command on every file one at a time would be crazy, so I wrote up a short batch file to hit every single .html file with the tidy love. Please excuse the nasty one-lined-ness of it.

FOR %%f IN ("*.html") DO tidy %%f --char-encoding utf8 --clean yes --doctype strict --escape-cdata yes --indent auto --indent-attributes no --join-classes yes --output-xhtml yes --show-errors 99 --tidy-mark no --wrap 0 -m "%%f"

Okay, so now that we've got that beautified file, we need to parse it. To make life easier I stripped all the other tags out except for the table tags with PHP's strip_tags(). I then regexed out anything that wasn't inside the <table> tags and created my SimpleXML object with what was left over.

$filename = $_REQUEST['dir'].'/'.$_REQUEST['file'];
// Remove the space in fopen, it's a Wordpress thing. :(
$handle = f open($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
$contents = strip_tags($contents,'
'); if(preg_match('/([.\r\n]*)<\/table>/si', $contents, $matches)) { print 'Will attempt to load. '; $contents = $matches[0]; $xml = new SimpleXMLElement($contents); if(!$xml) { print 'Error! XML Didn\'t read right.'; exit(); } print 'Load complete. Checking for expected structure. '; if(trim($xml->tr[0]->td[0]) != 'Course') { print 'Error! Structure not as expected. Dumping.
'; var_dump($xml); exit(); } }

A few loops and a whole lot of boring processing later and it's all in the DB. But that's the gist of the system and it's frickin tight.