"Extracting Data From Nasty HTML" or "How To Be Frickin Awesome"

Jan 23, 2007

Whoa! Post 100 for this little blog. Although at times it's been a little weak on content, I think there have been enough good ones to outweigh them. Besides, this site is more for me than for anyone else.

As this is post 100, I'm required by a dubious interpretation of a little known Norwegian law to list my favorite posts so far. So here they are, slightly categorized.

Mildly Useful http://www.velvetcache.org/2006/09/28/usb-apps/ http://www.velvetcache.org/2006/09/17/firefox-plugins/ http://www.velvetcache.org/2006/10/02/simple-php-caching/ http://www.velvetcache.org/2007/01/19/cleaning-up-e-books/ http://www.velvetcache.org/2007/01/08/renaming-your-the-folders/ http://www.velvetcache.org/2007/01/08/pop-can-bookmarkers/

Wordy And Thoughtful http://www.velvetcache.org/2006/09/15/i-cant-be-your-john-cusack/ http://www.velvetcache.org/2006/11/01/rediscovery/ http://www.velvetcache.org/2006/09/19/complexity-vs-redundancy/ http://www.velvetcache.org/2006/09/20/linux-on-the-desktop/ http://www.velvetcache.org/2006/09/07/facebook-apis/ http://www.velvetcache.org/2006/10/03/moniker-junkie/

Now to the meat of this post! When I was first hired at UNO I was given the transfer articulation site as a project. What they basically do is keep track from year to year what each class at a number of schools is equivalent to here at UNO. I wrote it pretty quick, and they've been slowly adding data by hand for a few months now.

It's a lot of data to enter too. So far they only have one year of one school done. The old system was a series of static HTML pages, so they didn't think they could load it into the new system. I didn't agree fully, because although the pages were poorly written and differed from year to year, they had a standard table layout on each one. I got to work on the idea of extracting the old data and loading it into the new database.

The first thing to do was create a syntactically correct file, here's a sample of part of one of the files:

... Max.
Transfer Hours
Allowed

Comments

ARCH1300

Architectural Desktop I ...

Nasty, all-caps and they didn't even close the tags. Ugly, ugly.

Luckily I knew of a secret weapon, HTML Tidy! When run through with the appropriate flags I got this lovely version of the code:

...

Comments

ARCH1300

Architectural Desktop I

....

Okay, so running the tidy command on every file one at a time would be crazy, so I wrote up a short batch file to hit every single .html file with the tidy love. Please excuse the nasty one-lined-ness of it.

FOR %%f IN ("*.html") DO tidy %%f --char-encoding utf8 --clean yes --doctype strict --escape-cdata yes --indent auto --indent-attributes no --join-classes yes --output-xhtml yes --show-errors 99 --tidy-mark no --wrap 0 -m "%%f"

Okay, so now that we've got that beautified file, we need to parse it. To make life easier I stripped all the other tags out except for the table tags with PHP's strip_tags(). I then regexed out anything that wasn't inside the <table> tags and created my SimpleXML object with what was left over.

$filename = $_REQUEST['dir'].'/'.$_REQUEST['file'];
// Remove the space in fopen, it's a Wordpress thing. :(
$handle = f open($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
$contents = strip_tags($contents,'');

if(preg_match('/([.\r\n]*)<\/table>/si', $contents, $matches)) {
	print 'Will attempt to load. ';
	$contents = $matches[0];
	
	$xml = new SimpleXMLElement($contents);
	
	if(!$xml) {
		print 'Error! XML Didn\'t read right.';
		exit();
	}
		
	print 'Load complete.  Checking for expected structure. ';
	
	if(trim($xml->tr[0]->td[0]) != 'Course') {
		print 'Error! Structure not as expected. Dumping.
';
		var_dump($xml);
		exit();
	}
}
A few loops and a whole lot of boring processing later and it's all in the DB.  But that's the gist of the system and it's frickin tight.
PHPProgrammingWorkGeek

    
    
      
        0xF79C72E6EDC70E38 -
        About
        
          Color:
          
          
          
          
          
          
          Mode: