Cleaning Up E-Books

January 19, 2007

I have a large number of ebooks in Microsoft’s .lit format. My Nokia 770 doesn’t have any software to read a .lit format book. In fact, I can’t say I’ve ever seen a .lit reader other than Microsoft’s own.

What I have seen is the nifty and very usefull ConvertLIT which I use to down convert the files into plain HTML. I don’t even bother with the images. The problem is, they tend to come out formatted in a hideous fashion. I came up with a nice combo of HTML tidy and a perl script.

Here’s my command line for tidy, beware, this will modify your original copy!

Here is my perl script, it just runs the file through some regex’s and writes to the same filename with “NEW” appended. I also made a nice little progress bar because I was bored.

You can download it here, but be careful with it.

Update (01/21/07)
That perl script has a line $_ =~ s/ //mi; which doesn’t really make that much sense looking at it now. I’m thinking $_ s/\s\s+/ /mi; for a replacement. Also, for some reason the server throws up a 500 error on trying to get that file, I’m working on it.

