I have a large number of ebooks in Microsoft’s .lit format. My Nokia 770 doesn’t have any software to read a .lit format book. In fact, I can’t say I’ve ever seen a .lit reader other than Microsoft’s own.
What I have seen is the nifty and very usefull ConvertLIT which I use to down convert the files into plain HTML. I don’t even bother with the images. The problem is, they tend to come out formatted in a hideous fashion. I came up with a nice combo of HTML tidy and a perl script.
Here’s my command line for tidy, beware, this will modify your original copy!
tidy --bare yes --clean yes --drop-font-tags yes --drop-proprietary-attributes yes --enclose-text yes --output-xhtml yes --word-2000 yes --tidy-mark no --write-back yes TARGETFILENAME.htm
Here is my perl script, it just runs the file through some regex’s and writes to the same filename with “NEW” appended. I also made a nice little progress bar because I was bored.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | #!/usr/bin/perl $file = $ARGV[0]; # Name the file open(INFO, "< ".$file); # Open the file @lines = <INFO>; # Read it into an array close(INFO); # Close the file $size = @lines; $counter = 0; $size = $size / 50; open(FILEWRITE, "> NEW".$file); foreach(@lines) { $counter++; if(0 == ($counter % 50) || $counter == @lines) { print "\rProcessing: ["; for($i = 0; $i < ($counter / $size); $i++) { print "+"; } for($i = 0; $i < (49 - ($counter / $size)); $i++) { print "-"; } print "]"; } # Empty paragraph removal $_ =~ s/<p>\s*<\/p>//mi; if($_ =~ m/^\s*\n$/) { # If the line is just a newline or newline and spaces, scrap it. $_ = ''; } else { # Remove excess spaces $_ =~ s/ //mi; # I get these alot... $_ =~ s/­//mi; } print FILEWRITE $_; } close FILEWRITE; print "\n"; |
You can download it here, but be careful with it.
cleaner.pl.txt
Update (01/21/07)
That perl script has a line $_ =~ s/ //mi; which doesn’t really make that much sense looking at it now. I’m thinking $_ s/\s\s+/ /mi; for a replacement. Also, for some reason the server throws up a 500 error on trying to get that file, I’m working on it.