I live in Omaha.
 
Navigation
 
Search
 
Random Image
CBSet_121953817704.jpg
 
Me. Elsewhere.
 
Archives
 
Darcy
 
Recently Read
 
Things I Like
github
 
License
 
Cleaning Up E-Books

I have a large number of ebooks in Microsoft’s .lit format. My Nokia 770 doesn’t have any software to read a .lit format book. In fact, I can’t say I’ve ever seen a .lit reader other than Microsoft’s own.

What I have seen is the nifty and very usefull ConvertLIT which I use to down convert the files into plain HTML. I don’t even bother with the images. The problem is, they tend to come out formatted in a hideous fashion. I came up with a nice combo of HTML tidy and a perl script.

Here’s my command line for tidy, beware, this will modify your original copy!

tidy --bare yes --clean yes --drop-font-tags yes --drop-proprietary-attributes yes --enclose-text yes --output-xhtml yes --word-2000 yes --tidy-mark no --write-back yes TARGETFILENAME.htm

Here is my perl script, it just runs the file through some regex’s and writes to the same filename with “NEW” appended. I also made a nice little progress bar because I was bored.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/perl
 
$file = $ARGV[0];   # Name the file
open(INFO, "< ".$file);   # Open the file
@lines = <INFO>;    # Read it into an array
close(INFO);      # Close the file
 
$size = @lines;
$counter = 0;
$size = $size / 50;
 
open(FILEWRITE, "> NEW".$file);
foreach(@lines) {
  $counter++;
  if(0 == ($counter % 50) || $counter == @lines) {
  print "\rProcessing: [";
  for($i = 0; $i < ($counter / $size); $i++) {
    print "+";
  }
  for($i = 0; $i < (49 - ($counter / $size)); $i++) {
    print "-";
  }
  print "]";
  }
 
  # Empty paragraph removal
  $_ =~ s/<p>\s*<\/p>//mi;
  if($_ =~ m/^\s*\n$/) {
    # If the line is just a newline or newline and spaces, scrap it.
    $_ = '';
  }
  else {
    # Remove excess spaces
    $_ =~ s/  //mi;
    # I get these alot...
    $_ =~ s/&shy;//mi;
  }
  print FILEWRITE $_;
}
close FILEWRITE;
print "\n";

You can download it here, but be careful with it.
cleaner.pl.txt

Update (01/21/07)
That perl script has a line $_ =~ s/ //mi; which doesn’t really make that much sense looking at it now. I’m thinking $_ s/\s\s+/ /mi; for a replacement. Also, for some reason the server throws up a 500 error on trying to get that file, I’m working on it.

Posted January 19th, 2007 - Permalink
Categories: Geek
Tags: ,
You can leave a comment, or trackback from your own site.
 
Possibly Related Posts
 
Adjacent Posts
 
Comments
 
Copyright © 2006 - 2010 John Hobbs
get userping