OCR of scanned pages on Ubuntu 10.04

Just for future reference:

  1. Scan images at 300 dpi (might be able to make this work at a lower resolution, but this is fine). For one sample page, this resulted in a 2348×3129 pixel image where each baseline height was around 50 pixels, and capital letters had a height around 30 pixels.
  2. Install ocropus 0.3.1-2 from Ubuntu mirror. Other Ubuntu versions may have other ocropus versions.
  3. Run
    ocroscript recognize image.png > image.html

