1. August 21, 2012

      OCR a scanned PDF with Tesseract

      Simple really: I wanted to OCR a scanned PDF, then embed the output text back into the PDF so that I can search. Surprisingly, an application for this doesn’t already exist, so here’s my script:

      cp $1 $1.bak
      pages=$(pdftk $1 dump_data output | grep NumberOfPages | sed -E 's/(.*): (\d*)/\2/g')
      for i in `seq 1 $pages`;
              convert -monochrome -density 600 $1\[$(($i - 1 ))\] page$i.tif
              tesseract page$i.tif output -l eng
              pdftk $1 attach_files output.txt to_page $i output $1.new
              mv $1.new $1
              rm output.txt