1. August 21, 2012

      OCR a scanned PDF with Tesseract

      Simple really: I wanted to OCR a scanned PDF, then embed the output text back into the PDF so that I can search. Surprisingly, an application for this doesn’t already exist, so here’s my script:


      cp $1 $1.bak

      pages=$(pdftk $1 dump_data output | grep NumberOfPages | sed -E ‘s/(.*): (\d*)/\2/g’)

      for i in `seq 1 $pages`;
      convert -monochrome -density 600 $1\[$(($i - 1 ))\] page$i.tif
      tesseract page$i.tif output -l eng
      pdftk $1 attach_files output.txt to_page $i output $1.new
      mv $1.new $1
      rm output.txt

      Leave a Reply