21 August, 2012

OCR a scanned PDF with Tesseract

Simple really: I wanted to OCR a scanned PDF, then embed the output text back into the PDF so that I can search. Surprisingly, an application for this doesn't already exist, so here's my script:

#!/bin/sh

cp $1 $1.bak

pages=$(pdftk $1 dump_data output | grep NumberOfPages | sed -E 's/(.*): (\d*)/\2/g')

for i in `seq 1 $pages`;
do
        convert -monochrome -density 600 $1\[$(($i - 1 ))\] page$i.tif
        tesseract page$i.tif output -l eng
        pdftk $1 attach_files output.txt to_page $i output $1.new
        mv $1.new $1
        rm output.txt
done