Bulk OCRing mixed content and exporting as PDF
This is more written as an aide-memoire to myself than anything. It’s a process I’m currently using for bulk-processing a set of documents of various forms (MS Word, PPT, PDF, LibreOffice etc), converting them all to PDF, running OCR on any embedded images and then sticking the end-result into Elasticsearch via Tika (not documented, plenty documentation elsewhere re this final step).
At some point I’ll write up a script to do this in one click but for now, someone else might find this useful…
-
start libreoffice document converter
unoconv --listener
-
convert any non-PDF documents to PDF
find input/ -type f | xargs file | \ grep -v PDF | awk -F ':' '{print $1}' | \ xargs unoconv -f pdf -o output/
-
document converter can now be killed
-
pull in all the rest of the documents
find input/ -type f | xargs file | grep PDF | \ awk -F ':' '{print $1}' | \ xargs -I{} sh -c 'cp {} $(echo {} | \ sed -re 's/bin/pdf/g' | sed -re 's/input/output/g')'
-
ocrmypdf via docker is required … native install tested & doesn’t behave on CentOS 7
docker pull jbarlow83/ocrmypdf
-
make a script for ocrpdf
find pdf/ -type f | xargs -I{} sh -c \ 'echo docker run --rm -i jbarlow83/ocrmypdf --redo-ocr - - \<{} \>$(echo {} | sed -re "s/pdf/ocr/g") >> ./run-ocr.sh' chmod +x run-ocr.sh
-
centos docker permissions need fixed. Then run the script and make a coffee while you wait …
newgrp dockerroot ./run-ocr.sh
-
remove all the zero-length PDFs: these are typically PDFs that contain interactive elements / forms. We’re not interested in these.
Files generated can now be processed via ES document pipeline
Couple more steps I’m using myself.
-
My ‘upstream step’ has a DB of all file hashes and stuff. Never mind me, I’m just renaming files to include a date/time. Anyone else can skip this step.
SELECT sha1 AS old, CONCAT(DATE_FORMAT(dt_start, '%Y%m%d%H%i%s'), '_', sha1) AS new FROM resources INNER JOIN events WHERE resources.event_id=events.id AND resource_type=1 AND sha1 IS NOT NULL;
-
run query and create script
mysql -u root -p gem < query.sql > output.txt cat output.txt | awk '{print "mv", $1 ".ocr", $2 ".pdf"}' > rename.sh
-
check file before running rename, then concatenate