Bulk OCRing mixed content and exporting as PDF

This is more written as an aide-memoire to myself than anything. It’s a process I’m currently using for bulk-processing a set of documents of various forms (MS Word, PPT, PDF, LibreOffice etc), converting them all to PDF, running OCR on any embedded images and then sticking the end-result into Elasticsearch via Tika (not documented, plenty documentation elsewhere re this final step).

At some point I’ll write up a script to do this in one click but for now, someone else might find this useful…

  1. start libreoffice document converter

    unoconv --listener
  2. convert any non-PDF documents to PDF

    find input/ -type f | xargs file | \
      grep -v PDF | awk -F ':' '{print $1}' | \
      xargs unoconv -f pdf -o output/
  3. document converter can now be killed

  4. pull in all the rest of the documents

    find input/ -type f | xargs file | grep PDF | \
      awk -F ':' '{print $1}' | \
      xargs -I{} sh -c 'cp {} $(echo {} | \
      sed -re 's/bin/pdf/g' | sed -re 's/input/output/g')'
  5. ocrmypdf via docker is required … native install tested & doesn’t behave on CentOS 7

    docker pull jbarlow83/ocrmypdf
  6. make a script for ocrpdf

    find pdf/ -type f | xargs -I{} sh -c \
      'echo docker run --rm -i jbarlow83/ocrmypdf --redo-ocr - - \<{} \>$(echo {} | sed -re "s/pdf/ocr/g") >> ./run-ocr.sh'
    chmod +x run-ocr.sh
  7. centos docker permissions need fixed. Then run the script and make a coffee while you wait …

    newgrp dockerroot
  8. remove all the zero-length PDFs: these are typically PDFs that contain interactive elements / forms. We’re not interested in these.

Files generated can now be processed via ES document pipeline

Couple more steps I’m using myself.

  1. My ‘upstream step’ has a DB of all file hashes and stuff. Never mind me, I’m just renaming files to include a date/time. Anyone else can skip this step.

    SELECT sha1 AS old, 
      CONCAT(DATE_FORMAT(dt_start, '%Y%m%d%H%i%s'), '_', sha1) AS new 
      FROM resources 
      INNER JOIN events 
      WHERE resources.event_id=events.id 
        AND resource_type=1 
        AND sha1 IS NOT NULL;
  2. run query and create script

    mysql -u root -p gem < query.sql > output.txt
    cat output.txt | awk '{print "mv", $1 ".ocr", $2 ".pdf"}' > rename.sh
  3. check file before running rename, then concatenate