OCR processor for PDF web filings from Companies House

Getting copies of annual returns and company information from Companies House is easy. Searching the data in those returns isn’t quite so easy.

CH use a PDF format (PDF/A, akin to fax) that ensures maximum compatability. However, the text itself is not searchable.

The following script will therefore, when executed against a PDF downloaded from Companies House, do the following;

  1. Split the PDF into all its separate pages
  2. OCR it all (using tesseract)
  3. Merge it all again (using GhostScript)
#!/bin/sh

# Adapted from https://apple.stackexchange.com/a/171594
set -e
set -f

fullpath="`pwd`/$1"
directory=`basename "$fullpath"`

name="${directory%.*}"
output="${name}_searchable.pdf"

if [ -d "$name" ] ; then
    echo "Error, directory $name already exists, exiting"
    exit 1
fi
if [ -f "$output" ] ; then
    echo "Error, file $output already exists, exiting"
    exit 1
fi

mkdir "$name"
cd "$name"

echo "Splitting..."
gs -dSAFER -dNOPAUSE -r400x391 -sDEVICE=tiffg4 -o out_%04d.tiff -dBATCH -f "$fullpath"

find -s . -type f -name *.tiff -print0 | while IFS= read -r -d $'\0' f; do
    echo "Processing ${f}"
    tesseract -l eng --psm 3 $f ${f%.*} pdf
    rm $f
done

echo "Merging..."
gs -dCompatibilityLevel=1.4 \
   -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE \
   -q -sDEVICE=pdfwrite -sOutputFile="../${output}" \
   $(find -s . -type f -name *.pdf -print0 | tr '\0' ' ')
    
cd ..
rm -rf "${name}"