OCR processor for PDF web filings from Companies House
Getting copies of annual returns and company information from Companies House is easy. Searching the data in those returns isn’t quite so easy.
CH use a PDF format (PDF/A, akin to fax) that ensures maximum compatability. However, the text itself is not searchable.
The following script will therefore, when executed against a PDF downloaded from Companies House, do the following;
- Split the PDF into all its separate pages
- OCR it all (using tesseract)
- Merge it all again (using GhostScript)
#!/bin/sh
# Adapted from https://apple.stackexchange.com/a/171594
set -e
set -f
fullpath="`pwd`/$1"
directory=`basename "$fullpath"`
name="${directory%.*}"
output="${name}_searchable.pdf"
if [ -d "$name" ] ; then
echo "Error, directory $name already exists, exiting"
exit 1
fi
if [ -f "$output" ] ; then
echo "Error, file $output already exists, exiting"
exit 1
fi
mkdir "$name"
cd "$name"
echo "Splitting..."
gs -dSAFER -dNOPAUSE -r400x391 -sDEVICE=tiffg4 -o out_%04d.tiff -dBATCH -f "$fullpath"
find -s . -type f -name *.tiff -print0 | while IFS= read -r -d $'\0' f; do
echo "Processing ${f}"
tesseract -l eng --psm 3 $f ${f%.*} pdf
rm $f
done
echo "Merging..."
gs -dCompatibilityLevel=1.4 \
-dNOPAUSE -dQUIET -dBATCH -dNOPAUSE \
-q -sDEVICE=pdfwrite -sOutputFile="../${output}" \
$(find -s . -type f -name *.pdf -print0 | tr '\0' ' ')
cd ..
rm -rf "${name}"