Ocr

Getting copies of annual returns and company information from Companies House is easy. Searching the data in those returns isn’t quite so easy. CH use a PDF format (PDF/A, akin to fax) that ensures maximum compatability.

This is more written as an aide-memoire to myself than anything. It’s a process I’m currently using for bulk-processing a set of documents of various forms (MS Word, PPT, PDF, LibreOffice etc), converting them all to PDF, running OCR on any embedded images and then sticking the end-result into Elasticsearch via Tika (not documented, plenty documentation elsewhere re this final step).

Ocr

OCR processor for PDF web filings from Companies House

Bulk OCRing mixed content and exporting as PDF