Getting copies of annual returns and company information from Companies House is easy. Searching the data in those returns isn’t quite so easy. CH use a PDF format (PDF/A, akin to fax) that ensures maximum compatability.
This is more written as an aide-memoire to myself than anything. It’s a process I’m currently using for bulk-processing a set of documents of various forms (MS Word, PPT, PDF, LibreOffice etc), converting them all to PDF, running OCR on any embedded images and then sticking the end-result into Elasticsearch via Tika (not documented, plenty documentation elsewhere re this final step).