Getting copies of annual returns and company information from Companies House is easy. Searching the data in those returns isn’t quite so easy. CH use a PDF format (PDF/A, akin to fax) that ensures maximum compatability.
I had a small project to display some simple stats for, for some static content sitting in an AWS S3 bucket. I could have forwarded everything to Elastic+Kibana and showed some fancy graphs and charts, but I was only being asked for what I could easily produce via AWStats.
For S3 logging, awstats needs its LogFormat set up in the following manner: %other %extra1 %time1 %host %logname %other %method %url %methodurl %code %other %extra2 %bytesd %other %extra3 %refererquot %uaquot %other %other %other %other %other %virtualname %other Amazon’s documentation is available here
This is more written as an aide-memoire to myself than anything. It’s a process I’m currently using for bulk-processing a set of documents of various forms (MS Word, PPT, PDF, LibreOffice etc), converting them all to PDF, running OCR on any embedded images and then sticking the end-result into Elasticsearch via Tika (not documented, plenty documentation elsewhere re this final step).
See the UI here Following on from here and here, this is just putting together a couple of blocks from bl.ocks.org to plot data from the PhysioNet site.