Converting PDF to Word using OCR text recognition – is there any point?

We have recently undertaken a piece of work for a journalist who was converting old newspaper articles into word documents. She had managed to find the articles online but they were, essentially, photographs of newspaper pages rather than in any recognisable text.

Photographs as PDF

Our client had been able to print them out as PDF documents, but was struggling to edit or use any of the material, because it was essentially a photograph rather than specific text. She got in touch with us to ask if there was something we could do to help.

The Power of OCR (Optical Character Recognition)

We are very much aware of the power of OCR text recognition, and the way it can be used effectively to convert difficult images of text into editable documents. The OCR software processes a digital image by locating and recognizing characters, such as letters, numbers, and symbols. Advanced OCR software can export the formatting of the text as well as the layout of the text found on a page.

Free OCR Services Online

Using OCR text takes a bit of patience and there are plenty of free options available online. We have recently used https://online2pdf.com/ (we do not take commission payments for publicising services just to reassure you if you click the link!) and this works well.

Adobe PDF to Word Service – Any Good?

Our company has been using the Adobe PDF to Word service for about five years now, and in most circumstances when a PDF comes in it is very straightforward to convert into a word document to edit it, because most PDF documents are essentially word documents in the first instance that have been converted into PDF.

In these circumstances the powerful Adobe OCR tool does a great job in converting the text.

Unfortunately this tool is very little use at all when it comes to a photograph of text where the text is either smudged, blurred or difficult to read. We usually estimate that the Adobe tool will get about 30-40% of the text into a usable format and the rest will need typing.

Free Alternatives? There are also free versions online that do exactly the same thing, and the quality can be just as good and if not better than the paid service from Adobe. Try https://pdf2doc.com for standard pdf to word conversions. Although again pdf2doc will not recognise photographed text very well and our recommendation to use online2pdf.com for OCR work stands.

PDF to Word Conversion Service

We actually converted the PDF document for the journalist in question without charge to start with, so she could see the quality she would get if she did this herself. It took about 30 seconds to do and we managed to get about 30-40% of the text straight into a word document, with the rest of it either appearing as gibberish or missed off altogether.

OCR = 30-40% Accuracy

OCR will struggle with photographed blocks of text if they are not crystal clear. This is the problem with any AI (artificial intelligence) for either text or speech recognition. Whilst these services are very good at transcribing text and speech that is simple, straightforward and easy to recognise, there are no services available yet that will do anything that is more than just a little bit complicated.

In the case of our client, we have ended up transcribing the entire documents using the old traditional method of copy typing, because the quality available using OCR was so poor. It was actually more cost effective for the client to have the whole thing copy typed than it was to sit down and go through it herself and correct.

Alternative for 100% Accuracy – Copy Typing

Copy typing is cheap, effective and gives you instant accurate copies of your work without needing to have to go through and amend large chunks of it.

If you would like to use our copy typing service, please get in touch for a quote. Similarly, if you would like us to demonstrate OCR text recognition capabilities on any PDF document you have, please email us over your PDF or text image and we will send you back an OCR version of it. There is no charge for the service.

Our Accreditations

We are Cyber Essentials Plus audited annually and we hold the Cyber Essentials and Cyber Essentials Plus certificates. We are UKAS ISO 27001:2022 audited and accredited and ISO 9001 & ISO 14001 systems accredited company. We are members of the American Translators Association and we are assessed for GDPR compliance annually by IASME (Cyber Assurance Level 1).

10% Profits to Charity

10% of our profits are donated to the Ten Percent Foundation, a charitable trust registered in the UK. Since 2000 over £150,000 has been donated to projects in Africa and the UK. Click here for details.