Human transcription – why do automated transcription companies recommend it?

This may surprise you, but some of the automated transcription services that are available online actually recommend that you use professional human transcribers for your work in certain situations.

Surely this is a bit of a regressive move back in time if the future of transcription is automation and artificial intelligence? As automated systems get increasingly clever for both transcription and translation presumably there will no longer be any need for human translators and transcribers to provide their services?

This may be the case, but we have been in business since 2003 and since that time we have been told continually that there is absolutely no future for human transcribers because automated transcription would be taking over.

Dictation vs Interviews

In the early days, a bit more of our work, but not a lot, was dictation rather than one to one interviews and focus groups (larger groups of speakers). Over time we have evolved into specialist academic and professional business transcribers, providing our services specifically where a high-quality level of transcription is needed. This can involve quite complicated recordings including extensive academic studies and projects.

This has not changed in many years. We offer transcription services to academics and business professionals and the overwhelming majority of our work is transcribing recordings that are fairly complex. They often involve hard to hear recordings, strong accents, transcription in other languages, and multi-speaker audio & video.

AI Transcription Software

A good number of the automated systems that are out there appear to run from just a couple of bits of code. Amazon has developed one automated transcription service that appears to be used by a number of the online automation companies.

Similarly, Zoom and Teams have automated systems operating to provide subtitling and transcripts effectively to their audio and video recordings and these are getting increasingly popular amongst researchers in order to transcribe their research interviews.

However, none of these have yet developed a way of transcribing complicated or multi-speaker recordings, and this is why automated companies will recommend the use of humans in a number of situations, rather than relying on their own automation.

You will find that if you try to submit a hard to hear or complicated recording to some of the larger companies, they will return it to you not completed, or you will get it back and the actual transcript will be completely useless because they would not have been able to pick up enough information to make it even worthwhile bothering trying to work from it.

We have written extensively about how some of the automated transcriptions we receive are so bad we very often have to work from scratch whenever a client sends us over an automated transcript, as well as a recording to get us to provide an accurate transcription of it.

Here are a few examples. We used the Amazon Transcribe system to complete this exercise.

The original recording of two English speakers reading questions and answers from a book

Q: In 1004 the people of Norwich made a deal with the Vikings to stop the raids. What did they agree?
A: The people of Norwich paid ‘peace money’ – a bribe. Of course the Vikings took the money and then robbed and destroyed Norwich anyway! Wouldn’t you?

Q. In 1006 King Ethelred of England was worried that his nobles were becoming too powerful. He had one noble, Aelfhelm of York, murdered. What did he do to make sure Aelfhelm’s sons didn’t rise up in revenge?
A: Ethelred had them blinded. Nasty.

This is the Amazon Transcribe version of the recording

Q: Too intense for the people of Norwich made to deal with the Vikings to stop the rate. What did they agree?
A: The people of no rich paid piece money, a bribe? Of course. The Vikings took the money and then robs and destroyed Norwich anyway, wouldn’t you?

Q: In 10. Oh, six. King Ethelred of England was worried that is, nobles are becoming too powerful. He had one Noble Alfama of York murdered. What did you do to make sure Al from sons did Right boys up in revenge?
A: Every word had been blinded, Nasty.

This shows the level of editing that is needed when dealing with the automated transcripts. We appreciate of course that no AI transcription service is going to be able to spell Aelfhelm and it is really impressive that the AI bot managed to transcribe Ethelred! NB: we have also formatted the above to add in the Q: and A: bits to the automated transcript.


The other issue is speed – AI tends to really struggle with speakers talking quickly, a problem if work is not dictation.

Here is an example – this is one speaker, reading an extract from a book.

The original recording of one English speaker reading from the book at a slow speed.

I respect a good check list, but I’m beginning to think my mother went overboard.

“Sorry, what page?” I ask, flipping through the handout at our kitchen table while mom, watches me expectantly via Skype. The heading reads Sterling-Shepard 20th anniversary trip: instructions for Ivy and Daniel, and it’s eleven pages total. Double sided.

My mother planned the first time she and Dad ever left me and my brother alone – for four days – with the same thoroughness and military precision she brings to everything. Between the checklist and the frequent calls over Skype and Facetime, it’s like they never left.

“Nine”, Mom says. Her blond hair is pulled back in her signature French twist, and her makeup is perfect, even though it’s barely five AM in San Francisco.

My parents’ flight home doesn’t take off for another three and a half hours, but Mom is never anything but prepared.

“Right after the lighting section.”

Amazon Transcribe version (slower speaker version)

I respect Ah.
Good checklist.
But I’m beginning to think my mother went overboard.
Sorry. What page I ask, flipping through the handout at our kitchen table.
While Mom watches me expectantly via Skype.
The heading reads Sterling Shepard 20th anniversary trip.
Instructions for Ivy on Daniel.
On its 11 pages total.
Double sided.
My mother won and the
First time.
She on D dad.
Have, uh, that
Maybe on do my brother flowed.
For four days.
With the same over Nets.
Ondo Millet tree precision.
She brings to everything.
Between the checklist on the frequent calls.
Thank the sky.
Andre face time.
It’s like they never left.
Nine moment, says.
Her blonde hair is
Pulled back in her signature French twist.
On her make up is perfect. Even though it’s bad. The five a.m.
In San Francisco.
My parents flight home doesn’t take off for another three on by a half.
But Mom is never anything but prepared.
Right after the lighting section.

Not too bad, could be a lot better of course and accuracy levels are not high. However, things disappear rapidly downhill if the speaker gets faster.

This is the Amazon Transcribe version of the same text spoken more quickly.

I respect a good checklist, but I’m beginning to think my mother went overboard. Sorry. What page I ask with him through the hand on our kitchen table while Mom What should me expectantly via Skype heading, reads stealing chapter 20th anniversary trip.
Instructions the Ivy and Daniel in its 11 pages. Total double sided. My mother planned the first time she ever left me and my brother alone full days with the same thoroughness and military precision she brings to everything.
Between the checklist and the frequent calls over Scotland based time. It’s like they never left nine months has her blond hair is pulled back in a symmetry French twist and her makeup was perfect, even though his body five AM in San Francisco. My parents fight home doesn’t take off for another 3.5 hours, But Rome is never anything go right after the lighting section.

Multi-Speaker Accuracy Levels

The other point in relation to humans doing the transcript, is where a level of accuracy is needed on the recording.

Dictation is often absolutely fine to do with automated transcription services, because very often it is a straightforward matter of the automated system detecting individual words, and producing a transcript of it.

Unfortunately, when there are multi-speakers it suddenly gets very difficult for the automated system to detect the individual speakers, particularly where the speakers talk over the top of each other, and automated systems do struggle in these circumstances.

Professional human transcribers spend a lot of their time dealing with multi-speaker recordings, and it tends to be a specialism of our professional transcribers who have been doing the work for years, to be able to pick out individual words when the speakers are talking over the top of each other. We actually take pride in completing transcription projects and finishing them even if the work is very difficult to hear. We don’t even charge extra for the service!

