AI transcription is the most effective when processing dictation (ie one speaker), where the context is not particularly important and the person receiving the transcription has got time to spend going through and correcting all the mistakes. In fact I have used it to prepare this article (Apple iphone voice memo transcription), but I’ve probably spent just as long going through amending errors (the Apple version struggles with commas and full stops in particular) and I may as well have asked a human transcriber to complete it or type from scratch!
AI transcription struggles to detect certain words and accents even when the recording is clear. It is fairly useful for one to one interview recordings, provided they are conducted in fair clear backgrounds without too much noise or speakers who talk over the top of each other.
It is utterly useless when it comes to handling multi-speaker recordings, or focus group transcripts. As at this point in time, no one has invented a way of producing a transcription that accurately differentiates between the different speakers, no matter what the AI transcription companies would have you believe. Try it out. Put a recording through one of the AI transcription companies and see what happens. Our company specialises in multi speaker transcription and focus groups, and we often get sent transcriptions completed by the likes of rev.com or otter.ai or using the Zoom or Teams transcription options.
We are asked by our clients to tidy them up, but our transcribers will very often simply work them from scratch because the AI produced transcription is so bad. Its actually faster and more cost effective for us to do the work from scratch than it is for a transcriber to have to go through and correct all the mistakes in the transcript for the client before we send them back. Multi speaker recordings are a particular problem because very often there will be bits where the speakers talk over the top of each other and AI transcription cannot differentiate between the different parties talking. Therefore it produces one big mash-up of words, all an indecipherable paragraphs and utterly inaccurate. This is the difference between AI transcription and human transcription. AI transcription is utterly out of its depth when it comes to multi-speakers or accents and unable to cope because it requires a human to differentiate between voices, the end of one sentence and the start of the next and the end of one speaker talking and the start of another.
So, whilst AI transcription can be good in certain circumstances, for certain types of recording, it is definitely not good for multiple speakers and focus groups, because as at this time it does not have the capacity to be able to determine the different speakers.
There are obviously going to be future technological advancements, but the simple fact is that AI transcription has not advanced the industry much at all since the first speech recognition software came in back in the 1990s. Yes, it is more accurate and capable when it comes to dictation (although I have just spent the last 30 minutes correcting this article!), but no, it cannot cope with background noise, strong accents, or multi speakers.