The Word Error Heard Round The World
What does this scene from Friends have to do with transcription? Phoebe misheard "Tiny Dancer" as "Tony Danza" and it changed how an entire generation heard this Elton John classic. Her mistake is an example of a Word Error when processing audio, known as a substitution error. It turns out that Word Errors in transcription can be agent coaching opportunities.
A Brief History of Transcription
Transcription, the process of converting speech to text, has existed far longer than you might imagine. How long, you ask? Well, for the bulk of history, transcription was done employing a method called shorthand. We all learn to write in longhand in school, but shorthand is a specialized, abbreviated, symbolic writing method used to write much faster. The earliest known reference to shorthand comes from a marble slab dating to roughly 350 BCE found at the Parthenon in Greece. If you're good at math, you've already figured out that the tablet is nearly 2,400 years old. If you had guessed roughly 2,400 years at the beginning of this section, you are probably the life of the party. Other than the specifics of the particular shorthand system used, transcription technology didn't change much until 1970.
Transcription from Recorded Audio
The 70's are famous for more than polyester leisure suits, disco, CB radios, and bad mustaches. The 70's also gave rise to the cassette tape. Portable cassette recorders meant that not only did the transcriptionist not have to be physically present, but the ability to pause and rewind meant that shorthand became unnecessary.
From Analog to ASR
Digital recording formats, particularly the Compact Disc (CD), eventually replaced analog cassette tapes in the music industry. But it was the digital file formats, particularly the MP3, that changed transcription. Portable recorders became incredibly small, very inexpensive, and could simply become a feature of other devices, including mobile phones. Contact Centers could record and store phone calls without cumbersome magnetic tape reels.
Today's digital recording methods can allow voice recordings to be ingested by computer systems instantly. Rather than relying on humans to manually transcribe a recording, Automatic Speech Recognition (ASR) combines decades of research in computer science, linguistics, and computer engineering to generate text from speech almost instantly.
Error Rates and Coaching Opportunities
By 2016, Microsoft achieved human parity with ASR with a WER of 5.9%, the same as a professional stenographer. ASR performance is typically measured by Word Error Rate (WER) or Word Accuracy (WAcc). As Contact Centers began evaluating transcription and analysis platforms, these measurements became critical to understanding how automated systems would perform. Could a system transcribing 100% of calls replace a Quality Assurance team? Could contact center leaders get call-driver data without manual agent dispositions? As someone who vetted possible commercially available solutions at that time, I can tell you with certainty that the WERs were still way too high to be relied upon for these functions. While WERs were incredibly low in "perfect" settings like a lab, Contact Center call recordings are anything but perfect. Background noise, various accents, industry jargon, speakers talking over each other, non-native speakers, compressed audio recordings - basically everything you would avoid in a lab setting is present in nearly every Contact Center's vault of call recordings.
Advances in the technology, however, have gotten typical WERs in contact center settings under 17%. Is that good enough? Consider that the average sentence is 15 to 20 words and the formula for WER = (# of words inserted + # of words deleted + # of words substituted) / total # of words. That means a single error (missed, added, or incorrectly transcribed word) in a sentence equates to a WER of 5 to 7%. When you consider the professional stenographer rate reached in 2016 was 5.9%, it lines up perfectly as an error per sentence. In that context, 2-3 errors per sentence with less than ideal conditions is pretty amazing.
No matter how low WERs get, one major problem still exists with this measure - is the problem with the person speaking or the transcription? Contact center agents, more than most speakers, say the same phrases over and over, every day.
Thank you for calling Happitu, how can I help you today?
If you've read this far, you know exactly what happens to the phrase - it gets compressed. It's said faster and faster with less and less attention to enunciation over time. Why? We begin using our brain's auto-pilot feature. While this feature is incredibly useful, it can make the things we're saying more difficult to understand. Consider this actual transcription produced by Happitu:
This is where I tell you that Comcast is not one of our partners. But if you were to listen to the audio, it's easy to hear why the ASR missed the mark. The brand the agent actually said was spoken very quickly, on auto-pilot, and sounded very much like "Comcast" even though they were in the business of pest control and not internet and cable services.
So where was the failure? The ASR, the phone connection, the agent's auto-pilot, the audio recording compression, agent accent, or even the mic and its placement can all contribute to a mistranscription. In this particular case, the agent's pace in the branding contributed significantly to the error as evidenced by the rest of the call not suffering from the same issues.
This is where Word Errors become potential coaching opportunities for agents. When you observe transcription errors, listen to the call and ask yourself what might have caused this particular error. Yes, it could be an ASR failure. It could be unavoidable background noise in your contact center. But it could also be simple agent behavior that can be corrected.