The Dirty Little Secret About Contact Center Quality Assurance

Rob Dwyer
August 12, 2024

Quality Assurance Artificial Intelligence Speech Analytics

"This Call May Be Monitored or Recorded for Training and Quality Assurance Purposes"

The Truth About Call Monitoring

We've all heard it. It's almost like a jingle you can't get out of your head. It's used because, at least in the US, about a dozen states require two-party consent to record calls. Because most businesses really don't want to track which states, they just use that blanket disclosure for every call. And yes, the call is almost certainly being recorded.

But what you may not have known is that, mostly, the recordings are not used for training or quality assurance purposes. More simply put, they're mostly not used for anything. Yes, they are recorded and some are used for quality or training purposes, but the likelihood of your particular call being used in that way is roughly the same as the odds of you being ambidextrous. But why is that?

Quality Criteria

To understand that, you need to understand how traditional Quality Assurance functions in a Contact Center. It begins with a rubric or document that explains how a call should be conducted. While there are some industry standards, most businesses create their own custom standards for what constitutes quality. These quality criteria are documented, typically as a scoresheet. Common criteria include:

Determining Customer Needs
Issue Resolution
Sales Process
Documentation
Various Soft Skills

These criteria are often broken down in specific behaviors and sometimes given weighting to emphasize their importance.

Quality Scoring

In many contact centers, each criterion, or "line item," is scored using the same methodology but some businesses mix methodologies to provide a more granular representation of performance. Common scoring methodologies include:

Binary (Met/Not Met, Yes/No, or Thumbs Up/Thumbs Down)
Gradient Point Scales (3, 4, 5, 7, and 11-point scales are the most common)
Auto-fail*

Once each criteria is evaluated, a single score representing overall performance is calculated. Depending on the methodologies and weights applied, the scoring can be represented as a total out of "points possible" or, more often, converted to a 100-point scale.

Quality Evaluation - Who and How Often?

Some contact centers have an entire team dedicated to evaluating calls. Often known as Quality Analysts, members of this team listen to calls while scoring them against the rubric. This can be done on paper, in a spreadsheet, or in dedicated Quality Evaluation software. Other contact centers rely on Supervisors or Team Leaders to perform the evaluations. There are even 3rd party companies that provide these services by calling in as a customer or potential customer. These are sometimes referred to as "Mystery Shoppers."

No matter who is charged with the task of evaluating calls, call centers have limited budgets and resources. A single dedicated Quality Analyst needs to listen to an entire call, sometimes listening to certain sections more than once, review documentation, and potentially review process / policy information to score a single call. Additionally, they may be asked to provide the agent feedback on the call. This process naturally takes quite a bit longer than the call itself. Depending on the rubric, ease of access to additional information, and expectations for feedback, the evaluation process can last up to 5 times longer than the call handle time. It's impossible for one person to evaluate every call for a single agent assuming they stay relatively busy. Quality Analysts are often responsible for evaluating up to 20 agents. When all is said and done, it's not uncommon for roughly 1% of all calls to be evaluated.

Statistically Insignificant

No one wants to be thought of as insignificant, certainly not the people who've been tasked with evaluating the performance of agents the world over. But in the world of statistics, significance is important. Significance is a determination that the result is not by chance. This is a common agent complaint when they receive a bad quality evaluation - the rare case of them not performing well was the one that got evaluated.

I'm sure you're not reading this for a tutorial on confidence levels, margin of error, or any of the other things that go into determining a statically valid sample size, so I won't give it to you. I will point you to Calculator.net's Sample Size Calculator and just tell you that if an agent takes about 1,000 calls a month, a good sample size for Quality Assurance is about 278** calls. Most companies are doing 8-10. I'm sure you can now see the issue. The margin of error with that kind of sample size is over 30%.

Call Quality - An Incredible Use-Case for Artificial Intelligence

Artificial Intelligence (AI) has made incredible strides over the past few years, particularly in the areas of Natural Language Processing (NLP), Natural Language Understanding (NLU), Automatic Speech Recognition (ASR), and, most recently, Generative AI using Large Language Models (LLM).

AI can now create accurate transcripts from call recordings almost instantaneously and with better accuracy than human transcription. Remember Agent Smith in The Matrix? Imagine him, but instead of dozens of copies of him hunting Keanu Reeves, dozens of copies of him are evaluating conversations. We'll call him QA Smith. Oh, and he doesn't have a disdain for humans or view us as a virus. He's encouraging yet honest.

And, our QA Smith identifies more than traditional Quality Analysts. QA Smith does sentiment analysis for both agent and customer. QA smith provides feedback for every line item. QA Smith can speak multiple languages. And QA Smith can evaluate every single call. QA Smith is significant. Oh, and QA Smith, he’s also incredibly inexpensive.

The Cost of Human Quality Evaluations

Let’s talk numbers. If we assume your human Quality Analyst is not based in the US but is more expensive than an agent, you’re probably paying at least $10 per hour for them. You can argue about the rate, but let’s just start there. If you pay more or less, you can plug in that amount and do similar math. Let’s also assume this analyst works 40 hours a week, though this too may change based on the country they work in. With said 40-hour work week, there are an average of 21 working days per month. 21*8*10 = $1,680 per calendar month for their labor without any consideration of benefits, software seat licensing costs, or any other associated costs of employing them. If they can evaluate 4 calls per hour (which is almost impossibly aggressive, likely includes minimal feedback for the agents, and excludes all long calls), that’s 21*8*4 = 672 evaluations per month. $1,680 / 672 = $2.50 per evaluation. In my experience, you’ll get less than half that production which quickly doubles the cost. If you’re using a QA based in the US, double it again.

The Cost of AI Quality Evaluations

WARNING - much math ahead! (It’s okay, we’ve done the math for you. Feel free to check our work.) If you’d like to skip the details, here’s the TL;DR:

$57.79 for Stereo Transcription

$17.20 for ChatGPT-4o Evaluation

$74.99 for 672 evaluations

22x AI Productivity Cost vs. Human Evaluator

Before we can get to the cost of the evaluation, we first have to address transcription if we’re evaluating a voice interaction. Your CCaaS platform may already provide transcription, so if that’s the case, you’re one step ahead. If not, transcription is typically a “per minute” cost. There are a number of services available to handle this. Deepgram will cost you $0.0043 per minute to pay as you go.

Once you have a transcription, you can use one of many Large Language Models (LLM) to evaluate it based on your criteria. This is where pricing becomes a little less transparent as most models price based on token usage. What’s a token? OpenAI says, “You can think of tokens as pieces of words, where 1,000 tokens is about 750 words.” A call ranging between four and six minutes is typically 540 words.

Let’s assume a typical call length of 5 minutes and 540 words. Our Human QA got us 672 evaluations (maybe) so let’s price that out. It’s 3,360 minutes and 362,880 words.

3,360 minutes at $0.0043 = $14.49 for transcription at Deepgram. GPT-4o costs $5.00 for 1 M input tokens and our 362,880 words equal roughly 483,840 tokens. That’s about $2.42. If we pair Deepgram’s transcription with GPT-4o, our costs have dropped to under $20.

But wait - there’s a catch. To get Deepgram’s quoted pricing, you are opting into training their models via their “Model Improvement Program.” Oh, you don’t want to let them use your data to train their models? Your price just doubled. That’s okay, we’re still under $35. Right? Right?? Oh, you have stereo recordings? That’s great because they make transcription even better. Also, your price also just doubled again. Okay, so now you’re at $57.79 for transcription and $2.42 for your tokens which is still just $60.21.

But you haven’t priced in the prompts… or the output. Say what?

ChatGPT pricing isn’t just based on the input - you have to pay for the output and tell it what you want it to do. The telling part is called a prompt. If you want to evaluate a call on various different dimensions, you need to include all of those things in your prompt. If you want feedback and scoring, that too will use tokens. How many? Well, that depends. A single line item might be as many as 200 words just to prompt for the result you want. That output may only be 20 words or so. But let’s say your QA form evaluates 15 dimensions of call handling. That 220*15 = 3,300 words per call. 3,300 * 672 = 2,217,600, or 2,956,800 tokens. That’s another $14.78. So we’re up to $74.99.

Wait, output tokens cost THREE times as much as input tokens?? It’s fine. Really. The output is fairly small as long as you prompt the output to be brief. Or concise. Or short. Or to the point. Or Succinct. Maybe compact? Seriously, you’ll have to play with the prompts to get what you want. You’ll actually need to spend quite a bit of time building your prompts. But you’re still well under $100 for 672 evaluations and that’s pretty awesome!

Of course, you’ll want some reporting which, *checks notes*, is not included with any of these services.

The Cost of Happitu Quality Evaluations

Forget tokens. Forget using multiple services. Forget adding in a reporting solution. Oh, and we’ll also summarize every call, provide call drivers, anomaly detection, 100% searchability, 6 months of data retention, custom highlights, and alerts all for a simple $1 per hour of analysis. No contract, no minimum. That’s $56 for the same 672 evaluations. That’s 30x Happitu Productivity vs. Human Evaluators if we’re being conservative. No, we’re not kidding. Let us show you how Happitu can evaluate the other 99% of your customer interactions for a fraction of the cost you’re currently paying.

*The Auto-fail methodology is typically only applied to a few high-value criteria, often related to compliance and/or customer abuse. When an auto-fail is triggered, the score for the entire call is 0 despite the performance on other criteria. Use of this is controversial and some contact centers prefer to address egregious errors and omissions using different means.

**95% Confidence Level, 5% Margin of Error

5 Proven Lessons from Using AI for Quality Assurance in Contact Centers

The Dirty Little Secret About Contact Center Quality Assurance

The Truth About Call Monitoring

Quality Criteria

Quality Scoring

Quality Evaluation - Who and How Often?

Statistically Insignificant

Call Quality - An Incredible Use-Case for Artificial Intelligence

The Cost of Human Quality Evaluations

The Cost of AI Quality Evaluations

The Cost of Happitu Quality Evaluations

Related Posts

5 Proven Lessons from Using AI for Quality Assurance in Contact Centers

Why You Need a Call Center Quality Monitoring Scorecard

The Dirty Little Secret About Contact Center Quality Assurance