Show all articles

5 Proven Lessons from Using AI for Quality Assurance in Contact Centers

  • Rob Dwyer
5 Proven Lessons from Using AI for Quality Assurance in Contact Centers

Customer-facing AI, Agentic AI, Hyper-Personalization… There's a lot of AI hype in the world of customer support, customer service, and sales centers these days. But beyond the hype, there are proven implementations of AI in contact centers that are actually delivering results. One of those implementations is using AI for Quality Assurance (QA).

What is Quality Assurance in Contact Centers?

Quality is about "setting the standards" for customer service interactions according to Brad Cleveland. Those standards define efficient and effective service that will help you build relationships with customers and are typically documented on a scorecard.

Quality Assurance, or Quality Management (QM) is the practice of assessing how well support center or contact center agents meet those standards. For decades, this practice involved Managers, Team Leaders, Supervisors, and/or Quality Analysts listening to calls and scoring them using the scorecard.

The biggest challenge with Quality Management has always been the cost of scaling to get an accurate assessment. But Artificial Intelligence (AI) has changed that.

Yet, AI isn't turnkey. After years of supporting the Quality Management efforts of businesses of all shapes and sizes, we've learned some valuable lessons.

Lessons We Learned through QA Automation

Lesson 1 - AI Needs Good Instructions

Can you imagine asking your cousin Devin, who knows nothing about your business, to QA your customer interactions? AI is a lot like that.

I began leading a Quality Assurance department in 2017. Because we were a BPO with dozens of partners in different verticals, we had over 100 scorecards, many with their own unique criteria.

Some scorecards included important regulatory compliance components, including PCI and GDPR adherence. Some included industry-specific requirements including verbiage requirements laid out by the Forbes Travel Guide. Some scorecards were established by our partners, others by us. Different channels also had different scorecards.

One thing many of the criteria had in common was they ASSUMED an understanding of what was meant. Let me give you some examples:

  • "Warm Greeting"
  • "Were we INTERESTED in the caller's issue?"
  • "Demonstrate formal courtesies and manner"
  • "Effective Listening"
  • "Effective Confident Communication"
  • "Identify and appropriately handle warranty issues"
  • "Mandatory Info"
  • "MAC / IP / Correct Authentication Process"
  • "Membership / Upsell / Balance Collection"
  • "Ownership"
  • "Validation/Empathy Skills"

A big part of my job in subsequent years was to flesh out definitions and refine criteria so we all clearly understood "what good sounds like." This meant documenting with clarity, specificity, and intent for each and every standard.

Clarity, Specificity, Intent

When we created QA scorecards in Happitu, we knew from our experience that they had to be flexible and fully customizable. But for AI to produce good results, it also needed definitions with clarity, specificity, and intent.

One sentence may have sufficed for experienced Quality Analysts who knew the business, but AI needs more. It needs:

  • clear, unambiguous instructions that might be broken down into bullets or steps
  • specific scenarios when the standard is not applicable
  • both why and when we're looking for a specific standard so it can provide relevant feedback

Lesson 2 - Calibration Is Still Required

Traditional calibrations are about alignment. The process usually involves multiple people reviewing and scoring an interaction. Once scored, as a group, they identify scoring differences and come to mutual agreement about how it should be scored and why.

An important part of refining your scorecard is using a calibration-style process to spot issues with those definitions. If AI is consistently mis-scoring criteria, it's time to revisit the definitions with clarity, specificity, and intent in mind. This is why we created the ability to "re-queue" interactions for analysis on the fly so you can see how those updated definitions perform.

Lesson 3 - Use AI to Create Your Scorecard

Don’t have a scorecard? Maybe you do, but think you should start over? You can use Claude, ChatGPT, Gemini, or your favorite flavor of AI to help you create a scorecard. Yes, it can be like cousin Devin with good instructions, but we’re here to help. You can use these prompting tips to get great results:

  1. Tell the AI who to be: “You are a Quality Assurance Manager in a Contact Center”
  2. Give the AI business context: “Your contact center supports customers of [your brand here] using calls, chats, and email as channels of support.”
  3. Give the AI the high-level task: “You will develop a Quality Scorecard that identifies agent strengths and opportunities when interacting with customers.”
  4. Give the AI company / department context: “When developing criteria, account for the following KPIs: [your KPIs here] and our company culture / voice which is [your company culture/voice here].”
  5. If you want to start with your existing scorecard, ask the AI for suggestions: “You should provide suggestions for Scorecard Criteria that are not currently present.”

Lesson 4 - Keep AI Benchmarks “on the Bench”

Technologists love a good benchmark because they assess relative performance. These standardized tests are critical for making informed decisions about hardware and software because they compare performance across a variety of objective measures.

But benchmarking the performance of a Large Language Model is significantly more complicated because there’s inherent subjectivity. The most popular method is basically like a rap battle. While entertaining, it’s not a great way to identify the best model to provide feedback on the quality of customer interactions.

We’ve tested dozens of models. We had high expectations for some, based on leaderboards and claims about their performance, only to see them fall flat on their virtual faces in our real world application. We continue to evaluate models and we’ve seen consistently that there’s very little correlation between benchmarks and actual performance.

Lesson 5 - AI Provides More Accurate Assessments of Agent Performance

"These scores are closer to the performance we see from the agents. These 95% that we're scoring hasn't been right for years."

That's a quote from a Director who was looking at Happitu QA scores in the high 70s for their organization. We like to think our internal QA processes give us an objective assessment, but they suffer from a variety of bias issues. Who performs the evaluations, incentives related to performance, and workload are among the various reasons that small sample, human scoring is often not entirely indicative of performance.

Despite AI not being flawless, when every interaction is evaluated without human subjectivity, you identify performance across long interactions, short interactions, and edge cases that might otherwise go unnoticed.

Applying These Lessons

We believe AI amplifies rather than replaces the human element. The right implementation turns your QA from a box-checking chore into something that gives your team clarity, your QA process credibility, and your leaders more time to actually lead.

These lessons were hard-earned. We made mistakes along the way and we want to share what actually worked. If it helps you avoid a few stumbles, then mission accomplished.

Are you ready to supercharge your QA without starting from scratch? Let’s chat about how Happitu can help you scale, score, and coach better. Because better conversations start with better quality. And better quality starts with Happitu.

Related Posts