Observe.AI cuts AI inference costs by over 40% to scale contact center QA

Discover how Observe.AI, a conversation intelligence company, optimized AI inference on Amazon EKS to scale QA automation.

Benefits

100

seconds LLM spin-up time achieved

40%+

lower cost per million tokens

40%+

reduction in overall infrastructure cost

40%

increase in automated QA moments

Overview

Observe.AI needed to address high infrastructure costs and slow model startup times that limited how it could scale Gen AI Moments, its key feature for automating contact center quality assurance (QA). To address this, the team reworked how its large language models (LLMs) were deployed and scaled on Amazon Web Services (AWS), focusing on reducing initialization delays, streamlining model loading, and improving capacity optimization during demand spikes. By optimizing its LLM deployment architecture on Amazon Elastic Kubernetes Service (Amazon EKS), Observe.AI cut model spin-up time to around 100 seconds, lowered overall infrastructure cost by over 40 percent, and empowered customers to automate a broader range of QA evaluations with high accuracy at scale.

About Observe.AI

Observe.AI is an AI agent platform for customer experience, helping enterprises automate customer interactions with natural conversations and predictable outcomes. It combines speech understanding, workflow automation, and governance to support AI agents, copilots, and quality insights at scale.

Opportunity | Scaling Gen AI Moments to meet rising demand for automated QA

Contact centers depend on QA evaluations to assess agent performance, guide coaching, and determine compensation. Traditionally, however, QA teams could review only a small sample of calls, manually listening to recordings and completing evaluation forms. This limited coverage made it difficult for enterprises to consistently assess performance or identify issues early.

To address this, Observe.AI introduced Gen AI Moments, a feature that automates QA evaluations at scale. Gen AI Moments analyzes full agent–customer conversations and answers QA form questions directly from call transcripts, providing clear yes-or-no responses with supporting evidence. This helps enterprises to move beyond manual sampling and apply QA more consistently across many more interactions. “This is one of the primary features customers use heavily because that’s the main outcome they’re looking for—they want to automate their QA processes,” says Anup Pattnaik, staff machine learning engineer, Observe.AI.

As adoption of Gen AI Moments grew quickly among Observe.AI’s enterprise customers, the volume of data processed rose significantly, with hundreds of billions of tokens handled each month to support these evaluations. However, scaling Gen AI Moments to support higher traffic would drive up operational costs and limit how broadly the feature could be rolled out. Observe.AI needed a more efficient way to sustain growing demand while keeping the feature reliable at enterprise scale.

Solution | Optimizing Gen AI Moments on Amazon EKS for scalable inference

To support the growing adoption of Gen AI Moments, Observe.AI focused on improving how its custom LLMs were deployed and scaled on Amazon EKS. The company worked with AWS specialists to analyze how inference workloads behaved on Amazon EKS, examining each stage involved in bringing new graphics processing unit (GPU)-backed nodes online. This review helped the team pinpoint where initialization delays occurred and determine which changes would most effectively improve readiness during traffic spikes. “We knew that AWS would have access to a lot of techniques that we could use to improve the setup, so we started working closely with the AWS team to understand how we could optimize it,” says Pattnaik.

One area of focus was model initialization. Previously, model weights were downloaded from Amazon Simple Storage Service (Amazon S3) to local disk before being loaded into GPU memory, adding unnecessary steps during startup. To streamline this process, Observe.AI adopted Run:AI Streamer to load model weights directly from Amazon S3 into GPU memory, simplifying initialization on new Amazon EKS nodes.

The team also addressed delays caused by large containers containing model images for inference. These images were downloaded each time a new node was spun up, slowing node readiness. Observe.AI moved to Bottlerocket Amazon Machine Images (AMIs) and preloaded inference engine images using Amazon Elastic Block Store (Amazon EBS) snapshots. By enabling Amazon EBS fast snapshot restore (FSR), new nodes could access these images more quickly during scale-out events.

Finally, Observe.AI refined how Amazon EKS detected and responded to increases in workload demand. Instead of relying on Amazon CloudWatch metrics, the platform began reading demand signals directly from Amazon Simple Queue Service (Amazon SQS), which served as the system’s source of truth for incoming workloads. This allowed Amazon EKS to scale GPU resources in line with incoming traffic patterns.

Outcome | Lowering inference costs to expand QA automation

By optimizing how Gen AI Moments scaled on Amazon EKS, Observe.AI reduced LLM spin-up time by nearly 90 percent, from 12–15 minutes to around 100 seconds, allowing inference workloads to start processing sooner as demand increased.

These improvements resulted in a 40-50 percent reduction in cost per million tokens and a 40-50 percent reduction in overall infrastructure costs for Gen AI Moments. Lower operating costs removed previous constraints on running high-volume inference workloads, making the feature more sustainable at scale.

With costs reduced, Observe.AI was able to expand usage of Gen AI Moments across its customer base. The platform now supports approximately 40 percent more Gen AI Moments, empowering customers to automate a greater number of QA evaluations per agent and apply automation across more QA forms. “By cutting infrastructure costs, we were able to scale Gen AI Moments to more customers and automate more QA workflows while maintaining accuracy,” says Pattnaik.

Today, more than 40 enterprise customers, such as DoorDash, Affordable Care, Signify Health, and Verida, use Gen AI Moments to enhance service speed and operational efficiency by automating complex QA evaluations across large volumes of agent–customer interactions, reinforcing the feature as a core capability of Observe.AI’s contact center intelligence platform.

By cutting infrastructure costs, we were able to scale Gen AI Moments to more customers and automate more QA workflows while maintaining accuracy.

Anup Pattnaik

Staff Machine Learning Engineer, Observe.AI

AWS Services Used

Amazon Elastic Kubernetes Service

Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service that enables you to run Kubernetes seamlessly in both AWS Cloud and on-premises data centers.

Learn more

Amazon Elastic Block Store (EBS)

High-performance, easy-to-use block storage at any scale.

Find out more

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.

Learn more

Amazon Simple Queue Service

Fully managed message queuing for microservices, distributed systems, and serverless applications

Learn more

Get Started

Organizations of all sizes across all industries are transforming their businesses and delivering on their missions every day using AWS. Contact our experts and start your own AWS journey today.

Contact Sales

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Observe.AI cuts AI inference costs by over 40% to scale contact center QA

Benefits

Overview

About Observe.AI

Opportunity | Scaling Gen AI Moments to meet rising demand for automated QA

Solution | Optimizing Gen AI Moments on Amazon EKS for scalable inference

Outcome | Lowering inference costs to expand QA automation

AWS Services Used

Amazon Elastic Kubernetes Service

Amazon Elastic Block Store (EBS)

Amazon S3

Amazon Simple Queue Service

Get Started

Did you find what you were looking for today?

Learn

Resources

Developers

Help