[seopress_breadcrumbs]

Securing the Voice Channel with Real‑Time Audio‑Native AI

•

June 2, 2026

This article is sponsored by Modulate and was written, edited, and published in alignment with our Emerj sponsored content guidelines. Learn more about our thought leadership and content creation services on our Emerj Media Services page.

Live voice interactions in contact centers have become a critical operational blind spot, where fraud, identity risk, and agent attrition emerge in real time without corresponding visibility from enterprise systems.

Financial services contact centers are hemorrhaging money from two directions simultaneously — and most enterprises are only measuring one of them. The FBI’s Internet Crime Complaint Center reported that AI-driven fraud, including voice cloning and deepfake impersonation, generated nearly $893 million in verified losses in 2025 — the first year the FBI formally tracked it as a crime category — representing just the fraction of attacks that victims actually reported.

The consequences compound on the operational side. The Society for Human Resource Management found that the average cost to recruit and hire a single employee is nearly $4,700 — before training, ramp-up, or lost productivity are factored in. In contact centers, where the Quality Assurance & Training Connection benchmarks annual agent turnover at 30 to 45%, that cost repeats at scale, every year, across every seat on the floor. A 500-agent center turning over at the industry average is not an HR problem. It is a capital problem.

The underlying issue is that contact centers are running real-time, voice-based operations with no real-time intelligence about what is actually happening on those calls — whether a synthetic voice is bypassing identity verification or an abusive caller is pushing a trained agent toward the exit. Both losses are measurable. Neither is inevitable.

Emerj recently hosted a three‑part series on securing the voice channel for real‑time risk, featuring Mike Pappas, CEO and Co‑Founder at Modulate; Ken Morino, Director of Market and Behavioral Research at Modulate; and Jon‑Rav Shende, Global CTO for Data and AI at Thales Group, examining how enterprises can detect fraud in‑call, deploy voice‑intelligence architectures that support high‑stakes decisions, and build workflow‑level governance that stands up to regulators and insurers.

This article examines three critical insights on how enterprises can secure the voice channel as it becomes a frontline surface for fraud and high‑stakes decisions:

Voice channel as a real‑time risk surface: Detecting fraud and manipulation during the call prevents financial loss, regulatory exposure, and agent churn before they escalate.
Specialized voice‑intelligence architecture for high‑stakes decisions: Models built for live audio provide the accuracy and speed required for authentication, account changes, and payment approvals that generic AI cannot support.
Workflow‑level governance and shared ownership for voice‑AI outcomes: Clear escalation paths and audit‑ready evidence enable Security, Operations, and CX to act on risk signals in ways regulators and insurers can trust.

Voice Channel as a Real‑Time Risk Surface

Episode: Why Ensemble Architectures Win Against Real-Time Voice Risk – with Mike Pappas of Modulate

Guest: Mike Pappas, CEO & Co-Founder at Modulate

Expertise: AI, Conversational AI, AI Safety & Trust, Systems Architecture

Brief Recognition: Mike Pappas co-founded Modulate, where he has led the development and deployment of AI-powered conversational analytics used by Fortune 500 companies and major gaming studios to address harassment, fraud, and user safety at scale. His prior experience includes technical and infrastructure roles at Lola and Bridgewater Associates, spanning machine learning, cloud systems, and software architecture. He also serves as a board member of the Family Online Safety Institute and holds a degree in Physics and Applied Mathematics from MIT.

Mike Pappas describes a shift in how organizations need to understand the voice channel. What was once treated as a routine service interaction has become a setting where fraud, impersonation, and manipulation occur in real time, often faster than existing controls can detect.

The operational gap, in his view, is not in detection capability, but in timing — what happens during the call versus what systems can observe afterward.

Pappas explains the gap directly:

“The biggest harms don’t show up in the logs — they happen while the call is still unfolding. By the time anyone reviews a transcript, the attacker has already succeeded. The real risk is the gap between what’s happening live and what the organization can actually see.”

— Mike Pappas, CEO & Co‑Founder, Modulate

Fraud attempts increasingly rely on urgency, emotional pressure, and impersonation, which surface in the live interaction itself. Because humans respond to emotion before policy, these signals influence decisions before traditional controls can intervene.

Pappas’ position is that detection must operate on those behavioral cues as they occur — requiring models built to interpret the audio stream itself rather than the transcript.

Agents are not trained to recognize adversarial conversational patterns, especially when those patterns are scripted to bypass verification steps. Pappas argues that expecting agents to identify these signals on their own is unrealistic; the solution is to give them real‑time visibility into risk indicators so they are not relying on instinct in high‑pressure moments.

In his framing, AI’s role is to consistently surface those indicators, even under time pressure or when dealing with a convincing impersonation.

In his episode, Ken Morino notes that behavioral and emotional cues disappear when reduced to text, limiting the usefulness of transcript‑based systems for detecting manipulation. The signals that indicate something is off — hesitation, tonal mismatch, conversational steering — are lost once the interaction is flattened into words.

Morino’s view is that AI systems built for real‑time audio can recover those signals and present them in a form that fits into existing workflows without requiring agents to interpret raw audio patterns themselves.

High‑stakes workflows such as authentication, account changes, and payment approvals are exposed because decisions must be made quickly, and attackers exploit that time pressure.

Jon‑Rav Shende adds that deepfake fraud often succeeds by exploiting workflow gaps and that most security teams have limited visibility into the live interaction where the compromise actually occurs. His emphasis is on using AI to surface in‑call signals tied to identity risk, giving security teams a view into the interaction while it is still happening rather than after the fact.

Across the three conversations, several solution patterns emerge:

Surface risk signals during the call, giving agents real‑time context rather than relying on instinct or memory.
Use audio‑native models that capture tone, hesitation, and emotional mismatch — signals that do not survive transcription.
Expose workflow‑level vulnerabilities in identity and approval processes where attackers exploit speed and ambiguity.
Provide agents with structured prompts or cues when risk indicators appear, thereby reducing cognitive load during high‑pressure interactions.
Integrate security visibility into live interactions so teams don’t discover compromises after the fact.

Specialized Voice‑Intelligence Architecture for High‑Stakes Decisions

Episode: Operationalizing Real-Time Voice Intelligence for FinServ and CX – with Ken Morino of Modulate

Expertise: Product Management, Behavioral Research, User Experience Design, Enterprise Software & Integrations

Brief Recognition: Ken Morino has led product and market research initiatives at Modulate, helping shape AI-driven conversational technology and user-focused product strategy. Prior to Modulate, he spent nearly a decade at LiveShopper Sassie leading enterprise product management, API integrations, and large-scale client implementations, working with major corporate clients and cross-functional technical teams. Earlier in his career, he held product, technical sales, and security solutions leadership roles at Demarc Security, and he holds both a BS in Computer Science and an MA in Economics from UC Santa Barbara.

Ken Morino argues that most organizations are attempting to solve identity‑critical problems with systems that were never designed for identity.

The dominant tools in the market — ASR pipelines, transcript analytics, and generic LLMs — were built for summarization, sentiment scoring, and compliance review. They operate on text, not audio, and they assume that accuracy requirements are flexible. In authentication and account‑change workflows, those assumptions break immediately.

The technical constraints are non‑negotiable:

Identity workflows have fixed latency budgets. A model that takes 1.5 seconds to respond is unusable in a system that must approve or deny an action in under 300 milliseconds.
Transcript‑based systems discard the acoustic features — pitch, timbre, micro‑pauses, harmonic structure — that identity systems depend on.
Generic LLMs cannot meet identity‑grade accuracy thresholds. A 95% accurate model is catastrophic when the remaining 5% is fraud.
Single‑model approaches fail because no individual signal (voiceprint, phrasing, metadata) is reliable enough to detect synthetic audio.
CX analytics systems lack multi‑signal fusion, which is required to combine acoustic, behavioral, and contextual indicators into a defensible identity decision.

Morino summarizes the core limitation:

“Once you flatten a conversation into text, you lose the hesitation, the tone, the emotional mismatch — all the things that tell you something isn’t right.”

— Ken Morino, Director of Market and Behavioral Research, Modulate

Mike Pappas adds that identity‑critical decisions require ensemble architectures — multiple specialized models operating on different parts of the audio signal and converging on a single risk assessment.

Jon‑Rav Shende notes that insurers and regulators increasingly expect audit‑ready evidence that shows how each signal contributed to the decision. Together, they view authentication, account changes, and payment approvals as requiring a purpose‑built architecture, not a repurposed analytics stack.

Workflow‑Level Governance and Shared Ownership for Voice‑AI Outcomes

Episode: Why Deepfake Fraud Beats Your Workflows, Not Your Technology – with Jon-Rav Shende of Thales Group

Guest: Jon-Rav Shende, Global CTO for Data and AI at Thales Group

Expertise: AI Security, Cloud & Enterprise Transformation, Cybersecurity & Risk Management, Data Governance & Trusted AI

Brief Recognition: Jon G. Shende has held senior technology and security leadership roles spanning CTO, CISO, and executive advisory positions focused on AI, cybersecurity, and enterprise transformation. His experience includes leadership roles at Thales, Sutherland, and ForenSec Global, where he led large-scale cloud, security, and AI modernization initiatives for global enterprises, including Fortune 500 organizations and multi-billion-dollar transformation programs. He also brings experience with major technology and consulting ecosystems, including Ernst & Young and Cognizant, as well as with cloud platforms such as AWS, Azure, and Google, alongside active involvement with InfraGard and extensive work in AI governance, cyber resilience, and trusted AI adoption.

Jon‑Rav Shende’s contribution across the conversations is that the technical capability to detect risk is only half the problem. The other half is organizational: once a system can surface identity‑relevant signals, the enterprise must decide who owns the response, how evidence is captured, and how decisions become defensible to regulators, auditors, and insurers.

In his view, the failure mode is not just missed detection; it is unclear ownership, inconsistent escalation, and the absence of audit‑ready records that explain why an action was taken:

“Organizations don’t fail because the signal wasn’t there. They fail because no one knows who is supposed to act on it. If Security sees something but Operations owns the workflow, the alert dies in the middle. And when something goes wrong, there’s no record that shows what was known, when it was known, and who made the decision. That’s what regulators look for, and that’s what insurers look for.”

— Jon‑Rav Shende, Global CTO for Data and AI at Thales Group

Ken Morino adds that governance also depends on interpretability. A model can detect a signal, but if the output is ambiguous or requires a specialist to decode, the organization has not solved the problem.

Ken’s view is that the system must present signals in a form that fits into existing workflows, because the moment an agent or analyst has to “figure out” what the model meant, accountability becomes unclear and decisions become inconsistent.

Mike Pappas reinforces this from a defensibility perspective. High‑stakes decisions — authentication approvals, account changes, payment authorizations — must be explainable to regulators and insurers. That requires a shared operational model: Security, Operations, and CX must agree on what constitutes risk, who owns the moment when a signal appears, and how the evidence is captured. Without that alignment, organizations end up with fragmented visibility and no unified record of what happened.

Across the episodes, three governance patterns emerge:

Clear escalation paths that specify who owns the decision when a risk signal appears, and what authority they have to pause, deny, or verify an action.
Audit‑ready evidence trails that capture the signals, the decision, and the rationale in a form regulators and insurers can evaluate.
Cross‑functional alignment between Security, Operations, and CX so that risk signals do not get trapped inside a single team’s workflow.

Shende’s view is that once AI begins influencing identity‑critical decisions, the organization must treat those decisions as shared assets rather than departmental tasks. The governance model becomes as important as the model architecture, because without it, even the most accurate system cannot produce outcomes that stand up to scrutiny.