Why Code Ownership Is the Moat in Enterprise Voice AI

CEO Spotlight

Nicholas Leonard, CEO of VoiceRun, on Why Code Ownership Is the Moat in Enterprise Voice AI

The question I get most from CTOs isn't about which LLM to use or which STT provider has the best benchmark. It's some version of "We tried a no-code voice builder, the demo looked great, and now we can't ship it." The root cause is almost always that the team doesn't own the code.

When you build on a visual, no-code interface, you're constrained to whatever the builder anticipated you'd need. If you want your agent to handle a specific dialect, integrate with a proprietary internal API, or execute a branching workflow that's three layers deep, you're either blocked or you're waiting on a vendor roadmap. In code, those things are trivial. There's a long tail of millions of small tasks that an enterprise voice agent needs to handle, and no visual interface can anticipate them all. That's why Derek and I landed on a code-first architecture what we think of as AWS Lambda for voice agents. Serverless, event-driven, arbitrary code execution. Speech comes in, your code runs, speech goes out. It's your code, not ours.

The data problem is equally serious and less discussed. Say you're a platform company with third parties building voice agents on top of your APIs. Every reservation, every support call, every transaction is data you've always owned. You know every click on your web interface. You know every recommendation you served. But the moment a third-party voice agent handles that interaction, you're blind. You don't know what happened, what the agent said, what logic it executed, or why the call ended the way it did. That's a strategic problem. And the only real fix is owning the code that runs the agent.

For regulated industries such as banking, insurance, and financial services, this becomes more acute. If a bank wants to deploy a voice agent, it needs to be able to tell its security team exactly where the data lives and what code is running on it. You can't do that with a black-box third-party solution.

The rollout strategy I recommend for any enterprise, especially in financial services, is incremental by complexity. Start by ripping out the IVR menu. You don't need to integrate into core banking software to do that. You just need natural language routing that understands intent and transfers appropriately. That's high-value, low-risk, and you can ship it fast. Then go case by case, in increasing order of complexity (credit card fraud reporting, account inquiries, loan status). Each one gets you deeper into the core systems, but you're building on a known foundation and generating real production data to inform the next step. The teams that try to boil the ocean in one deployment are the ones that stall.

The other thing I tell CTOs is to think about extensibility now. You don't know what your AI stack is going to look like in two years. Maybe you're building a custom credit scoring model that your voice agent will eventually need to call. Maybe you're integrating a new CRM. If your voice agent is built on a no-code platform, adding that integration is someone else's problem on someone else's timeline. If it's built-in code you own, it's an API call. That's the compounding advantage of a code-first approach, not just what you can do today, but what you can do as your systems evolve without having to rebuild from scratch.

Market Pulse

VoiceRun integrated Fish Audio S2 as a new TTS provider. VoiceRun users can now select any voice on Fish Audio directly in their agent code using "provider": "fish_audio" and "identifier": "<model_ID>". Streaming, caching, and the rest of the infrastructure are already handled. The integration was driven by a specific production gap: Western TTS models, including Qwen3 and MiniMax, consistently underperform on Thai and other Southeast Asian languages at the fluency level required for enterprise deployments. Fish Audio S2 was trained on 10M+ hours of audio across 80+ languages using a Dual-Autoregressive architecture: a 4B-parameter Slow AR for semantic prediction and a 400M-parameter Fast AR for acoustic detail, and was tested in a blind preference test across 71,000+ paired comparisons. S2 Pro scored 8.11 in Chinese versus ElevenLabs V3 at 2.36, with the largest quality gap concentrated in CJK and Southeast Asian languages.
ElevenLabs' audio data retention defaults vary by plan tier, creating real compliance gaps for B2B2B companies. On Free and Growth plans, audio is used to improve ElevenLabs' models unless you proactively toggle it off and that opt-out only applies prospectively, meaning data already submitted may already be in training sets. Zero Retention Mode, the primary technical control for data minimization, is available only to Enterprise-tier customers. Voice-cloning workflows are excluded entirely, creating a compliance gap for biometric data.
Deepgram shipped JavaScript SDK v5 and Python SDK v6 to general availability. The SDKs are now auto-generated directly from Deepgram's API specs using Fern, so TypeScript types reflect what the API actually returns rather than what a developer thought it returned when they last manually updated the definitions. In Python v6, all WebSocket clients, Listen, Speak, and Agent are now generated from the AsyncAPI spec, replacing the hand-rolled WebSocket code that produced inconsistencies across v5. Control messages now use named methods (send_keep_alive(), send_finalize(), send_flush()) instead of the generic send_control() pattern, which required looking up the correct structure each time.
Mistral released Voxtral TTS, completing its end-to-end speech stack. The model is 4B parameters, runs on a single GPU with 16GB+ VRAM, supports 9 languages, and delivers ~90ms time-to-first-audio in production, putting it in direct latency competition with ElevenLabs Flash. Zero-shot voice cloning works from as little as 3 seconds of reference audio and carries across languages, so a French-accented voice prompt can generate natural French-accented English without separate training. In human evaluations on zero-shot voice cloning, Voxtral TTS showed a 68.4% preference rate over ElevenLabs Flash v2.5.
Speechmatics and Cekura announced an integration that embeds Speechmatics' STT engine directly into Cekura's automated QA platform for voice-agent pipelines. The integration lets teams run head-to-head STT comparisons across providers, including Azure, Gemini, and Deepgram, within a consistent testing environment using their own audio conditions rather than published benchmarks. Critically, testing can happen at every stage of development and deployment, not just pre-launch.
Microsoft released MAI-Transcribe-1 and MAI-Voice-1 through Microsoft Foundry. MAI-Transcribe-1 achieves the lowest WER across 25 languages on the FLEURS benchmark, processes audio 2.5x faster than Azure's previous Fast offering, and is priced at $0.36/hour. MAI-Voice-1 generates 60 seconds of natural-sounding audio in under one second on a single GPU. Any team building on Azure Foundry now has access to a fully integrated first-party STT-TTS stack without stitching together external providers, which changes the cost, integration, and vendor lock-in calculus for teams currently running multi-vendor audio pipelines.
Cohere released cohere-transcribe-03-2026, an open-weight ASR model under Apache 2.0 that claims the top spot on the HuggingFace Open ASR Leaderboard with a 5.42% average WER below Whisper Large v3 (7.44%), ElevenLabs Scribe v2 (5.83%), and Qwen3-ASR-1.7B (5.76%). The model has 2 billion parameters, runs on consumer-grade GPUs, and was trained on 500K hours of curated audio-transcript pairs using a Fast-Conformer encoder architecture optimized for throughput alongside accuracy. For enterprise teams, self-hosting a production-quality ASR layer no longer requires a research-grade GPU fleet or a proprietary API dependency. In compliance-heavy environments, including HIPAA, GDPR, and financial conduct regulation, keeping audio data entirely within your own infrastructure is now a viable option.
A detailed architecture guide published by WebRTC outlines a three-tier model for regulated voice AI deployments. The media tier manages real-time audio, SIP/WebRTC session state, streaming, barge-in, and call-level evidence generation. The agent tier handles orchestration, LLM reasoning, tool execution, and escalation logic, with the LLM operating inside a defined execution framework, and the governance tier enforces identity, access control, data residency, retention policies, and audit trails.
Despite accelerating adoption, most organizations are running voice AI in only 5-10% of actual interactions at scale. The use cases with genuine traction are transactional and tightly scoped, including call containment for simple queries, reservations, and FAQ deflection. Broader multi-turn, tool-augmented automation, particularly for high-risk flows, remains constrained by workflow redesign complexity, backend data access, and governance requirements.

Resources & Events

📅 AI-Powered Outbound Dialing in Healthcare (Virtual - April 16, 2026)

Two Deepgram and AWS engineers walk through the architecture for replacing nurse-staffed outbound calling in healthcare with voice AI agents, using clinical trial recruitment as the live example. CROs screening thousands of patients for medication changes, lifestyle status, and eligibility criteria can't do it fast enough with humans, and IVR trees can't handle the clinical nuance. The session shows two working reference architectures (SageMaker + Bedrock, and Amazon Connect + Lex), both running Nova-3 Medical for STT and Aura-2 for TTS, each under 250ms. Details →

📅 MIT Technology Review EmTech AI (Cambridge, MA - April 21-23, 2026)

EmTech AI is MIT Technology Review's flagship technical conference, designed specifically for decision-makers and senior engineers evaluating emerging AI systems for enterprise deployment. The 2026 agenda covers AI infrastructure, agentic systems, and real-world deployment challenges, with a deliberate focus on technology through the lens of business and operations. Details →

📅 Ai4 Summit (Las Vegas, NV - August 4-6, 2026)

Ai4 is one of the largest applied AI conferences in the US, with 12,000+ attendees, 1,000+ speakers, and dedicated industry tracks across healthcare, financial services, retail, and aerospace. The format prioritizes real deployment case studies, with sector-specific sessions focused on how teams are operationalizing AI within existing enterprise constraints, integration complexity, compliance, workflow redesign, and performance under load. Details →

📊 Report Spotlight: Voice AI in 2026 (Speechmatics)

Based on Speechmatics' own production data combined with deployment case studies across healthcare and financial services, this report highlights that real-time agent usage grew 4x year-on-year in 2025 and demand shifted from batch analytics to in-the-moment response. On latency, teams are pushing toward ~250ms for transcript finalization, with new architectures decoupling turn detection from silence buffering to eliminate the 700–1000ms waiting tax imposed by traditional engines. Domain-tuned models (medical, legal, financial) show up to 70% fewer keyword errors versus general-purpose systems, and that gap is now driving regulated-industry procurement decisions. Y Combinator's most recent batch has nearly one in four companies building voice-first products, up 70% up from early 2024. Read →

For the Commute

Taming Voice Complexity with Dynamic Ensembles at Modulate (AI Engineering Podcast)

Carter Huffman, CTO of Modulate, makes the case for ensemble architectures over monolithic STT pipelines in production voice systems. The episode covers how Modulate's Ensemble Listening Model routes audio dynamically across specialized models rather than running everything through a single general-purpose stack and why that tradeoff matters for latency, accuracy under noisy conditions, and silent failure detection. Useful for any team hitting quality ceilings with a single-provider ASR setup and unsure where the degradation is actually coming from.

Listen