How Real-Time Voice Agents Are Built in Production

Big Story

How Real-Time Voice Agents Are Built in Production

Real-time voice systems are built as streaming pipelines, where audio is captured continuously from the client, chunked into small frames, and sent over persistent connections such as WebSockets. The system processes partial audio as it arrives, allowing transcription to begin before the user finishes speaking. This enables downstream components to start inference early and reduces perceived latency.

The core architecture consists of three components: speech-to-text, a language model, and text-to-speech, which operate independently. Teams optimize latency by co-locating services and minimizing network round-trip times, since without this, latency compounds across the pipeline and response times degrade quickly.

Audio chunking acts as a primary control lever. Smaller chunks reduce the delay between speech and transcription, whereas browser-based systems typically send larger chunks, increasing latency. In contrast, telephony and server-side systems operate on smaller intervals, often in the range of tens of milliseconds, improving responsiveness at the cost of higher compute load.

Endpointing determines when the system responds, and silence-based detection alone is insufficient in production. Systems instead incorporate semantic endpointing to detect whether a user has completed a thought, avoiding premature responses during natural pauses and improving turn-taking behavior. As a result, endpointing directly affects both latency and user experience.

Function execution is handled through the LLM, which is provided with a set of callable tools and decides when to invoke them. Rather than parsing intent or routing requests, the application layer receives structured outputs from the model and executes corresponding APIs, simplifying client logic while centralizing decision-making in the orchestration layer.

State is managed outside the model, as the LLM does not maintain an authoritative system state. Instead, it interacts with backend services via APIs, for example, to retrieve or update order data stored in external systems, while the client independently fetches and renders this state. This prevents context bloat and ensures consistency across interactions.

Interrupt handling is implemented at the client and streaming layers, where audio playback is stopped immediately upon detecting new speech. Systems rely on voice activity detection to identify speech onset within tens of milliseconds, enabling users to interrupt responses without waiting for playback to complete.

Latency is managed across the full stack, with time-to-first-token from the language model serving as a key metric. Larger prompts increase this latency, so systems limit prompt size and offload complexity into APIs and specialized agents, reducing response time while improving determinism.

Production systems often decompose workflows into multiple agents, each responsible for a specific task such as routing, payments, or support. A routing layer selects the appropriate agent based on intent, thereby reducing prompt complexity, improving reliability, and enabling independent scaling and optimization of each component.

Infrastructure ultimately determines system performance at scale. Voice systems require regional deployment to minimize network latency and must scale horizontally to handle burst traffic, as peak scenarios can increase concurrency by orders of magnitude within minutes. The architecture, therefore, needs to support rapid scaling while maintaining low latency.

These systems are engineered around constraints, where model capability is only one factor. Latency, control, cost, and infrastructure collectively determine how voice agents are built and deployed in production.

Market Pulse

Deepgram is currently valued at $1.3 billion and powers over 2,000 third-party voice products. CEO Scott Stephenson recently spoke about internal development practices, the company's IBM partnership, and what's next for voice infrastructure.
Cloudflare shipped a real-time voice pipeline for its Agents SDK this week. With @cloudflare/voice, developers can add voice to the same Agent architecture (Durable Object, tools, persistence, WebSocket connection model) they already use. Continuous STT and TTS now takes approximately 30 lines of server-side code. The pipeline is provider-agnostic, supporting Cartesia, PlayHT, AssemblyAI, Speechmatics, and others.
Deepgram's Voice Agent API now supports reusable agent configurations stored and referenced by UUID. Instead of sending a full configuration with every WebSocket session, teams can define it once and reference it at runtime. The feature supports per-customer personas, regional compliance configurations, A/B testing across voices or prompts, and multi-agent architectures without a code deploy. Template variables using the DG_<VARIABLE_NAME> format are interpolated automatically at session start.
NVIDIA is now a supported LLM provider in Deepgram's Voice Agent API. Two models are available: llama-nemotron-super-49B for multi-agentic reasoning and nemotron-3-nano-30B-A3B for cost-efficient targeted agentic tasks. Both sit in the Standard pricing tier.
Miravoice built an AI voice agent that conducts long-form quantitative research calls (120+ questions, 40+ minutes per session, no human interviewers). The agent handles open-ended responses, Likert scales, and matrix questions within a single call.

Resources & Events

📅 MLSys 2026 (Bellevue, Washington - May 18-22, 2026)

MLSys is a systems-first conference focused on the intersection of machine learning and infrastructure, bringing together researchers and practitioners working on inference, distributed systems, hardware acceleration, and agent systems. The 2026 program includes research and industry tracks covering topics such as LLM serving, compound AI systems, observability, and performance optimization. For voice systems, this is one of the few venues that directly addresses latency, throughput, and system design tradeoffs at scale. Details →

📅 Low Latency Club Voice AI Meetup (San Francisco, California - April 20, 2026)

This meetup, co-hosted by Telnyx and Deepgram, is focused on how voice AI systems are actually built and operated in production. The session centers on the full real-time pipeline (capturing audio, streaming transcription, LLM reasoning, and generating responses), showing how these components connect into a working system. Live demos walk through a production voice agent pipeline and real-time speech understanding, giving a grounded view of latency, orchestration, and system design decisions. The event is designed for engineers and technical teams working on voice agents, transcription systems, and real-time infrastructure, with an emphasis on practical implementation. Details →

📅 SIGNAL 2026 by Twilio (San Francisco, California - May 6-7, 2026)

SIGNAL is Twilio’s flagship developer conference focused on building and scaling real-time communication systems across voice, messaging, and customer engagement platforms. The event brings together engineers and infrastructure teams working on APIs, telephony, and AI-driven workflows, with sessions covering voice pipelines, event-driven architectures, and global system reliability. For teams building voice agents, it offers a view into how real-time audio systems are integrated with backend services and deployed at scale under production constraints. Details →

📊 Report Spotlight: Security & Governance of Voice Agents (arXiv)

This paper presents a framework for securing voice agents operating in real-time environments, focusing on risks that emerge from probabilistic model behavior. It analyzes how voice agents can be manipulated through prompt injection, adversarial audio inputs, and tool misuse, particularly in systems that combine STT, LLM orchestration, and external API calls. The report outlines a layered defense approach, including input validation, policy enforcement at the tool layer, runtime monitoring, and post-response filtering to prevent unsafe or unauthorized actions. Read →

For the Commute

How Voice Agents Are Becoming The New Retail Frontline (Retail OCD)

This episode features Ed Crowley, Founder of URWay Holdings, and Nick Leonard, CEO of VoiceRun, discussing how voice agents are transitioning to the backbone of modern retail operations. The conversation explores how online and in-store voice agents are being utilized to fundamentally transform the customer journey and streamline retail efficiency. The discussion breaks down the core technology driving this rapid shift and identifies the specific adoption challenges retailers currently face.

Listen

How Real-Time Voice Agents Are Built in Production

Big Story