Glossary
Voice AI Agent
Voice AI Agent — A voice AI agent is a software system that holds spoken conversations in real time using a combination of speech recognition, a large language model, and speech synthesis — typically used for customer-facing phone calls.
Also known as: conversational voice AI · voice agent · AI phone agent
The technical stack
A voice AI agent stitches together four components in real time:
- ASR (Automatic Speech Recognition) — converts the caller's speech to text. Industry leaders in 2026: Deepgram Nova-3, OpenAI Whisper, Google Speech-to-Text. Streaming ASR is mandatory — non-streaming adds 500-1500ms of perceptible delay.
- LLM (Large Language Model) — decides what to say back. Production deployments typically use GPT-5, Claude Opus 4.7, or Claude Sonnet 4.6 depending on latency-vs-quality tradeoffs.
- TTS (Text-to-Speech) — converts the response into spoken audio. ElevenLabs, Cartesia, and OpenAI's voice models are the 2026 incumbents.
- Orchestration — manages the call lifecycle, interruption handling, function calls (book appointment, transfer to human), and CRM integrations.
How voice AI agents differ from chatbots
Chatbots are text-first and turn-based — you type, it answers. Voice AI agents are speech-first and continuous — both sides speak, often overlapping ("barge-in"). The engineering is harder: the model has to detect when the caller has finished speaking, decide whether to interrupt, and recover gracefully when the conversation goes off-rails.
The hardest technical problem in voice AI is end-of-turn detection — knowing when the caller has finished a sentence vs paused mid-thought. Most production systems use a combination of voice activity detection (VAD), semantic completion estimation, and silence thresholds (typically 600-1200ms).
Latency budget
End-to-end latency — from caller stops speaking to agent starts speaking — is the single most-measured metric in voice AI. The industry target is under 800ms. Above 1500ms and callers assume the agent has hung up.
Typical 2026 breakdown:
- ASR streaming partial: 100-200ms
- LLM time-to-first-token: 250-450ms
- TTS time-to-first-byte: 100-200ms
- Network round-trips: 100-300ms
- Total budget: 550-1150ms
Common deployments
- Inbound receptionist — answer + qualify + book or dispatch. Largest commercial deployment in 2026.
- Outbound sales / appointment-setting — high-volume cold outbound (regulatory friction in most US states; check TCPA).
- In-call assist — coach a human agent in real time with prompts during the call.
- IVR replacement — replace touch-tone menus with natural language ("press 1 for…" becomes "what can I help you with?").
Limitations in 2026
- Multi-party conferences — most agents handle 1:1 calls only.
- Code-switching — switching languages mid-conversation works but accents reduce ASR accuracy.
- Emotional escalation — agents detect frustration but don't always defuse it. Routing to a human is still the standard pattern.
- Complex authentication — voice biometric KYC is improving but not yet trusted for high-stakes verification.
Vendors by category (June 2026)
- Vertical receptionists — Hi Agent (home services), Smith.ai (pro services), Goodcall (general SMB).
- Dev platforms — Vapi, Bland.ai, Retell, Synthflow.
- Enterprise CX — Sierra, Avoca AI, ServiceTitan Voice.
- Voice infrastructure — ElevenLabs, Cartesia, Deepgram, OpenAI Realtime.
Related terms