How real-time speech translation works at production scale. The architectural fork every project hits: cascaded (ASR → MT → TTS, three vendors, three logs) versus end-to-end speech-to-speech (Meta SeamlessM4T v2, DeepL Voice, Google Translatotron). The six trade-offs that define every translation system. The latency budget that decides whether the experience feels natural or broken. Written from the platforms we have shipped: Translinguist (16+ language pairs across telehealth, legal, live events), VOLO.live (Black Hat USA 2025, 22,000 participants, six languages), Rafiky (conference interpretation).
Real-time speech translation is the process of converting spoken audio in one language into spoken audio or text in another language with low enough latency that two-way conversation can flow naturally. Two architectures dominate in 2026: cascaded ASR → MT → TTS and end-to-end speech-to-speech using a single multilingual model.
A live video translation system captures audio from a WebRTC session, runs streaming automatic speech recognition to produce a transcript, translates the transcript with a machine-translation model, generates target-language speech with text-to-speech, and publishes the translated audio back into the session as a parallel audio track. End-to-end glass-to-glass latency typically lands at 1.2 to 3.0 seconds.
Cascaded chains three vendors and three models. Each stage emits inspectable text or audio, which makes per-stage observability and PII redaction simple. End-to-end speech-to-speech takes audio in and emits audio out from a single model. End-to-end wins on latency and prosody preservation. Cascaded wins on accuracy for technical vocabulary, vendor flexibility, and audit-trail clarity. Most production systems still ship cascaded in 2026. Not sure yet whether you need a translator, an interpreter, or AI at all? Read our 2026 decision tree first.
Four shapes of real-time speech translation dominate the 2026 landscape. Each one fits a different architecture, vendor stack, and latency ceiling.
Multi-party live translation for conferences, summits, online events. Cascaded ASR → MT → TTS, optional human-interpreter fallback. Scale 100 to 22,000 participants across 4 to 8 simultaneous languages. Reference: VOLO.live at Black Hat USA 2025, 22,000 participants, six languages, sub-three-second end-to-end latency.
Two-way translated conversations for medical, legal, customer-support workflows. HIPAA BAA chain across ASR / MT / TTS vendors. Voice cloning preserves practitioner identity. Reference: Translinguist for telehealth interpretation across 16+ language pairs.
Display-only translation. Captions over the original audio. Lower latency target (sub-1.5 seconds end-to-end, no TTS stage). Used for broadcast, accessibility compliance, hearing-impaired audiences. KUDO, Wordly, Interprefy lead this segment.
Developer-focused SDK / API. Deepgram, AssemblyAI, AWS Transcribe + Translate, Azure Speech Translation, Google Cloud Speech-to-Speech. Used inside chat apps, meeting platforms, customer-service tools.
Telehealth interpretation across 16+ language pairs. Live multilingual translation at Black Hat USA 2025. Conference interpretation platform mixing AI and human interpreters. Three production builds running today across very different shapes.
Architecture. Cascaded ASR → MT → TTS with voice cloning at the TTS layer.
Outcome. 16+ language pairs in production. Sub-two-second cascaded end-to-end latency. Voice cloning preserves speaker identity. The HIPAA-compliant BAA chain runs from the cloud provider through every model vendor. PHI redaction between ASR and MT stages prevents identifiers from being persisted in translation logs.
Architecture. Hybrid Cloud + self-host. Cascaded translation with conference-scale autoscale on the public broadcast tier. Self-hosted speaker tracks for control over voice cloning.
Outcome. 22,000+ participants at Black Hat USA 2025. Six-language live translation. Sub-three-second end-to-end latency at peak load. Scale shape: 22K listeners × 6 languages = 132K simultaneous translation streams at keynote peaks. Per-language audio tracks generated server-side and distributed via CDN edge cache.
Architecture. Cascaded ASR → MT → TTS with human-interpreter fallback option for high-stakes sessions.
Outcome. Production conference interpretation platform serving multi-language events at scale. Architecture mirrors KUDO and Interprefy in shape: cascaded AI for routine multilingual broadcast, human interpreters bookable on demand for high-stakes sessions. The mixed AI / human model is the dominant 2026 pattern for premium conference work.
Three architectural paths for shipping a real-time speech translation product. None is universally correct. The right choice is a function of usage volume, customization depth, compliance scope, and whether translation is the product or a feature of another product. Still deciding whether you need AI at all, or a human translator, or a human interpreter? Read the 2026 decision tree first. The framework below picks up after that decision.
Wins when: Translation is the product. Custom voice cloning required. Regulated industry (HIPAA, EU AI Act high-risk). Branded experience embedded in your own product. Multi-tenant SaaS plays. Low-resource language pairs not covered by major platforms. Custom domain glossaries at the heart of the offering.
Cost shape: $8K-$40K build over 1-3 months. $500-$2K monthly operations. Per-minute cost $0.03-$0.17 depending on stack tier.
Archetypes: Translinguist, VOLO.live, Rafiky.
Wins when: Conference interpretation use case. Standard language pairs (top 30). No custom voice cloning. No custom domain glossary. No in-house engineering capacity. Willing to live within the vendor's UX and feature set.
Cost shape: $5K-$50K per major event for conference platforms. Per-meeting and subscription pricing varies. Operationally simpler than building. See the four-vendor public-data comparison.
Wins when: Conference and event work spans routine sessions (AI) and high-stakes keynotes (human interpreters). The dominant 2026 pattern for premium events. KUDO, Interprefy, and Translinguist all support this routing model.
Pattern: AI cascaded translation for routine sessions and high-volume broadcast tracks. Certified human interpreters routed for the keynote, regulated proceedings, and multi-language Q&A panels. Event organizers route per session stake.
Cost ranges are 2026-indicative. Implementation specifics — concurrency target, language pair count, compliance scope, voice cloning vs synthetic voices, custom glossary depth — dominate the spread within each tier.
A custom real-time speech translation system costs $8K–$40K to build over 1–3 months. KUDO and Interprefy ship event-ready in days. Hybrid (AI for scale, human for stakes) is the dominant 2026 pattern for premium conference and event work.
Each piece below picks up where this guide ends. The decision tree if you have not picked AI yet. The vendor synthesis if you have. The engineering playbook if you are building. The streaming engineering guide if your translation rides on a live stream.
If you are scoping a real-time speech translation system and want a second opinion on cascaded-versus-end-to-end, the vendor stack, language-pair complexity, the voice-cloning consent shape, or the EU AI Act compliance approach, write us. A senior engineer with shipped translation platforms in production replies within 24 hours.