We build custom speech-to-text systems on Whisper Large-v3, Deepgram Nova-3, AssemblyAI Universal, and NVIDIA Parakeet — real-time captions, searchable transcripts, and voice-agent input that hold 90%+ accuracy on your audio, deployed in your cloud or fully on-prem. First working build in 10–12 weeks, from $20K.
Who we build for
The build decision
Most teams start with a managed API because it ships in an afternoon. That works until your audio is noisy, your vocabulary is niche, your data can't leave your network, or your per-minute bill outgrows a fixed-cost build. The point of this page isn't to tell you custom always wins — it's to show you exactly where each option stops being the right one, so you build the version that fits your audio and your constraints. Here's the honest trade-off across the three real options.
The pipeline
Every production STT system is the same five stages. The accuracy and the latency live in how each one is tuned. Here's the path a spoken word takes from a microphone to a finalized, searchable transcript — with the millisecond budget that decides whether captions feel live or laggy.
Figure 1: Real-time speech-to-text pipeline — capture to delivery, with per-stage latency budget.
Audio comes off WebRTC (Opus), a phone line (SIP), or a file. We resample to 16 kHz mono PCM and chunk it into 20–40 ms frames for streaming.
~20 msSilero VAD or WebRTC VAD drops silence and detects endpoints, so the ASR only burns compute on speech and knows when an utterance ends.
~10 msA streaming model (Whisper Large-v3, Deepgram Nova-3, or NVIDIA Parakeet) emits partial transcripts within a few hundred ms and finalizes on endpoint. This is where engine choice and your accuracy bar collide.
150–400 msPunctuation and casing, speaker diarization (pyannote — who said what), custom-vocabulary boosting for names and jargon, and PII redaction where compliance needs it.
~30 msLive captions stream back over WebSocket; the finalized transcript lands in storage with a word-level timestamp index so every spoken word is searchable later.
streamA well-tuned streaming pipeline shows the first partial caption in under 500 ms and finalizes a sentence in under 1 second — fast enough that captions track a live speaker. For the deep audio-pipeline engineering (echo cancellation, jitter buffer, packet loss), see our real-time audio knowledge base.
Engine selection
There is no single "best" speech-to-text model — there's the right one for your audio, latency, languages, and deployment. We benchmark candidates on your real recordings before committing. The figures below are the published 2025–2026 baselines we start from; the numbers that matter are the ones we measure on your feed.
The gap between a vendor's headline WER and the number you'll actually get is the whole game. Benchmark scores come from clean, read-aloud audio like LibriSpeech; real product audio is people interrupting each other on a bad connection, using acronyms a model has never seen. That's why AssemblyAI can post ~2.1% on a clean set and ~14.5% on hard mixed audio — same model, very different reality.
We measure on your conditions first, because a decision made on a leaderboard number is a decision you re-make in production. Cost works the same way: a managed API at a fraction of a cent per minute looks cheap until you're transcribing millions of minutes a month, at which point a self-hosted Parakeet or Whisper deployment on your own GPUs often costs less per hour and keeps the data in-house.
What we build
Real-time captions for events, broadcasts, and meetings that meet WCAG and ADA requirements. Translinguist runs live closed captioning in 22 languages on the STT layer we built.
Searchable recordings where every spoken word is indexed and jumpable. We built V.A.L.T.'s spoken-word search on Amazon Transcribe — 770+ organizations, 50,000+ users, exportable as PDF reports.
Real-time agent assist, automated QA scoring, and call summaries. Streaming transcripts feed sentiment and topic models the moment a call starts, so supervisors see a flagged call while it's still live instead of reviewing it a day later.
Ambient scribe and dictation, HIPAA-compliant and on-prem if needed, with custom medical vocabulary so drug names and procedures transcribe correctly. The audio never leaves your network, which is the constraint most off-the-shelf APIs can't meet for clinical work.
Subtitle generation from raw footage with timecodes and speaker labels. Need the captions in another language too? That's real-time speech translation.
STT as the input layer for LLM voice agents, where partial transcripts and fast endpointing decide whether the conversation feels natural or stilted. Every 100 ms the ASR shaves off the front of the pipeline is 100 ms the agent answers sooner. vBoard pairs Whisper with GPT-4 to turn raw dictation into publish-ready text.
When custom wins
A managed API is the right call when your audio is clean, English, and your data can leave your network. Custom wins the moment one of those stops being true — and it wins at any volume, from your first 100 hours of audio to 50 million minutes a month. The axes that decide it aren't cost and scale; they're how accurate the system has to be on your audio and how much control you need over it.
Figure 2: Build vs Buy — accuracy on your domain audio × data control and ownership. Custom wins the top-right at any volume.
How we work
You have audio and an accuracy target, no system yet. We map requirements, benchmark engines on your data, and ship a working real-time or batch pipeline.
Discuss scopeYou have STT live but it's wrong too often, too slow, or too expensive. We tune vocabulary, swap or fine-tune the engine, and cut latency or per-minute cost.
Discuss scopeYou inherited a half-built or unmaintained transcription system. We stabilize it, document it, and extend it — the way we took over and rebuilt Rafiky's real-time pipeline.
Discuss scopePricing
Fixed-scope starting points. Final scope depends on languages, real-time vs batch, deployment, and accuracy targets — run the calculator for an instant estimate.
An independent assessment of your streaming build, written by engineers who would actually ship it. Pick the one that fits where you are now: planning the MVP, mid-build, or stabilizing what's already in production. NDA before any code, footage, or system access changes hands.
Competitor analysis, core feature definition, monetization modeling, and a full launch blueprint — delivered within a week. Written by engineers who'll build what they plan.
An independent review of your system's technology choices, structural components, and workload fit — with a plain verdict on what's working, what's a liability, and exactly what to change to reach your goal. Delivered within a week.
A full audit of your code with every issue documented, evidenced, and located — exact file, exact line. Plus a system architecture review and a prioritized fix roadmap. Not a consultant's opinion. A case file. Delivered within a week.
No commitment. NDA before any code, footage, or system access is shared.
Why Fora Soft
Since 2005, the harder half of STT — capture, streaming, latency, the audio pipeline — has been our core. Speech-to-text sits right on top of it.
We've shipped Whisper, Deepgram, Speechmatics, Google Cloud Speech-to-Text, and Amazon Transcribe in real products — not slideware. Translinguist, V.A.L.T., VocalViews, vBoard.
No offshore handoffs. The team that benchmarks your audio is the team that ships and maintains it. We think like product owners, not ticket-takers.
250+ products since 2005 and a 100% job-success score on Upwork. We finish what we start and hand it over clean.
FAQ
On clean English, modern engines clear ~95% accuracy (around 5% word error rate). The real number depends on your audio — accents, background noise, overlapping speakers, and domain vocabulary all move it. Custom tuning (vocabulary boosting and fine-tuning) typically recovers 5–8 accuracy points on hard audio, which is usually the difference between a system that frustrates users and one that ships.
Use a managed API (Deepgram Nova-3, AssemblyAI) when your audio can leave your network and you want the lowest-latency streaming with no infrastructure to run. Use self-hosted Whisper Large-v3 or NVIDIA Parakeet when you need on-prem privacy, 90+ languages, or to control cost per hour at volume. We benchmark both on your audio before recommending one.
Yes. A tuned streaming pipeline shows the first partial caption in under 500 ms and finalizes a sentence in under 1 second — fast enough to caption a live speaker or feed a voice agent. Real-time is harder than batch; it's also exactly what we've done for 20 years.
Whisper covers 90+ languages out of the box; Deepgram and Google cover 30–100+ depending on the model. We've shipped transcription in 30+ languages (VocalViews) and captioning in 22 (Translinguist). We confirm coverage and accuracy per language on your audio.
Yes — self-hosted Whisper or Parakeet runs entirely inside your network or VPC, so audio never leaves. This is how we build for HIPAA, GDPR, and contractual data-residency requirements, with PII redaction in the pipeline where needed.
Two levers: custom-vocabulary boosting (drug names, product names, acronyms) and fine-tuning the acoustic and language models on a sample of your labeled audio. Together they recover the 5–8 points that off-the-shelf models leave on the table for specialized domains.
Starter integrations begin at $8K, custom-tuned hybrid builds around $16K, and on-prem or fine-tuned enterprise systems from $32K. Final scope depends on languages, real-time vs batch, and deployment — the calculator gives an instant estimate.
A managed-API integration ships in 1–3 weeks; a custom-tuned hybrid in 3–6; an on-prem fine-tuned system in 6–8. You see a working build early and iterate from there.
Yes. Speaker diarization (who said what) uses pyannote; word-level timestamps let us build search that jumps to the exact moment a word was spoken — the way V.A.L.T.'s spoken-word search works across thousands of recordings.
Speech-to-text transcribes audio into text in the same language. Speech translation adds machine translation and text-to-speech to deliver it in another language. If you need cross-language captions or interpretation, see our real-time speech translation work.
Keep reading
Real-time speech translation architecture
When you need cross-language, not just transcription →BlogMultilingual translation in video calls
Our most-read piece on live multilingual communication →ToolEstimate your build
Instant ballpark on scope and cost →Tell us about your audio, your accuracy bar, and your deadline. We'll come back with an engine recommendation and a realistic plan — usually within a day.