ASR & STT developmentCustom-built since 2005

Speech to Text Software Development

We build custom speech-to-text systems on Whisper Large-v3, Deepgram Nova-3, AssemblyAI Universal, and NVIDIA Parakeet — real-time captions, searchable transcripts, and voice-agent input that hold 90%+ accuracy on your audio, deployed in your cloud or fully on-prem. First working build in 10–12 weeks, from $20K.

20+ yrsBuilding real-time audio & video since 2005
250+Products shipped
75+Languages live in production (Translinguist)
50K+Daily users on transcription search we built (V.A.L.T.)

Who we build for

Video & meeting platformsContact centersTelemedicine & healthcareLegal & complianceMedia & broadcast captioningVoice AI agentsEdTech & accessibility

The build decision

Three ways to add speech-to-text — and where each one breaks

Most teams start with a managed API because it ships in an afternoon. That works until your audio is noisy, your vocabulary is niche, your data can't leave your network, or your per-minute bill outgrows a fixed-cost build. The point of this page isn't to tell you custom always wins — it's to show you exactly where each option stops being the right one, so you build the version that fits your audio and your constraints. Here's the honest trade-off across the three real options.

Managed API
(Deepgram, AssemblyAI, Google)
Open-source DIY
(self-hosted Whisper)
Fora custom build
Accuracy on your audioGood on clean speech; drops on jargon & accentsStrong baseline, no tuning out of the boxTuned to your audio — custom vocabulary & fine-tuning, the 5–8 points that decide "works" vs "ships"
Real-time latency150–300 ms streaming, you don't control itWhatever your hardware delivers; needs engineeringBudgeted end-to-end — sub-500 ms first partial, tuned per use case
Data control / privacyAudio leaves your network; BAA varies by vendorFull control, you run itYour cloud or fully on-prem — nothing leaves; HIPAA/GDPR by design
Custom vocabulary & tuningLimited boosting; no real fine-tuningPossible, but you build the pipelineDomain vocab, fine-tuned acoustic + language models, PII redaction
Cost modelPer-minute forever — scales with usageInfra + your engineering timeFixed build + predictable infra; pays back at volume
Who owns itThe vendor — you rentYou, if you can staff itYou. We hand over source, models, and docs

No single answer is right for everyone. We start most engagements by mapping your audio, accuracy bar, latency target, and privacy constraints — then recommend managed, hybrid, or fully custom. Sometimes the honest answer is "use Deepgram and call us when you outgrow it."

The pipeline

How a real-time speech-to-text system actually works

Every production STT system is the same five stages. The accuracy and the latency live in how each one is tuned. Here's the path a spoken word takes from a microphone to a finalized, searchable transcript — with the millisecond budget that decides whether captions feel live or laggy.

01CaptureWebRTC / SIP16 kHz PCM~20 ms02VAD & segmentSilero VADendpointing~10 ms03ASR engineWhisper / DeepgramParakeet · partials150–400 ms04Post-processpunctuation · diarizevocab · PII redact~30 ms05Deliverylive captions+ search indexstream

Figure 1: Real-time speech-to-text pipeline — capture to delivery, with per-stage latency budget.

01

Capture

Audio comes off WebRTC (Opus), a phone line (SIP), or a file. We resample to 16 kHz mono PCM and chunk it into 20–40 ms frames for streaming.

~20 ms
02

Voice activity & segmentation

Silero VAD or WebRTC VAD drops silence and detects endpoints, so the ASR only burns compute on speech and knows when an utterance ends.

~10 ms
03

ASR engine

A streaming model (Whisper Large-v3, Deepgram Nova-3, or NVIDIA Parakeet) emits partial transcripts within a few hundred ms and finalizes on endpoint. This is where engine choice and your accuracy bar collide.

150–400 ms
04

Post-processing

Punctuation and casing, speaker diarization (pyannote — who said what), custom-vocabulary boosting for names and jargon, and PII redaction where compliance needs it.

~30 ms
05

Delivery

Live captions stream back over WebSocket; the finalized transcript lands in storage with a word-level timestamp index so every spoken word is searchable later.

stream

A well-tuned streaming pipeline shows the first partial caption in under 500 ms and finalizes a sentence in under 1 second — fast enough that captions track a live speaker. For the deep audio-pipeline engineering (echo cancellation, jitter buffer, packet loss), see our real-time audio knowledge base.

Engine selection

Whisper, Deepgram, AssemblyAI, or Parakeet — how we pick

There is no single "best" speech-to-text model — there's the right one for your audio, latency, languages, and deployment. We benchmark candidates on your real recordings before committing. The figures below are the published 2025–2026 baselines we start from; the numbers that matter are the ones we measure on your feed.

EngineWER — clean / real-worldStreamingThroughput (RTFx)DeploymentBest for
Whisper Large-v3 (OpenAI, open)~2.8% / ~6.4–10.6%via faster-whisper / WhisperX68.6Self-host, on-prem, on-device (WhisperKit)Privacy, 90+ languages, full control
Deepgram Nova-3~5.3–6.8% (production median)Native, lowest latencyVery high (cloud)Managed APIReal-time at scale, contact centers
AssemblyAI Universal-2~2.1% / ~14.5% (hard mixed)NativeHigh (cloud)Managed APIRich features (sentiment, topics, PII)
NVIDIA Parakeet CTC 1.1B~6.7%Yes (CTC/TDT)2,793 (~40× Whisper L-v3)Self-host on NVIDIA GPUsLong-form throughput, cost per hour

WER alone never decides it. A model that's one point more accurate but 40× slower loses for live captions; a model that's free but ships your audio to a vendor loses for a hospital. We weigh accuracy on your audio, streaming latency, language coverage, deployment, and cost-per-hour together — then run a head-to-head on a five-condition sample of your recordings: clean, noisy, accented, overlapping speakers, and domain jargon.

Baselines: Open ASR Leaderboard (Nov 2025), vendor documentation.

The gap between a vendor's headline WER and the number you'll actually get is the whole game. Benchmark scores come from clean, read-aloud audio like LibriSpeech; real product audio is people interrupting each other on a bad connection, using acronyms a model has never seen. That's why AssemblyAI can post ~2.1% on a clean set and ~14.5% on hard mixed audio — same model, very different reality.

We measure on your conditions first, because a decision made on a leaderboard number is a decision you re-make in production. Cost works the same way: a managed API at a fraction of a cent per minute looks cheap until you're transcribing millions of minutes a month, at which point a self-hosted Parakeet or Whisper deployment on your own GPUs often costs less per hour and keeps the data in-house.

What we build

Speech-to-text systems we've shipped

Accessibility

Live captioning

Real-time captions for events, broadcasts, and meetings that meet WCAG and ADA requirements. Translinguist runs live closed captioning in 22 languages on the STT layer we built.

Search

Transcription & word search

Searchable recordings where every spoken word is indexed and jumpable. We built V.A.L.T.'s spoken-word search on Amazon Transcribe — 770+ organizations, 50,000+ users, exportable as PDF reports.

Contact center

Speech analytics

Real-time agent assist, automated QA scoring, and call summaries. Streaming transcripts feed sentiment and topic models the moment a call starts, so supervisors see a flagged call while it's still live instead of reviewing it a day later.

Healthcare

Clinical documentation

Ambient scribe and dictation, HIPAA-compliant and on-prem if needed, with custom medical vocabulary so drug names and procedures transcribe correctly. The audio never leaves your network, which is the constraint most off-the-shelf APIs can't meet for clinical work.

Media

Subtitling & localization

Subtitle generation from raw footage with timecodes and speaker labels. Need the captions in another language too? That's real-time speech translation.

Voice AI

Voice-agent front-end

STT as the input layer for LLM voice agents, where partial transcripts and fast endpointing decide whether the conversation feels natural or stilted. Every 100 ms the ASR shaves off the front of the pipeline is 100 ms the agent answers sooner. vBoard pairs Whisper with GPT-4 to turn raw dictation into publish-ready text.

When custom wins

When a custom speech-to-text build pays off

A managed API is the right call when your audio is clean, English, and your data can leave your network. Custom wins the moment one of those stops being true — and it wins at any volume, from your first 100 hours of audio to 50 million minutes a month. The axes that decide it aren't cost and scale; they're how accurate the system has to be on your audio and how much control you need over it.

Accuracy on YOUR audio →Data control & ownership →Managed APIrent · generic accuracyOpen-source DIYyou run & staff itFora custom buildtuned to your audio · you own it

Figure 2: Build vs Buy — accuracy on your domain audio × data control and ownership. Custom wins the top-right at any volume.

Buy a managed API when
Your audio is clean, conversational English
Off-the-shelf accuracy clears your bar
Audio can leave your network (no on-prem requirement)
You're validating an idea and want it live this week
Build custom when
Domain jargon, accents, or noise tank off-the-shelf accuracy
Data can't leave your network — HIPAA, GDPR, or contractual
Per-minute costs are outgrowing a fixed-cost build
You want to own the models, the source, and the roadmap
Right when: accuracy on your own audio matters more than time-to-first-demo — at any volume.

How we work

Three ways to start

From scratch

New build

You have audio and an accuracy target, no system yet. We map requirements, benchmark engines on your data, and ship a working real-time or batch pipeline.

Discuss scope
Upgrades

Accuracy tuning

You have STT live but it's wrong too often, too slow, or too expensive. We tune vocabulary, swap or fine-tune the engine, and cut latency or per-minute cost.

Discuss scope
Takeovers

Rescue & extend

You inherited a half-built or unmaintained transcription system. We stabilize it, document it, and extend it — the way we took over and rebuilt Rafiky's real-time pipeline.

Discuss scope

Pricing

What a speech-to-text build costs

Fixed-scope starting points. Final scope depends on languages, real-time vs batch, deployment, and accuracy targets — run the calculator for an instant estimate.

Starterfrom $8KLive in –3 weeks
  • Managed-API integration (Deepgram or AssemblyAI)
  • Real-time captions or batch transcription
  • One language, your app
Get an instant estimate
Most chosenGrowthfrom $16K3–6 weeks
  • Custom-tuned hybrid — engine benchmarked on your audio
  • Custom vocabulary + speaker diarization
  • Word-level search, multi-language
Get an instant estimate
Enterprisefrom $32K6–8 weeks
  • On-prem or fine-tuned self-hosted Whisper / Parakeet
  • HIPAA/GDPR, PII redaction, SLA
  • Handover of models + source
Get an instant estimate
Free for qualified projects

Three deliverables. Yours within a week.

An independent assessment of your streaming build, written by engineers who would actually ship it. Pick the one that fits where you are now: planning the MVP, mid-build, or stabilizing what's already in production. NDA before any code, footage, or system access changes hands.

No commitment. NDA before any code, footage, or system access is shared.

Why Fora Soft

Why teams pick us for speech-to-text

20 years in real-time audio & video

Since 2005, the harder half of STT — capture, streaming, latency, the audio pipeline — has been our core. Speech-to-text sits right on top of it.

The STT stack, in production

We've shipped Whisper, Deepgram, Speechmatics, Google Cloud Speech-to-Text, and Amazon Transcribe in real products — not slideware. Translinguist, V.A.L.T., VocalViews, vBoard.

All in-house, all senior

No offshore handoffs. The team that benchmarks your audio is the team that ships and maintains it. We think like product owners, not ticket-takers.

250+ products, 100% track record

250+ products since 2005 and a 100% job-success score on Upwork. We finish what we start and hand it over clean.

FAQ

Speech-to-text development, answered

How accurate is custom speech-to-text?Chevron down icon for interactive fields

On clean English, modern engines clear ~95% accuracy (around 5% word error rate). The real number depends on your audio — accents, background noise, overlapping speakers, and domain vocabulary all move it. Custom tuning (vocabulary boosting and fine-tuning) typically recovers 5–8 accuracy points on hard audio, which is usually the difference between a system that frustrates users and one that ships.

Whisper or a managed API like Deepgram — which should I use?Chevron down icon for interactive fields

Use a managed API (Deepgram Nova-3, AssemblyAI) when your audio can leave your network and you want the lowest-latency streaming with no infrastructure to run. Use self-hosted Whisper Large-v3 or NVIDIA Parakeet when you need on-prem privacy, 90+ languages, or to control cost per hour at volume. We benchmark both on your audio before recommending one.

Can it run in real time?Chevron down icon for interactive fields

Yes. A tuned streaming pipeline shows the first partial caption in under 500 ms and finalizes a sentence in under 1 second — fast enough to caption a live speaker or feed a voice agent. Real-time is harder than batch; it's also exactly what we've done for 20 years.

How many languages can it handle?Chevron down icon for interactive fields

Whisper covers 90+ languages out of the box; Deepgram and Google cover 30–100+ depending on the model. We've shipped transcription in 30+ languages (VocalViews) and captioning in 22 (Translinguist). We confirm coverage and accuracy per language on your audio.

Can it run on-prem or privately?Chevron down icon for interactive fields

Yes — self-hosted Whisper or Parakeet runs entirely inside your network or VPC, so audio never leaves. This is how we build for HIPAA, GDPR, and contractual data-residency requirements, with PII redaction in the pipeline where needed.

How do you improve accuracy on our jargon and names?Chevron down icon for interactive fields

Two levers: custom-vocabulary boosting (drug names, product names, acronyms) and fine-tuning the acoustic and language models on a sample of your labeled audio. Together they recover the 5–8 points that off-the-shelf models leave on the table for specialized domains.

How much does a build cost?Chevron down icon for interactive fields

Starter integrations begin at $8K, custom-tuned hybrid builds around $16K, and on-prem or fine-tuned enterprise systems from $32K. Final scope depends on languages, real-time vs batch, and deployment — the calculator gives an instant estimate.

How long does it take?Chevron down icon for interactive fields

A managed-API integration ships in 1–3 weeks; a custom-tuned hybrid in 3–6; an on-prem fine-tuned system in 6–8. You see a working build early and iterate from there.

Can you add diarization and search?Chevron down icon for interactive fields

Yes. Speaker diarization (who said what) uses pyannote; word-level timestamps let us build search that jumps to the exact moment a word was spoken — the way V.A.L.T.'s spoken-word search works across thousands of recordings.

What's the difference between speech-to-text and speech translation?Chevron down icon for interactive fields

Speech-to-text transcribes audio into text in the same language. Speech translation adds machine translation and text-to-speech to deliver it in another language. If you need cross-language captions or interpretation, see our real-time speech translation work.

Keep reading

Go deeper

Have an idea?

Let's scope your speech-to-text build.

Tell us about your audio, your accuracy bar, and your deadline. We'll come back with an engine recommendation and a realistic plan — usually within a day.

Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.

+1 (914) 775-5855
New York · USA
© Fora Soft, 2005–2026
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.