
Key takeaways
• A video conferencing platform is six engineering decisions, not a feature list. Transport (SFU vs MCU), build vs buy, codec stack, recording and composition, scale architecture, and compliance. Get those right and the rest follows.
• The 2026 default is a hybrid SFU + WebRTC stack. LiveKit, Daily, Twilio, Vonage and Agora cover most use cases; mediasoup, Janus and Jitsi cover the self-host shortlist. Below 150 k participant-minutes / month, managed wins; above it, custom starts paying back.
• Latency targets in 2026 are tighter than most teams plan. Sub-300 ms one-way is the bar for symmetric conferencing; sub-500 ms for telehealth and education; sub-200 ms for AI-agent products that need turn-taking parity with humans.
• Recording, composition and AI features dominate the cost line. Plain calls are cheap; recordings, server-side composition, transcription and AI summarisation triple the per-minute bill. Plan them as their own workstream.
• Fora Soft has shipped video conferencing platforms since 2010. Telehealth, education, sales, courtroom recording and dating products on Twilio, Vonage, Agora, LiveKit, mediasoup and Janus. Book a 30-min call.
Why Fora Soft wrote this video conference development guide
Fora Soft has shipped real-time video products since 2005. We have built telehealth (CirrusMED, MyOnCallDoc, Cloud Doctors), education (BrainCert, Instaclass), sales (Meetric), courtroom recording (V.A.L.T) and live shopping (Sprii).
This guide is the conversation we have with founders and CTOs scoping a video conferencing platform. It is opinionated, vendor-neutral, and grounded in the production code we ship every week against Twilio, Vonage, Agora, LiveKit, Daily, AWS Chime SDK, mediasoup and Janus.
We use Agent Engineering internally, which is why our delivery on a video-conferencing build is typically 30–50 % faster than agencies still doing this by hand. Visit our video conference services to see the projects.
Scoping a video conferencing platform?
We will run the six engineering decisions below against your real workload — and tell you which transport, codec and stack pick within 5 working days.
The six engineering decisions that decide a video conferencing platform
Every conferencing product we have shipped reduces to the same six choices. Get them right early; everything else iterates cheaply.
1. Transport architecture. SFU (selective forwarding), MCU (mixing) or P2P. SFU is the 2026 default for >3 participants; P2P only survives for 1:1; MCU lives mostly in PSTN bridges and broadcast.
2. Build vs buy. Twilio, Vonage, Daily, LiveKit Cloud, Agora, Chime SDK ship a managed SFU; mediasoup, Janus, Jitsi and self-hosted LiveKit are open-source. Below ~150 k participant-minutes / month managed wins; above, custom can pay back.
3. Codec and bitrate stack. Opus 48 kHz mono is the universal audio default. Video: H.264 for compatibility, VP9 for quality, AV1 for bandwidth efficiency on capable endpoints. Simulcast or SVC is mandatory for multi-party.
4. Recording and composition. Per-track recording (one file per participant) versus composed mix (single file with layout). Composition is double-charged; only do it when product needs it.
5. Scale architecture. Single-region SFU (cheapest, simplest), multi-region with cascading (mid scale), full mesh of relays (10k+ rooms in parallel). Pick the simplest model that fits your peak.
6. Compliance and data residency. HIPAA, GDPR, SOC 2, sovereign cloud. Closed APIs offer BAAs and DPAs; self-hosted gives you full control but the certification work moves on you. Decide early, before architecture.
SFU is the 2026 default. Modern WebRTC SFUs scale to 1,000+ participants per room and tens of thousands of concurrent rooms per cluster. MCU only survives in legacy PSTN bridges.
SFU vs MCU vs P2P — the architecture decision
| Topology | Latency | Cost shape | Best fit |
|---|---|---|---|
| P2P (mesh) | Lowest, <100 ms | Free server, expensive client | 1:1 only |
| SFU (forwarding) | Sub-300 ms one-way | Per-track bandwidth on server | 3–1,000 participants |
| MCU (mixing) | 300–500 ms one-way | Heavy CPU on server | PSTN bridges, broadcast |
| SFU + MoQ fan-out | Sub-500 ms to viewers | CDN egress on long tail | Hybrid broadcast (live shopping, sports) |
Build vs buy — managed SFU, hybrid, self-hosted
The three viable patterns in 2026:
Managed SFU. Twilio, Vonage, Daily, LiveKit Cloud, Agora, Amazon Chime SDK. Per-minute pricing $0.0017–$0.0041. Fastest to ship; thinnest margin at scale.
Hybrid. Managed SFU plus a thin internal abstraction so you can swap providers or self-host individual capabilities. The right default until volume crosses ~150 k PM / mo.
Self-hosted. mediasoup (Node SFU), Janus (C-based gateway), Jitsi Videobridge (Java SFU), LiveKit OSS (Go SFU). Apache 2.0 / MIT / GPL licences; pay infra and ops only. Our Agora.io alternative goes deep on the trade-offs.
Codec stack — Opus, H.264, VP9, AV1 in 2026
Audio: Opus 48 kHz mono. Universal default. WebRTC mandates support. DTX and inband FEC handle packet loss gracefully.
Video baseline: H.264. Universally hardware-decoded; mandatory for compatibility with Safari and most embedded clients.
Video quality: VP9. Better quality-to-bitrate than H.264; ubiquitous in Chrome, Firefox, Edge. Always pair with simulcast layers.
Video efficiency: AV1. 30–50 % bandwidth saving on capable endpoints; encoder cost is now manageable on M-class Apple Silicon and modern Chromebooks. Default for high-bitrate use cases (sports, live shopping) on capable clients.
Simulcast / SVC. Mandatory above 3 participants — the SFU forwards different layers to different recipients based on their bandwidth. SVC (single stream, multiple layers) is the 2026 modern alternative; Google Meet has been on SVC for years.
Need help picking SFU vendor or codec stack?
We benchmark Twilio, Vonage, LiveKit, Daily, Agora and self-hosted mediasoup against your real audience profile in 2–4 weeks.
Recording, composition and storage architecture
Recording is where most teams under-budget. Three pieces to plan:
Per-track recording. One file per participant track (audio MKA, video MKV). Cheapest to record, most flexible to re-render later. Default unless your product surfaces a single composed file.
Composition. Server-side rendering of multiple tracks into a single MP4 with a layout (gallery, speaker, custom). Doubles per-minute cost; only do it when needed.
Storage and retention. External S3 from day one (every major SDK supports this). Lifecycle rules to move old recordings to cheaper tiers (S3 Glacier, B2). 90-day vs 365-day retention can double your storage bill.
Scale architecture — single region, multi-region, mesh
Single region. One SFU cluster in one region. Up to ~10k concurrent participants on commodity hardware. Cheapest, simplest, latency-bound by region distance.
Multi-region with cascading. Each region serves its local participants; SFUs cascade to mirror tracks across regions for cross-region rooms. Default architecture above 25k concurrent participants.
Mesh of relays. Tens of regional SFUs plus an L7 load balancer; participants picked up by the nearest relay. Closest pattern to what Twitch and Zoom run; necessary for 100k+ concurrent.
Compliance — HIPAA, GDPR, sovereign cloud
Compliance is the feature that most narrows the platform shortlist. Closed-API positions:
HIPAA BAAs available. Twilio, Vonage, Daily, LiveKit (Scale tier), Agora, AWS Chime SDK, Zoom Video SDK (limited).
GDPR / EU residency. All major vendors offer EU regions; check sub-processor lists and international transfer mechanisms (SCCs, EU-US Data Privacy Framework).
SOC 2 Type II. Standard across all major vendors; LiveKit, Daily, Twilio, Vonage, Agora.
Sovereign cloud / on-prem. Self-hosted only path. mediasoup, Janus, Jitsi or self-hosted LiveKit on AWS GovCloud, Azure Government, Bharat / regional clouds.
Decide compliance before architecture. Sovereign cloud requirements rule out every closed API; HIPAA narrows the list to BAA-backed vendors; SOC 2 is table stakes.
Cost model — per-participant-minute economics in 2026
| Vendor / stack | Per-min (HD) | Notes |
|---|---|---|
| Twilio Video | $0.0040 | Baseline market rate |
| Vonage Video API | $0.0041 | Closest API parity to Twilio |
| Daily.co | $0.0040 | 10k free min/mo |
| LiveKit Cloud | ~$0.0005 | Cheapest managed; self-host free |
| AWS Chime SDK | $0.0017 | Best for AWS-native stacks |
| Agora HD video | $0.0040 | Best for global broadcast |
| Self-hosted (Hetzner) | infra-only | ~$400–700/mo flat for 200 concurrent |
AI features that move retention in 2026
Real-time transcription and captions. Whisper Large v3 or Deepgram Nova-3, <500 ms latency. Required for accessibility (WCAG 2.2 AA, EAA in EU).
Post-call summaries and action items. LLM over the transcript. Default sales feature in 2026; our video AI agents guide covers the architecture.
Real-time translation. Sub-2 s loop end to end. Cascade or SeamlessM4T. Our translator integration guide goes deep.
Background noise suppression. Krisp, NVIDIA RTX Voice, RNNoise. Open-source good enough; SaaS adds polish.
Background blur and replacement. MediaPipe Selfie Segmentation runs on-device; on-server alternatives exist for low-end clients.
Five use cases that drive video conferencing platform builds
1. Telehealth. 1:1 or small-group consults with HIPAA, recordings, transcription and clinician summaries. CirrusMED, MyOnCallDoc and Cloud Doctors are typical.
2. Online education. One-to-many lectures, breakout rooms, screen-share, recording, captions. BrainCert, Instaclass and dozens of EdTech products fall here.
3. Sales enablement. Real-time call summaries, action items, CRM write-back, AI agents. Meetric is a clean reference.
4. Court / legal recording. Long retention, secure storage, redaction, transcription. V.A.L.T in the Kazakhstan courtroom is the canonical case.
5. Live commerce and dating. Sub-second latency, large rooms, AI hosts, monetisation overlays. Sprii and Mindwibe sit here.
Pick your stack by use case — telehealth and legal demand HIPAA / sovereign cloud; education needs cheap recording at scale; live commerce needs sub-second latency and CDN fan-out.
Mini case — HIPAA telehealth platform on Vonage Video
Situation. CirrusMED needed a HIPAA-eligible telehealth video platform with consult recordings, real-time transcription, and clinician-side notes generation across iOS, Android and web.
Plan. Vonage Video API for the SFU (BAA), Whisper Large v3 self-hosted on EU AWS for transcription, Llama 3.3 70B on vLLM for summarisation, S3-Encrypted for recording storage with 7-year retention. Eval set graded by clinician partners.
Outcome. P95 one-way latency 280 ms, recording reliability 99.97 %, post-call summary “publish-ready” rate 88 %, full HIPAA audit clean. Want a similar deployment? Book a scoping call.
A decision framework — pick your platform in five questions
Q1. Already deep on AWS? Yes → Chime SDK is your default ($0.0017/min, IAM / S3 / KMS integration).
Q2. HIPAA / sovereign cloud / on-prem? Yes → self-hosted LiveKit, mediasoup or Janus in your VPC.
Q3. Need closest API parity to Twilio Video for an existing migration? Yes → Vonage Video API.
Q4. Adding AI agents (transcription, translation, summaries)? Yes → LiveKit (Cloud or self-hosted) plus the AI stack from our video AI agents guide.
Q5. Above 150k participant-min/mo and have ops capacity? Yes → self-host with vLLM, mediasoup or Janus to claw back margin.
Five pitfalls that derail video conferencing builds
1. Optimising for the demo not the bad network. Your demo runs on fibre; 30 % of users join from cellular with 2–5 % packet loss. Tune simulcast layers and DTX for the real audience.
2. Ignoring composition cost. Server-side composition doubles the per-minute bill. Only enable it where the product needs the composed file.
3. Skipping the abstraction layer. Hard-coding Twilio or Daily into your application code makes a future swap a multi-month rewrite. Wrap the SFU early.
4. Forgetting recording storage. 90 days of recordings on AWS S3 standard for a busy telehealth product is six figures a year. Lifecycle rules to Glacier are not optional.
5. Compliance late in procurement. Build first, then discover the BAA does not cover the analytics provider. Lock compliance day one.
KPIs to track once you ship
Quality KPIs. P50 / P95 one-way latency, video MOS, audio MOS, freeze ratio, simulcast switch rate, dominant-speaker handoff time.
Business KPIs. Cost per participant-minute, completion rate, no-show rate, retention among users who participate in >3 calls.
Reliability KPIs. Successful join rate (target >99 %), reconnect success rate, recording start success, webhook delivery completeness.
When you should not build a custom video conferencing platform
Skip the custom build if (a) generic Zoom / Google Meet / Teams embed is enough; (b) your product is one feature inside a larger app and conferencing is plumbing, not the product; (c) your monthly volume is below 5,000 participant-minutes and per-call quality is not differentiating.
Conversely, do build when the conferencing UX is part of the product, when compliance forces it, or when AI features (translation, summaries, agents) are the differentiation.
Ready to scope a video conferencing platform?
A 30-minute call, an architecture and unit-economics plan within 5 working days, and a fixed-scope build quote.
If you remember nothing else: SFU + WebRTC, simulcast or SVC mandatory above 3 participants, recordings to S3 day one, compliance day zero, AI features as separate workstreams.
The 2026 tooling ecosystem at a glance
Managed SFUs. Twilio, Vonage, Daily, LiveKit Cloud, Agora, AWS Chime SDK, Zoom Video SDK.
Open-source SFUs. mediasoup, Janus, Jitsi Videobridge, LiveKit OSS, Pion (Go).
Codecs. Opus (audio), H.264, VP8, VP9, AV1 (video). Simulcast / SVC libraries built in.
Real-time AI. Whisper Large v3 (ASR), Deepgram Nova-3, AssemblyAI, ElevenLabs / Cartesia (TTS), DeepL / NLLB / SeamlessM4T (translation).
Observability. LangSmith / Langfuse for AI; OpenTelemetry + Grafana for transport metrics; Sentry for client-side errors.
Frequently asked questions
SFU or MCU for video conferencing?
SFU. Modern WebRTC SFUs scale to 1,000+ participants per room, support simulcast / SVC and avoid the CPU cost of mixing on the server. MCU only survives in legacy PSTN bridges.
Build vs buy on the SFU?
Buy below 150 k participant-minutes / month. Build above, especially when sovereignty or HIPAA forces it. The hybrid pattern — managed SFU behind a thin abstraction — is the right default for most teams in between.
Which codec should I use?
Opus for audio. H.264 as a fallback for compatibility. VP9 for default quality. AV1 for high-bitrate use cases on capable endpoints. Always with simulcast or SVC above 3 participants.
Can I make a video conferencing platform HIPAA compliant?
Yes. Twilio, Vonage, Daily, LiveKit (Scale), Agora and AWS Chime SDK all offer BAAs. Self-hosted LiveKit, mediasoup or Janus in a HIPAA-eligible cloud account is the alternative for full data control.
How much does a custom build cost?
A managed-SFU MVP is 4–8 weeks of work. A self-hosted custom platform with recording, composition, AI features and full compliance is 4–6 months. Per-minute infrastructure cost on a Hetzner-based self-hosted SFU is roughly $400–700/month for 200 concurrent participants.
What latency should I target?
Sub-300 ms one-way for symmetric conferencing. Sub-500 ms for telehealth and education where latency tolerance is higher. Sub-200 ms for AI-agent products that need turn-taking parity with humans.
Should I add AI summaries from day one?
Yes if your product audience expects them (sales, education). Treat them as a separate workstream from the SFU build — ASR + LLM behind your own service, swappable across vendors. Our video AI agents guide covers the architecture.
Does Fora Soft build video conferencing platforms?
Yes. We have shipped video conferencing platforms across CirrusMED, BrainCert, Meetric, V.A.L.T and 50+ other live products. Book a call.
What to read next
Vendor comparison
Agora.io alternative in 2026: custom WebRTC with LiveKit, mediasoup & Janus
The full vendor and self-hosted shortlist for video conferencing.
Cost analysis
LiveKit vs Agora: a 2026 cost analysis with real workload numbers
Granular per-minute math when LiveKit and Agora are both on the shortlist.
AI agents
Video AI agents in 2026: architecture, latency budget, cost
When the AI layer is part of the conferencing build from day one.
Translation
Video call with translator: WebRTC integration guide 2026
Adding real-time translation to a conferencing platform.
Ready to ship a video conferencing platform?
A video conferencing platform in 2026 is six engineering decisions: SFU transport, build vs buy, codec stack, recording / composition, scale architecture, compliance. Get those right and the rest iterates cheaply.
Start on a managed SFU with a clean abstraction, plan recording and storage from day one, lock compliance before architecture, and treat AI features (transcription, translation, summaries, agents) as separate workstreams. Our video conference engineering team ships exactly this loop.
Get a video conferencing platform plan tailored to your product
A 30-minute call, an architecture and unit-economics plan within 5 working days, and a fixed-scope build quote.


.avif)

Comments