Learning course · Updated June 2026

Audio for video, end to end: codecs, loudness, sync, WebRTC

How audio actually works inside a video product — sample rate and loudness, the AAC and Opus codecs, LUFS targets per platform, the WebRTC audio pipeline, lip-sync, and Dolby Atmos. A practical course from Fora Soft engineers, from the microphone to the viewer’s ears.

Every chapter starts with a question and ends with a production decision. Specs cited by document number — ITU-R, EBU, RFC. No marketing slides.

6 chapters       68 articles        100+ glossary terms       ~23 hrs total reading

Outcomes

What you'll be able to ship.

Six blocks that take you from the sound wave to the viewer’s ears. By the end, you can choose codecs, hit loudness compliance, build and debug the real-time audio pipeline, keep audio locked to the picture, and deliver immersive sound — for live, VOD, and conferencing.

01

Choose the right audio codec for any use case

AAC for streaming reach, Opus for real-time, AC-4 and MPEG-H for immersive broadcast, LC3 for Bluetooth. Know which codec wins, and why.

02

Hit loudness compliance on every platform

LUFS targets, true peak, and dialnorm to EBU R128, ITU-R BS.1770, and ATSC A/85 — so audio passes Spotify, YouTube, Netflix, and broadcast checks.

03

Build the WebRTC audio pipeline end to end

Acoustic echo cancellation, noise suppression, AGC, the NetEQ jitter buffer, PLC, and FEC — a real-time path that stays clear under packet loss.

04

Keep audio and video in sync

PTS/DTS, PCR, RTP/RTCP, and the lip-sync tolerance window (ITU-R BT.1359). Diagnose drift and correct it across WebRTC, HLS, and DASH.

05

Deliver immersive audio

Dolby Atmos and MPEG-H from master to stream, plus ambisonics, HRTF, and binaural rendering for VR, AR, and conferencing.

06

Measure audio quality objectively

PESQ, POLQA, ViSQOL, and subjective MUSHRA/MOS testing — so you can prove a codec or pipeline change improved quality, not just claim it.

Ship production-grade audio in your video product

Talk to the engineers who built it. Fora Soft helps teams choose audio codecs, hit loudness compliance, build and debug the WebRTC audio pipeline, fix lip-sync, and deliver immersive sound for telemedicine, conferencing, e-learning, live, and OTT.

Reference

The vocabulary of audio for video

100+ terms with crisp definitions, aliases, and links to deep dives. From LUFS and Opus to NetEQ and Dolby Atmos — the full A–Z is one click away.

LUFS

Loudness Units relative to Full Scale (ITU-R BS.1770). The standard unit for perceived loudness; every streaming platform normalizes to a LUFS target.

Opus

The open, royalty-free codec (RFC 6716) that dominates WebRTC. Switches between SILK (speech) and CELT (music) and scales from 6 to 510 kbps.

AAC

Advanced Audio Coding — the default codec for MP4, HLS, and DASH playback, and the standard on Apple devices (AAC-LC, HE-AAC, xHE-AAC).

AEC

Acoustic Echo Cancellation — the WebRTC stage that removes far-end echo from a microphone signal (WebRTC AEC3).

Lip-sync

Audio-to-video timing alignment. ITU-R BT.1359 defines the tolerance window before viewers notice the drift.

Dolby Atmos

Object-based immersive audio that places sounds in 3D space, delivered to film, streaming, and music.

Written and maintained by

The author.

Nikolay Sapunov, CEO at Fora Soft

Nikolay Sapunov

CEO at Fora Soft

Leads a software studio specialising in video- and audio-centric products — streaming platforms, WebRTC apps, video conferencing, and AI-driven video and audio tools. Writes this course so product and engineering teams can reason clearly about the codes, protocols, audio pipelines, and advanced audio features for modern video and audio software.

FAQ

Frequently asked questions.

What is audio for video?

Audio for video is the engineering discipline of capturing, encoding, delivering, and synchronizing sound alongside a video stream. It spans digital fundamentals (sample rate, bit depth, loudness), the codecs that compress sound (AAC, Opus, AC-4, MPEG-H), loudness normalization for streaming, the real-time WebRTC pipeline, lip-sync, and immersive formats like Dolby Atmos. Unlike music production, it is judged by intelligibility, loudness compliance, and staying in sync with the picture.

AAC vs Opus — which audio codec should I use?

Use AAC for on-demand and broadcast streaming: it is the default in MP4, HLS, and DASH, and it is universally supported on Apple devices and smart TVs. Use Opus for real-time and interactive audio: it is the de-facto WebRTC codec (RFC 6716), royalty-free, and scales from 6 to 510 kbps while switching between speech (SILK) and music (CELT) modes. Many products ship both — AAC for playback, Opus for live.

What LUFS should I target for streaming platforms?

Most platforms normalize to a fixed integrated-loudness target measured in LUFS (ITU-R BS.1770). Common 2026 targets: Spotify and YouTube around −14 LUFS, Apple Music around −16 LUFS, podcasts around −16 to −19 LUFS, and broadcast (ATSC A/85 / EBU R128) at −24 LKFS / −23 LUFS. Keep true peak at or below −1 dBTP to avoid clipping after encoding. Master to the platform's target, not to one universal number.

How does the WebRTC audio pipeline work?

On capture, WebRTC runs the microphone signal through acoustic echo cancellation (AEC), noise suppression, and automatic gain control, then voice activity detection (VAD) with discontinuous transmission (DTX) to save bandwidth in silence. The audio is encoded with Opus, protected by in-band FEC, and packetized over RTP. At the receiver, the NetEQ jitter buffer absorbs network jitter, packet loss concealment (PLC) hides lost frames, and the decoder feeds the renderer — inside a sub-150 ms budget.

What is acceptable audio-to-video (lip-sync) latency?

Per ITU-R BT.1359, lip-sync stays imperceptible when audio leads the video by no more than about 45 ms or lags by no more than about 125 ms — the ear tolerates sound arriving late better than early. Broadcast specs such as ATSC IS-191 and EBU R37 tighten the target to roughly +40/−60 ms at the distribution point. Past these windows the mismatch becomes objectionable, so live, conferencing, and OTT pipelines budget sync explicitly.

What's the difference between Dolby Atmos and MPEG-H?

Both are immersive, object-based audio: instead of fixed channels, they carry sound objects plus positional metadata that a renderer maps to any speaker layout or to headphones. Dolby Atmos is proprietary and dominant in cinema and streaming (Netflix, Disney+, Apple Music). MPEG-H 3D Audio is the open ISO standard (used in ATSC 3.0 broadcast and by some music services) and adds listener interactivity such as dialogue-level control. Atmos has wider device reach; MPEG-H is more flexible.

Need to ship audio in video, not just understand it?

Fora Soft has built real-time video, audio, and AI products since 2005 — WebRTC, LiveKit, generative pipelines, and AI agents at scale. Tell us what you’re building and we’ll send a real engineer your way.

Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.

+1 (914) 775-5855
New York · USA
© Fora Soft, 20052026
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.