Learning course · Updated June 2026
How audio actually works inside a video product — sample rate and loudness, the AAC and Opus codecs, LUFS targets per platform, the WebRTC audio pipeline, lip-sync, and Dolby Atmos. A practical course from Fora Soft engineers, from the microphone to the viewer’s ears.
Every chapter starts with a question and ends with a production decision. Specs cited by document number — ITU-R, EBU, RFC. No marketing slides.
Outcomes
Six blocks that take you from the sound wave to the viewer’s ears. By the end, you can choose codecs, hit loudness compliance, build and debug the real-time audio pipeline, keep audio locked to the picture, and deliver immersive sound — for live, VOD, and conferencing.
Pick a path
The same 57 articles, ordered for what you actually need to do this quarter.
From "what is digital audio" to reading any audio spec without a glossary. Sample rate, bit depth, channels, lossless formats, and how loudness is measured.
Ship audio that holds up. The codec landscape, audio in HLS/DASH/CMAF, loudness compliance, and the full WebRTC audio pipeline.
The hard parts. Lip-sync and timestamps, drift correction, Dolby Atmos and MPEG-H, spatial audio, objective quality metrics, and where AI is taking audio next.
Syllabus
Every chapter is self-contained. Read in order, or jump straight to the block you need — from digital fundamentals to immersive audio.
01
02
03
04
05
06
Talk to the engineers who built it. Fora Soft helps teams choose audio codecs, hit loudness compliance, build and debug the WebRTC audio pipeline, fix lip-sync, and deliver immersive sound for telemedicine, conferencing, e-learning, live, and OTT.
Featured
Hand-picked deep dives across codecs, loudness, the real-time pipeline, and sync — the highest-impact reads first, before you commit to a learning path.
Reference
100+ terms with crisp definitions, aliases, and links to deep dives. From LUFS and Opus to NetEQ and Dolby Atmos — the full A–Z is one click away.
LUFS
Loudness Units relative to Full Scale (ITU-R BS.1770). The standard unit for perceived loudness; every streaming platform normalizes to a LUFS target.
Opus
The open, royalty-free codec (RFC 6716) that dominates WebRTC. Switches between SILK (speech) and CELT (music) and scales from 6 to 510 kbps.
AAC
Advanced Audio Coding — the default codec for MP4, HLS, and DASH playback, and the standard on Apple devices (AAC-LC, HE-AAC, xHE-AAC).
AEC
Acoustic Echo Cancellation — the WebRTC stage that removes far-end echo from a microphone signal (WebRTC AEC3).
Lip-sync
Audio-to-video timing alignment. ITU-R BT.1359 defines the tolerance window before viewers notice the drift.
Dolby Atmos
Object-based immersive audio that places sounds in 3D space, delivered to film, streaming, and music.
Written and maintained by
FAQ
Audio for video is the engineering discipline of capturing, encoding, delivering, and synchronizing sound alongside a video stream. It spans digital fundamentals (sample rate, bit depth, loudness), the codecs that compress sound (AAC, Opus, AC-4, MPEG-H), loudness normalization for streaming, the real-time WebRTC pipeline, lip-sync, and immersive formats like Dolby Atmos. Unlike music production, it is judged by intelligibility, loudness compliance, and staying in sync with the picture.
Use AAC for on-demand and broadcast streaming: it is the default in MP4, HLS, and DASH, and it is universally supported on Apple devices and smart TVs. Use Opus for real-time and interactive audio: it is the de-facto WebRTC codec (RFC 6716), royalty-free, and scales from 6 to 510 kbps while switching between speech (SILK) and music (CELT) modes. Many products ship both — AAC for playback, Opus for live.
Most platforms normalize to a fixed integrated-loudness target measured in LUFS (ITU-R BS.1770). Common 2026 targets: Spotify and YouTube around −14 LUFS, Apple Music around −16 LUFS, podcasts around −16 to −19 LUFS, and broadcast (ATSC A/85 / EBU R128) at −24 LKFS / −23 LUFS. Keep true peak at or below −1 dBTP to avoid clipping after encoding. Master to the platform's target, not to one universal number.
On capture, WebRTC runs the microphone signal through acoustic echo cancellation (AEC), noise suppression, and automatic gain control, then voice activity detection (VAD) with discontinuous transmission (DTX) to save bandwidth in silence. The audio is encoded with Opus, protected by in-band FEC, and packetized over RTP. At the receiver, the NetEQ jitter buffer absorbs network jitter, packet loss concealment (PLC) hides lost frames, and the decoder feeds the renderer — inside a sub-150 ms budget.
Per ITU-R BT.1359, lip-sync stays imperceptible when audio leads the video by no more than about 45 ms or lags by no more than about 125 ms — the ear tolerates sound arriving late better than early. Broadcast specs such as ATSC IS-191 and EBU R37 tighten the target to roughly +40/−60 ms at the distribution point. Past these windows the mismatch becomes objectionable, so live, conferencing, and OTT pipelines budget sync explicitly.
Both are immersive, object-based audio: instead of fixed channels, they carry sound objects plus positional metadata that a renderer maps to any speaker layout or to headphones. Dolby Atmos is proprietary and dominant in cinema and streaming (Netflix, Disney+, Apple Music). MPEG-H 3D Audio is the open ISO standard (used in ATSC 3.0 broadcast and by some music services) and adds listener interactivity such as dialogue-level control. Atmos has wider device reach; MPEG-H is more flexible.
Fora Soft has built real-time video, audio, and AI products since 2005 — WebRTC, LiveKit, generative pipelines, and AI agents at scale. Tell us what you’re building and we’ll send a real engineer your way.