
In our past article, we broke down how video is structured and just how massive uncompressed files can get — a simple five-minute clip can take up a staggering 6.59 gigabytes. Storing and transmitting that much data isn’t exactly practical, especially for streaming or sharing online.
That’s where video encoding comes in. It compresses video files, reducing their size while keeping the quality intact, making them much easier to store, stream, and send.
In this article, we’ll dive into how video encoding works, covering key concepts and exploring how different encoding methods balance file size, quality, and efficiency to optimize video for various use cases.
Understanding Video Encoding
But what exactly is video encoding, and why is it so important? Video encoding is the process of converting video into a format that is not only optimized for storage but also for transmission over a network. This process reduces the file size by eliminating unnecessary data, which is critical for efficient storage and faster transmission speeds.
Video encoding involves two main components: the encoder and the decoder. The encoder compresses the video for storage and transmission, while the decoder reverses the process to restore the video to a viewable format. The combination of encoder and decoder is referred to as a codec.
There are many different codecs, such as H.264, H.265, VP8, VP9, and AV1, each using distinct algorithms and approaches.
Compression is achieved by removing redundant data without drastically affecting the video's quality. Some types of compression, like lossless encoding, remove minimal data, ensuring the result is nearly identical to the original video. However, this kind of encoding typically only achieves a modest reduction in size — around three to four times. To achieve more significant compression, lossy encoding is used, which discards more data. While this leads to a reduction in video quality, the trade-off is often worthwhile for the sake of reduced file size and bandwidth usage.
Video Encoding Approaches: How It Works
At its core, video encoding involves two main strategies for data compression:
- Encoding Information Within Individual Frames: Here, each frame of the video is treated as an individual image. The encoder works to compress the image by optimizing the data it contains, reducing the amount needed for storage or transmission.
- Encoding Information Between Adjacent Frames: Rather than encoding each frame independently, this method examines neighboring frames to identify common elements. For example, if there is minimal change from one frame to the next, the encoder avoids re-transmitting identical information, only sending the differences between frames.
Encoding Individual Frames: Understanding Color Encoding
In video encoding, color plays a crucial role in compression. As we discussed in our previous article, color information in videos is typically represented using the YCbCr color model. This model differs from other color models, like RGB, because it allows for greater compression while maintaining quality.
The YCbCr color model uses three components:
- Y: Represents the brightness or luminance of the image.
- Cb: Represents the blue chrominance.
- Cr: Represents the red chrominance.
Putting these three parts together makes a full-color picture.

While both RGB and YCbCr use three components to represent each pixel’s color, the latter allows for some data reduction without sacrificing much visual quality. This is possible because our eyes are more sensitive to brightness than to color variations, which is why the chrominance components (Cb and Cr) can be encoded at a lower resolution than the luminance component (Y). This technique is called chroma subsampling and significantly reduces the data required for color information.
Why Chroma Subsampling Works
Our visual perception is naturally attuned to brightness over color. This is because the human eye relies more on rod cells, which detect light and shadow, than on cone cells, which perceive color. This biological trait allows video compression algorithms to reduce the amount of color data without significantly impacting visual quality.
You can see this in optical illusions that trick your brain into seeing different shades when the colors are actually the same. A good example is the checker shadow illusion, where the same colors look different because of how bright or dark the surrounding area is.

This principle underpins chroma subsampling: by prioritizing luminance (brightness) over chrominance (color information), we can remove redundant data while preserving the image’s clarity.
In chroma subsampling, the YCbCr color model plays a vital role. The Y component carries the luminance, while the Cb and Cr components store color details. Since our eyes are much more sensitive to brightness changes than to shifts in color, we can encode chroma at a lower resolution without a noticeable drop in perceived quality. This technique significantly reduces the data required for color representation, optimizing video compression.
Color Subsampling Systems: Balancing Quality and Efficiency
Imagine dividing an image into small blocks, where each section shares color data among a group of pixels. The degree of subsampling determines how much color information is discarded, directly influencing file size and compression efficiency. Modern video codecs employ various chroma subsampling schemes, including 4:4:4, 4:2:2, 4:2:0, and more. Each method balances visual fidelity with storage and bandwidth constraints by altering how chroma channels are sampled.
The notation system used to describe chroma subsampling follows the X:a:b format:
- X represents the sampling frequency of luminance (Y), typically set at 4.
- a indicates the number of chroma samples (Cr, Cb) in the first row.
- b specifies the number of chroma samples in the second row.
So, 4:4:4 means that for every 2 rows of 4 pixels that show brightness, there are 4 color pixels on the first row and another 4 on the second row.

4:2:2 means that for every 2 rows of 4 pixels, there are 2 color values for the top row and 2 color values for the bottom row.

Common Subsampling Schemes Used in Modern Codecs
In video encoding, subsampling is a common technique used to reduce the amount of color data in a video file without noticeably affecting its quality. There are several subsampling schemes used in modern codecs, each balancing file size and visual quality differently.
4:4:4
This is the scheme where no subsampling takes place — each of the three YCbCr components (brightness and two color channels) is sampled at the same frequency. This scheme is often used in high-end applications like cinematic production and scanning, where preserving every bit of color detail is essential.
4:2:2
This scheme is used in professional systems, including scientific research and formats like MPEG-2. In this setup, the brightness signal (Y) is fully preserved, but the color signals (Cb, Cr) are sampled at half the horizontal resolution. This results in reduced horizontal color resolution, but it still maintains a decent balance between quality and file size.
4:2:1
It’s a rarer option used by a limited range of encoders. It reduces the chroma resolution even more, but it’s not as widely adopted in mainstream codecs.
4:1:1
This scheme reduces the horizontal chroma resolution to a quarter of the brightness signal’s resolution. This was used in the DV format, which was primarily aimed at low-budget or consumer video recording. Today, though, it still finds a place in professional settings like news production or video servers, where the trade-off between quality and compression is acceptable.
4:2:0
This is a very common scheme in consumer video formats, especially for systems like PAL and SECAM. It discards every second sample both horizontally and vertically for the chroma components, reducing bandwidth by half. This makes it particularly suitable for platforms that prioritize efficient streaming and storage, though it does slightly compromise the quality of color detail.
4:1:0
This is a subsampling ratio that is rarely used. In this case, both vertical and horizontal color resolution are greatly reduced, leading to a significant decrease in the bandwidth required. While it offers substantial compression, it’s typically only supported by specific codecs and is not commonly used for high-quality video production.
To give you a sense of how these schemes affect image quality, imagine the same image encoded with each of the subsampling types. The brightness (Y) remains the same, but the chroma resolution decreases as you move down the list. Despite the compression, the loss of quality is often barely noticeable to the naked eye, especially in lower-quality modes like 4:1:0 and 4:2:0.

Data Encoding Between Frames: Reducing Redundancy
One of the key principles in video compression is the identification and removal of temporal redundancy. Since video is essentially a sequence of images (or frames), adjacent frames often look quite similar to one another. For instance, in a news broadcast, the background remains static while only the anchor moves. Encoding techniques recognize this stability and focus on capturing the movement rather than redundantly storing unchanged pixels.
This predictive encoding method is a key reason modern video formats achieve such efficient compression. By minimizing unnecessary data storage while maintaining smooth playback, it allows high-quality videos to be streamed and stored with significantly reduced bandwidth and file sizes.
There are different types of frames in video encoding, each serving a specific purpose:
- I-frames (Intra-coded frames) are fully compressed and independent of other frames. They serve as keyframes, which allow for random access to the video stream.
- P-frames (Predicted frames) reference previous I or P frames and store only the differences from those frames.
- B-frames (Bidirectional frames) reference both previous and subsequent frames, offering even higher compression by encoding the differences in both directions.
- D-frames are an old type of video frame that is super compressed and isn't used by other frame types. They're only for quick video previews, like when you're fast-forwarding to find a specific part. They're outdated and we won't talk about them further.

The image below shows how I, P, and B frames are linked. The sequence of frames from one I-frame to the next I-frame (including any B- and P-frames in between) is called a Group of Pictures. This structure allows for efficient compression and lets you jump to any point in the video. Let’s talk about that in the next section.

The Impact of GOP Structure

So, we know that a Group of Pictures (GOP) is the key to efficient video encoding. It's a repeating pattern of I, P, and B frames, put together to find a good balance between making the video file smaller and how well it plays back. The GOP structure has two key values that define it.
- N: The number of frames between two I-frames (e.g., the GOP length/size).
- M: The number of frames between anchor frames (I- or P-frames).
For example, a GOP structure with N = 12 and M = 3 follows the pattern: IBBPBBPBBPBBI. In this setup, every twelfth frame is a keyframe (I-frame), while predictive (P) and bidirectional (B) frames fill the gaps, optimizing storage.
I-frames are chosen based on how fast you need to access the video and how reliable it needs to be. P- and B-frames are based on how much you need to compress the video and the decoder's limits. B-frames need more power, but they compress better. Changing how often the frame types show up makes the algorithm flexible and scalable.
Shorter GOP lengths reduce latency and improve error recovery, making them suitable for real-time applications like live streaming. However, they require more storage because I-frames — being fully encoded images — consume more data. Conversely, longer GOP lengths offer better compression, making them ideal for pre-recorded content, though they may introduce decoding challenges when seeking through the video.

For instance, when encoding with ffmpeg, you can specify GOP settings with:
$ ffmpeg -i SAMPLE_MOVIE.mp4 -c:v libx264 -b:v 4M -x264-params keyint=24:bframes=2 OUTPUT.mp4
This command sets a GOP length of 24 frames, with two B-frames between anchor frames, ensuring an optimal trade-off between quality and compression.
Ultimately, the choice of GOP structure depends on the specific requirements of a video application, balancing file size, compression efficiency, and playback flexibility.
To Sum Up
Video encoding is key to efficiently storing and transmitting video content. By using color models like YCbCr and techniques like chroma subsampling and temporal redundancy reduction, encoding helps shrink file sizes without sacrificing too much quality. Understanding the different encoding strategies, frame types, and subsampling options is essential for choosing the best approach for everything from professional video production to online streaming.
Whether you're looking to improve performance, quality, or both, our team can help you get the best results for your system. Contact us or book a quick call — let’s figure out the perfect solution for your needs.
Take a look at our other articles too:
How to Know the Server Cost for a Video Platform? - Estimation Guide
AI Video Quality Enhancement: 6 Breakthrough Features for Perfect Streaming
Seamless App Updates: Ensuring a Disruption-Free User Experience
Comments