Digital video is a time series of pictures, and each picture is comprised of an array of pixels, and each pixel is comprised of three numbers representing how brightly the red, green, and blue LCD dots (or CRT phosphors, if you’re old school) glow. The representation in memory, however, is not of RGB values, but of YCbCr values, which one calculates by multiplying a 3×3 matrix with the RGB values, and then adding/subtracting some offsets. This converts the components into a gray value (Y, or luma) and Cb and Cr (chroma blue and chroma red). The reason for doing this is because the human visual system is more sensitive to variations in luma compared to variations in chroma (er, actually luminance and chrominance, see below). Furthermore, for this reason, typically half or 3/4 of the chroma values are dropped and not stored — the missing ones are interpolated when converting back to RGB for display.
There are various theoretical reasons for choosing a particular matrix, and I’ve recently become interested if these reasons are actually valid. For historical reasons, early digital video copied analog precedent and used a matrix that is theoretically suboptimal. This matrix is used in standard definition (SD) video, but was changed to the theoretically correct matrix for high-definition (HD) video. There are other technical differences between SD and HD video, but this is the most significant for color accuracy.
For some time, I’ve been curious how much of a visual difference there is between the two matrices. Here are two stills from Big Buck Bunny, the first is the original, correct image, and the second is the same picture converted to YCbCr with the HDTV matrix and then back to RGB with the SDTV matrix. (To best see the differences, open the images in separate browser tabs and flip between them.)
If you are like me, you probably have trouble seeing the difference side by side, but flipping between them makes it fairly obvious. I chose this image because it has relatively saturated green and greenish-yellow, which shows off some of the largest differences.
The RGB values for the pixels that are used in computation are not proportional to the actual amount of power output by a monitor. This is known as gamma correction, and is a clever byproduct of the fact that the response curve of television phosphors (the amount of light output for a given voltage) is approximately similar to the response curve of the eye (the perceived brightness based on the amount of light). Thus voltage became synonymous with perceived brightness, televisions had fewer vacuum tubes, and we’re left with that legacy. But it’s not a bad legacy, because just like dropping chroma values, it makes it easier to compress images.
However, color comes along and messes with that simplicity a bit. Luminance in color theory is used to describe how the brain interprets the brightness of a particular pixel, which is proportional to the RGB values in linear light space, i.e., the amount of light emanating from a display. Luma is proportional to the RGB values in gamma-corrected (actually, gamma-compressed) space. This means that luma doesn’t simply depend on luminance, and contains some variation due to color. This messes with our idea that matrixing RGB values will separate variations in brightness from variations in color. How visible is it? I took the above picture and squashed the luma to one value, leaving chroma values the same (HD matrix):
What you see here is that saturated areas appear brighter than the grey areas. This is chroma (i.e., the color values we use in calculations) feeding into luminance (i.e., the perception of brightness).
How much does this matter for image and video compression efficiency? It’s a minor inefficiency of a subtle visual difference. In other words, not very much.
Earlier I mentioned that the HD matrix was theoretically more correct than the SD matrix. What about in practice? Here’s the same luma-squashed image with the SD matrix. Notice that there’s a lot more leakage from chroma into luminance, especially in the green leaves: