JPEG – The Gory Details – Part 1 – Introduction

In order to understand the background layer in the PDF, you need to understand how JPEG encodes its data. JPEG is a lossy format, with a quality factor, typically between 0 and 100, determining the size and thus the quality of the resulting image. JPEG compresses its information in a variety of manners, trying to remove ‘non-visible’ information.

Color space transformation

The first step in transforming an image into JPEG is to transform the colorspace from RGB (Red, Green, Blue) into YCbCR, with Y being the luminance channel and Cb, Cr the chrominance channel, splity into Blue and Red. This is done because our eyes are most sensitive to green.

The transformation is

Y = 0.299R + 0.587G + 0.114B
Cb = 128 – 0.168736R – 0.331264G + 0.5B
Cr = 128 + 0.5R – 0.418688G – 0.081312B

Note how most of the Luminance bandwidth is assigned to the Green channel (almost 60%), while the Cr and Cb channels can most of the Red and Blue.

As we have seen, sub-sampling the Cr and Cb channels will mostly affect red and blue channels but not the luminance channel, which contains the most visible information.





By subsampling the Chroma channels, you can reduce the number of data points that need to be stored. Typically the luma channel is not sub-sampled and in most devices, the chroma channels are sub-sampled 2 vertically and 2 horizontally, also known as 4:2:0, where the chroma channels are thus compressed by a factor of 4. Of course, the exact sub-sampling scheme has different effects, especially in areas of high contrast, smearing out the colors in the direction of the sub-sampling.


The effect on the image is noticeable


Block Splitting

The image is divided in 8×8 blocks and at the bottom and the right, data may be added to align the image to an 8×8 boundary. Different filling algorithms will lead to different artifacts.

Discrete Cosine Transform

This is where the ‘magic’ happens. The 8×8 matrix is transformed into frequency space. The 8×8 block is now transformed into an 8×8 matrix of ‘frequency information. The top left entry contains the DC component, the others what is called an AC component. Low frequencies are in the top left, high frequencies in the bottom right.


Because our eyes are more sensitive to lower frequency components, which allows us to compress further by removing the higher frequency components. Here we use the quality factor to determine how aggressively we remove frequencies.

As Wikipedia explains

This is done by simply dividing each component in the frequency domain by a constant for that component, and then rounding to the nearest integer. This rounding operation is the only lossy operation in the whole process (other than chroma subsampling) if the DCT computation is performed with sufficiently high precision. As a result of this, it is typically the case that many of the higher frequency components are rounded to zero, and many of the rest become small positive or negative numbers, which take many fewer bits to represent.

Many of the DCT frequency components will be clipped at zero now

Entropy coding

By reordering the matrix into a zig-zag form, you can organize the many zeros into a consistent run, and by applying what is called run length encoding, you can now represent them by something like 12 00, or 18 values of zero. Finally Hufmann encoding is used to further compress the data.

zig-zag order

Compression Artifacts

The JPEG compression can lead to well known compression artifacts such as blocky images (8×8), noise around sharp boundaries which particularly affects text on a background as text, by definition has sharp boundaries.