Lesson6

JPEG is the real start of the image compression.

REF

JFIF
ITU.81

Here we only talk about the classic JPEG, so progressive JPEG or JPEG2000 will not be covered.

Markers

The jpeg file consists of a set of markers. List some here, see more in itu-t81 B.1.1.3 Marker assignments

Marker	value	playoad	Meaning	Comments
SOI	0xFF, 0xD8	none	Start Of Image
SOF0	0xFF, 0xC0	variable	Start Of Frame	(baseline DCT) Indicates that this is a baseline DCT-based JPEG, and specifies the width, height, number of components, and component subsampling (e.g., 4:2:0).
SOF2	0xFF, 0xC2	variable	Start Of Frame	(progressive DCT) Indicates that this is a progressive DCT-based JPEG, and specifies the width, height, number of components, and component subsampling (e.g., 4:2:0).
DHT	0xFF, 0xC4	variable	Define Huffman Table(s)	Specifies one or more Huffman tables.
DQT	0xFF, 0xDB	variable	Define Quantization Table(s)	Specifies one or more quantization tables.
DRI	0xFF, 0xDD	4 bytes	Define Restart Interval	Specifies the interval between RSTn markers, in Minimum Coded Units (MCUs). This marker is followed by two bytes indicating the fixed size so it can be treated like any other variable size segment.
SOS	0xFF, 0xDA	variable	Start Of Scan	Begins a top-to-bottom scan of the image. In baseline DCT JPEG images, there is generally a single scan. Progressive DCT JPEG images usually contain multiple scans. This marker specifies which slice of data it will contain, and is immediately followed by entropy-coded data.
RSTn	0xFF, 0xDn (n=0..7)	none	Restart	Inserted every r macroblocks, where r is the restart interval set by a DRI marker. Not used if there was no DRI marker. The low three bits of the marker code cycle in value from 0 to 7.
APPn	0xFF, 0xEn	variable	Application-specific	For example, an Exif JPEG file uses an APP1 marker to store metadata, laid out in a structure based closely on TIFF.
COM	0xFF, 0xFE	variable	Comment	Contains a text comment.
EOI	0xFF, 0xD9	none	End Of Image

The markers follow the TLV (tag - length - value) structure. the Tag and Length themselves are two byte long. Except for SOI, EOI, RST, TEM, they will only contain the two tag bytes.

You may notice that the Tag start with 0xFF and the second the byte should not be 0x00 or 0xFF, see requirements in B.1.1.2.

If we do see a 0xFF00 or 0xFFFF, then it is not a valid marker, we can skip all those as null marker.

itu-t81 give the payload of these markers from B.2.2 to B.2.5

List some of them below here.

B.2.2 SOFn

2 bytes	2 bytes	1 byte	2 bytes	2 bytes	1 byte	n * 3 bytes
SOFn	Length	depth	height	width	number of components	componets

componets have the struct

1 bytes	4 bits	4 bits	1 bytes
component identifier	Horizontal sampling factor	Vertical sampling factor	Quantization table destination selector

B.2.3 SOS

2 bytes	2 bytes	1 byte	2 * num_of_componets bytes	1 bytes	1 byte	4 bits	4 bits
SOS	length	num of componets	componets	spectral start	spectral end	Successive approximation bit position high	Successive approximation bit position low

B.2.4.1 DQT

2 bytes	2 bytes	4 bits	4 bits	n * 2 bytes or n bytes
DQT	Length	DQT element precision(0 or 1)	DQT id (0 - 3)	componets

B.2.4.2 DHT

2 bytes	2 bytes	4 bits	4 bits	1* 16 bytes	symbols accumulated bytes
DHT	Length	table class(0/DC or 1/AC)	DHT id (0 - 3)	num of codecs for each codec length	codecs symbols for each codec length

B.2.4.6 APPn

2 bytes	2 bytes	Length bytes
APPn	Length	for APP0,see jfif3 page 5,6

So APP0 could have the following in the payload

5 bytes	2 bytes	1 byte	2 bytes	2 bytes	1 byte	1 byte	3n bytes
identifier("JFIF")	version(0x0102)	units	Xdensity	Y density	Xthumbnail	Ythumbnail	RGBn

And immediately following the JFIF APP0 marker segment may be a JFIF extension APP0 marker.

5 bytes	1 byte	varible byte
identifier("JFXX")	extension_code	extension_data

Now we know the structures of all the markers. The code for parsing should be very easy, just read and handle the markers in the order you met them. But we should take care in case we have null bytes 0xFFFF/0xFF00.

See B.2.1 Figure B.2 we will know that all the compressed/encoded data are put just after the SOS. Accoding to the spec, we may have several DNL segments dividing the frame to several scans. However in baseline DCT JPEG images, there is generally a single scan.

So for a baseline JPEG image, the markers would be like

SOI(0xFFD8)
APP0(0xFFE0)
[APPn(0xFFEn)]optional
DQT(0xFFDB)
SOF0(0xFFC0)
DHT(0xFFC4)
SOS(0xFFDA)
compressed data
EOI(0xFFD9)

Decode/Decompress

Just like what we do for png, here we have already got the compressed data into one contiguous memory buffer. Now the decoder should take the job from now.

As a start, we should know the process of a DCT-basded decoder, see itu-t81 Figure6, it is a inverted process of a encoder process.

entropy decoding -> dequantation -> IDCT -> data merging/ up sample -> YUV to RGB

So the first step is that we have to do huffman decoding for the data. Itu 81 F.2.2 Baseline Huffman Decoding procedures give us some procedures. However it is not that easy to understand if you did not know the basis of huffman coding.

What is huffman coding? In simple words, sort the symbols by the order of frequency and then encode them with variable length codecs. The more frequent the symbol shows, the shorter the codec should be. The encoding process is like build a tree. Left child code should be 0 and right child code should be 1. Then parent frequency should be the sum of their children. And do it recursively you can get a tree with all symbols. Read all the bits from root to leaf is a codec for the leaf.

See DHT structure and we can get the the table class/id, and symbol number and symbols for each component. It has 16 bytes storing the numbers for each length of symbol codec starting from length 1 not 0.

For example, YUV each has one AC huffman table and one DC huffman table. And U V will share the same huffman table. Then we will get four DHT.

One of the DHT is like FF C4 00 1D 00 00 02 02 03 01 01 01 00 00 00 00 00 00 00 00 00 03 04 02 05 01 06 07 00 08 09, put it in a readable format:

DC DHT 0: 00 02 02 03 01 01 01 00 00 00 00 00 00 00 00 00 
1: 
2: 3 4 
3: 2 5 
4: 1 6 7 
5: 0 
6: 8 
7: 9 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16:

Jpeg is using Canonical Huffman Code. It will follow three rules:

first symbol must be 0, For example consider the length 3 then the codec is 000.
codecs should be continous with the same length of codecs
first codec with length j+1 should be calculated from last codec with length j. C(j+1) = (Cj + 1) << 1. For example, last codec in 2 bits is 01 and the next codec is 3 bits, then it is 100. If we did not have codec in 3 bits. The next codec is 4 bits, then the codec should be C(j+k) = (Cj + 1) << k, which is 1000. Then we can build a codec table for lookup. Basicly we can build a table for like

symbols. 3,  4,   2,   5,   1,    6,    7,     0,      8,      9
codec.   00, 01, 100, 101, 1100, 1101, 1110,  11110,  111110, 1111110

It would require a bitstream operator that could control the read bits from stream bytes. When we see a byte 01100101 in stream, check the table, we can know it is 4, 2, 5 three symbols there. We can do the lookup like

uint8_t symbols[]; // hold all the symbols
uint8_t codecs[];  // hold all the codec

let c = 0
while bitlen of c < max bitlen
    read 1 bit and append it to c, so c = c << 1 | read_bit()
    for each of the codecs, compare the symbols:
        if c == codecs[i]:
            return symbols[i]
    bitlen of c ++

A trick for speed up the lookup, we will make it like below. With this kind of table, we can always read 16bits or 8bits from stream and know the codec and bits it should cosume. Then put the unused bits back to the stream. If we consider save some memory for decoding, like we can read 8 bits and build the fast table in 256 entries. If we do have symbol codec with length greater than 8, we can still use the slow way like above.

fast lookup table:
       codec.  symbol.   bits.
        0       3       2
        1       3       2
        2       3       2
        3       3       2
        4       3       2
        5       3       2
        6       3       2
        7       3       2
        8       3       2
        9       3       2
        10      3       2
        11      3       2
        12      3       2
        13      3       2
        14      3       2
        15      3       2
        16      3       2
        17      3       2
        18      3       2
        19      3       2
        20      3       2
        21      3       2
        22      3       2
        23      3       2
        24      3       2
        25      3       2
        26      3       2
        27      3       2
        28      3       2
        29      3       2
        30      3       2
        31      3       2
        32      3       2
        33      3       2
        34      3       2
        35      3       2
        36      3       2
        37      3       2
        38      3       2
        39      3       2
        40      3       2
        41      3       2
        42      3       2
        43      3       2
        44      3       2
        45      3       2
        46      3       2
        47      3       2
        48      3       2
        49      3       2
        50      3       2
        51      3       2
        52      3       2
        53      3       2
        54      3       2
        55      3       2
        56      3       2
        57      3       2
        58      3       2
        59      3       2
        60      3       2
        61      3       2
        62      3       2
        63      3       2
        64      4       2
        65      4       2
        66      4       2
        67      4       2
        68      4       2
        69      4       2
        ......

Why can we do the table like this? As the codec for symbols 3 is 00, so any byte start with two 00 bits could consider to be a symbol 3. From 64 which is 01000000 in binary format, we can know that the begining two bits is 01 and it should be codec for symbol 4. In this way, we can fill the sparse table. It will be a full index table then. So just read 8 bits and than we know the real symbol and bits it should use, the bitstream operator should make a mark that only 2 bits of the the byte is used, when we read next byte, concat them to the rest 6 bits of last read. Look it up in the table again and get the next symbol and bits it uses.

The presuedo code for fast decoding is simple

    build lookup tree for DHT
    index = read_n_bits
    codec, bits = lookup_table[index]
    step back bits
    return the codec

Now prepare for the real decoding is ready.

After the huffman entropy decoding, what we got is 1 DC and 63 AC coefficients for each MCU component.

For 1 DC coefficient and 63 AC coefficients, they will use different DHTs. And what's more, the DC coefficient will be encoded with VLC again for the delta value between previous DC and current DC value.

See F.1.1.5.1 / F.1.2.1.1, DC DIFF = ZZ(0) - PRED. ZZ(0) means the zigzag sequence index 0 coefficient which is the real DC value. When we see DRI, the PRED is initialize to 0.

Claim no rights on posts. CC @junka
Have fun with low level details.
Fill an issue if you like.

Home
- Lesson1
Legacy formats
- Lesson2
- Lesson3
- Lesson4
- Lesson5
JPEG
- Lesson6
- Lesson7
- Lesson8
Modern formats
- Lesson9
- Lesson10
- Lesson11
- Lesson12
- Lesson13
- Lesson14
- Lesson15
Acceleration
- Lesson16
- Lesson17
Further Topics
- Lesson18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesson6

JPEG is the real start of the image compression.

Markers

B.2.2 SOFn

B.2.3 SOS

B.2.4.1 DQT

B.2.4.2 DHT

B.2.4.6 APPn

Decode/Decompress

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally