Proposal: Deprecate padding in favor of variable-sized trailing characters

The current scheme has five different modes of encoding (the notation below is the regular expression for KS X 1001 Hangul syllables numbered from 0 to 2,349):
- `[0-1023]{4}*` for every data which byte size is 0 (mod 5);
- `[0-1023]{4}* [0 4 ... 1020] 2324{3}` for every data which byte size is 1 (mod 5);
- `[0-1023]{4}* [0-1023] [0 16 ... 1008] 2324{2}` for every data which byte size is 2 (mod 5);
- `[0-1023]{4}* [0-1023]{2} [0 64 ... 960] 2324` for every data which byte size is 3 (mod 5);
- `[0-1023]{4}* [0-1023]{3} [1024-1027]` for every data which byte size is 4 (mod 5).

This has redundant padding characters and is non-trivial to verify, especially when we have lots of spare characters available. I hereby propose an alternative design that allows for no excess padding and can be extended to data with the non-integral number of bytes:
- `[0-1023]*` for every data which _bit_ size is 0 (mod 10);
- `[0-1023]* [1024-1535]` for every data which bit size is 9 (mod 10);
- `[0-1023]* [1536-1791]` for every data which bit size is 8 (mod 10);
- `[0-1023]* [1792-1919]` for every data which bit size is 7 (mod 10);
- ...
- `[0-1023]* [2040-2043]` for every data which bit size is 2 (mod 10);
- `[0-1023]* [2044-2045]` for every data which bit size is 1 (mod 10).
- (In this scheme `[2046-2047]` is reserved. They should not be used for any other purpose than signaling some out-of-band information that doesn't translate to the non-zero bits of data.)

Therefore, quoting the original examples, the byte sequence `31` (in hex) is encoded to `1585` (`1_10_00110001` in bin) and `31 32 33 64` is encoded to `196 803 217 2040` (`0_00110001_00 0_110010_0011 0_0011_011001 1_11111110_00` in bin). Note that the number of leftmost `1` bits in each encoded Hangul syllable is 10 minus the number of bits, i.e. each data is `10-nbits` ones followed by a single zero and the subsequent `nbits`-bit data.

This scheme is flexible, and when used only in the byte-sized data, fully backward-compatible to the current scheme (`[1024-1027]` is only used when the bit size is 9 (mod 10); a small change to the current spec---using `[2040-2043]` instead---will make it compatible even with the bit-sized data). And it still leaves some hundreds of special characters at the end (including the old padding character `2343`) which can be used for other purposes later. For example, `[2048-2303]` can be used for the single byte repeating 5 times (analogous to [Ascii85](http://en.wikipedia.org/wiki/Ascii85)'s `y` or `z`); the byte sequence `20 20 20 20 20` can be encoded to `2080` (instead of the longer `128 514 8 32`).

One downside is that counting the number of leftmost `1` bits is not trivial. It is not possible with a simple expression, as _Hackers' Delight_ [Section 2-1](http://www.hackersdelight.org/basics2.pdf) has proved. I consider this a minor issue however, since it is only required only once per decoding and can be slightly optimized out (e.g. using the 32-entry lookup table twice).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Deprecate padding in favor of variable-sized trailing characters #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: Deprecate padding in favor of variable-sized trailing characters #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions