-
Notifications
You must be signed in to change notification settings - Fork 6
Description
The current scheme has five different modes of encoding (the notation below is the regular expression for KS X 1001 Hangul syllables numbered from 0 to 2,349):
[0-1023]{4}*for every data which byte size is 0 (mod 5);[0-1023]{4}* [0 4 ... 1020] 2324{3}for every data which byte size is 1 (mod 5);[0-1023]{4}* [0-1023] [0 16 ... 1008] 2324{2}for every data which byte size is 2 (mod 5);[0-1023]{4}* [0-1023]{2} [0 64 ... 960] 2324for every data which byte size is 3 (mod 5);[0-1023]{4}* [0-1023]{3} [1024-1027]for every data which byte size is 4 (mod 5).
This has redundant padding characters and is non-trivial to verify, especially when we have lots of spare characters available. I hereby propose an alternative design that allows for no excess padding and can be extended to data with the non-integral number of bytes:
[0-1023]*for every data which bit size is 0 (mod 10);[0-1023]* [1024-1535]for every data which bit size is 9 (mod 10);[0-1023]* [1536-1791]for every data which bit size is 8 (mod 10);[0-1023]* [1792-1919]for every data which bit size is 7 (mod 10);- ...
[0-1023]* [2040-2043]for every data which bit size is 2 (mod 10);[0-1023]* [2044-2045]for every data which bit size is 1 (mod 10).- (In this scheme
[2046-2047]is reserved. They should not be used for any other purpose than signaling some out-of-band information that doesn't translate to the non-zero bits of data.)
Therefore, quoting the original examples, the byte sequence 31 (in hex) is encoded to 1585 (1_10_00110001 in bin) and 31 32 33 64 is encoded to 196 803 217 2040 (0_00110001_00 0_110010_0011 0_0011_011001 1_11111110_00 in bin). Note that the number of leftmost 1 bits in each encoded Hangul syllable is 10 minus the number of bits, i.e. each data is 10-nbits ones followed by a single zero and the subsequent nbits-bit data.
This scheme is flexible, and when used only in the byte-sized data, fully backward-compatible to the current scheme ([1024-1027] is only used when the bit size is 9 (mod 10); a small change to the current spec---using [2040-2043] instead---will make it compatible even with the bit-sized data). And it still leaves some hundreds of special characters at the end (including the old padding character 2343) which can be used for other purposes later. For example, [2048-2303] can be used for the single byte repeating 5 times (analogous to Ascii85's y or z); the byte sequence 20 20 20 20 20 can be encoded to 2080 (instead of the longer 128 514 8 32).
One downside is that counting the number of leftmost 1 bits is not trivial. It is not possible with a simple expression, as Hackers' Delight Section 2-1 has proved. I consider this a minor issue however, since it is only required only once per decoding and can be slightly optimized out (e.g. using the 32-entry lookup table twice).