Skip to content

Proposal: Deprecate padding in favor of variable-sized trailing characters #3

@lifthrasiir

Description

@lifthrasiir

The current scheme has five different modes of encoding (the notation below is the regular expression for KS X 1001 Hangul syllables numbered from 0 to 2,349):

  • [0-1023]{4}* for every data which byte size is 0 (mod 5);
  • [0-1023]{4}* [0 4 ... 1020] 2324{3} for every data which byte size is 1 (mod 5);
  • [0-1023]{4}* [0-1023] [0 16 ... 1008] 2324{2} for every data which byte size is 2 (mod 5);
  • [0-1023]{4}* [0-1023]{2} [0 64 ... 960] 2324 for every data which byte size is 3 (mod 5);
  • [0-1023]{4}* [0-1023]{3} [1024-1027] for every data which byte size is 4 (mod 5).

This has redundant padding characters and is non-trivial to verify, especially when we have lots of spare characters available. I hereby propose an alternative design that allows for no excess padding and can be extended to data with the non-integral number of bytes:

  • [0-1023]* for every data which bit size is 0 (mod 10);
  • [0-1023]* [1024-1535] for every data which bit size is 9 (mod 10);
  • [0-1023]* [1536-1791] for every data which bit size is 8 (mod 10);
  • [0-1023]* [1792-1919] for every data which bit size is 7 (mod 10);
  • ...
  • [0-1023]* [2040-2043] for every data which bit size is 2 (mod 10);
  • [0-1023]* [2044-2045] for every data which bit size is 1 (mod 10).
  • (In this scheme [2046-2047] is reserved. They should not be used for any other purpose than signaling some out-of-band information that doesn't translate to the non-zero bits of data.)

Therefore, quoting the original examples, the byte sequence 31 (in hex) is encoded to 1585 (1_10_00110001 in bin) and 31 32 33 64 is encoded to 196 803 217 2040 (0_00110001_00 0_110010_0011 0_0011_011001 1_11111110_00 in bin). Note that the number of leftmost 1 bits in each encoded Hangul syllable is 10 minus the number of bits, i.e. each data is 10-nbits ones followed by a single zero and the subsequent nbits-bit data.

This scheme is flexible, and when used only in the byte-sized data, fully backward-compatible to the current scheme ([1024-1027] is only used when the bit size is 9 (mod 10); a small change to the current spec---using [2040-2043] instead---will make it compatible even with the bit-sized data). And it still leaves some hundreds of special characters at the end (including the old padding character 2343) which can be used for other purposes later. For example, [2048-2303] can be used for the single byte repeating 5 times (analogous to Ascii85's y or z); the byte sequence 20 20 20 20 20 can be encoded to 2080 (instead of the longer 128 514 8 32).

One downside is that counting the number of leftmost 1 bits is not trivial. It is not possible with a simple expression, as Hackers' Delight Section 2-1 has proved. I consider this a minor issue however, since it is only required only once per decoding and can be slightly optimized out (e.g. using the 32-entry lookup table twice).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions