Skip to content

Implement approximate length and other length routines for proper broken character processing #26

@lopex

Description

@lopex

MRI has several character length routines that have different semantics and are used quite inconsistently, wiki: https://github.com/jruby/jruby/wiki/Encodings-in-JRuby.

For now we only have two semantics:

  • return -1 on broken or (-1 - n) for missing n bytes in a stream (in jcodings itself).
  • StringSupport.preciseLength in JRuby core.

There are several issues:
#25
jruby/joni#38
jruby/joni#17
jruby/joni#46

All of those are related to semantics where length returns 1 for invalid character, so scans can advance while consuming arrays (were we have -1 and fall into infinite loops or AIOOBE)

Presto mitigated some of that by using our NonStrictUtf8Encoding here:
prestodb/presto#8711

Ultimately, we need to decide whether to scatter our code with more costly validating length routines (which would be wasteful for already validated Strings), or try a less wasteful approach by expanding on https://github.com/jruby/jcodings/tree/unsafe-encoding

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions