MRI has several character length routines that have different semantics and are used quite inconsistently, wiki: https://github.com/jruby/jruby/wiki/Encodings-in-JRuby.
For now we only have two semantics:
- return -1 on broken or (-1 - n) for missing n bytes in a stream (in jcodings itself).
- StringSupport.preciseLength in JRuby core.
There are several issues:
#25
jruby/joni#38
jruby/joni#17
jruby/joni#46
All of those are related to semantics where length returns 1 for invalid character, so scans can advance while consuming arrays (were we have -1 and fall into infinite loops or AIOOBE)
Presto mitigated some of that by using our NonStrictUtf8Encoding here:
prestodb/presto#8711
Ultimately, we need to decide whether to scatter our code with more costly validating length routines (which would be wasteful for already validated Strings), or try a less wasteful approach by expanding on https://github.com/jruby/jcodings/tree/unsafe-encoding