`Feature`: implemented all `utf8` functions in stdlib #110

reloginn · 2025-05-02T19:15:02Z

All functions are based on and fully comply with the Lua 5.3 manual (https://www.lua.org/manual/5.3/).

Features

Implemented functions utf8.char, utf8.charpattern (string), utf8.codes, utf8.codepoint, utf8.len, utf8.offset.
Added tests in utf8.lua for all new functions.

… `utf8.len`, `utf8.codepoint` and `utf8.offset`

Aeledfyr · 2025-05-25T10:40:19Z

Note that Lua 5.4 has different behavior from Lua 5.3 for the utf8 module. In Lua 5.3, all of the functions can work with over-long utf8 code points (code points above 10FFFF, with encodings longer than 4 bytes), while Lua 5.4 disallows them by default and adds a lax parameter to conditionally allow them. The main issue is that Rust's String/str/char types only support fully valid UTF-8, so the lax mode has to be custom.

For more info, see the Lua 5.4 docs on the utf8 module.

The main question is whether piccolo should even support the lax mode, since if we don't, it simplifies the implementation a lot; most parts could be implemented with byte slices and utf_chunks()/Utf8Chunks.

Aeledfyr · 2025-05-25T10:41:19Z

For the utf8.codes iterator, the implementation can be simplified a bit by using an iterator function rather than interior mutability

Here's a description of the iterator/generic for loop approach, from the Lua 5.3 docs (but it's roughly the same in 5.4).

A for statement like

for var_1, ···, var_n in explist do block end

is equivalent to the code:

do
 local f, s, var = explist
 while true do
   local var_1, ···, var_n = f(s, var)
   if var_1 == nil then break end
   var = var_1
   block
 end
end

The standard implementation of utf8.codes returns a step function, the string, and an initial index of 0; then the for loop with call the step function with the string and the index as arguments. This lets you avoid the interior mutability / sequence stuff and instead have a single callback function.

reloginn · 2025-06-02T18:29:27Z

I've simplified utf8.codes, but it still has some bugs, which I'll fix as soon as possible. Regarding lax mode, I think it's better not to add it. However, we should mention somewhere (for example, in COMPATIBILITY.md, under a general heading for utf8, rather than for each specific function) that it's not supported. I'll simplify the other functions as much as I can.

Aeledfyr · 2025-06-02T23:01:51Z

Thank you! I agree with not supporting lax mode, since it makes everything simpler.

If it helps with the indexing math, I'm pretty sure adjust_index has to be different depending on whether it's a starting index or an ending index of a range, due to differences in the wrapping behavior.

fn convert_index(i: i64, len: usize) -> Option<usize> {
    let val = match i {
        0 => 0,
        v @ 1.. => v - 1,
        v @ ..=-1 => (len as i64 + v).max(0),
    };
    usize::try_from(val).ok()
}

fn convert_index_end(i: i64, len: usize) -> Option<usize> {
    let val = match i {
        v @ 0.. => v,
        v @ ..=-1 => (len as i64 + v + 1).max(0),
    };
    usize::try_from(val).ok()
}

fn sub(string: &[u8], i: i64, j: i64) -> &[u8] {
    let len = string.len();
    let i = convert_index(i, len).unwrap_or(usize::MAX);
    let j = convert_index_end(j, len).unwrap_or(usize::MAX).min(len);
    let slice = if i > j { &[] } else { &string[i..j] };
    slice
}

IIRC the above code worked when I tested it, and should simplify utf8.len and utf8.codepoint (call sub(i.unwrap_or(1), j.unwrap_or(len)) for len and roughly sub(i.unwrap_or(1), j.unwrap_or(i)) for codepoint).

Here are a few other ideas for ways of simplifying it:

len could use std::str::from_utf8, and if it's invalid utf8, you can get the index of the invalid byte from std::str::Utf8Error. When it's valid, you can count the characters with str.chars().count(). It's simpler and should optimize to basically the same inner loop as the current version.
codepoint should also be able to use from_utf8 as well; if there's an error, return the index, but otherwise you can do stack.extend(str.chars().map(|c| Value::Integer(c as i64))

Also, the s.to_str() call in the codes callback can be removed, since the chunks part already correctly handles invalid utf8.

reloginn · 2025-06-03T17:10:43Z

I've simplified utf8.codepoint and I think we need to simplify utf8.offset because it doesn't look good. What do you think about it?

Aeledfyr · 2025-06-03T17:40:59Z

For len and codepoint, the from_utf8 should be done after the subslice to have it error instead of panicking when the indices are in the middle of code points.

I'm not really sure if there's a clean way to improve utf8.offset. There are three cases there:

if n = 0, find the start of the character containing the index; I don't know of a clean way to do this with the rust stdlib
if n > 0, find the start of the nth char; this could be done with str.char_indices().nth(n - 1)
if n < 0, find the start of the nth char from the end; this should be possible with str.char_indices().rev().nth(n - 1)? The issue is that this requires checking that the entire string is valid utf8, which would be slow with large strings.

Given that the docs say

This function assumes that s is a valid UTF-8 string.

I think the current approach of scanning for the continuation bytes is fine.

Aeledfyr · 2025-06-03T17:41:11Z

src/stdlib/utf8.rs

+                            if c.is_ascii() {
+                                stack.replace(ctx, (n as i64 + 1, c as i64));
+                            } else {


This check isn't needed; len_utf8 will return 1 if c is ascii.

reloginn · 2025-08-10T15:52:35Z

Hi! I have implemented all of your recommendations. May we proceed with the integration?

reloginn added 3 commits May 2, 2025 23:49

feature: implemented utf8.char, utf8.charpattern, utf8.codes,…

5f895f9

… `utf8.len`, `utf8.codepoint` and `utf8.offset`

cargo fmt

b5643db

feature: added tests for all functions, fix some bugs

d5447ba

reloginn added 6 commits June 2, 2025 17:49

refactor: simplify utf8.codes

2c0320c

fix: index conversion

eccc6fa

fix: try to fix indexing bug

e743975

fix: compile error

b23cbb5

fix: indexing bug

b214a11

fix: remove useless if branch

e01396d

reloginn added 2 commits June 3, 2025 21:09

refactor: simplify utf8.len

a787394

refactor: simplify utf8.codepoint

0b5b72c

reloginn marked this pull request as draft June 3, 2025 17:13

Aeledfyr reviewed Jun 3, 2025

View reviewed changes

fix: recommendations for change have been implemented

ebe528e

reloginn marked this pull request as ready for review August 10, 2025 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Feature`: implemented all `utf8` functions in stdlib #110

`Feature`: implemented all `utf8` functions in stdlib #110

Uh oh!

reloginn commented May 2, 2025

Uh oh!

Aeledfyr commented May 25, 2025

Uh oh!

Aeledfyr commented May 25, 2025

Uh oh!

reloginn commented Jun 2, 2025

Uh oh!

Aeledfyr commented Jun 2, 2025

Uh oh!

reloginn commented Jun 3, 2025

Uh oh!

Aeledfyr commented Jun 3, 2025

Uh oh!

Aeledfyr Jun 3, 2025

Uh oh!

reloginn commented Aug 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature: implemented all utf8 functions in stdlib #110

Are you sure you want to change the base?

Feature: implemented all utf8 functions in stdlib #110

Uh oh!

Conversation

reloginn commented May 2, 2025

Features

Uh oh!

Aeledfyr commented May 25, 2025

Uh oh!

Aeledfyr commented May 25, 2025

Uh oh!

reloginn commented Jun 2, 2025

Uh oh!

Aeledfyr commented Jun 2, 2025

Uh oh!

reloginn commented Jun 3, 2025

Uh oh!

Aeledfyr commented Jun 3, 2025

Uh oh!

Aeledfyr Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

reloginn commented Aug 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`Feature`: implemented all `utf8` functions in stdlib #110

`Feature`: implemented all `utf8` functions in stdlib #110