Skip to content

phd: run_shell_command got serial echo out of order? #1024

@iximeow

Description

@iximeow

this is from seeing migrate_running_process flake in https://github.com/oxidecomputer/propolis/pull/1023/checks?check_run_id=61509073090 , where the logs from phd-runner are like:

1996	2026-01-26T18:51:34.848Z	INFO	phd-runner: [WAIT_FOR_SERIAL_OUTPUT - EVENT] Waiting for output on serial console
    file = phd-tests/framework/src/test_vm/mod.rs
    line = 804
    path = phd_tests::migrate::running_process::migrate_running_process
    target = phd_framework::test_vm
    timeout_duration = Explicit(15s)
    vm = migrate_running_process_source
    vm_id = b463172b-ae55-498a-9f94-e772d657dc5a
1997	2026-01-26T18:51:49.849Z	INFO	phd-runner: [WAIT_FOR_SERIAL_OUTPUT - END]
    elapsed_milliseconds = 15001
    file = phd-tests/framework/src/test_vm/mod.rs
    line = 797
    path = phd_tests::migrate::running_process::migrate_running_process
    target = phd_framework::test_vm
    vm = migrate_running_process_source
    vm_id = b463172b-ae55-498a-9f94-e772d657dc5a
1998	2026-01-26T18:51:49.849Z	DEBG	phd-runner: [TEST - EVENT] Killing Propolis server process
    file = phd-tests/framework/src/test_vm/server.rs
    line = 140
    path = phd_tests::migrate::running_process::migrate_running_process
    self.address = 127.0.0.1:9002
    target = phd_framework::test_vm::server
1999	2026-01-26T18:51:49.858Z	INFO	phd-runner: [VM CLEANUP - START]
    file = phd-tests/framework/src/test_vm/mod.rs
    line = 1054
    path = phd_tests::migrate::running_process::migrate_running_process
    target = phd_framework::test_vm
    vm = migrate_running_process_source
    vm_id = b463172b-ae55-498a-9f94-e772d657dc5a
2000	2026-01-26T18:51:49.859Z	DEBG	phd-runner: [SERIAL CONSOLE TASK - EVENT] serial console command channel was closed
    file = phd-tests/framework/src/serial/mod.rs
    line = 213
    target = phd_framework::serial
2001	2026-01-26T18:51:49.859Z	INFO	phd-runner: [SERIAL CONSOLE TASK - END]
    elapsed_milliseconds = 19426
    file = phd-tests/framework/src/serial/mod.rs
    line = 200
    target = phd_framework::serial
2002	2026-01-26T18:51:49.994Z	INFO	phd-runner: test phd_tests::migrate::running_process::migrate_running_process ... FAILED: deadline has elapsed
    
    Stack backtrace:
       0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
                 at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/anyhow-1.0.100/src/backtrace.rs:27:14
       1: <T as core::convert::Into<U>>::into
                 at /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/core/src/convert/mod.rs:784:9
       2: anyhow::kind::Trait::new
                 at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/anyhow-1.0.100/src/kind.rs:95:15
       3: phd_framework::test_vm::TestVm::wait_for_serial_output::{{closure}}::{{closure}}
                 at ./oxidecomputer/propolis/phd-tests/framework/src/test_vm/mod.rs:822:33
       4: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
                 at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tracing-0.1.40/src/instrument.rs:321:15
       5: phd_framework::test_vm::TestVm::wait_for_serial_output::{{closure}}
                 at ./oxidecomputer/propolis/phd-tests/framework/src/test_vm/mod.rs:797:5
       6: phd_framework::test_vm::TestVm::run_command_sequence::{{closure}}
                 at ./oxidecomputer/propolis/phd-tests/framework/src/test_vm/mod.rs:933:22
       7: phd_framework::test_vm::TestVm::run_shell_command::{{closure}}
                 at ./oxidecomputer/propolis/phd-tests/framework/src/test_vm/mod.rs:894:53
       8: phd_tests::migrate::running_process::mk_dirt::{{closure}}
                 at ./oxidecomputer/propolis/phd-tests/tests/src/migrate.rs:276:10

where we timed out waiting for the echo of this cat.

what we got instead, in migrate_running_process_source.serial.log (skipping lines before which are all normal) is:

localhost:~# cat >dirt.sh <<'EOF'
> #!/usr/bin/env sh
> # a simple script for testing migration of running process memory.
> #
> # this script will write a bunch of "hello"s to an in-memory string, then
> # suspend to wait for a migration to occur. after the migration happens, the
> # script can be foregrounded, and it will then check that the data is still                                                                                > # there.                                                                                                                                                   > dirt="hello"                                                                                                                                               > len=8192 # 8 KiB ought to be enough for anyone                                                                                                             >                                                                                                                                                            > # store 8k of data in memory
> i=0
> while [ "$i" -lt "$len" ]
> do
>     dirt="$dirt
> hello"
>     i=$(( i + 1 ))
> done    
> 
> echo "made dirt"
> 
> # suspend this process and wait for it to be resumed before checking that the
> # data still exists.
> #
> # N.B.: posix sh doesn't have a suspend builtin, but we can make our own!
> kill -s TSTP $$
> 
> # check that the data is still correct.
> #
> # we do this by writing it out to a file and then looping through the file,
> # because i wasn't sure how to loop over a variable line by line in posix sh
> # without using a file.
> dirtfile="/tmp/dirt.txt"
> echo "$dirt" > "$dirtfile"
> actual_len=$(wc -l "$dirtfile" | cut -d " " -f 1)
> echo "found $actual_len lines of dirt"
one < "$dirtfile"
EOF> 
> # pre-check the file's length
> if [ "$actual_len" -lt "$len" ]
> then
>     echo "not enough dirt: $actual_len < $len"
>     exit 1
> fi
>  
> i=0
> while read -r line
> do
>     if [ $i -eq 8192 ]; then
>         echo "all good"
>         exit 0
>     fi
>     if [ "$line" != "hello" ]
>     then
>         echo "bad dirt $i: $line"
>         exit 1
>     fi
>     i=$(( i + 1 ))
> done < "$dirtfile"

so at the end we're missing > EOF. but looking a few lines up, it's there and has a partial duplicate of done < "$dirtfile":

> echo "found $actual_len lines of dirt"
one < "$dirtfile"
EOF>#
> # pre-check the file's length

what gives. this is apparently a flake (rerunning the test went fine) and uncommon (I went back through a bunch of old Propolis CI runs and didn't see anything like it). I also don't see anything that looks like a new prompt in the output, but i think given how long we waited I would have expected a trailing localhost:~# ? I'm fuzzy on how bytes get to serial.log and maybe this is more explainable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    testingRelated to testing and/or the PHD test framework.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions