Description
During evaluations, runtimes are being force-stopped by the runtime-api’s idle GC after ~20–24 minutes with no runtime-api traffic. Once the controller stops the pod/service, any subsequent conversation/exec/health request returns 404 Remote conversation not found, aborting the run. This happens even though the evaluation is still in progress; the only trigger is that the runtime-api has not received requests for ~20+ minutes.
Impact
- Evaluations fail mid-run with 404s.
- No conversation artifacts are saved for the affected runs.
Evidence (from runtime-api logs)
- Repeated pattern: Stopping idle runtime (idle for N seconds) → Force-stopping runtime … → Runtime stopped successfully → pod/service removal messages (Calico/felix, log tailers closing).
- The 404s occur immediately after the forced stop.