Fix broken idle detection and websocket disconnect handling #1634
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes two critical bugs causing runtime instability as reported in #1633:
1. Broken Idle Time Detection (PRIMARY FIX)
Problem: The runtime-api was killing runtimes that were actively serving requests because it incorrectly calculated them as "idle". The idle time was only updated when events were processed in the conversation service, not when actual HTTP requests were being served.
Solution: Added
ActivityTrackingMiddlewarethat updates the last activity timestamp on every HTTP request. This ensures that the/server_infoendpoint accurately reflects actual HTTP activity, allowing runtime-api to correctly detect idle vs active runtimes.2. Websocket Disconnect Handling (SECONDARY FIX)
Problem: When websocket connections encountered
RuntimeErrororConnectionError, these exceptions were re-raised, potentially causing server instability.Solution: Modified websocket handlers to gracefully handle these errors by logging a warning and returning normally instead of re-raising. Cleanup (unsubscription) still happens in the finally block.
Changes
middleware.py: AddedActivityTrackingMiddlewareclass that callsupdate_last_execution_time()on every HTTP requestapi.py: Added theActivityTrackingMiddlewareto the FastAPI applicationsockets.py: Modified exception handling to gracefully handleRuntimeErrorandConnectionErrorinstead of re-raising themtest_middleware.py: Added tests for the activity tracking middlewaretest_event_router_websocket.py: Updated test to reflect new graceful error handling behaviorChecklist
Fixes #1633
@simonrosenberg can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:bb0eaac-pythonRun
All tags pushed for this build
About Multi-Architecture Support
bb0eaac-python) is a multi-arch manifest supporting both amd64 and arm64bb0eaac-python-amd64) are also available if needed