Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 8, 2026

Summary

This PR fixes two critical bugs causing runtime instability as reported in #1633:

1. Broken Idle Time Detection (PRIMARY FIX)

Problem: The runtime-api was killing runtimes that were actively serving requests because it incorrectly calculated them as "idle". The idle time was only updated when events were processed in the conversation service, not when actual HTTP requests were being served.

Solution: Added ActivityTrackingMiddleware that updates the last activity timestamp on every HTTP request. This ensures that the /server_info endpoint accurately reflects actual HTTP activity, allowing runtime-api to correctly detect idle vs active runtimes.

2. Websocket Disconnect Handling (SECONDARY FIX)

Problem: When websocket connections encountered RuntimeError or ConnectionError, these exceptions were re-raised, potentially causing server instability.

Solution: Modified websocket handlers to gracefully handle these errors by logging a warning and returning normally instead of re-raising. Cleanup (unsubscription) still happens in the finally block.

Changes

  • middleware.py: Added ActivityTrackingMiddleware class that calls update_last_execution_time() on every HTTP request
  • api.py: Added the ActivityTrackingMiddleware to the FastAPI application
  • sockets.py: Modified exception handling to gracefully handle RuntimeError and ConnectionError instead of re-raising them
  • test_middleware.py: Added tests for the activity tracking middleware
  • test_event_router_websocket.py: Updated test to reflect new graceful error handling behavior

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Fixes #1633

@simonrosenberg can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:bb0eaac-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-bb0eaac-python \
  ghcr.io/openhands/agent-server:bb0eaac-python

All tags pushed for this build

ghcr.io/openhands/agent-server:bb0eaac-golang-amd64
ghcr.io/openhands/agent-server:bb0eaac-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:bb0eaac-golang-arm64
ghcr.io/openhands/agent-server:bb0eaac-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:bb0eaac-java-amd64
ghcr.io/openhands/agent-server:bb0eaac-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:bb0eaac-java-arm64
ghcr.io/openhands/agent-server:bb0eaac-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:bb0eaac-python-amd64
ghcr.io/openhands/agent-server:bb0eaac-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:bb0eaac-python-arm64
ghcr.io/openhands/agent-server:bb0eaac-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:bb0eaac-golang
ghcr.io/openhands/agent-server:bb0eaac-java
ghcr.io/openhands/agent-server:bb0eaac-python

About Multi-Architecture Support

  • Each variant tag (e.g., bb0eaac-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., bb0eaac-python-amd64) are also available if needed

This commit addresses two critical bugs causing runtime instability:

1. **Broken Idle Time Detection (PRIMARY FIX)**
   - Added ActivityTrackingMiddleware that updates last activity timestamp
     on every HTTP request
   - Previously, idle time was only updated when events were processed,
     causing runtime-api to incorrectly kill active runtimes
   - Now the /server_info endpoint accurately reflects actual HTTP activity

2. **Websocket Disconnect Handling (SECONDARY FIX)**
   - Modified websocket handlers to gracefully handle RuntimeError and
     ConnectionError instead of re-raising them
   - This prevents server shutdown when websocket connections encounter
     errors
   - Cleanup (unsubscription) still happens in the finally block

Fixes #1633

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   api.py1474966%57, 63, 67–69, 71, 83, 85, 108, 147–149, 151–155, 168, 201, 208–209, 213–216, 218, 230, 238, 243–244, 246–247, 250–253, 255, 261, 265, 271, 279–280, 288–289, 297, 301–302, 304, 310
   middleware.py23769%27–29, 32–33, 36–37
   sockets.py1015149%50–51, 57–59, 68–70, 76–78, 84–85, 88–89, 93, 107–108, 110–111, 113–115, 118, 120–123, 125–126, 128–132, 135–138, 141–142, 145, 148, 155–156, 170–174, 184
TOTAL14794709452% 

raise
logger.warning(
f"WebSocket connection error for {conversation_id}, "
"closing connection gracefully"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about not raising on RuntimeError? 🤔

Tiny nit: we could refactor these as except (exceptions) as e: like the WebSocketDisconnect code above, rather than isinstance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed the PR and make 2 smaller PRs
#1636
#1635

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Bash command polling stops after 2 attempts, causing agent loop to hang

4 participants