-
Notifications
You must be signed in to change notification settings - Fork 109
fix: several issues related to scalability #1619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses scalability issues encountered during large-scale agentic data synthesis by increasing resource limits and improving cleanup mechanisms for remote workspaces.
- Increases HTTP read timeout from 60s to 600s (10 minutes) to accommodate long-running LLM queries
- Removes the default 100 connection limit in httpx client to support high parallelism (>100 workers)
- Adds DELETE request on conversation close to clean up server-side tmux sessions and prevent resource accumulation
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| openhands-workspace/openhands/workspace/docker/workspace.py | Increases file descriptor limit (nofile) to 65536 via Docker ulimit flag to prevent "too many open files" errors |
| openhands-sdk/openhands/sdk/workspace/remote/base.py | Increases httpx read timeout to 600s and removes max_connections limit to support large-scale parallel operations |
| openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py | Adds DELETE request to trigger server-side conversation cleanup and prevent tmux session accumulation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py
Outdated
Show resolved
Hide resolved
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR! I find your use case amazing, I honestly didn't realize that it can work so well at that scale. I saw it working with, idk, some dozen conversations in the same workspace - tiny play. 😅
I'd love to know what @xingyaoww thinks about the proposal here.
xingyaoww
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contributions - these are pretty legit stability fixes!
d638fc8 to
dd3b07c
Compare
| def close(self) -> None: | ||
| """Close the conversation and clean up all tool executors.""" | ||
| # Use getattr for safety - object may be partially constructed | ||
| if getattr(self, "_cleanup_initiated", False): | ||
| return | ||
| self._cleanup_initiated = True | ||
| logger.debug("Closing conversation and cleaning up tool executors") | ||
| hook_processor = getattr(self, "_hook_processor", None) | ||
| if hook_processor is not None: | ||
| hook_processor.run_session_end() | ||
| try: | ||
| self._end_observability_span() | ||
| except AttributeError: | ||
| # Object may be partially constructed; span fields may be missing. | ||
| pass | ||
| try: | ||
| tools_map = self.agent.tools_map | ||
| except (AttributeError, RuntimeError): | ||
| # Agent not initialized or partially constructed | ||
| return | ||
| for tool in tools_map.values(): | ||
| if self.stop_agent_on_close: | ||
| try: | ||
| executable_tool = tool.as_executable() | ||
| executable_tool.executor.close() | ||
| except NotImplementedError: | ||
| # Tool has no executor, skip it without erroring | ||
| continue | ||
| except Exception as e: | ||
| logger.warning(f"Error closing executor for tool '{tool.name}': {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xingyaoww , I checked the current implementation in LocalConversation, and I think it already stops the agent running when the conversation is closed.
Therefore, in the last commit of this PR, I added a new attribute stop_agent_on_close in both LocalConversation and RemoteConversation to decide whether to keep agent running after closing the conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually quite understand the need to have stop_agent_on_close, since it only close the tools 🤔
And putting it inside close() function might not be the best idea, when you call agent.close() you kinda expect the agent to be closed
|
Thanks for both of you following up on this! :) Let me know what do you think on the current version. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR addresses important scalability issues, but introduces several behavioral changes that need attention. The most critical issue is that the fixes are opt-in (via stop_agent_on_close=False by default), which means the stated problems will still occur unless users explicitly enable the fixes.
| base_url=self.host, | ||
| timeout=timeout, | ||
| headers=self._headers, | ||
| limits=httpx.Limits(max_connections=None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Unbounded connection limit is dangerous
Setting max_connections=None removes all connection pooling limits, which could lead to resource exhaustion. While the PR description mentions needing >100 concurrent connections, unlimited connections can cause:
- Memory exhaustion
- File descriptor exhaustion (even with ulimit fixes)
- Performance degradation
Recommend setting a large but bounded limit instead:
| limits=httpx.Limits(max_connections=None), | |
| limits=httpx.Limits(max_connections=1000, max_keepalive_connections=500), |
This allows scaling to 1000 concurrent connections (10x the default) while still providing protection against runaway resource usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we keep the limits to be default?
| type[ConversationVisualizerBase] | ConversationVisualizerBase | None | ||
| ) = DefaultConversationVisualizer, | ||
| secrets: dict[str, SecretValue] | dict[str, str] | None = None, | ||
| stop_agent_on_close: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Confusing parameter name
The name stop_agent_on_close is misleading - it sounds like it stops the agent, but it actually controls whether server-side resources (tmux sessions, executors) are cleaned up. This will confuse users.
Consider a more descriptive name:
cleanup_server_resourcesdelete_conversation_on_closerelease_resources_on_close
Also, defaulting to False means the scalability issues described in the PR (tmux session accumulation, resource leaks) will still occur by default. Users must explicitly opt-in to the fixes. Should this default to True to actually fix the problems?
| if self.stop_agent_on_close: | ||
| try: | ||
| executable_tool = tool.as_executable() | ||
| executable_tool.executor.close() | ||
| except NotImplementedError: | ||
| # Tool has no executor, skip it without erroring | ||
| continue | ||
| except Exception as e: | ||
| logger.warning(f"Error closing executor for tool '{tool.name}': {e}") | ||
| tools_map = self.agent.tools_map | ||
| except (AttributeError, RuntimeError): | ||
| # Agent not initialized or partially constructed | ||
| return | ||
| for tool in tools_map.values(): | ||
| try: | ||
| executable_tool = tool.as_executable() | ||
| executable_tool.executor.close() | ||
| except NotImplementedError: | ||
| # Tool has no executor, skip it without erroring | ||
| continue | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"Error closing executor for tool '{tool.name}': {e}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking Change: Tool executors no longer cleaned up by default
Previously, tool executors were always closed when conversations ended. Now they are only closed if stop_agent_on_close=True (which defaults to False).
This is a breaking behavioral change that will:
- Break existing tests (
tests/tools/terminal/test_conversation_cleanup.pyexpects executors to be cleaned up) - Lead to resource leaks (tmux sessions, open files, etc.) unless users know to set the flag
- Contradict the PR description which claims to fix resource accumulation
Why was this behavior changed? If there was a specific reason to make cleanup opt-in, it should be documented. Otherwise, executors should always be cleaned up to prevent resource leaks.
| if self.stop_agent_on_close: | ||
| try: | ||
| # trigger server-side delete_conversation to release resources | ||
| # like tmux sessions | ||
| _send_request(self._client, "DELETE", f"/api/conversations/{self.id}") | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking Change: Server-side cleanup now opt-in
An earlier commit (236ed33) always called DELETE to clean up tmux sessions. This was later changed to be opt-in via stop_agent_on_close=False.
This means:
- By default, tmux sessions will still accumulate (the problem is NOT fixed)
- The example code in the PR description does not set
stop_agent_on_close=True, so it would still have the accumulation problem - Users must know to explicitly enable this flag
Additionally, the bare except Exception: pass on line 1003 silently swallows all errors, which could hide important issues like network failures or permission errors. Consider logging the exception:
| if self.stop_agent_on_close: | |
| try: | |
| # trigger server-side delete_conversation to release resources | |
| # like tmux sessions | |
| _send_request(self._client, "DELETE", f"/api/conversations/{self.id}") | |
| except Exception: | |
| pass | |
| if self.stop_agent_on_close: | |
| try: | |
| # trigger server-side delete_conversation to release resources | |
| # like tmux sessions | |
| _send_request(self._client, "DELETE", f"/api/conversations/{self.id}") | |
| except Exception as e: | |
| logger.warning(f"Failed to delete conversation on server: {e}") |
| self.platform, | ||
| "--rm", | ||
| "--ulimit", | ||
| "nofile=65536:65536", # prevent "too many open files" errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider ApptainerWorkspace
This ulimit fix is only applied to DockerWorkspace. Does ApptainerWorkspace need the same fix? If users run large-scale jobs with Apptainer, they could still hit "too many open files" errors.
Check if apptainer run has similar ulimit options that should be set.
|
[Automatic Post]: It has been a while since there was any activity on this PR. @Co1lin, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. |
| base_url=self.host, | ||
| timeout=timeout, | ||
| headers=self._headers, | ||
| limits=httpx.Limits(max_connections=None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we keep the limits to be default?
| _client: httpx.Client | ||
| _hook_processor: HookEventProcessor | None | ||
| _cleanup_initiated: bool | ||
| stop_agent_on_close: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we consider rename it to smth like: delete_on_close?
Summary
This PR tries to fix several issues relevant to scalability. It is motivated from my large-scale agentic data synthesis jobs.
vertex_ai/gemini-3-pro-previewandvertex_ai/gemini-2.5-proin my jobs, and it is quite common for them to take more than 60 seconds to respond, when the context is long and the reasoning effort is high.tmux lsor attach to any session inside.) To solve this issue, we can send the DELETE request from the client to trigger server-sidedelete_conversation, which I think will eventually triggerTmuxTerminal.closeafter a long chain. I verified on my ~10k scale job that this can effectively prevent the tmux session accumulation over time (I monitored the number of sessions in the container) and keep the system smooth.too many open files error I got
```python [DOCKER] {"asctime": "2025-12-11 12:11:49,946", "levelname": "ERROR", "name": "openhands.agent_server.api", "filename": "api.py", "lineno": 223, "message": "Unhandled exception on POST /api/conversations", "exc_info": "Traceback (most recent call last):\n File \"starlette/middleware/errors.py\", line 164, in __call__\n File \"starlette/middleware/cors.py\", line 85, in __call__\n File \"starlette/middleware/exceptions.py\", line 63, in __call__\n File \"starlette/_exception_handler.py\", line 53, in wrapped_app\n File \"starlette/_exception_handler.py\", line 42, in wrapped_app\n File \"fastapi/middleware/asyncexitstack.py\", line 18, in __call__\n File \"starlette/routing.py\", line 716, in __call__\n File \"starlette/routing.py\", line 736, in app\n File \"starlette/routing.py\", line 290, in handle\n File \"fastapi/routing.py\", line 123, in app\n File \"starlette/_exception_handler.py\", line 53, in wrapped_app\n File \"starlette/_exception_handler.py\", line 42, in wrapped_app\n File \"fastapi/routing.py\", line 109, in app\n File \"fastapi/routing.py\", line 389, in app\n File \"fastapi/routing.py\", line 288, in run_endpoint_function\n File \"openhands/agent_server/conversation_router.py\", line 137, in start_conversation\n File \"openhands/agent_server/conversation_service.py\", line 196, in start_conversation\n File \"openhands/agent_server/conversation_service.py\", line 418, in _start_event_service\n File \"openhands/agent_server/event_service.py\", line 355, in start\n File \"openhands/sdk/conversation/impl/local_conversation.py\", line 164, in __init__\n File \"openhands/sdk/agent/agent.py\", line 100, in init_state\n File \"openhands/sdk/agent/base.py\", line 189, in init_state\n File \"openhands/sdk/agent/base.py\", line 200, in _initialize\n File \"openhands/sdk/tool/registry.py\", line 156, in resolve_tool\n File \"openhands/sdk/tool/registry.py\", line 103, in _resolve\n File \"openhands/tools/terminal/definition.py\", line 272, in create\n File \"openhands/tools/terminal/impl.py\", line 57, in __init__\n File \"openhands/tools/terminal/terminal/terminal_session.py\", line 83, in initialize\n File \"openhands/tools/terminal/terminal/tmux_terminal.py\", line 57, in initialize\n File \"libtmux/server.py\", line 574, in new_session\nlibtmux.exc.LibTmuxException: ['create window failed: fork failed: Too many open files']"} ```The toy example showing the concept of a large-scale data synthesis job:
Checklist