Skip to content

Conversation

@Co1lin
Copy link
Contributor

@Co1lin Co1lin commented Jan 7, 2026

Summary

This PR tries to fix several issues relevant to scalability. It is motivated from my large-scale agentic data synthesis jobs.

  • In RemoteWorkspace, the httpx client is responsible for making all network requests, including querying LLMs. However, the current 60 second read timeout is not enough for the following settings:
    • Queries with long context and reasoning: I used vertex_ai/gemini-3-pro-preview and vertex_ai/gemini-2.5-pro in my jobs, and it is quite common for them to take more than 60 seconds to respond, when the context is long and the reasoning effort is high.
    • Querying self-host LLMs: self-hosted models with vLLM and SGLang can often have a latency longer than 60 seconds, due to the speed of the used GPUs and the batch size of the queries, such as large batch sizes in data synthesis settings.
  • Consider the example code below showing a toy example of large-scale agentic data synthesis. In this example, I want the agent to do read-only explorations in the repo. To reduce overhead, it is straightforward to use a single workspace for all tasks, instead of launching an independent workspace for each sample synthesis. With such a setting, I encountered several issues and therefore propose the following fixes:
    • I propose to remove the default limit of 100 on the max connections. In my tasks, it is quite common to use > 100 workers in parallel to query LLMs to speed up large-scale data synthesis.
    • When using one DockerWorkspace for many conversations, there will be many tmux sessions, each for one conversation. This can trigger the "too many open files" errors. Though there could be other ways to trigger this error. So I propose to increase the limit when starting the container.
    • If too many conversations reuse one container, currently the tmux sessions are not closed when their conversations close, which will accumulate many tmux sessions, resulting in the whole system getting pretty slow and stuck. (I tried to enter into the container and I cannot even successfully run tmux ls or attach to any session inside.) To solve this issue, we can send the DELETE request from the client to trigger server-side delete_conversation, which I think will eventually trigger TmuxTerminal.close after a long chain. I verified on my ~10k scale job that this can effectively prevent the tmux session accumulation over time (I monitored the number of sessions in the container) and keep the system smooth.
too many open files error I got ```python [DOCKER] {"asctime": "2025-12-11 12:11:49,946", "levelname": "ERROR", "name": "openhands.agent_server.api", "filename": "api.py", "lineno": 223, "message": "Unhandled exception on POST /api/conversations", "exc_info": "Traceback (most recent call last):\n  File \"starlette/middleware/errors.py\", line 164, in __call__\n  File \"starlette/middleware/cors.py\", line 85, in __call__\n  File \"starlette/middleware/exceptions.py\", line 63, in __call__\n  File \"starlette/_exception_handler.py\", line 53, in wrapped_app\n  File \"starlette/_exception_handler.py\", line 42, in wrapped_app\n  File \"fastapi/middleware/asyncexitstack.py\", line 18, in __call__\n  File \"starlette/routing.py\", line 716, in __call__\n  File \"starlette/routing.py\", line 736, in app\n  File \"starlette/routing.py\", line 290, in handle\n  File \"fastapi/routing.py\", line 123, in app\n  File \"starlette/_exception_handler.py\", line 53, in wrapped_app\n  File \"starlette/_exception_handler.py\", line 42, in wrapped_app\n  File \"fastapi/routing.py\", line 109, in app\n  File \"fastapi/routing.py\", line 389, in app\n  File \"fastapi/routing.py\", line 288, in run_endpoint_function\n  File \"openhands/agent_server/conversation_router.py\", line 137, in start_conversation\n  File \"openhands/agent_server/conversation_service.py\", line 196, in start_conversation\n  File \"openhands/agent_server/conversation_service.py\", line 418, in _start_event_service\n  File \"openhands/agent_server/event_service.py\", line 355, in start\n  File \"openhands/sdk/conversation/impl/local_conversation.py\", line 164, in __init__\n  File \"openhands/sdk/agent/agent.py\", line 100, in init_state\n  File \"openhands/sdk/agent/base.py\", line 189, in init_state\n  File \"openhands/sdk/agent/base.py\", line 200, in _initialize\n  File \"openhands/sdk/tool/registry.py\", line 156, in resolve_tool\n  File \"openhands/sdk/tool/registry.py\", line 103, in _resolve\n  File \"openhands/tools/terminal/definition.py\", line 272, in create\n  File \"openhands/tools/terminal/impl.py\", line 57, in __init__\n  File \"openhands/tools/terminal/terminal/terminal_session.py\", line 83, in initialize\n  File \"openhands/tools/terminal/terminal/tmux_terminal.py\", line 57, in initialize\n  File \"libtmux/server.py\", line 574, in new_session\nlibtmux.exc.LibTmuxException: ['create window failed: fork failed: Too many open files']"} ```

The toy example showing the concept of a large-scale data synthesis job:

import asyncio
from pathlib import Path

from tqdm.asyncio import tqdm as async_tqdm

from openhands.sdk import (
    LLM,
    Conversation,
    RemoteConversation,
)
from openhands.tools.preset.default import get_default_agent
from openhands.workspace import DockerWorkspace


llm = LLM(
    model='openai/gpt-5.1-codex-mini',
)

agent = get_default_agent(llm=llm, cli_mode=True)


async def work(workspace: DockerWorkspace, code_path: str, semaphone: asyncio.Semaphore) -> RemoteConversation:
    async with semaphone:
        loop = asyncio.get_event_loop()

        conversation = Conversation(agent=agent, workspace=workspace)

        def run():
            conversation.send_message(
                f"Read the code in {code_path} and summarize it. You should respond with a report in markdown format in your last message. Your report should only focus on the code in {code_path} without information about other files. Your report should be informative, detailed, and easy to understand. Do NOT change or write any files into the environment."
            )
            conversation.run()

        await loop.run_in_executor(None, run)  # run()
        return conversation


if __name__ == "__main__":
    host_dir = Path(__file__).parent / "z_code_test"
    src_dir = host_dir / "src"
    src_dir.mkdir(parents=True, exist_ok=True)
    (src_dir / "code_0.py").write_text("print('Hello, world!')\n")
    (src_dir / "code_1.py").write_text("print('Goodbye, world!')\n")
    (src_dir / "code_2.py").write_text("print('Another code file')\n")
    # there could be more code files for analysis...

    num_samples = 100_000
    num_workers = 128
    semaphone = asyncio.Semaphore(num_workers)

    with DockerWorkspace(
        server_image="ghcr.io/openhands/agent-server:latest-python",
        host_port=8011,
        mount_dir=str(host_dir.resolve()),
        working_dir='/workspace/src',
    ) as workspace:
        import httpx
        workspace._client = httpx.Client(
            base_url=workspace.host,
            headers=workspace._headers,
            timeout=httpx.Timeout(connect=10.0, read=600.0, write=10.0, pool=10.0),
            limits=httpx.Limits(max_connections=None),
        )

        code_paths = [f"code_{i % 3}.py" for i in range(num_samples)]

        rets = asyncio.run(
            async_tqdm.gather(
                *[work(workspace, code_path, semaphone) for code_path in code_paths],
                desc="working",
            )
        )

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Copilot AI review requested due to automatic review settings January 7, 2026 04:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses scalability issues encountered during large-scale agentic data synthesis by increasing resource limits and improving cleanup mechanisms for remote workspaces.

  • Increases HTTP read timeout from 60s to 600s (10 minutes) to accommodate long-running LLM queries
  • Removes the default 100 connection limit in httpx client to support high parallelism (>100 workers)
  • Adds DELETE request on conversation close to clean up server-side tmux sessions and prevent resource accumulation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
openhands-workspace/openhands/workspace/docker/workspace.py Increases file descriptor limit (nofile) to 65536 via Docker ulimit flag to prevent "too many open files" errors
openhands-sdk/openhands/sdk/workspace/remote/base.py Increases httpx read timeout to 600s and removes max_connections limit to support large-scale parallel operations
openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py Adds DELETE request to trigger server-side conversation cleanup and prevent tmux session accumulation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! I find your use case amazing, I honestly didn't realize that it can work so well at that scale. I saw it working with, idk, some dozen conversations in the same workspace - tiny play. 😅

I'd love to know what @xingyaoww thinks about the proposal here.

@enyst enyst requested a review from xingyaoww January 7, 2026 18:27
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributions - these are pretty legit stability fixes!

Comment on lines -523 to -551
def close(self) -> None:
"""Close the conversation and clean up all tool executors."""
# Use getattr for safety - object may be partially constructed
if getattr(self, "_cleanup_initiated", False):
return
self._cleanup_initiated = True
logger.debug("Closing conversation and cleaning up tool executors")
hook_processor = getattr(self, "_hook_processor", None)
if hook_processor is not None:
hook_processor.run_session_end()
try:
self._end_observability_span()
except AttributeError:
# Object may be partially constructed; span fields may be missing.
pass
try:
tools_map = self.agent.tools_map
except (AttributeError, RuntimeError):
# Agent not initialized or partially constructed
return
for tool in tools_map.values():
if self.stop_agent_on_close:
try:
executable_tool = tool.as_executable()
executable_tool.executor.close()
except NotImplementedError:
# Tool has no executor, skip it without erroring
continue
except Exception as e:
logger.warning(f"Error closing executor for tool '{tool.name}': {e}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xingyaoww , I checked the current implementation in LocalConversation, and I think it already stops the agent running when the conversation is closed.
Therefore, in the last commit of this PR, I added a new attribute stop_agent_on_close in both LocalConversation and RemoteConversation to decide whether to keep agent running after closing the conversation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually quite understand the need to have stop_agent_on_close, since it only close the tools 🤔

And putting it inside close() function might not be the best idea, when you call agent.close() you kinda expect the agent to be closed

@Co1lin
Copy link
Contributor Author

Co1lin commented Jan 9, 2026

Thanks for both of you following up on this! :) Let me know what do you think on the current version.

@Co1lin Co1lin requested review from enyst and xingyaoww January 9, 2026 20:29
@xingyaoww xingyaoww added the review-this This label triggers a PR review by OpenHands label Jan 10, 2026
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR addresses important scalability issues, but introduces several behavioral changes that need attention. The most critical issue is that the fixes are opt-in (via stop_agent_on_close=False by default), which means the stated problems will still occur unless users explicitly enable the fixes.

base_url=self.host,
timeout=timeout,
headers=self._headers,
limits=httpx.Limits(max_connections=None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Unbounded connection limit is dangerous

Setting max_connections=None removes all connection pooling limits, which could lead to resource exhaustion. While the PR description mentions needing >100 concurrent connections, unlimited connections can cause:

  • Memory exhaustion
  • File descriptor exhaustion (even with ulimit fixes)
  • Performance degradation

Recommend setting a large but bounded limit instead:

Suggested change
limits=httpx.Limits(max_connections=None),
limits=httpx.Limits(max_connections=1000, max_keepalive_connections=500),

This allows scaling to 1000 concurrent connections (10x the default) while still providing protection against runaway resource usage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we keep the limits to be default?

type[ConversationVisualizerBase] | ConversationVisualizerBase | None
) = DefaultConversationVisualizer,
secrets: dict[str, SecretValue] | dict[str, str] | None = None,
stop_agent_on_close: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Confusing parameter name

The name stop_agent_on_close is misleading - it sounds like it stops the agent, but it actually controls whether server-side resources (tmux sessions, executors) are cleaned up. This will confuse users.

Consider a more descriptive name:

  • cleanup_server_resources
  • delete_conversation_on_close
  • release_resources_on_close

Also, defaulting to False means the scalability issues described in the PR (tmux session accumulation, resource leaks) will still occur by default. Users must explicitly opt-in to the fixes. Should this default to True to actually fix the problems?

Comment on lines +541 to +557
if self.stop_agent_on_close:
try:
executable_tool = tool.as_executable()
executable_tool.executor.close()
except NotImplementedError:
# Tool has no executor, skip it without erroring
continue
except Exception as e:
logger.warning(f"Error closing executor for tool '{tool.name}': {e}")
tools_map = self.agent.tools_map
except (AttributeError, RuntimeError):
# Agent not initialized or partially constructed
return
for tool in tools_map.values():
try:
executable_tool = tool.as_executable()
executable_tool.executor.close()
except NotImplementedError:
# Tool has no executor, skip it without erroring
continue
except Exception as e:
logger.warning(
f"Error closing executor for tool '{tool.name}': {e}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking Change: Tool executors no longer cleaned up by default

Previously, tool executors were always closed when conversations ended. Now they are only closed if stop_agent_on_close=True (which defaults to False).

This is a breaking behavioral change that will:

  1. Break existing tests (tests/tools/terminal/test_conversation_cleanup.py expects executors to be cleaned up)
  2. Lead to resource leaks (tmux sessions, open files, etc.) unless users know to set the flag
  3. Contradict the PR description which claims to fix resource accumulation

Why was this behavior changed? If there was a specific reason to make cleanup opt-in, it should be documented. Otherwise, executors should always be cleaned up to prevent resource leaks.

Comment on lines +998 to +1004
if self.stop_agent_on_close:
try:
# trigger server-side delete_conversation to release resources
# like tmux sessions
_send_request(self._client, "DELETE", f"/api/conversations/{self.id}")
except Exception:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking Change: Server-side cleanup now opt-in

An earlier commit (236ed33) always called DELETE to clean up tmux sessions. This was later changed to be opt-in via stop_agent_on_close=False.

This means:

  1. By default, tmux sessions will still accumulate (the problem is NOT fixed)
  2. The example code in the PR description does not set stop_agent_on_close=True, so it would still have the accumulation problem
  3. Users must know to explicitly enable this flag

Additionally, the bare except Exception: pass on line 1003 silently swallows all errors, which could hide important issues like network failures or permission errors. Consider logging the exception:

Suggested change
if self.stop_agent_on_close:
try:
# trigger server-side delete_conversation to release resources
# like tmux sessions
_send_request(self._client, "DELETE", f"/api/conversations/{self.id}")
except Exception:
pass
if self.stop_agent_on_close:
try:
# trigger server-side delete_conversation to release resources
# like tmux sessions
_send_request(self._client, "DELETE", f"/api/conversations/{self.id}")
except Exception as e:
logger.warning(f"Failed to delete conversation on server: {e}")

self.platform,
"--rm",
"--ulimit",
"nofile=65536:65536", # prevent "too many open files" errors
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider ApptainerWorkspace

This ulimit fix is only applied to DockerWorkspace. Does ApptainerWorkspace need the same fix? If users run large-scale jobs with Apptainer, they could still hit "too many open files" errors.

Check if apptainer run has similar ulimit options that should be set.

@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @Co1lin, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

base_url=self.host,
timeout=timeout,
headers=self._headers,
limits=httpx.Limits(max_connections=None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we keep the limits to be default?

_client: httpx.Client
_hook_processor: HookEventProcessor | None
_cleanup_initiated: bool
stop_agent_on_close: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider rename it to smth like: delete_on_close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this This label triggers a PR review by OpenHands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants