fix: several issues related to scalability #1619

Co1lin · 2026-01-07T04:21:47Z

Summary

This PR tries to fix several issues relevant to scalability. It is motivated from my large-scale agentic data synthesis jobs.

In RemoteWorkspace, the httpx client is responsible for making all network requests, including querying LLMs. However, the current 60 second read timeout is not enough for the following settings:
- Queries with long context and reasoning: I used vertex_ai/gemini-3-pro-preview and vertex_ai/gemini-2.5-pro in my jobs, and it is quite common for them to take more than 60 seconds to respond, when the context is long and the reasoning effort is high.
- Querying self-host LLMs: self-hosted models with vLLM and SGLang can often have a latency longer than 60 seconds, due to the speed of the used GPUs and the batch size of the queries, such as large batch sizes in data synthesis settings.
Consider the example code below showing a toy example of large-scale agentic data synthesis. In this example, I want the agent to do read-only explorations in the repo. To reduce overhead, it is straightforward to use a single workspace for all tasks, instead of launching an independent workspace for each sample synthesis. With such a setting, I encountered several issues and therefore propose the following fixes:
- I propose to remove the default limit of 100 on the max connections. In my tasks, it is quite common to use > 100 workers in parallel to query LLMs to speed up large-scale data synthesis.
- When using one DockerWorkspace for many conversations, there will be many tmux sessions, each for one conversation. This can trigger the "too many open files" errors. Though there could be other ways to trigger this error. So I propose to increase the limit when starting the container.
- If too many conversations reuse one container, currently the tmux sessions are not closed when their conversations close, which will accumulate many tmux sessions, resulting in the whole system getting pretty slow and stuck. (I tried to enter into the container and I cannot even successfully run tmux ls or attach to any session inside.) To solve this issue, we can send the DELETE request from the client to trigger server-side delete_conversation, which I think will eventually trigger TmuxTerminal.close after a long chain. I verified on my ~10k scale job that this can effectively prevent the tmux session accumulation over time (I monitored the number of sessions in the container) and keep the system smooth.

too many open files error I got

```python [DOCKER] {"asctime": "2025-12-11 12:11:49,946", "levelname": "ERROR", "name": "openhands.agent_server.api", "filename": "api.py", "lineno": 223, "message": "Unhandled exception on POST /api/conversations", "exc_info": "Traceback (most recent call last):\n File \"starlette/middleware/errors.py\", line 164, in __call__\n File \"starlette/middleware/cors.py\", line 85, in __call__\n File \"starlette/middleware/exceptions.py\", line 63, in __call__\n File \"starlette/_exception_handler.py\", line 53, in wrapped_app\n File \"starlette/_exception_handler.py\", line 42, in wrapped_app\n File \"fastapi/middleware/asyncexitstack.py\", line 18, in __call__\n File \"starlette/routing.py\", line 716, in __call__\n File \"starlette/routing.py\", line 736, in app\n File \"starlette/routing.py\", line 290, in handle\n File \"fastapi/routing.py\", line 123, in app\n File \"starlette/_exception_handler.py\", line 53, in wrapped_app\n File \"starlette/_exception_handler.py\", line 42, in wrapped_app\n File \"fastapi/routing.py\", line 109, in app\n File \"fastapi/routing.py\", line 389, in app\n File \"fastapi/routing.py\", line 288, in run_endpoint_function\n File \"openhands/agent_server/conversation_router.py\", line 137, in start_conversation\n File \"openhands/agent_server/conversation_service.py\", line 196, in start_conversation\n File \"openhands/agent_server/conversation_service.py\", line 418, in _start_event_service\n File \"openhands/agent_server/event_service.py\", line 355, in start\n File \"openhands/sdk/conversation/impl/local_conversation.py\", line 164, in __init__\n File \"openhands/sdk/agent/agent.py\", line 100, in init_state\n File \"openhands/sdk/agent/base.py\", line 189, in init_state\n File \"openhands/sdk/agent/base.py\", line 200, in _initialize\n File \"openhands/sdk/tool/registry.py\", line 156, in resolve_tool\n File \"openhands/sdk/tool/registry.py\", line 103, in _resolve\n File \"openhands/tools/terminal/definition.py\", line 272, in create\n File \"openhands/tools/terminal/impl.py\", line 57, in __init__\n File \"openhands/tools/terminal/terminal/terminal_session.py\", line 83, in initialize\n File \"openhands/tools/terminal/terminal/tmux_terminal.py\", line 57, in initialize\n File \"libtmux/server.py\", line 574, in new_session\nlibtmux.exc.LibTmuxException: ['create window failed: fork failed: Too many open files']"} ```

The toy example showing the concept of a large-scale data synthesis job:

import asyncio
from pathlib import Path

from tqdm.asyncio import tqdm as async_tqdm

from openhands.sdk import (
    LLM,
    Conversation,
    RemoteConversation,
)
from openhands.tools.preset.default import get_default_agent
from openhands.workspace import DockerWorkspace


llm = LLM(
    model='openai/gpt-5.1-codex-mini',
)

agent = get_default_agent(llm=llm, cli_mode=True)


async def work(workspace: DockerWorkspace, code_path: str, semaphone: asyncio.Semaphore) -> RemoteConversation:
    async with semaphone:
        loop = asyncio.get_event_loop()

        conversation = Conversation(agent=agent, workspace=workspace)

        def run():
            conversation.send_message(
                f"Read the code in {code_path} and summarize it. You should respond with a report in markdown format in your last message. Your report should only focus on the code in {code_path} without information about other files. Your report should be informative, detailed, and easy to understand. Do NOT change or write any files into the environment."
            )
            conversation.run()

        await loop.run_in_executor(None, run)  # run()
        return conversation


if __name__ == "__main__":
    host_dir = Path(__file__).parent / "z_code_test"
    src_dir = host_dir / "src"
    src_dir.mkdir(parents=True, exist_ok=True)
    (src_dir / "code_0.py").write_text("print('Hello, world!')\n")
    (src_dir / "code_1.py").write_text("print('Goodbye, world!')\n")
    (src_dir / "code_2.py").write_text("print('Another code file')\n")
    # there could be more code files for analysis...

    num_samples = 100_000
    num_workers = 128
    semaphone = asyncio.Semaphore(num_workers)

    with DockerWorkspace(
        server_image="ghcr.io/openhands/agent-server:latest-python",
        host_port=8011,
        mount_dir=str(host_dir.resolve()),
        working_dir='/workspace/src',
    ) as workspace:
        import httpx
        workspace._client = httpx.Client(
            base_url=workspace.host,
            headers=workspace._headers,
            timeout=httpx.Timeout(connect=10.0, read=600.0, write=10.0, pool=10.0),
            limits=httpx.Limits(max_connections=None),
        )

        code_paths = [f"code_{i % 3}.py" for i in range(num_samples)]

        rets = asyncio.run(
            async_tqdm.gather(
                *[work(workspace, code_path, semaphone) for code_path in code_paths],
                desc="working",
            )
        )

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

…used by remoteworkspace

Copilot

Pull request overview

This PR addresses scalability issues encountered during large-scale agentic data synthesis by increasing resource limits and improving cleanup mechanisms for remote workspaces.

Increases HTTP read timeout from 60s to 600s (10 minutes) to accommodate long-running LLM queries
Removes the default 100 connection limit in httpx client to support high parallelism (>100 workers)
Adds DELETE request on conversation close to clean up server-side tmux sessions and prevent resource accumulation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
openhands-workspace/openhands/workspace/docker/workspace.py	Increases file descriptor limit (nofile) to 65536 via Docker ulimit flag to prevent "too many open files" errors
openhands-sdk/openhands/sdk/workspace/remote/base.py	Increases httpx read timeout to 600s and removes max_connections limit to support large-scale parallel operations
openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py	Adds DELETE request to trigger server-side conversation cleanup and prevent tmux session accumulation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

openhands-sdk/openhands/sdk/workspace/remote/base.py

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

enyst

Thank you for this PR! I find your use case amazing, I honestly didn't realize that it can work so well at that scale. I saw it working with, idk, some dozen conversations in the same workspace - tiny play. 😅

I'd love to know what @xingyaoww thinks about the proposal here.

openhands-sdk/openhands/sdk/workspace/remote/base.py

xingyaoww

Thanks for your contributions - these are pretty legit stability fixes!

Co1lin · 2026-01-09T20:26:52Z

openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py

    def close(self) -> None:
        """Close the conversation and clean up all tool executors."""
        # Use getattr for safety - object may be partially constructed
        if getattr(self, "_cleanup_initiated", False):
            return
        self._cleanup_initiated = True
        logger.debug("Closing conversation and cleaning up tool executors")
        hook_processor = getattr(self, "_hook_processor", None)
        if hook_processor is not None:
            hook_processor.run_session_end()
        try:
            self._end_observability_span()
        except AttributeError:
            # Object may be partially constructed; span fields may be missing.
            pass
-        try:
-            tools_map = self.agent.tools_map
-        except (AttributeError, RuntimeError):
-            # Agent not initialized or partially constructed
-            return
-        for tool in tools_map.values():
+        if self.stop_agent_on_close:
            try:
-                executable_tool = tool.as_executable()
-                executable_tool.executor.close()
-            except NotImplementedError:
-                # Tool has no executor, skip it without erroring
-                continue
-            except Exception as e:
-                logger.warning(f"Error closing executor for tool '{tool.name}': {e}")


Hi @xingyaoww , I checked the current implementation in LocalConversation, and I think it already stops the agent running when the conversation is closed.
Therefore, in the last commit of this PR, I added a new attribute stop_agent_on_close in both LocalConversation and RemoteConversation to decide whether to keep agent running after closing the conversation.

I don't actually quite understand the need to have stop_agent_on_close, since it only close the tools 🤔

And putting it inside close() function might not be the best idea, when you call agent.close() you kinda expect the agent to be closed

Co1lin · 2026-01-09T20:28:31Z

Thanks for both of you following up on this! :) Let me know what do you think on the current version.

all-hands-bot

This PR addresses important scalability issues, but introduces several behavioral changes that need attention. The most critical issue is that the fixes are opt-in (via stop_agent_on_close=False by default), which means the stated problems will still occur unless users explicitly enable the fixes.

all-hands-bot · 2026-01-10T01:05:58Z

openhands-sdk/openhands/sdk/workspace/remote/base.py

+                base_url=self.host,
+                timeout=timeout,
+                headers=self._headers,
+                limits=httpx.Limits(max_connections=None),


Critical: Unbounded connection limit is dangerous

Setting max_connections=None removes all connection pooling limits, which could lead to resource exhaustion. While the PR description mentions needing >100 concurrent connections, unlimited connections can cause:

Memory exhaustion

File descriptor exhaustion (even with ulimit fixes)

Performance degradation

Recommend setting a large but bounded limit instead:

Suggested change

limits=httpx.Limits(max_connections=None),

limits=httpx.Limits(max_connections=1000, max_keepalive_connections=500),

This allows scaling to 1000 concurrent connections (10x the default) while still providing protection against runaway resource usage.

How about we keep the limits to be default?

all-hands-bot · 2026-01-10T01:05:58Z

openhands-sdk/openhands/sdk/conversation/conversation.py

            type[ConversationVisualizerBase] | ConversationVisualizerBase | None
        ) = DefaultConversationVisualizer,
        secrets: dict[str, SecretValue] | dict[str, str] | None = None,
+        stop_agent_on_close: bool = False,


Issue: Confusing parameter name

The name stop_agent_on_close is misleading - it sounds like it stops the agent, but it actually controls whether server-side resources (tmux sessions, executors) are cleaned up. This will confuse users.

Consider a more descriptive name:

cleanup_server_resources

delete_conversation_on_close

release_resources_on_close

Also, defaulting to False means the scalability issues described in the PR (tmux session accumulation, resource leaks) will still occur by default. Users must explicitly opt-in to the fixes. Should this default to True to actually fix the problems?

all-hands-bot · 2026-01-10T01:05:58Z

openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py

+        if self.stop_agent_on_close:
            try:
-                executable_tool = tool.as_executable()
-                executable_tool.executor.close()
-            except NotImplementedError:
-                # Tool has no executor, skip it without erroring
-                continue
-            except Exception as e:
-                logger.warning(f"Error closing executor for tool '{tool.name}': {e}")
+                tools_map = self.agent.tools_map
+            except (AttributeError, RuntimeError):
+                # Agent not initialized or partially constructed
+                return
+            for tool in tools_map.values():
+                try:
+                    executable_tool = tool.as_executable()
+                    executable_tool.executor.close()
+                except NotImplementedError:
+                    # Tool has no executor, skip it without erroring
+                    continue
+                except Exception as e:
+                    logger.warning(
+                        f"Error closing executor for tool '{tool.name}': {e}"
+                    )


Breaking Change: Tool executors no longer cleaned up by default

Previously, tool executors were always closed when conversations ended. Now they are only closed if stop_agent_on_close=True (which defaults to False).

This is a breaking behavioral change that will:

Break existing tests (tests/tools/terminal/test_conversation_cleanup.py expects executors to be cleaned up)

Lead to resource leaks (tmux sessions, open files, etc.) unless users know to set the flag

Contradict the PR description which claims to fix resource accumulation

Why was this behavior changed? If there was a specific reason to make cleanup opt-in, it should be documented. Otherwise, executors should always be cleaned up to prevent resource leaks.

all-hands-bot · 2026-01-10T01:05:58Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

+        if self.stop_agent_on_close:
+            try:
+                # trigger server-side delete_conversation to release resources
+                # like tmux sessions
+                _send_request(self._client, "DELETE", f"/api/conversations/{self.id}")
+            except Exception:
+                pass


Breaking Change: Server-side cleanup now opt-in

An earlier commit (236ed33) always called DELETE to clean up tmux sessions. This was later changed to be opt-in via stop_agent_on_close=False.

This means:

By default, tmux sessions will still accumulate (the problem is NOT fixed)

The example code in the PR description does not set stop_agent_on_close=True, so it would still have the accumulation problem

Users must know to explicitly enable this flag

Additionally, the bare except Exception: pass on line 1003 silently swallows all errors, which could hide important issues like network failures or permission errors. Consider logging the exception:

Suggested change

if self.stop_agent_on_close:

try:

# trigger server-side delete_conversation to release resources

# like tmux sessions

_send_request(self._client, "DELETE", f"/api/conversations/{self.id}")

except Exception:

pass

if self.stop_agent_on_close:

try:

# trigger server-side delete_conversation to release resources

# like tmux sessions

_send_request(self._client, "DELETE", f"/api/conversations/{self.id}")

except Exception as e:

logger.warning(f"Failed to delete conversation on server: {e}")

all-hands-bot · 2026-01-10T01:05:58Z

openhands-workspace/openhands/workspace/docker/workspace.py

            self.platform,
            "--rm",
+            "--ulimit",
+            "nofile=65536:65536",  # prevent "too many open files" errors


Consider ApptainerWorkspace

This ulimit fix is only applied to DockerWorkspace. Does ApptainerWorkspace need the same fix? If users run large-scale jobs with Apptainer, they could still hit "too many open files" errors.

Check if apptainer run has similar ulimit options that should be set.

openhands-sdk/openhands/sdk/workspace/remote/base.py

all-hands-bot · 2026-01-15T12:18:28Z

[Automatic Post]: It has been a while since there was any activity on this PR. @Co1lin, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

xingyaoww · 2026-01-15T15:21:34Z

openhands-sdk/openhands/sdk/workspace/remote/base.py

+                base_url=self.host,
+                timeout=timeout,
+                headers=self._headers,
+                limits=httpx.Limits(max_connections=None),


How about we keep the limits to be default?

xingyaoww · 2026-01-15T15:22:12Z

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

    _client: httpx.Client
    _hook_processor: HookEventProcessor | None
    _cleanup_initiated: bool
+    stop_agent_on_close: bool = False


Shall we consider rename it to smth like: delete_on_close?

Co1lin added 2 commits January 7, 2026 03:23

fix: increase read timeout and max connections limit of httpx client …

ff4bc16

…used by remoteworkspace

fix: release tmux, prevent too many open files

236ed33

Copilot AI review requested due to automatic review settings January 7, 2026 04:21

Copilot started reviewing on behalf of Co1lin January 7, 2026 04:22 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

openhands-sdk/openhands/sdk/workspace/remote/base.py Outdated Show resolved Hide resolved

Co1lin added 2 commits January 7, 2026 04:25

fix: comments on read timeout

cc0063b

Merge branch 'main' into fix-httpx-client

9a31701

enyst reviewed Jan 7, 2026

View reviewed changes

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py Outdated Show resolved Hide resolved

enyst reviewed Jan 7, 2026

View reviewed changes

enyst requested a review from xingyaoww January 7, 2026 18:27

xingyaoww reviewed Jan 8, 2026

View reviewed changes

openhands-sdk/openhands/sdk/workspace/remote/base.py Outdated Show resolved Hide resolved

xingyaoww reviewed Jan 8, 2026

View reviewed changes

refactor: add read_timeout attribute

dd3b07c

Co1lin force-pushed the fix-httpx-client branch from d638fc8 to dd3b07c Compare January 9, 2026 20:19

Co1lin added 2 commits January 9, 2026 15:21

Merge branch 'main' into fix-httpx-client

da051cf

feat: stop_agent_on_close

f49c409

Co1lin commented Jan 9, 2026

View reviewed changes

Co1lin requested review from enyst and xingyaoww January 9, 2026 20:29

Merge branch 'main' into fix-httpx-client

ea5b06f

xingyaoww added the review-this This label triggers a PR review by OpenHands label Jan 10, 2026

all-hands-bot reviewed Jan 10, 2026

View reviewed changes

xingyaoww reviewed Jan 15, 2026

View reviewed changes

	limits=httpx.Limits(max_connections=None),
	limits=httpx.Limits(max_connections=1000, max_keepalive_connections=500),

fix: several issues related to scalability #1619

Are you sure you want to change the base?

fix: several issues related to scalability #1619

Uh oh!

Conversation

Co1lin commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Co1lin commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot commented Jan 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Co1lin commented Jan 7, 2026 •

edited

Loading

Co1lin commented Jan 9, 2026 •

edited

Loading