Skip to content

Conversation

@yucc-leon
Copy link

Summary

This PR includes several bug fixes and improvements to enhance training stability and usability.

Changes

  1. Fix: Pin verifiers version and update skyrl-train dependency
  • Pin verifiers==0.1.6.post0 to avoid SDK interface incompatibilities
  • Update skyrl-train to commit 7cd23ea to fix a crash when saving the final checkpoint
  1. Fix: Activate max_input_tokens to prevent context overflow
  • Ensure the existing max_input_tokens configuration is properly applied
  • Prevent context overflow issues that could cause vLLM to crash and terminate training
  1. Fix: Tool call metrics counting using ActionEvent
  • Correct tool call counting logic to use ActionEvent instead of ToolCallEvent
  • Improve metrics accuracy for tool usage tracking
  1. Fix: LiteLLM model routing with correct provider prefix
  • Ensure proper OpenAI-compatible routing with the openai/ prefix
  • Support both HuggingFace model IDs and local model paths
  1. Fix: HTTP endpoint connection when host is 0.0.0.0
  • Use 127.0.0.1 for client connections when the server binds to 0.0.0.0
  • Prevent connection errors in distributed training setups

Testing

All changes have been tested with:

  • Local model paths
  • HuggingFace model IDs
  • Distributed training configurations

Impact

  • Improved training stability
  • Reduced risk of runtime crashes
  • Clearer documentation for users

When http_endpoint_host is set to 0.0.0.0 for server binding,
clients cannot connect to 0.0.0.0. This fix automatically uses
127.0.0.1 for client connections when the host is 0.0.0.0 or ::.

This resolves connection failures in distributed training setups
where the server binds to all interfaces but clients need a
specific loopback address to connect.

Technical details:
- Server binding: 0.0.0.0 (listens on all interfaces)
- Client connection: 127.0.0.1 (connects to loopback)
- Also handles IPv6 :: binding
Change from litellm_proxy/ to openai/ prefix to properly route
requests to SkyRL's OpenAI-compatible endpoint. The litellm_proxy/
prefix was causing 'LLM Provider NOT provided' errors.

LiteLLM requires a provider prefix to route requests correctly.
Using the openai/ prefix ensures:
1. Proper routing to OpenAI-compatible endpoints
2. Preservation of SkyRL's strict model name checking
3. Correct model identifier extraction after the first slash

Example:
  model_name = "/path/to/qwen3-4b"
  litellm_model_name = "openai//path/to/qwen3-4b"
  SkyRL receives: model="/path/to/qwen3-4b" (matches loaded model)

This fixes LLM initialization failures in training.
Change tool call counting from assistant messages to ActionEvent
messages, which represent actual tool invocations. The previous
method of counting tool_calls field in assistant messages was
inaccurate and didn't reflect actual tool executions.

Changes:
- Count ActionEvent messages instead of assistant messages
- Extract tool_name directly from ActionEvent.tool_name field
- Simplify logic by removing complex dict/object handling
- Update tests to use ActionEvent-based test data

This provides accurate tool usage statistics for training analysis
and metrics reporting.

Technical details:
- ActionEvent represents actual tool invocations in the system
- Assistant messages may contain tool_calls but don't guarantee execution
- This fix ensures metrics match actual tool usage behavior
Add max_input_tokens configuration to LLM initialization to prevent
context length overflow errors during training. Without this setting,
the system may crash when input exceeds model's context window.

Changes:
- Read max_input_length from generator config (default: 38400)
- Calculate effective_max_input by reserving 2000 tokens for overhead
- Pass max_input_tokens to LLM initialization
- Let OpenHands handle context truncation automatically

This prevents training crashes due to context overflow and ensures
stable long-running training sessions.

Technical details:
- Reserves tokens for system prompt, tools, and response generation
- OpenHands will automatically truncate context when needed
- Default 38400 tokens accommodates most training scenarios
Pin verifiers to exact version 0.1.6.post0 for SDK/API compatibility
and reproducibility. Update skyrl-train to rev 7cd23ea for latest fixes.

Changes:
- Pin verifiers: 0.1.6.post0 (was >=0.1.6.post0)
- Update skyrl-train: rev 7cd23ea (was 69ca4d9)
- Add note in README_verifiers.md about version pinning

This ensures:
- Reproducible builds across environments
- SDK/API compatibility
- Latest skyrl-train fixes and improvements

Note: If you hit version mismatch issues, re-run 'uv sync' and avoid
installing verifiers via pip outside the uv environment.
- Update code_search_generator.py comment to clarify LiteLLM routing
  supports both HuggingFace model IDs and local paths
- Add model path format documentation to README_Training.md
- Simplify TESTBED_ROOT environment variable handling
@yucc-leon
Copy link
Author

yucc-leon commented Jan 15, 2026

Sorry. For the fourth fix, I initially thought the OpenAIExceptions were caused by the input model prefix, but it now seems that Aditya/Lintang had already identified the problem and submitted a PR: NovaSky-AI/SkyRL#796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant