-
Notifications
You must be signed in to change notification settings - Fork 6
Fix: Training stability and configuration improvements #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yucc-leon
wants to merge
8
commits into
OpenHands:main
Choose a base branch
from
yucc-leon:leonli/training-bug-fixes
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When http_endpoint_host is set to 0.0.0.0 for server binding, clients cannot connect to 0.0.0.0. This fix automatically uses 127.0.0.1 for client connections when the host is 0.0.0.0 or ::. This resolves connection failures in distributed training setups where the server binds to all interfaces but clients need a specific loopback address to connect. Technical details: - Server binding: 0.0.0.0 (listens on all interfaces) - Client connection: 127.0.0.1 (connects to loopback) - Also handles IPv6 :: binding
Change from litellm_proxy/ to openai/ prefix to properly route requests to SkyRL's OpenAI-compatible endpoint. The litellm_proxy/ prefix was causing 'LLM Provider NOT provided' errors. LiteLLM requires a provider prefix to route requests correctly. Using the openai/ prefix ensures: 1. Proper routing to OpenAI-compatible endpoints 2. Preservation of SkyRL's strict model name checking 3. Correct model identifier extraction after the first slash Example: model_name = "/path/to/qwen3-4b" litellm_model_name = "openai//path/to/qwen3-4b" SkyRL receives: model="/path/to/qwen3-4b" (matches loaded model) This fixes LLM initialization failures in training.
Change tool call counting from assistant messages to ActionEvent messages, which represent actual tool invocations. The previous method of counting tool_calls field in assistant messages was inaccurate and didn't reflect actual tool executions. Changes: - Count ActionEvent messages instead of assistant messages - Extract tool_name directly from ActionEvent.tool_name field - Simplify logic by removing complex dict/object handling - Update tests to use ActionEvent-based test data This provides accurate tool usage statistics for training analysis and metrics reporting. Technical details: - ActionEvent represents actual tool invocations in the system - Assistant messages may contain tool_calls but don't guarantee execution - This fix ensures metrics match actual tool usage behavior
Add max_input_tokens configuration to LLM initialization to prevent context length overflow errors during training. Without this setting, the system may crash when input exceeds model's context window. Changes: - Read max_input_length from generator config (default: 38400) - Calculate effective_max_input by reserving 2000 tokens for overhead - Pass max_input_tokens to LLM initialization - Let OpenHands handle context truncation automatically This prevents training crashes due to context overflow and ensures stable long-running training sessions. Technical details: - Reserves tokens for system prompt, tools, and response generation - OpenHands will automatically truncate context when needed - Default 38400 tokens accommodates most training scenarios
Pin verifiers to exact version 0.1.6.post0 for SDK/API compatibility and reproducibility. Update skyrl-train to rev 7cd23ea for latest fixes. Changes: - Pin verifiers: 0.1.6.post0 (was >=0.1.6.post0) - Update skyrl-train: rev 7cd23ea (was 69ca4d9) - Add note in README_verifiers.md about version pinning This ensures: - Reproducible builds across environments - SDK/API compatibility - Latest skyrl-train fixes and improvements Note: If you hit version mismatch issues, re-run 'uv sync' and avoid installing verifiers via pip outside the uv environment.
- Update code_search_generator.py comment to clarify LiteLLM routing supports both HuggingFace model IDs and local paths - Add model path format documentation to README_Training.md - Simplify TESTBED_ROOT environment variable handling
Author
|
Sorry. For the fourth fix, I initially thought the OpenAIExceptions were caused by the input model prefix, but it now seems that Aditya/Lintang had already identified the problem and submitted a PR: NovaSky-AI/SkyRL#796 |
…wrapper reasoning parser
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR includes several bug fixes and improvements to enhance training stability and usability.
Changes
verifiersversion and updateskyrl-traindependencyTesting
All changes have been tested with:
Impact