-
Notifications
You must be signed in to change notification settings - Fork 42
feat: callback support #805 #875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: callback support #805 #875
Conversation
…o#805) Add PyTorch Lightning-style callbacks with hooks for train start/end, batch end, validation end, checkpoint save, and exceptions. Includes @rank_zero_only decorator for distributed training. Callbacks are passed programmatically to recipe constructors and receive full training context (recipe object, MetricsSample, checkpoint_info, etc.). Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
|
|
||
| # Special case for VLM finetune which uses finetune.py instead of train_ft.py | ||
| if args.domain == "vlm" and args.command == "finetune": | ||
| command = "finetune" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I also fix a bug here; previusly the aliases would make the vlm finetune map to non-existing file
|
Hello! Quick check-in on this PR. I noticed I missed adding callback integration to the If this aligns with your plans, I’m happy to iterate or expand it. If not, no worries. I know this may not be a high priority. Just want to understand the direction so I can adjust accordingly. Thanks! |
Implements PyTorch Lightning-style callback system for Automodel recipes to enable custom integrations (e.g., Customizer metrics reporting, custom monitoring).
Key Features
on_train_start,on_train_batch_end,on_validation_end,on_save_checkpoint,on_exception,on_train_end@rank_zero_onlydecorator for distributed training convenienceMetricsSampleobjects, checkpoint info, etc.Changes
Callbackbase class andCallbackRunnerinnemo_automodel/components/callbacks/examples/llm_finetune/finetune_with_callback.pydocs/guides/callbacks.mdUsage
Validation
Unit tests: All 9 tests pass ✅
python -m unittest tests.unit_tests.recipes.test_callbacks -v Ran 9 tests in 0.006s OKExample run:
Verified callbacks execute at correct lifecycle points with custom log prefixes:
Closes #805
Note: Happy to implement this callback feature! I designed it to be familiar (PyTorch Lightning-style) and practical for real integrations. Free free to provide feedback on the design/API if you'd like any adjustments. (AI assisted with documentation generation, but I personally reviewed and refined everything to ensure quality and accuracy.)