Fix: Configure gradient accumulation and chunking collator for stable training #301
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR updates the training configuration to enable gradient accumulation. This change stabilizes the training loss and prevents the
ValueError: matrix contains invalid numeric entrieserror, which often occurs due to unstable gradients when training with small batch sizes.Changes
gradient_accumulation_stepsto 16.train_batch_sizeto 16.sam3.train.data.collator.collate_fn_api_with_chunking. This is critical because the trainer expects a list of micro-batches (chunks) when gradient accumulation is enabled, whereas the original collate_fn_api returned a single batch dict, causing anAssertionError.num_chunksparameter linked togradient_accumulation_steps.Verified
Tested locally. The training process is now stable, and the
ValueErrorregarding invalid numeric entries in the cost matrix is resolved.