Skip to content

Conversation

@jundi69
Copy link
Collaborator

@jundi69 jundi69 commented May 5, 2025

Does the following:

  1. Gradient clip after allreduce to enhance stability of gradients
  2. Zero outer optimizer grads to avoid accumulating gradients in the outer optimizer
  3. Remove gradscaler as we are training in bfloat16
  4. Reduce effective batch size back to 512 as per diloco paper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants