-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
either torch.compile / triton, forward / backward operations got too much activations that are probably bottlenecking training.
For some reason, i got about 30% speedup at 1B scale but does not seem to do better at larger scale. Either way, attaching a good fusedlinear , fusedlayernorm, fusedMHlayernorm would be very helpful.
Reason i would prefer over torch.compile is that it torch.compile with max-autotune takes entirety. :P
Some references
-
trident has lot of them implemented? https://github.com/kakaobrain/trident
-
unsloth has many backwards implemented https://github.com/unslothai/unsloth/blob/main/unsloth/kernels/geglu.py
-
Triton code extraction https://youtu.be/LuhJEEJQgUM?t=2234
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed