Transformer Implementation This repository holds the code for the training and architectural design of an auto-regressive decoder module. Attention Trick: A lower triangular matrix is used for attention scores, so that tokens only attend to the ones generated before them.