GPTs trained with shakespeare dataset. Includes: small 10.8M GPT mimicking Andrej Karpathy's video lecture, Universal Transformer with Adaptive Computation Time
- Multihead Attention
- KV Caching
- SwiGLU (Swish Gated Linear Unit) FeedForward
- RoPE: Rotary Positional Embeddings
- einops