This project provides an implementation of the classic Sequence-to-to-Sequence (Seq2Seq) architecture, powered by Recurrent Neural Networks (RNN/LSTM) and built with PyTorch, to address the problem of Neural Machine Translation (NMT).
The model is trained to translate from German (Source) to English (Target), utilizing the Multi30k dataset.
The implementation is based on an article by Ilya Sutskever, Oriol Vinyals, Quoc V. Le: https://arxiv.org/abs/1409.3215
The project implements the standard Encoder-Decoder approach:
- Encoder: A stack of multi-layered LSTMs that reads the input sequence (the German sentence, which is reversed for improved RNN performance) and generates a context vector (the final hidden states).
- Decoder: A stack of multi-layered LSTMs that takes the context vector and generates the target sequence (the English sentence) one token at a time.
- Teacher Forcing: Used during training to accelerate convergence and stabilize gradients.
The project is organized into a modular structure for ease of testing and scalability.
nmt_project/
├── config.py # Core hyperparameters and constants (dimensions, dropout, learning rate, special tokens).
├── data.py # Functions for loading the Multi30k dataset, tokenization (WordPunctTokenizer), vocabulary building, and PyTorch DataLoader creation.
├── model.py # Neural network module classes: Encoder, Decoder, Seq2Seq.
├── utils.py # Helper functions (setting the random seed, weight initialization).
├── train.py # Main script for launching model training and evaluation. Saves the best model.
├── inference.py # Script for testing the trained model on arbitrary sentences.
├── best-model.pt # Saved weights of the best model (created after the first training run).
└── notebooks/
└── seq2seq_for_nmt.ipynb # The original code in Jupyter Notebook format for experimentation and visualization.
Install the necessary libraries using pip:
pip install -r requirements.txt(Note: Running this requires a GPU (CUDA) for fast performance, but the code will automatically fall back to CPU if no GPU is available.)
To start the training process, execute the main script:
python train.pyThe script will perform the following steps:
- Load and prepare the data.
- Initialize the model with parameters defined in
config.py. - Run the training loop, evaluating the model on the validation set after each epoch.
- Save the best performing model weights to
best-model.pt.
Once the best-model.pt file has been created, you can use the inference.py script to translate test sentences:
python inference.pyThis script loads the saved model and performs a translation of a predefined test sentence (which can be modified within the if __name__ == "__main__": block).
The notebooks/seq2seq_for_nmt.ipynb file contains the step-by-step version of the implementation used for initial experimentation. You can use it to:
- Visualize sentence length distributions.
- Examine tokenization steps.
- Line-by-line debugging of the model logic.