This repository contains code and models_checkpoint for the paper "Curriculum Learning for Small Code Language Models". It implements training and evaluation of code language models on the TinyPy dataset using various curriculum learning techniques.
You can access the paper using this link : https://aclanthology.org/2024.acl-srw.44/
├── README.md
├── code
│ ├── evaluate_*.py - Evaluation scripts
│ ├── model.py - Model definition
│ ├── train_*.py - Training scripts for different techniques
│ └── pycache
├── data
│ ├── TinyPy - Raw TinyPy dataset
│ │ ├── [Split] - Data splits
│ └── TinyPy_processed - Preprocessed binary data
├── models_checkpoint - Saved model checkpoints
The following models are included:
baseline_model_1M.pth- Baseline model trained on all datahybrid_cl_model_1M.pth- Hybrid curriculum learningincremental_cl_model_1M.pth- Incremental curriculum learningsequential_cl_model_1M.pth- Sequential curriculum learning
The data folder should be added manually by downloading the dataset from Kaggle:
- Create a new folder named
datainside thecodefolder. - Go to the Kaggle dataset TinyPy for Curriculum Learning.
- Download the dataset.
- Extract the contents and place them in the
datafolder.
- numpy
- pandas
- torch
- tqdm
- wandb (optional)
- fuzzywuzzy
To train a model, use one of the training scripts. For example, to train the baseline model:
python train_baseline.pyTo train with hybrid curriculum learning:
python code/train_hybrid_cl.pyTo train with incremental curriculum learning:
python code/train_incremental_cl.pyTo train with sequential curriculum learning:
python code/train_sequential_cl.pyTo evaluate a model, use one of the evaluation scripts. For example, to evaluate the code execution:
python evaluate_code_execution.pyTo evaluate code completion at the token level:
python code/evaluate_code_completion_tokenlevel.pyTo evaluate code completion at the line level:
python code/evaluate_code_completion_linelevel.pyThis work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise