A machine learning project that fine-tunes a T5 model to generate structured summaries of medical research papers from PubMed abstracts.
This project trains a sequence-to-sequence model to automatically generate comprehensive summaries of medical papers including:
- Plain-language summary
- Key findings
- Clinical relevance
- Methodology brief
- Python 3.12 (required - Python 3.13 has compatibility issues with PyTorch on macOS)
- PyTorch
- Transformers (Hugging Face)
- Datasets
- Other dependencies listed below
Using Homebrew (recommended for macOS):
brew install python@3.12cd MedicalPaperSummarizer
/opt/homebrew/bin/python3.12 -m venv venv312
source venv312/bin/activatepip install torch transformers datasets evaluate rouge-score sentencepiece accelerate# Activate the virtual environment
source venv312/bin/activate
# Run the training script (the entrypoint is `train_model.py`)
python train_model.pyThe script will:
- Load PubMed articles from
pubmed_abstracts.json - Process and structure the abstracts
- Fine-tune the T5-small model
- Save the trained model to
pubmed-summarizer-best/
- Model:
google-t5/t5-small - Max steps: 5 (for quick testing)
- Batch size: 4
- Learning rate: 5e-5
- Device: CPU (configured for compatibility)
MedicalPaperSummarizer/
├── train_model.py # Main training script (Python 3.12 compatible)
├── get_data.py # Script to fetch PubMed data
├── run_train.sh # Helper script to launch training with env vars set
├── run_training.sh # Alternative launcher with additional macOS tweaks
├── pubmed_abstracts.json # Input data (PubMed articles)
├── pubmed-sum/ # Training outputs
└── pubmed-summarizer-best/ # Final trained model
If you encounter [mutex.cc : 452] RAW: Lock blocking errors, you're likely using Python 3.13. This is a known PyTorch bug on macOS. Solution: Use Python 3.12 as shown in the installation steps.
Fixed in train_model.py by properly handling tensor conversions and clipping values to valid ranges.
The input data (pubmed_abstracts.json) should contain PubMed articles with structured abstracts including sections like:
- Background/Introduction
- Methods
- Results
- Conclusions
The model generates structured summaries in the following format:
# Plain-language summary
[3-sentence accessible summary]
# Key findings
- [Finding 1]
- [Finding 2]
- [Finding 3]
- [Finding 4]
# Clinical relevance
[Clinical implications and applications]
# Methodology brief
[Brief description of study methodology]
Feel free to submit issues and pull requests!
MIT License
- Built with Hugging Face Transformers
- Uses Google's T5 model
- PubMed data from NCBI