This project predicts the likelihood of clinical trial sites completing their studies successfully using historical data from ClinicalTrials.gov. It was developed as a hands-on, end-to-end ML project focused on real-world challenges in clinical trial operations.
site-performance-prediction/
βββ data/
β βββ raw/
β β βββ aact-data/ # Extracted AACT pipe-delimited files
β βββ site_performance.csv # Cleaned and enriched dataset
βββ notebooks/
β βββ 01_prepare_site_data.ipynb # ETL: loads, joins, cleans AACT data
β βββ 02_model_site_performance.ipynb # Modeling and evaluation
βββ README.md # Project overview and instructions
Clinical trial site performance (completion vs. dropout) is critical to successful studies. By modeling historical data, we aim to:
- Identify high-risk sites before activation
- Surface factors that influence site success
- Support predictive trial planning
Data is sourced from the AACT (Aggregate Analysis of ClinicalTrials.gov) public dataset:
studies.txt,facilities.txt,sponsors.txt, etc.- Pipe-delimited format
- Updated monthly by CTTI
- Clone this repo and download the AACT dataset
(see instructions at AACT Download) - Extract AACT files into
data/raw/aact-data/ - Open Jupyter and run:
jupyter notebook notebooks/01_prepare_site_data.ipynb- Then run the modeling notebook:
jupyter notebook notebooks/02_model_site_performance.ipynb- Random Forest Classifier
- Logistic Regression
- XGBoost (optional)
- Evaluation via Accuracy, Precision, Recall, F1, ROC-AUC
| Feature Name | Description |
|---|---|
phase |
Phase of the clinical trial |
enrollment |
Target enrollment number |
study_type |
Interventional, Observational, etc. |
location_country |
Site location |
duration_days |
Trial duration |
num_sites |
Number of sites per study |
site_status |
Outcome label (Completed/Terminated) |
site_performance.csv: final dataset for modeling.pklmodel file: saved trained model for reuse- Performance plots and feature importance charts
- β Develop an interactive dashboard (e.g., Streamlit or Dash)
- π Publish a blog post or case study on this work
- π Add Docker + API for full deployment
- Incorporate SHAP or LIME for explainability
- Add hyperparameter tuning via Optuna or GridSearchCV
- Include a small synthetic dataset to make the repo self-contained
Andy Waters
Sr. Director of Software Engineering
MIT (or update as appropriate)