Skip to content

End-to-end ML project for clinical trial site performance prediction: data engineering (AACT), feature modeling, and predictive analytics.

Notifications You must be signed in to change notification settings

andy-waters/site-performance-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Site Performance Prediction

This project predicts the likelihood of clinical trial sites completing their studies successfully using historical data from ClinicalTrials.gov. It was developed as a hands-on, end-to-end ML project focused on real-world challenges in clinical trial operations.


πŸ“¦ Project Structure

site-performance-prediction/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── aact-data/                 # Extracted AACT pipe-delimited files
β”‚   └── site_performance.csv          # Cleaned and enriched dataset
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_prepare_site_data.ipynb        # ETL: loads, joins, cleans AACT data
β”‚   └── 02_model_site_performance.ipynb  # Modeling and evaluation
β”œβ”€β”€ README.md                         # Project overview and instructions

πŸ“Š Problem Statement

Clinical trial site performance (completion vs. dropout) is critical to successful studies. By modeling historical data, we aim to:

  • Identify high-risk sites before activation
  • Surface factors that influence site success
  • Support predictive trial planning

πŸ§ͺ Data Source

Data is sourced from the AACT (Aggregate Analysis of ClinicalTrials.gov) public dataset:

  • studies.txt, facilities.txt, sponsors.txt, etc.
  • Pipe-delimited format
  • Updated monthly by CTTI

πŸš€ How to Run

  1. Clone this repo and download the AACT dataset
    (see instructions at AACT Download)
  2. Extract AACT files into data/raw/aact-data/
  3. Open Jupyter and run:
jupyter notebook notebooks/01_prepare_site_data.ipynb
  1. Then run the modeling notebook:
jupyter notebook notebooks/02_model_site_performance.ipynb

πŸ€– Models Used

  • Random Forest Classifier
  • Logistic Regression
  • XGBoost (optional)
  • Evaluation via Accuracy, Precision, Recall, F1, ROC-AUC

πŸ“ˆ Sample Features

Feature Name Description
phase Phase of the clinical trial
enrollment Target enrollment number
study_type Interventional, Observational, etc.
location_country Site location
duration_days Trial duration
num_sites Number of sites per study
site_status Outcome label (Completed/Terminated)

πŸ“Œ Output

  • site_performance.csv: final dataset for modeling
  • .pkl model file: saved trained model for reuse
  • Performance plots and feature importance charts

πŸ“ Next Steps

  • βœ… Develop an interactive dashboard (e.g., Streamlit or Dash)
  • πŸ”œ Publish a blog post or case study on this work
  • πŸ”œ Add Docker + API for full deployment

πŸ› οΈ Possible Enhancements

  • Incorporate SHAP or LIME for explainability
  • Add hyperparameter tuning via Optuna or GridSearchCV
  • Include a small synthetic dataset to make the repo self-contained

πŸ‘¨β€πŸ’» Author

Andy Waters
Sr. Director of Software Engineering


πŸ“œ License

MIT (or update as appropriate)

About

End-to-end ML project for clinical trial site performance prediction: data engineering (AACT), feature modeling, and predictive analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published