- Description
- Key Features
- Installation
- Quickstart
- Configuration
- Documentation
- Contributing
- Diagrams
- License
- Acknowledgments
This project provides a genetic algorithm designed to evolve an optimal ensemble of machine learning classifiers for binary classification tasks. It applies a grid search over the feature space and genetic algorithm hyperparameters to find the best-performing model ensemble.
- Evolve Ensembles: Uses a genetic algorithm to automatically find the best combination of models.
- Extensible Model Library: Easily add any scikit-learn compatible classifier.
- Comprehensive Search: Performs a grid search over data preprocessing, feature selection, and GA hyperparameters.
- Advanced Weighting: Includes methods like Differential Evolution and ANNs to find optimal ensemble weights.
- Model Caching: Re-use trained base learners to dramatically speed up subsequent experiments.
The recommended way to install the package is from PyPI:
pip install ensemble-genetic-algorithmNote on PyTorch: This package requires PyTorch. For GPU support, it is highly recommended that you first install PyTorch manually by following the official instructions at pytorch.org to ensure the correct version for your CUDA toolkit is installed.
If you wish to contribute to the project, you can install it from a cloned repository. This method gives you access to the setup.sh script and development dependencies.
- Python: Version 3.10 or higher.
- Git: For cloning the repository.
- (Optional) NVIDIA GPU with CUDA: For GPU-accelerated computations.
-
Clone the repository:
git clone https://github.com/SamoraHunter/ensemble_genetic_algorithm.git cd ensemble_genetic_algorithm -
(Optional) Run the setup script: The
setup.shscript automates the creation of a virtual environment and dependency installation.chmod +x setup.sh ./setup.sh
This will create a virtual environment named
ga_env, install the default dependencies, and set up a Jupyter kernel. The environment will be activated for your current terminal session.
The setup script supports different installation profiles. You can specify one using flags:
./setup.sh --cpu: Installs the CPU-only version of PyTorch. Ideal for systems without a dedicated GPU../setup.sh --gpu: Installs dependencies with GPU support (requires a compatible NVIDIA GPU and CUDA toolkit)../setup.sh --dev: Installs all development dependencies, including tools for testing../setup.sh --all: Installs everything, including GPU and development dependencies.
To see all available options, run:
./setup.sh --helpThe setup script activates the ga_env environment for your current session. For future sessions, you can activate it manually:
source ga_env/bin/activateA numeric data matrix (Pandas dataframe) with a binary outcome variable with the suffix label _outcome_var_1. For more details see https://github.com/SamoraHunter/pat2vec/tree/main.
- Python: >=3.10
- Operating System: Linux-5.4.0-125-generic-x86_64-with-debian-buster-sid
You can run experiments either from the command line (recommended for most users) or programmatically within a script for development purposes.
The main.py script is the primary entry point for running experiments.
-
Activate the virtual environment:
source ga_env/bin/activate -
Run the experiment:
- To run with the default
config.yml:python main.py
- To specify a different configuration file:
python main.py --config path/to/your/config.yml
- To run, evaluate the best model, and generate all analysis plots:
python main.py --config path/to/your/config.yml --evaluate --plot
- To run with the default
For development or debugging, you can still run the pipeline within a Python script or Jupyter notebook.
import datetime
import os
import pathlib
from tqdm import tqdm
from ml_grid.pipeline import data, main_ga
from ml_grid.util.global_params import global_parameters
from ml_grid.util.grid_param_space_ga import Grid
# 1. Initialize parameters from a config file
config_path = 'config.yml'
global_params = global_parameters(config_path=config_path)
# 2. Create a unique, timestamped directory for the experiment run
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
run_specific_dir = os.path.join(global_params.base_project_dir, timestamp)
pathlib.Path(run_specific_dir).mkdir(parents=True, exist_ok=True)
# 3. Update global_params to use this new directory for all outputs
global_params.base_project_dir = run_specific_dir
# 4. Define the search space from the config
grid = Grid(
global_params=global_params,
config_path=config_path
)
# 5. Run the main experiment loop
for i in tqdm(range(global_params.n_iter)):
local_param_dict = next(grid.settings_list_iterator)
# The data pipeline prepares the data for this specific iteration
ml_grid_object = data.pipe(
global_params=global_params,
file_name=global_params.input_csv_path,
local_param_dict=local_param_dict,
param_space_index=i,
)
main_ga.run(
ml_grid_object,
local_param_dict=local_param_dict,
global_params=global_params
).execute()The recommended way to configure the project is by creating a config.yml file in your project's root directory. This allows you to manage all settings in one place.
- Create
config.yml: Copy theconfig.yml.examplefile from the repository to a new file namedconfig.yml. - Edit: Uncomment and modify the parameters you wish to change. Any parameter not specified in your
config.ymlwill use its default value.
# config.yml
global_params:
# --- Experiment Settings ---
input_csv_path: "data/my_dataset.csv"
n_iter: 20
model_list: ["logisticRegression", "randomForest", "XGBoost", "Pytorch_binary_class"]
# --- Execution & Logging ---
verbose: 2
grid_n_jobs: 8
base_project_dir: "HFE_GA_experiments"
store_base_learners: True
ga_params:
nb_params: [8, 16, 24] # Number of base learners per ensemble
pop_params: [64, 128] # Population size
g_params: [100] # Number of generations
grid_params:
weighted: ["unweighted", "de"] # Ensemble weighting methods
resample: ["undersample", None]
corr: [0.95]For detailed user guides, tutorials, and the full API reference, please see the Official Documentation. The documentation provides comprehensive information on everything from data preparation to interpreting results and extending the framework.
This section contains visual representations of the genetic algorithm implementation and model architecture.
- Source: assets/example_usage_permutations.mmd
- Description: Illustrates the genetic algorithm search over grid parameters. (See example usage).
- Source: assets/ga_data_diagram.mmd
- Description: Illustrates the data flow through the genetic algorithm pipeline
- Source: assets/model_classes.mmd
- Description: Shows the inheritance hierarchy and relationships between model classes
- Source: assets/ga_weighting.mmd
- Description: Demonstrates the weighting mechanism used in the genetic algorithm
- Source: assets/grid_param_space_ga.mmd
- Description: Visualizes the parameter space exploration grid used by the genetic algorithm
- Source: assets/svc_model_gen.mmd
- Description: Flow diagram for Support Vector Classifier model generation process
- Source: assets/torch_model_gen.mmd
- Description: Flow diagram for PyTorch neural network model generation process
- Source: assets/xgb_model_gen.mmd
- Description: Flow diagram for XGBoost model generation process
All diagrams are available in both Mermaid source format (.mmd) and rendered formats (.png/.svg). The Mermaid source files can be edited and re-rendered as needed for documentation updates.
MIT License
Copyright (c) 2023 Samora Hunter
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS," WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
If you want to contribute to this project, please follow our contributing guidelines.
- Samora Hunter
This software is based primarily on Machine learning methodology originally described in:
Agius, R., Brieghel, C., Andersen, M.A. et al. Machine learning can identify newly diagnosed patients with CLL at high risk of infection. Nat Commun 11, 363 (2020). https://doi.org/10.1038/s41467-019-14225-8
The following are relevant excerpts from the manuscript presenting this work
A classification problem can be expressed as the problem of learning a function f: X onto y. A binary classification problem is a problem where y is binary, 0 or 1. Several learning algorithms have been developed to this end by researchers in mathematical statistics and more recently machine learning โ(Bishop and Nasrabadi, 2006)โ โ(Haykin, 1998)โ. The problem is further described by a dataset {d1, โฆ dn} of training data points di = (Xi, yi) ฮต X * y. A learning algorithm typically has associated hyper parameters ฮป ฮต which alter the behaviour of the learning algorithm. The task of machine learning typically entails the optimisation of these hyper parameters. Several approaches to efficient and effective hyperparameter optimisation have been proposed โ(Komer, Bergstra and Eliasmith, 2014)โ.
Given a set of learning algorithms A and a limited dataset D = {(x1, y1), ..., (xn, yn)}, the objective of model selection is to identify the algorithm Aโ โ A that achieves optimal generalization performance. Generalization performance is assessed by splitting D into disjoint training and validation sets D(i)_train and D(i)_valid. The learning functions are constructed using Aโ on D(i)_train, and their predictive performance is then evaluated on D(i)_valid. This frames the model selection problem as follows:
*(Equation 1) *
๐ดโ = arg min (๐ดโ๐๐) โ ๐ ๐ฟ(๐ด, ๐ท(๐)train, ๐ท(๐)valid)
where L(A, D(i)_train, D(i)_valid) represents the loss (e.g., misclassification rate) attained by A when trained on D(i)_train and assessed on D(i)_valid โ(Thornton et al., 2013)โ. Performance can be further evaluated by partitioning data into k equally sized folds and the learning algorithm fitted to k-1 folds and evaluated on the held-out set.
The choice of learning algorithm, hyperparameter and feature set can be viewed as a hyperparameter set to be optimised for itself โ(Komer, Bergstra and Eliasmith, 2014)โ. This set may then be optimised for using known generally available optimisation methods such as Bayesian hyperparameter optimisation โ(Komer, Bergstra and Eliasmith, 2014)โ and genetic algorithms.
In order to address the optimisation problem in (Equation 1) a method inspired by recent applied machine learning in medicine was developed โ(Agius et al., 2020)โ, note well feature engineering methods utilised were similarly heavily inspired by those found in that manuscript. The precise predictive problem addressed by those method is different however the general problem is very similar across available data sources, types of features, numbers of samples and others. This method in its referential form entails a genetic algorithm to search for the optimal ensemble of machine learning classifiers for a binary outcome. In the method developed and extended here it entails a grid search over genetic algorithm hyperparameters and feature space and feature transformations for the optimal ensemble of machine learning classifiers for a binary outcome. Ensemble weighting, additional base learning algorithms, early stopping, model recycling, neural architecture search and more.
A genetic algorithm is an optimization technique inspired by the process of natural selection. It's used in machine learning to find optimal solutions to complex problems by mimicking the evolutionary process. It starts with a population of potential solutions represented as individuals. Through iterations, individuals are selected, recombined, and mutated to create new generations. The selection is based on the fitness of individuals, which measures their quality in solving the problem. Over time, this process tends to improve the overall fitness of the population, leading to solutions that are better adapted to the problem at hand.
The ensemble method described in โ(Agius et al., 2020)โโ(Nickolls et al., 2008)โ was first replicated in software. An ensemble modelling approach combines predictions from multiple independent classifiers. The authors cite โ(Hansen and Salamon, 1990)โ โ(Perrone and Cooper,1995)โ arguing that ensemble methods can reduce overfitting and that multiple independent uncorrelated predictors can reduce the error. This method was designed to consider greatly more variables than are typically found in the medical diagnosis prediction literature. It was then adapted largely to increase the available configuration and explore a greater algorithm, feature and transformation hyperparameter space. Several developments are a result of reengineering to deploy the algorithm in a resource constrained environment. The hyperparameter configuration of the genetic algorithm and feature space were not optimised for, however in principle and given sufficient compute resource, they could be optimised with Bayesian hyperparameter โ(Komer, Bergstra and Eliasmith, 2014)โ optimisation as in Ma.
The following is a description of the method as it is finally implemented for this project. All data transformations and feature space segmenting methods previously described for Ma were used to form a grid of datasets for consideration for the primary datasetโs Da and Db. A random sub sample of the possible grids was selected to reduce compute time. Candidate base learners from Scikit-learn were implemented as in CLL-TIM. Hyperparameter spaces were expanded. An additional base learner of a binary classifier implemented in Pytorch was developed and included. This base learner implements rudimentary aspects of neural architecture search โ(Elsken, Metzen and Hutter, 2019)โ by exposing elementary artificial neural network architecture hyperparameters in the search space. This is an improvement on the Scikit-learn multilayer perceptron classifier implemented in CLL-TIM as it is accelerated by CUDA โ(Nickolls et al., 2008)โ and is greatly more extensible.
The genetic algorithm's process begins with creating a population through random selection from a pool of base learners. Each base learner comprises a specific learning algorithm along with its corresponding hyperparameter space. Initializing a base learner involves training the learning algorithm with a randomized arrangement of hyperparameters on a training dataset. Prior to this, a feature reduction technique is applied to the dataset. The trained model is then utilized to assess its performance on a test dataset, which is shortened by the feature selection process. Once created, the base learner includes the learning algorithm, a subset of features, relevant metrics, an evaluation score (AUC), and a prediction vector for the test dataset.
The pool of base learners is invoked to construct ensembles ranging from two to the maximum specified ensemble size. To introduce a skew towards smaller ensembles, a skew normal distribution function is applied. The ensembles generated using this method serve as individuals within the genetic algorithm. These individuals are defined by the chromosomes of the individual base learners. The fitness of each ensemble is determined by its performance on the test set, as assessed by the AUC metric. Notably, no measures were taken to implement ensemble diversity weighting. However, a hyperparameter for ensemble weighting offers three potential weighting options. The first is no weighting, each base learner in the ensembleโs prediction is collapsed in a matrix and transformed into a binary akin to applying a sigmoid to the mean. The second, differential evolution โ(Virtanen et al., 2020)โ is used to find the weights for each base learner that maximise AUC on the training set. Far fewer iterations for this algorithm than are normally used to reduce compute time. Differential evolution weighted ensemble individuals then have their AUC attribute set to the weighted ensemble score. The third entails a similar optimisation problem however an artificial neural network implemented in Pytorch is used to learn the optimal ensemble weighting. Artificial neural network weighted ensemble individuals then have their AUC attribute set to the weighted ensemble score.
Individuals are generated to fill a population of size 96. These individuals then undergo evaluation whereby they are measured on their classification performance, Matthewsโs correlation coefficient was used to evaluate performance on the test set, this is the individualโs fitness. Parents are selected by tournament selection of the size of a hyperparameter from these individuals and 2-point crossover is applied. Mutation of the probability given in a hyperparameter of ensembles occurs when one base learner is swapped out for a newly randomly generated one. Fitness of the offspring is recalculated. This cycle is repeated for a maximum of 128 generations. Early stopping defined by a failure to improve on the maximum MCC score reached after five cycles was implemented. A full description and illustration of this process is available in โ(Agius et al., 2020)โsupplementary 17). The genetic algorithm was implemented with DEAP Python library โ(Fortin et al., 2012)โ.
-
Agius, R., Brieghel, C., Andersen, M.A., Pearson, A.T., Ledergerber, B., Cozzi-Lepri, A., Louzoun, Y., Andersen, C.L., Bergstedt, J., von Stemann, J.H., Jorgensen, M., Tang, M.E., Fontes, M., Bahlo, J., Herling, C.D., Hallek, M., Lundgren, J., MacPherson, C.R., Larsen, J. and Niemann, C.U. (2020) 'Machine learning can identify newly diagnosed patients with CLL at high risk of infection', Nature communications, 11(1), pp. 363-8. doi: 10.1038/s41467-019-14225-8
-
Bishop, C.M. and Nasrabadi, N.M. (2006) Pattern recognition and machine learning, Springer.
-
Haykin, S. (1998) Neural networks: a comprehensive foundation, Prentice Hall PTR.
-
Komer, B., Bergstra, J., and Eliasmith, C. (2014) Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn, Citeseer Austin, TX, pp. 50.
-
Thornton, C., Hutter, F., Hoos, H.H., and Leyton-Brown, K. (2013) Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, pp. 847.
-
Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008) Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for?, Queue, 6(2), pp. 40-53.
-
Fortin, F., De Rainville, F., Gardner, M.G., Parizeau, M., and Gagnรฉ, C. (2012) DEAP: Evolutionary algorithms made easy, The Journal of Machine Learning Research, 13(1), pp. 2171-2175.
-
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., and Bright, J. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python, Nature methods, 17(3), pp. 261-272.
