Repository for paper "Deep Learning Approaches to Molecular Classification Using Voxel-Based Representations" István Lakatos, András Hajdu and Balázs Harangi.
Abstract:
Accurate molecular classification is a key step in computational drug discovery, toxicological risk assessment, and high-throughput screening. While most machine learning approaches rely on 2D molecular fingerprints, these projections often fail to capture stereochemistry and 3D spatial interactions that critically influence molecular activity. In this study, we propose a voxel-based 3D molecular representation combined with convolutional neural networks (CNNs) for end-to-end molecular classification for the Tox21 toxicity dataset. We systematically compare three architectures: a 2D CNN operating on molecular images, a dense 3D CNN using volumetric grids, and a sparse 3D CNN implemented with the TensorFlow 3D framework. All models are trained under consistent preprocessing and multi-task settings to isolate the effects of molecular representation and network design. The 2D CNN achieves the highest mean ROC–AUC score, followed by the dense 3D CNN and the sparse 3D CNN, indicating that simple voxel occupancy grids provide limited benefit over 2D projections. Moreover, the sparse 3D CNN does not provide substantial computational savings relative to the dense 3D model, despite reducing the number of processed voxels. These results suggest that, while voxel-based CNNs remain viable for toxicity prediction, traditional 2D approaches currently offer a more favorable balance between predictive accuracy and resource efficiency.
Install Conda environments, environment.yml files can be found in envs folders:
-
jmol-scripts -
molecule36-tf21 -
molecule39-tf210 -
molecule-threedim -
wsl-molecule37-tf23-tf3dFor installing Tensorflow 3D, follow https://github.com/google-research/google-research/tree/master/tf3d
-
Generate 2D images
Environment:
jmol-scriptstwodim/dataset_generator/generate25d-universal.py
-
Pack images into TFRecords
Environment:
molecule36-tf21twodim/tensorflow/pack-data-tfrecords.pytwodim/tensorflow/pack-data-tfrecords-multitask.py
Optionally verify TFRecord files:
twodim/tensorflow/check-tfrecord-files.pytwodim/tensorflow/check-tfrecord-files-multitask.py
-
Train 2DCNN
Environment:
molecule39-tf210twodim/tensorflow/Molecule25D-train-small.pytwodim/tensorflow/Molecule25D-train-small-multitask.py
-
Data preprocess
Environment:
molecule-threedimthreedim/moleculenet-tox21-task-preprocess.pythreedim/tox21-smiles-to-inchi.py
-
Generate voxelboxes
Environment:
molecule-threedimthreedim/dataset_generator/dataset-generator.py
-
Convert 3D voxelboxes to 2D images
Environment:
molecule-threedimthreedim/dataset_generator/voxel-to-2d-converter.py
-
Convert 3D voxelboxes to Sparse voxelboxes
Environment:
molecule-threedimthreedim/dataset_sparse_converter.py
-
Train 3D CNN
Environment:
molecule39-tf210threedim/tensorflow/alexnet3d_keras_res294_multitask_training_regularization_binary_multichannel.pythreedim/tensorflow/alexnet3d_keras_res294_singletask_training_regularization_binary_multichannel.py
-
Train Sparse 3D CNN
Environment:
wsl-molecule37-tf23-tf3dthreedim/tensorflow/alexnet3d_keras_res294_multitask_training_regularization_binary_multichannel_sparse.pythreedim/tensorflow/alexnet3d_keras_res294_singletask_training_regularization_binary_multichannel_sparse.py