MolFTP: Molecular Fragment-Target Prevalence

High-performance molecular feature generation based on fragment-target prevalence statistics. MolFTP generates interpretable, statistically-grounded features for molecular property prediction with state-of-the-art performance.

📄 Research Paper: Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP) (arXiv:2510.06029)

Features

✨ Key-LOO Method: Statistical filtering with leave-one-out rescaling for improved extrapolation to novel fragments

🎯 Dummy-Masking Method: Per-fold feature masking for fair cross-validation while maximizing statistical power

🚀 Multi-Task Learning: Native support for multiple related prediction tasks with sparse labels (NaN handling)

⚡ High Performance: Optimized C++ implementation with Python bindings (10-100x faster than pure Python)

📊 Interpretable: Features based on fragment prevalence statistics (chi-squared, McNemar, Fisher's exact tests)

🔬 Production-Ready: Extensively validated, mathematically proven correct, publication-quality code

Installation

Requirements

Python >= 3.11
RDKit >= 2025.3.0
NumPy >= 1.19.0
C++17 compatible compiler

Install from source

# Clone the repository
git clone https://github.com/osmoai/molftp.git
cd molftp

# Create and activate conda environment with build tools
mamba create -n rdkit_dev cmake librdkit-dev eigen libboost-devel compilers
conda activate rdkit_dev

# Install Python dependencies
conda install -c conda-forge numpy pandas scikit-learn
conda install -c conda-forge rdkit

# Build and install
python setup.py install

Note: Use mamba for faster dependency resolution, or replace with conda if mamba is not installed.

Quick install with pip (coming soon)

pip install molftp

Quick Start

Single-Task Key-LOO

from molftp import MultiTaskPrevalenceGenerator
import numpy as np

# Your molecular data
smiles = ["CC", "CCC", "CCCC", "CCCCC", "CCCCCC"]
labels = np.array([0, 1, 0, 1, 0])

# Generate features with Key-LOO
gen = MultiTaskPrevalenceGenerator(radius=6, method='key_loo')
gen.fit(smiles, labels.reshape(-1, 1), task_names=['activity'])
features = gen.transform(smiles)

print(f"Features shape: {features.shape}")
# Features shape: (5, 27)  # 27 features per molecule

Multi-Task with Sparse Labels

# Multi-task labels (NaN = not measured)
labels_multitask = np.array([
    [0, 1, np.nan],
    [1, 1, 0],
    [0, np.nan, 1],
], dtype=float)

# Generate multi-task features
gen = MultiTaskPrevalenceGenerator(radius=6, method='key_loo')
gen.fit(smiles, labels_multitask, task_names=['task1', 'task2', 'task3'])
features = gen.transform(smiles)

print(f"Multi-task features shape: {features.shape}")
# Features shape: (3, 81)  # 27 features per task × 3 tasks

Examples

See the examples/ directory for comprehensive examples:

example_single_task_keyloo.py: Basic single-task feature generation
example_single_task_dummymask.py: Cross-validation with Dummy-Masking
example_multitask_keyloo.py: Multi-task feature generation
example_multitask_dummymask.py: Multi-task CV with sparse labels

Methods

Key-LOO (Key Leave-One-Out)

Filters keys appearing in <= k molecules (default k=2)
Applies rescaling factor: (n - k) / n for better extrapolation
Best for: Final model training, prediction on new molecules
Features are task-independent (can be pre-computed once)

Dummy-Masking

Builds prevalence on all available data
Masks test-only keys per fold (set to 0)
Renormalizes training keys by N_train / N_full
Best for: Fair cross-validation, hyperparameter tuning
Features are fold-dependent (computed per CV fold)

API Reference

MultiTaskPrevalenceGenerator

MultiTaskPrevalenceGenerator(
    radius=6,                    # Morgan fingerprint radius
    method='key_loo',           # 'key_loo' or 'dummy_masking'
    key_loo_k=2,               # Min molecules per key (Key-LOO only, default 2)
    rescale_key_loo=True,      # Apply rescaling (Key-LOO only)
    num_threads=-1,            # Number of threads (-1 = all cores)
    counting_method='total'    # 'total', 'unique', or 'binary'
)

Methods:

fit(smiles, labels, task_names): Build prevalence statistics
- smiles: List of SMILES strings
- labels: np.array of shape (n_molecules, n_tasks)
- task_names: List of task names
transform(smiles, train_indices_per_task=None): Generate features
- smiles: List of SMILES strings
- train_indices_per_task: For Dummy-Masking only, list of train indices per task
- Returns: np.array of shape (n_molecules, n_features)

Performance

On the BBBP dataset (Blood-Brain Barrier Penetration, 2039 molecules):

Method	Single-Task AUROC	Multi-Task AUROC	Speedup vs Python
Key-LOO	0.9369 ± 0.0115	0.9513 ± 0.0085	50-100x
Dummy-Masking	0.9205 ± 0.0149	0.9110 ± 0.0165	50-100x

Multi-task Key-LOO achieves state-of-the-art performance on BBB prediction tasks (new paper in preparation).

Citation

If you use MolFTP in your research, please cite:

@article{godin2025molftp,
  title={Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP): From Dummy Masking to Key-LOO for Leakage-Free Feature Construction},
  author={Godin, Guillaume},
  journal={arXiv preprint arXiv:2510.06029},
  year={2025},
  url={https://arxiv.org/abs/2510.06029}
}

Paper: Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP)
Code: https://github.com/osmoai/molftp

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Author: Guillaume GODIN (Osmo labs pbc)
Built on RDKit for molecular structure handling
Uses pybind11 for Python-C++ interoperability
Inspired by statistical methods in cheminformatics and bioinformatics

Support

Issues: GitHub Issues
Documentation: See examples/ directory and this README
Contact: Open an issue for questions or bug reports

MolFTP - High-performance, interpretable molecular features for the modern ML era.

Developed by Guillaume GODIN @ Osmo labs pbc.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
examples		examples
molftp		molftp
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG_v1.3.0.md		CHANGELOG_v1.3.0.md
COMMIT_INSTRUCTIONS.md		COMMIT_INSTRUCTIONS.md
FILES_CHANGED.md		FILES_CHANGED.md
LICENSE		LICENSE
MULTITHREADING_PLAN.md		MULTITHREADING_PLAN.md
PHASE2_PHASE3_COMPLETE.md		PHASE2_PHASE3_COMPLETE.md
PHASE2_PHASE3_SUMMARY.md		PHASE2_PHASE3_SUMMARY.md
PR1_BODY.md		PR1_BODY.md
PR2_BODY.md		PR2_BODY.md
PR_BODY.md		PR_BODY.md
PR_DESCRIPTION.md		PR_DESCRIPTION.md
PR_MULTITHREAD_2D_3D.md		PR_MULTITHREAD_2D_3D.md
PR_STRUCTURE.md		PR_STRUCTURE.md
PR_SUMMARY.md		PR_SUMMARY.md
README.md		README.md
V1.5.0_READY_FOR_PR.md		V1.5.0_READY_FOR_PR.md
biodegradation_metrics_results.log		biodegradation_metrics_results.log
biodegradation_test_results.log		biodegradation_test_results.log
compile.log		compile.log
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py
test_biodegradation_speed_metrics.py		test_biodegradation_speed_metrics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolFTP: Molecular Fragment-Target Prevalence

Features

Installation

Requirements

Install from source

Quick install with pip (coming soon)

Quick Start

Single-Task Key-LOO

Multi-Task with Sparse Labels

Examples

Methods

Key-LOO (Key Leave-One-Out)

Dummy-Masking

API Reference

MultiTaskPrevalenceGenerator

Performance

Citation

License

Contributing

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Languages

License

osmoai/molftp

Folders and files

Latest commit

History

Repository files navigation

MolFTP: Molecular Fragment-Target Prevalence

Features

Installation

Requirements

Install from source

Quick install with pip (coming soon)

Quick Start

Single-Task Key-LOO

Multi-Task with Sparse Labels

Examples

Methods

Key-LOO (Key Leave-One-Out)

Dummy-Masking

API Reference

MultiTaskPrevalenceGenerator

Performance

Citation

License

Contributing

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages