diverse-seq implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of k-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. diverse-seq can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.
You can read more about the methods implemented in diverse-seq in the preprint here.
The user documentation is here.
We recommend installing diverse-seq from PyPI as follows
pip install "diverse-seq[extra]"
for the full jupyter experience.
For command line only usage, install as follows
pip install diverse-seq
NOTE If you experience any errors during installation, we recommend using uv pip. This command provides much better error messages than the standard
pipcommand. If you cannot resolve the installation problem, please open an issue on the GitHub repository.
Speaking of uv, it provides a simplified approach to install dvs as a command-line only tool as
uv tool install diverse-seq
Usage in this case is then
uvx --from diverse-seq dvs
For a full listing of dependencies, see the pyproject.toml file.
dvs is the command line interface for diverse-seq.
The `dvs` subcommands
Usage: dvs [OPTIONS] COMMAND [ARGS]...
dvs -- alignment free detection of the most diverse sequences using JSD
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
demo-data Export a demo sequence file
prep Writes processed sequences to a <HDF5 file>.dvseqs.
max Identify the seqs that maximise average delta JSD
nmost Identify n seqs that maximise average delta JSD
ctree Quickly compute a cluster tree based on kmers for a collection...
We make comparable capabilities available as cogent3 apps. The main difference is the app instances directly operate on, and return, cogent3 sequence collections. See the docs for demonstrations of how to use the apps.
diverse-seq is released under the BSD-3 license. If you want to contribute to the diverse-seq project (and we hope you do! 😇) the code of conduct and other useful developer information is available on the wiki.