sr2silo can convert millions of Short-Read nucleotide reads in the form of .bam CIGAR
alignments to cleartext alignments compatible with LAPIS-SILO v0.8.0+. It gracefully extracts insertions
and deletions. Optionally, sr2silo can translate and align each read using diamond / blastX, handling insertions and deletions in amino acid sequences as well.
Your input .bam/.sam with one line as:
294 163 NC_045512.2 79 60 31S220M = 197 400 CTCTTGTAGAT FGGGHHHHLMM ...
sr2silo outputs per read a JSON (compatible with LAPIS-SILO v0.8.0+):
{
"readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
"sampleId": "A1_05_2024_10_08",
"batchId": "20241024_2411515907",
"samplingDate": "2024-10-08",
"locationName": "Lugano (TI)",
"locationCode": "5",
"sr2siloVersion": "1.3.0",
"main": {
"sequence": "CGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTG",
"insertions": ["10:ACTG", "456:TACG"],
"offset": 4545
},
"S": {
"sequence": "MESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGV",
"insertions": ["23:A", "145:KLM"],
"offset": 78
},
"ORF1a": {
"sequence": "XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLV",
"insertions": ["2323:TG", "2389:CA"],
"offset": 678
},
"E": null,
"M": null,
"N": null,
"ORF1b": null,
"ORF3a": null,
"ORF6": null,
"ORF7a": null,
"ORF7b": null,
"ORF8": null,
"ORF10": null
}The total output is handled in an .ndjson.zst.
When running sr2silo, particularly the process-from-vpipe command, be aware of memory and storage requirements:
- Standard configuration uses 8GB RAM and one CPU core
- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
- Temporary storage needs (especially on clusters) can reach 30-50GB
For detailed information about resource requirements, especially for cluster environments, please refer to the Resource Requirements documentation.
Originally this was started for wrangling short-read genomic alignments from wastewater-sampling, into a format for easy import into Loculus and its sequence database SILO.
sr2silo is designed to process nucleotide alignments from .bam files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend LAPIS-SILO v0.8.0+.
Output Format for LAPIS-SILO v0.8.0+:
- Metadata fields use camelCase naming (e.g.,
readId,sampleId,batchId) to align with Loculus standards - Metadata fields are at the root level (no nested "metadata" object)
- Genomic segments use a structured format with
sequence,insertions, andoffsetfields - The main nucleotide segment is required and contains the primary alignment
- Gene segments (S, ORF1a, etc.) contain amino acid sequences or
nullif empty - Insertions use the format
"position:sequence"(e.g.,"123:ACGT")
Output Schema Configuration:
The output schema is defined in src/sr2silo/silo_read_schema.py using Pydantic models with field aliases for camelCase output. To modify the metadata fields:
- Edit
src/sr2silo/silo_read_schema.py- Add/modify fields inReadMetadataclass - Update
resources/silo/database_config.yaml- Ensure field names match the Pydantic aliases - Run validation:
python tests/test_database_config_validation.py
The validation ensures your Pydantic schema matches the SILO database configuration.
For the V-Pipe to Silo implementation we include the following metadata fields at the root level:
{
"readId": "AV233803:AV044:2411515907:1:10805:5199:3294",
"sampleId": "A1_05_2024_10_08",
"batchId": "20241024_2411515907",
"samplingDate": "2024-10-08",
"locationName": "Lugano (TI)",
"locationCode": "5",
"sr2siloVersion": "1.3.0"
}To build the package and maintain dependencies, we use Poetry. In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.
sr2silo can be installed either from Bioconda or from source.
The easiest way to install sr2silo is through the Bioconda channel:
# Add necessary channels if you haven't already
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# Install sr2silo
conda install sr2siloFor development purposes or to install the latest version, you can install from source using Poetry:
The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the environments/ directory:
For basic usage of sr2silo:
make setupThis creates the core conda environment with essential dependencies and installs the package using Poetry.
For development work:
make setup-devThis command sets up the development environment with Poetry.
For working with the snakemake workflow:
make setup-workflowThis creates an environment specifically configured for running the sr2silo in snakemake workflows.
You can set up all environments at once:
make setup-allAfter setting up the development environment:
conda activate sr2silo-dev
poetry install --with dev
poetry run pre-commit installmake testor
conda activate sr2silo-dev
pytestsr2silo follows a two-step workflow:
- Process data:
sr2silo process-from-vpipe --help - Submit to Loculus:
sr2silo submit-to-loculus --help
# Example: Process V-Pipe data
sr2silo process-from-vpipe \
--input-file input.bam \
--sample-id SAMPLE_001 \
--timeline-file timeline.tsv \
--output-fp output.ndjson
# Example: Submit to Loculus (use environment variables for credentials)
export KEYCLOAK_TOKEN_URL=https://auth.example.com/token
export BACKEND_URL=https://api.example.com/submit
export GROUP_ID=123
export USERNAME=your-username
export PASSWORD=your-password
sr2silo submit-to-loculus --processed-file output.ndjson.zstNote: Use environment variables for credentials to avoid exposing sensitive information in command history.
Note: The --lapis-url parameter is optional. If not provided, sr2silo uses default SARS-CoV-2 references (NC_045512.2). See sr2silo process-from-vpipe --help for details.
sr2silo supports flexible configuration through environment variables, making it easy to use in different deployment scenarios including conda packages and pip installations.
Note: CLI parameters override environment variables
Common configuration via environment variables:
# Authentication credentials (recommended approach for security)
export KEYCLOAK_TOKEN_URL=https://auth.example.com/token
export BACKEND_URL=https://backend.example.com/api
export GROUP_ID=123
export USERNAME=your-username
export PASSWORD=your-password
# Run with environment variables set
sr2silo process-from-vpipe \
--input-file input.bam \
--sample-id SAMPLE_001 \
--timeline-file /path/to/timeline.tsv \
--output-fp output.ndjson
# Submission using environment variables for credentials
sr2silo submit-to-loculus \
--processed-file output.ndjson.zst