ChromDetect

A utility to classify scaffolds in genome assemblies based on naming conventions and size.

What It Does

ChromDetect is a simple utility that classifies scaffolds in genome assemblies as chromosomes, unlocalized, or unplaced sequences. It works by:

Matching scaffold names against common naming patterns (chr1, Super_scaffold_1, LG_1, NC_*, etc.)
Using size heuristics (large scaffolds are likely chromosomes)
Adjusting for expected karyotype if you know the chromosome count

Why Use It?

Genome assemblies use inconsistent naming conventions:

Super_scaffold_1, chr1, LG_1, HiC_scaffold_1, NC_000001.11, scaffold_1_cov50...

If you need to quickly identify which scaffolds are chromosomes—for filtering, statistics, or downstream analysis—ChromDetect automates that classification.

This is a utility tool, not a validator. It doesn't detect misassemblies or verify correctness. For assembly QC, use tools like QUAST or Merqury.

Installation

pip install chromdetect

Or install from source:

git clone https://github.com/shandley/chromdetect.git
cd chromdetect
pip install -e .

Example Data

ChromDetect includes synthetic test assemblies in the examples/ directory:

# Try it immediately after installation
chromdetect examples/synthetic_assembly.fasta

# Compare two assembly versions
chromdetect examples/synthetic_assembly.fasta --compare examples/synthetic_assembly_v2.fasta

Downloading Real Genome Assemblies

For testing with real data, we recommend these small, well-annotated assemblies:

Saccharomyces cerevisiae S288C (Yeast, ~12 Mb, 16 chromosomes):

# Using NCBI datasets CLI (install: pip install ncbi-datasets-cli)
datasets download genome accession GCF_000146045.2 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna

Caenorhabditis elegans (Nematode, ~100 Mb, 6 chromosomes):

datasets download genome accession GCF_000002985.6 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000002985.6/*.fna

Arabidopsis thaliana (Plant, ~135 Mb, 5 chromosomes):

datasets download genome accession GCF_000001735.4 --include genome
unzip ncbi_dataset.zip
chromdetect ncbi_dataset/data/GCF_000001735.4/*.fna --karyotype 5

For more test data options, see NCBI Datasets or GenomeArk (VGP assemblies).

Quick Start

Command Line

# Basic usage - get summary
chromdetect assembly.fasta

# Output JSON for programmatic use
chromdetect assembly.fasta --format json --output results.json

# Use karyotype information for better accuracy
chromdetect assembly.fasta --karyotype 24

# Export only chromosome-level scaffolds as TSV
chromdetect assembly.fasta --chromosomes-only --format tsv > chromosomes.tsv

# Export as BED or GFF format for pipeline integration
chromdetect assembly.fasta --format bed > scaffolds.bed
chromdetect assembly.fasta --format gff > scaffolds.gff

# Extract chromosome sequences to a new FASTA file
chromdetect assembly.fasta --extract-chromosomes chromosomes.fasta

# Batch process multiple assemblies
chromdetect --batch assemblies_dir/ --output results_dir/

# Compare two assemblies side-by-side
chromdetect assembly_v1.fasta --compare assembly_v2.fasta

# Generate visual HTML report
chromdetect assembly.fasta --format html -o report.html

# Use custom naming patterns
chromdetect assembly.fasta --patterns custom_patterns.yaml

# Use NCBI assembly report for accurate classification
chromdetect assembly.fasta --assembly-report GCF_000001405.assembly_report.txt

Python API

# Simple one-liner classification (recommended for most use cases)
from chromdetect import classify_fasta, compare_fasta_files

results, stats = classify_fasta("assembly.fasta")
print(f"Found {stats.chromosome_count} chromosomes")
print(f"N50: {stats.n50 / 1e6:.1f} Mb")

# Compare two assemblies
comparison = compare_fasta_files("assembly_v1.fasta", "assembly_v2.fasta")
print(f"Shared chromosomes: {len(comparison.shared_chromosomes)}")
print(f"N50 change: {comparison.summary()['n50_difference']:,} bp")

For more control, use the lower-level API:

from chromdetect import (
    parse_fasta, classify_scaffolds, write_fasta, format_bed, format_gff,
    parse_assembly_report
)

# Parse and classify with options
scaffolds = parse_fasta("assembly.fasta.gz")
results, stats = classify_scaffolds(scaffolds, expected_chromosomes=24)

# Print summary
print(f"Found {stats.chromosome_count} chromosomes")
print(f"Total assembly: {stats.total_length / 1e9:.2f} Gb")

# Access individual scaffold classifications
for r in results:
    if r.classification == "chromosome":
        print(f"{r.name}: {r.length:,} bp (confidence: {r.confidence:.2f})")

# Export to BED or GFF format
bed_output = format_bed(results)
gff_output = format_gff(results)

# Use NCBI assembly report for authoritative classification
report = parse_assembly_report("assembly_report.txt")
results, stats = classify_scaffolds(scaffolds, assembly_report=report)

Output Formats

Summary (default)

============================================================
CHROMDETECT ASSEMBLY ANALYSIS
============================================================

Total scaffolds:     1,234
Total length:        2,876,543,210 bp (2.88 Gb)
N50:                 45,678,901 bp (45.7 Mb)
N90:                 12,345,678 bp
Largest scaffold:    198,765,432 bp

Scaffold Classification:
  Chromosomes:       24 (2.85 Gb)
  Unlocalized:       15
  Unplaced:          1,195

Chromosome N50:      118,234,567 bp (118.2 Mb)
GC content:          41.2%

JSON

{
  "summary": {
    "total_scaffolds": 1234,
    "chromosome_count": 24,
    "n50": 45678901,
    ...
  },
  "scaffolds": [
    {
      "name": "chr1",
      "length": 198765432,
      "classification": "chromosome",
      "confidence": 0.95,
      "detection_method": "name_chr_explicit",
      "chromosome_id": "1"
    },
    ...
  ]
}

TSV

name    length    classification    confidence    method    chromosome_id
chr1    198765432    chromosome    0.95    name_chr_explicit    1
chr2    175432198    chromosome    0.93    name_chr_explicit    2
...

BED

Standard BED6 format for integration with bedtools, IGV, and other genomics tools:

chr1    0    198765432    chromosome    950    .
chr2    0    175432198    chromosome    930    .
...

GFF3

GFF3 format with classification metadata in attributes:

##gff-version 3
chr1    chromdetect    chromosome    1    198765432    0.950    .    .    ID=chr1;Name=chr1;classification=chromosome;detection_method=name_chr_explicit;chromosome_id=1
...

Options

Option	Description
`-f, --format`	Output format: `summary`, `json`, `tsv`, `bed`, `gff`, `html` (default: summary)
`-o, --output`	Write output to file instead of stdout
`-k, --karyotype`	Expected chromosome count for karyotype-informed detection
`-s, --min-size`	Minimum size (bp) to consider chromosome-level (default: 10Mb)
`-c, --chromosomes-only`	Only output chromosome-level scaffolds
`--extract-chromosomes`	Extract chromosome sequences to a FASTA file
`--batch`	Process all FASTA files in a directory
`--compare`	Compare with a second assembly (side-by-side analysis)
`--patterns`	Custom patterns file (YAML or JSON) for scaffold name matching
`--assembly-report`	NCBI assembly report file for authoritative classification
`--min-confidence`	Minimum confidence threshold (0.0-1.0) to include scaffolds
`--min-length`	Minimum scaffold length (bp) to include in output
`-q, --quiet`	Suppress progress messages
`-v, --verbose`	Show detailed processing information

Supported Naming Conventions

ChromDetect recognizes these naming patterns (case-insensitive):

Pattern	Examples	Method
Explicit chromosome	`chr1`, `chromosome_X`, `Chr_MT`	`name_chr_explicit`
Super scaffold	`Super_scaffold_1`, `Superscaffold_X`	`name_super_scaffold`
SUPER	`SUPER_1`, `SUPER1`	`name_SUPER`
Linkage group	`LG1`, `LG_X`	`name_linkage_group`
NCBI RefSeq	`NC_000001.11`	`name_ncbi_refseq`
NCBI GenBank	`CM000001.1`	`name_ncbi_genbank`
HiC scaffold	`HiC_scaffold_1`	`name_hic_scaffold`
RaGOO	`Scaffold_1_RaGOO`	`name_ragoo`
Simple numeric	`1`, `X`, `MT`	`name_numeric`

Patterns that indicate unlocalized scaffolds:

*_random, *_unloc*, chrUn_*

Patterns that indicate unplaced scaffolds (contigs/fragments):

*_ctg*, *contig*, *_arrow_*, *_pilon*, *_hap*

How It Works

ChromDetect combines name-based and size-based detection with these priority rules:

Strong name match (confidence ≥ 0.8) takes priority
Large scaffold + weak name match = chromosome with boosted confidence
Large scaffold + no name match = chromosome with reduced confidence
Small scaffold = unplaced regardless of name

When --karyotype is provided:

If too many candidates: demote lowest-confidence chromosomes
If too few candidates: promote largest unplaced scaffolds

Use Cases

VGP Assembly Classification

# Classify scaffolds in a VGP curated assembly
chromdetect species.pri.cur.fasta.gz --karyotype 24 --format json

Multi-Assembly Classification

from chromdetect import classify_fasta

# Classify multiple assemblies independently
species = [
    ("human.fa", 23),
    ("mouse.fa", 20),
    ("zebrafish.fa", 25),
]

for fasta, expected_chr in species:
    results, stats = classify_fasta(fasta)
    print(f"{fasta}: {stats.chromosome_count} chromosomes detected (expected {expected_chr})")

Note: This classifies each assembly independently. ChromDetect does not perform synteny analysis or identify homologous chromosomes across species.

Pipeline Integration

# As part of assembly QC pipeline
chromdetect assembly.fasta --format json | jq '.summary.chromosome_count'

# Export scaffold regions in BED format for downstream analysis
chromdetect assembly.fasta --format bed --chromosomes-only > chromosomes.bed
bedtools getfasta -fi assembly.fasta -bed chromosomes.bed -fo chr_regions.fa

Batch Processing

# Process all assemblies in a directory
chromdetect --batch assemblies/ --format json --output results/

# This creates:
# - results/assembly1.json
# - results/assembly2.json
# - ...
# - results/batch_summary.tsv  (overview of all assemblies)

Extract Chromosome Sequences

# Extract only chromosome-level sequences to a new FASTA
chromdetect assembly.fasta --extract-chromosomes chromosomes.fasta

# Combine with other options
chromdetect assembly.fasta \
    --karyotype 24 \
    --extract-chromosomes chromosomes.fasta \
    --format json --output report.json

Using NCBI Assembly Reports

# Download an assembly report from NCBI
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40

# Use it for authoritative scaffold classification
chromdetect GRCh38.fasta --assembly-report GCF_000001405.40_GRCh38.p14_assembly_report.txt

Limitations

ChromDetect uses heuristics and pattern matching—it has inherent limitations:

Not a validator: ChromDetect classifies scaffolds but cannot detect misassemblies, inversions, or sequence errors. Use QUAST, Merqury, or similar tools for assembly validation.
Pattern-dependent: Classification relies on naming conventions. Unusual or custom naming schemes may not be recognized without custom patterns.
Size heuristics are approximate: A 50 Mb scaffold is assumed to be chromosome-level, but could be a misassembly or concatenated contigs.
No reference comparison: ChromDetect doesn't compare against reference genomes, so it cannot identify missing chromosomes or structural variants.

For critical applications, combine ChromDetect with comprehensive QC tools and manual curation.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Adding New Patterns

To add support for a new naming convention:

Add the regex pattern to chromdetect/patterns.py
Include a descriptive method name
Ensure the pattern captures chromosome ID in group 1
Add tests in tests/test_patterns.py

Example:

# In patterns.py
CHROMOSOME_PATTERNS.append(
    (r'^MyConvention_(\d+)$', 'my_convention'),
)

Using Custom Patterns

You can also use custom patterns without modifying the source code:

# custom_patterns.yaml
chromosome_patterns:
  - pattern: "^MyScaffold_(\\d+)$"
    name: "my_scaffold"
  - pattern: "^CustomChr_(\\d+)$"
    name: "custom_chr"
unlocalized_patterns:
  - my_random
fragment_patterns:
  - my_contig

chromdetect assembly.fasta --patterns custom_patterns.yaml

Citation

If you use ChromDetect in your research, please cite it using the metadata from our CITATION.cff file:

@software{chromdetect,
  author = {Handley, Scott A.},
  title = {ChromDetect: A utility for classifying scaffolds in genome assemblies},
  url = {https://github.com/shandley/chromdetect},
  version = {0.5.0},
  year = {2024}
}

License

MIT License - see LICENSE for details.

Related Projects

QUAST - Quality assessment tool for genome assemblies
Verity - Hi-C-based assembly validation framework

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
chromdetect		chromdetect
examples		examples
tests		tests
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
chromdetect_hex_logo.jpeg		chromdetect_hex_logo.jpeg
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml

License

shandley/chromdetect

Folders and files

Latest commit

History

Repository files navigation

ChromDetect

What It Does

Why Use It?

Installation

Example Data

Downloading Real Genome Assemblies

Quick Start

Command Line

Python API

Output Formats

Summary (default)

JSON

TSV

BED

GFF3

Options

Supported Naming Conventions

How It Works

Use Cases

VGP Assembly Classification

Multi-Assembly Classification

Pipeline Integration

Batch Processing

Extract Chromosome Sequences

Using NCBI Assembly Reports

Limitations

Contributing

Adding New Patterns

Using Custom Patterns

Citation

License

Related Projects

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages