Skip to content

Direct GFF reading: 1001G scalability #101

@josiahseaman

Description

@josiahseaman

For the 1001 Genomes project we have 23 A. thaliana genomes that are 150 Mbp. Each genome will have its own annotation of >30,000 genes for a total of 750,000 paths. This requires some consideration about scalability.

Important Note: 1001G Graph (was previously) being built with RevealGraph, now using seqwish.

@ekg a key question is, will the approach of using alignment really scale to 750,000 short sequences or will this break down at some point? My main concern is that at some point it becomes a statistical certainty that we'll get off-target matches (even at 100% identity) or that some annotation will fail to meet the alignment criteria. @AndreaGuarracino and @mandosoft mentioned needing to tweak the alignment parameters with the 30 SARS-CoV-2 genes in order to get it to work. The obvious case are short UTRs where the sequences are clearly not unique but we annotate short stretches because of their context.

Changes Needed

  • Remove list of path names from each chunk file
  • Stress test Annotation from alignment
  • Translate Annotation coordinates directly into Node ids rather than using alignment from sequence

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions