Direct GFF reading: 1001G scalability

For the 1001 Genomes project we have 23 *A. thaliana* genomes that are 150 Mbp.  Each genome will have its own annotation of >30,000 genes for a total of 750,000 paths. This requires some consideration about scalability.  

**Important Note:** 1001G Graph (was previously) being built with RevealGraph, now using seqwish.

@ekg a key question is, will the approach of using alignment really scale to 750,000 short sequences or will this break down at some point? My main concern is that at some point it becomes a statistical certainty that we'll get off-target matches (even at 100% identity) or that some annotation will fail to meet the alignment criteria. @AndreaGuarracino and @mandosoft mentioned needing to tweak the alignment parameters with the 30 SARS-CoV-2 genes in order to get it to work. The obvious case are short UTRs where the sequences are clearly not unique but we annotate short stretches because of their context.

### Changes Needed
- [ ] Remove list of path names from each chunk file
- [ ] Stress test Annotation from alignment
- [ ] Translate Annotation coordinates directly into Node ids rather than using alignment from sequence


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Direct GFF reading: 1001G scalability #101

Changes Needed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Direct GFF reading: 1001G scalability #101

Description

Changes Needed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions