Skip to content

Conversation

@adamnovak
Copy link
Member

Changelog Entry

To be copied to the draft changelog by merger:

  • Giraffe now just uses a single chaining pass, instead of a fragmenting pass and then a chaining pass

Description

To avoid metaphysical angst about why recombination penalties at fragmenting make things worse instead of better, this PR removes fragmenting entirely (on top of some commits merely bypassing it).

Bypassing fragmenting seems to decrease speed substantially on simulated hifi reads, increase accuracy somewhat on simulated hifi reads, and decrease speed somewhat on real hifi reads. (I haven't gotten R10 results yet because my whole-node timing jobs are still in queue.)

This code has been almost all synthesized by Anthropic Claude, using almost all of its patience (aka token limit for the day). I reviewed it and it appears to have done what I wanted to do and glommed the two step functions together (even though it did this by writing a new one and then deleting the old ones), but this still needs to be tested for mapping and calling accuracy effects (vs. d1625a9).

This is what I managed to get out of Anthropic Claude on the subject of
removing fragmenting and coalescing things to go straight from zip code
trees to chaining.

I had to make a couple changes to make it pass the Giraffe tests.

I read through the code and it looks plausible, but it's possible the
funnel logic is wrong or that the less apt parameter defaults get kept.

This needs to be evaluated for mapping and calling accuracy against the
version that has the fragmenting code but defaults to bypassing it.
@faithokamoto
Copy link
Contributor

If we're getting rid of chaining, the generate_zip_tree_transitions() function can be significantly simplified. Currently we have to be very careful about checking whether each seed actually corresponds to an anchor border, e.g.

// Both were traversed in the same orientation as the read.
// They might not be at anchor borders though, so check.
auto found_source_anchor = seed_to_ending.find(source_seed.seed);
if (found_source_anchor != seed_to_ending.end()) {

But if we simply chain all seeds directly, then every seed will correspond to an anchor border, since every seed will be its own anchor.

@adamnovak
Copy link
Member Author

@faithokamoto We can't get rid of the abstraction of having an Anchor represent several seeds unless we also get rid of the gapless extension feature in the chaining codepath, since we use it there too:

// We have seeds here and can make an anchor
// Note the index of the new anchor
extension_anchor_indexes.push_back(extension_anchors.size());
// Make the actual anchor out of this range of seeds and this read range.
extension_anchors.push_back(to_anchor(aln, anchor_interval.first, anchor_interval.second, anchor_seeds, seed_anchors, internal_mismatch_begin, internal_mismatch_end, gbwt_graph, this->get_regular_aligner()));

So I think even with the removal of fragmenting, we still have to deal with having seeds in play that are not Anchor boundaries.

@adamnovak
Copy link
Member Author

I checked this on calling with ac3735ca988928881a06b84ba46e500fec89f275 of https://github.com/vgteam/recombination-aware-giraffe-experiments.

==> ./output/experiments/hifi_real_full_call_chm13/results/snp_errors.tsv <==
giraffe-0793c7	24255
giraffe-55e964	24364

==> ./output/experiments/hifi_real_full_call_chm13/results/indel_errors.tsv <==
giraffe-0793c7	61665
giraffe-55e964	61669

==> ./output/experiments/hifi_real_full_call_chm13/results/total_errors.tsv <==
giraffe-0793c7	85920
giraffe-55e964	86033

It looks like this removes a few calling errors. I also evaluated speed previously and we don't get too much slower.

So I think this is ready.

@adamnovak adamnovak marked this pull request as ready for review January 6, 2026 15:30
@adamnovak
Copy link
Member Author

adamnovak commented Jan 6, 2026

This should also be tested on R10 and Illumina reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants