Remove fragmenting from Giraffe #4765

adamnovak · 2025-12-03T20:42:21Z

Changelog Entry

To be copied to the draft changelog by merger:

Giraffe now just uses a single chaining pass, instead of a fragmenting pass and then a chaining pass

Description

To avoid metaphysical angst about why recombination penalties at fragmenting make things worse instead of better, this PR removes fragmenting entirely (on top of some commits merely bypassing it).

Bypassing fragmenting seems to decrease speed substantially on simulated hifi reads, increase accuracy somewhat on simulated hifi reads, and decrease speed somewhat on real hifi reads. (I haven't gotten R10 results yet because my whole-node timing jobs are still in queue.)

This code has been almost all synthesized by Anthropic Claude, using almost all of its patience (aka token limit for the day). I reviewed it and it appears to have done what I wanted to do and glommed the two step functions together (even though it did this by writing a new one and then deleting the old ones), but this still needs to be tested for mapping and calling accuracy effects (vs. d1625a9).

This is what I managed to get out of Anthropic Claude on the subject of removing fragmenting and coalescing things to go straight from zip code trees to chaining. I had to make a couple changes to make it pass the Giraffe tests. I read through the code and it looks plausible, but it's possible the funnel logic is wrong or that the less apt parameter defaults get kept. This needs to be evaluated for mapping and calling accuracy against the version that has the fragmenting code but defaults to bypassing it.

faithokamoto · 2025-12-04T18:13:14Z

If we're getting rid of chaining, the generate_zip_tree_transitions() function can be significantly simplified. Currently we have to be very careful about checking whether each seed actually corresponds to an anchor border, e.g.

vg/src/algorithms/chain_items.cpp

Lines 329 to 332 in aa8171c

    
           // Both were traversed in the same orientation as the read. 
        
           // They might not be at anchor borders though, so check. 
        
           auto found_source_anchor = seed_to_ending.find(source_seed.seed); 
        
           if (found_source_anchor != seed_to_ending.end()) {

But if we simply chain all seeds directly, then every seed will correspond to an anchor border, since every seed will be its own anchor.

adamnovak · 2026-01-05T22:24:23Z

@faithokamoto We can't get rid of the abstraction of having an Anchor represent several seeds unless we also get rid of the gapless extension feature in the chaining codepath, since we use it there too:

vg/src/minimizer_mapper_from_chains.cpp

Lines 1386 to 1391 in 0c86c8f

    
           // We have seeds here and can make an anchor 
        
           // Note the index of the new anchor 
        
           extension_anchor_indexes.push_back(extension_anchors.size()); 
        
           // Make the actual anchor out of this range of seeds and this read range. 
        
           extension_anchors.push_back(to_anchor(aln, anchor_interval.first, anchor_interval.second, anchor_seeds, seed_anchors, internal_mismatch_begin, internal_mismatch_end, gbwt_graph, this->get_regular_aligner()));

So I think even with the removal of fragmenting, we still have to deal with having seeds in play that are not Anchor boundaries.

adamnovak · 2026-01-06T15:30:42Z

I checked this on calling with ac3735ca988928881a06b84ba46e500fec89f275 of https://github.com/vgteam/recombination-aware-giraffe-experiments.

==> ./output/experiments/hifi_real_full_call_chm13/results/snp_errors.tsv <==
giraffe-0793c7	24255
giraffe-55e964	24364

==> ./output/experiments/hifi_real_full_call_chm13/results/indel_errors.tsv <==
giraffe-0793c7	61665
giraffe-55e964	61669

==> ./output/experiments/hifi_real_full_call_chm13/results/total_errors.tsv <==
giraffe-0793c7	85920
giraffe-55e964	86033

It looks like this removes a few calling errors. I also evaluated speed previously and we don't get too much slower.

So I think this is ready.

adamnovak · 2026-01-06T18:21:53Z

This should also be tested on R10 ~~and Illumina~~ reads.

adamnovak added 4 commits December 2, 2025 14:11

Hackily add a --skip-fragmenting bypass to make all seeds fragments

4fa4451

Default fragmenting to off

d1625a9

Add missing include guard to log.hpp

74fb0b1

adamnovak added 2 commits December 12, 2025 16:07

Merge remote-tracking branch 'origin/master' into no-fragmenting

0c4b31d

Protect forest_state.open_chains.back() access with empty check

0c86c8f

adamnovak added 2 commits January 5, 2026 17:25

Merge remote-tracking branch 'origin/master' into no-fragmenting

7410ff7

Merge remote-tracking branch 'origin/no-fragmenting' into no-fragmenting

0793c7d

adamnovak marked this pull request as ready for review January 6, 2026 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove fragmenting from Giraffe #4765

Remove fragmenting from Giraffe #4765

adamnovak commented Dec 3, 2025

Uh oh!

faithokamoto commented Dec 4, 2025

Uh oh!

adamnovak commented Jan 5, 2026

Uh oh!

adamnovak commented Jan 6, 2026

Uh oh!

adamnovak commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove fragmenting from Giraffe #4765

Are you sure you want to change the base?

Remove fragmenting from Giraffe #4765

Conversation

adamnovak commented Dec 3, 2025

Changelog Entry

Description

Uh oh!

faithokamoto commented Dec 4, 2025

Uh oh!

adamnovak commented Jan 5, 2026

Uh oh!

adamnovak commented Jan 6, 2026

Uh oh!

adamnovak commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adamnovak commented Jan 6, 2026 •

edited

Loading