-
Notifications
You must be signed in to change notification settings - Fork 212
Add functionality to compute discontiguous supplementary alignments in short-read Giraffe #4767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add functionality to compute discontiguous supplementary alignments in short-read Giraffe #4767
Conversation
|
Hi Jordan,
This is really cool. We are, of course, very interested in getting such
functionality going for the chaining code path, with Faith on deck to take
it on, specifically targeting longer reads (but we want the chaining code
path to work for both short and long). Mobin, et al. - we should test with
DV.
…On Fri, Dec 5, 2025 at 5:04 PM eizengaj-roche ***@***.***> wrote:
Changelog Entry
To be copied to the draft changelog
<https://github.com/vgteam/vg/wiki/Draft-Changelog> by merger:
- vg giraffe can compute supplementary alignments with the
--supplementary option.
Description
This PR provides the capability to identify supplementary alignments from
non-adjacent regions of the graph in vg giraffe. It compliments #4699
<#4699>, which added the capability to
preserve contiguous alignments that became separated during surjection as
supplementary alignments.
The strategy for identifying supplementary alignments is intentionally
fairly conservative. Supplementaries are only reported if it's possible to
find a set of alignments that are disjoint on the read and completely cover
the read. There are also some conditions on score and size that get at some
notion of the supplementary alignment being "significant". These conditions
are largely borrowed from BWA-MEM.
In order to get better candidates for supplementary alignments, the
--supplementary flag additionally triggers behavior that allows clusters
to escape being filtered by the funnel if they cover a part of the read
that is disjoint with all alignments computed previously. My testing
suggest that this has a negligible impact on mapping speed.
The current implementation is entirely inside the map_from_extensions
code path. If this code path is eventually eliminated, there will be a bit
of work to port this functionality over to the chaining code path. There's
probably some benefit in developing a supplementary alignment
identification step in the chaining code path regardless, since split
alignments play a bigger role in long read and contig mapping than in short
read mapping.
------------------------------
You can view, comment on, or merge this pull request online at:
#4767
Commit Summary
- a1e0c01
<a1e0c01>
add functionality to compute supplementary alignments in short read giraffe
- b8d7f67
<b8d7f67>
add 2 files missing from the previous commit
File Changes
(19 files <https://github.com/vgteam/vg/pull/4767/files>)
- *M* src/alignment.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-8f01228bbe0df1d13b69f55375e7b872ae186b7709eb8fab4afd58cfe047d4b4>
(167)
- *M* src/alignment.hpp
<https://github.com/vgteam/vg/pull/4767/files#diff-5e2f32e4d9ff46f08e72414c4b0288f49ad96941f9a66e5780476289a0d8ec9c>
(161)
- *A* src/interval_union.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-29c3a9442228493522cd2cd3a1d146b840b1e70d34d9cd49d0945511b7a75bde>
(113)
- *A* src/interval_union.hpp
<https://github.com/vgteam/vg/pull/4767/files#diff-7bbb57c988b95f5ad895dd8bea493664578a43aeb7b9b9225fdd08327817c5d1>
(67)
- *M* src/minimizer_mapper.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-ff9203cb83c40b72594f6b0da724f79ddda30b86e12e2aae291afa9cc423483a>
(465)
- *M* src/minimizer_mapper.hpp
<https://github.com/vgteam/vg/pull/4767/files#diff-742746d66f9edc4cc39ac8e5efc947a3ca890d36d2c9289f6d26e7904e6e6e6a>
(135)
- *M* src/subcommand/giraffe_main.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-5ab1ab0c5bf48c0616257ab6ee3e4f4f48f0458d159f36b3497f00f96b373456>
(18)
- *M* src/subcommand/surject_main.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-281d97a60ecc118e402475d4328d25f5d8ed35618559870bfabcfad1fa73eccf>
(41)
- *M* src/surjecting_alignment_emitter.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-c1b16e6b06356876f685ecfb13abcc8c1fbd3c5015526a5fcf57e92db2054e41>
(52)
- *M* src/surjector.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-e49ffade5deb0d7cac24851cffd7efaeb00bd2b6b95e4bc40ff31697bcbe5149>
(348)
- *M* src/surjector.hpp
<https://github.com/vgteam/vg/pull/4767/files#diff-704141acbb9573b918973695ce760965dfcaac00c694a9ea97d06d1335930a0a>
(58)
- *M* src/unittest/alignment.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-d755120722849282c20e011853870ec2d6c85144d92ec2947d3e98898849f5b1>
(309)
- *A* src/unittest/interval_union.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-0ba887817f5293462ad48116e2acce2bde5b1d6d9f5ac6d820eeaa6356e2aa77>
(114)
- *M* src/unittest/surject.cpp
<https://github.com/vgteam/vg/pull/4767/files#diff-868a5a0e8dba293afa200392bfd43b3825be04a4071415e8feab6e6530c6ba3b>
(84)
- *M* src/utility.hpp
<https://github.com/vgteam/vg/pull/4767/files#diff-f3ba931e1d09a1035fc6cc31cc986f9057a0acdd1597d852225aac90d9b3b14e>
(39)
- *A* test/graphs/disconnected.gfa
<https://github.com/vgteam/vg/pull/4767/files#diff-ad1602d7da7e6cc42dbbe0a63de6a24772e2de292d56ff1b2db071b768fe445d>
(7)
- *A* test/reads/disconnected.fq
<https://github.com/vgteam/vg/pull/4767/files#diff-b31792ce587aa7a5f82201587b31be9e6cacb51b313dd1ea567ff9099f2745d7>
(4)
- *A* test/reads/disconnected_pair.fq
<https://github.com/vgteam/vg/pull/4767/files#diff-d2465b39ba475b6a4d57e4b6e4d96fff2bcdf0611fabc28a77347ba67a21a8f9>
(8)
- *M* test/t/50_vg_giraffe.t
<https://github.com/vgteam/vg/pull/4767/files#diff-8010715941cd97a5d74cfd3c0e80f72f9de21b96cdac5f872aa365a59dac9154>
(18)
Patch Links:
- https://github.com/vgteam/vg/pull/4767.patch
- https://github.com/vgteam/vg/pull/4767.diff
—
Reply to this email directly, view it on GitHub
<#4767>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEQ4IAYBAKRF2JCSY56CTL4AITS5AVCNFSM6AAAAACOGURVNKVHI2DSMVQWIX3LMV43ASLTON2WKOZTG4YDANZUHEYTOOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
I think some of the components here should be pretty easy to repurpose. If someone else plans to work on this, maybe we should meet to figure out how the joints could fit together. |
adamnovak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty good.
I think there's one place in Surjector::surject() where the new code is not really sure what level of abstraction it lives at, where it probably should be split up so the top-level surject method doesn't need to think about individual edits, even if that results in functions that only get used once.
I don't like the idea of hiding supplementary alignments inside annotations, especially as an externally-visible feature of the GAM format. Then you can't do things like put annotations on them themselves, and you have to deal with packing and unpacking these strings. I would recommend instead adding a real Protobuf repeated Alignment message field inside Alignment in libvgio, to hold supplementary alignments. That's still going to have a good impedance match with storing them in GAF tags (which I think is what we're doing there?), but it's more natural for how GAM would do it if we weren't now backporting ideas from GAF. I think it does make sense to have the supplementary alignments physically attached to the primary one, because it makes pairing easier and sort of captures that this is "really" one longer alignment with the gaps elided.
(The real right way to handle it would be to be operating on some higher-level frame object that covers a set of pieces of an alignment, and maybe also represents a pair, which might be able to fit into the same notion of partial-ness, but I don't think we're ever going to refactor over to that.)
It looks like the way the supplementary-assignment logic works is, if you have a currently-winning primary alignment, it can pick up other smaller alignment pieces as supplementary. This adds a constraint that means two smaller pieces can't gang up as supplementary with each other and dethrone an incompatible primary that's only a little better than either alone. Is that what we want? Does that match the behavior of current linear mappers?
| /** | ||
| * The union of a collection of intervals | ||
| */ | ||
| class IntervalUnion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a candidate for living in structures.
| /** | ||
| * Same as the process_until_threshold_b, except that items can escape the score | ||
| * threshold by a function-specified condition. The process_item function receives | ||
| * an additional bool indicating whether an item passed by threshold escape | ||
| */ | ||
| template<typename Score = double> | ||
| void process_until_threshold_d(const vector<Score>& scores, | ||
| const function<bool(size_t)>& threshold_escape, | ||
| double threshold, size_t min_count, size_t max_count, | ||
| LazyRNG& get_seed, | ||
| const function<bool(size_t, size_t, bool)>& process_item, | ||
| const function<void(size_t)>& discard_item_by_count, | ||
| const function<void(size_t)>& discard_item_by_score) const; | ||
|
|
||
| /** | ||
| * Same as the process_until_threshold_c, except that items can escape the score | ||
| * threshold by a function-specified condition. The process_item function receives | ||
| * an additional bool indicating whether an item passed by threshold escape | ||
| */ | ||
| template<typename Score = double> | ||
| void process_until_threshold_e(size_t items, | ||
| const function<Score(size_t)>& get_score, | ||
| const function<bool(size_t, size_t)>& comparator, | ||
| const function<bool(size_t)>& threshold_escape, | ||
| double threshold, size_t min_count, size_t max_count, | ||
| LazyRNG& get_seed, | ||
| const function<bool(size_t, size_t, bool)>& process_item, | ||
| const function<void(size_t)>& discard_item_by_count, | ||
| const function<void(size_t)>& discard_item_by_score) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are getting pretty complex, and we're now filling in a matrix of possible combinations of features, so adding one feature gives us two new functions.
Maybe these (eventually?) need to become some kind of builder pattern, where you have an object you construct, and then you hand it the various thresholds and sorting functions and discard event listener callbacks with specific methods for each, so you don't need a function version for every combination one might use. Then once you've set it up you run it.
src/surjector.cpp
Outdated
| // figure out which interval should be assigned to the primary | ||
|
|
||
| // collect the supplementary intervals (begin, end, idx) | ||
| vector<tuple<size_t, size_t, size_t>> suppl_intervals; | ||
| suppl_intervals.reserve(mapper_supplementaries.size()); | ||
| for (size_t i = 0; i < mapper_supplementaries.size(); ++i) { | ||
| const auto& suppl = mapper_supplementaries[i]; | ||
| auto interval = aligned_interval(suppl); | ||
| if (interval.second < interval.first) { | ||
| // unaligned -- TODO: this shouldn't ever happen though, remove this? | ||
| continue; | ||
| } | ||
| suppl_intervals.emplace_back(interval.first, interval.second, i); | ||
| } | ||
| auto primary_interval = aligned_interval(source); | ||
| suppl_intervals.emplace_back(primary_interval.first, primary_interval.second, -1); | ||
| sort(suppl_intervals.begin(), suppl_intervals.end()); | ||
|
|
||
| // make hard-clipped Alignments | ||
| for (size_t i = 0; i < suppl_intervals.size(); ++i) { | ||
|
|
||
| const auto& interval = suppl_intervals[i]; | ||
| const auto& unclipped = get<2>(interval) < mapper_supplementaries.size() ? mapper_supplementaries[get<2>(interval)] : source; | ||
|
|
||
| Alignment clipped; | ||
|
|
||
| // copy over metadata | ||
| clipped.set_score(unclipped.score()); | ||
| clipped.set_mapping_quality(unclipped.mapping_quality()); | ||
| clipped.set_name(source.name()); | ||
| clipped.set_read_group(source.read_group()); | ||
| clipped.set_sample_name(source.sample_name()); | ||
| clipped.set_is_secondary(source.is_secondary()); | ||
| if (source.has_fragment_next()) { | ||
| *clipped.mutable_fragment_next() = source.fragment_next(); | ||
| } | ||
| if (source.has_fragment_prev()) { | ||
| *clipped.mutable_fragment_prev() = source.fragment_prev(); | ||
| } | ||
| if (Annotation<Alignment>::get(source).fields().size() > 1) { | ||
| // there are additional annotations to the supplementaries that we already parsed | ||
| *clipped.mutable_annotation() = source.annotation(); | ||
| // don't pass through the mapper supplementaries | ||
| clear_annotation(clipped, "supplementaries"); | ||
| } | ||
| if (&unclipped != &source) { | ||
| set_annotation<bool>(clipped, "supplementary", true); | ||
| } | ||
|
|
||
| // possibly expand the sequence interval into "unclaimed" adjacent sequence | ||
| size_t left_bnd = i != 0 ? get<1>(suppl_intervals[i - 1]) : 0; | ||
| size_t right_bnd = i + 1 < suppl_intervals.size() ? get<0>(suppl_intervals[i + 1]) : source.sequence().size(); | ||
| size_t subseq_begin = min(left_bnd, get<0>(interval)); | ||
| size_t subseq_end = max(right_bnd, get<1>(interval)); | ||
|
|
||
| #ifdef debug_anchored_surject | ||
| cerr << "hard clipping "; | ||
| if (get<2>(interval) >= mapper_supplementaries.size()) { | ||
| cerr << "primary alignment"; | ||
| } | ||
| else { | ||
| cerr << "supplementary " << get<2>(interval); | ||
| } | ||
| cerr << " to the interval " << subseq_begin << ":" << subseq_end << " for surjection" << endl; | ||
| #endif | ||
|
|
||
| // make the clipped sequence | ||
| clipped.set_sequence(source.sequence().substr(subseq_begin, subseq_end - subseq_begin)); | ||
| if (!source.quality().empty()) { | ||
| clipped.set_quality(source.quality().substr(subseq_begin, subseq_end - subseq_begin)); | ||
| } | ||
|
|
||
| auto clipped_path = clipped.mutable_path(); | ||
| const auto& unclipped_path = unclipped.path(); | ||
| for (size_t m = 0; m < unclipped_path.mapping_size(); ++m) { | ||
| const auto& unclipped_mapping = unclipped_path.mapping(m); | ||
| auto clipped_mapping = clipped_path->add_mapping(); | ||
| *clipped_mapping->mutable_position() = unclipped_mapping.position(); | ||
| if (m != 0 && m + 1 != unclipped_path.mapping_size()) { | ||
| // internal mapping | ||
| *clipped_mapping = unclipped_mapping; | ||
| clipped_mapping->set_rank(clipped_path->mapping_size()); | ||
| } | ||
| else { | ||
| // first and/or last mapping | ||
| for (size_t e = 0; e < unclipped_mapping.edit_size(); ++e) { | ||
| if (m == 0 && e == 0 && subseq_begin != 0) { | ||
| // initial hard-clip | ||
| const auto& softclip_edit = unclipped_mapping.edit(e); | ||
| assert(softclip_edit.from_length() == 0); | ||
| if (subseq_begin < softclip_edit.to_length()) { | ||
| auto clipped_edit = clipped_mapping->add_edit(); | ||
| clipped_edit->set_from_length(0); | ||
| clipped_edit->set_to_length(softclip_edit.to_length() - subseq_begin); | ||
| clipped_edit->set_sequence(softclip_edit.sequence().substr(softclip_edit.to_length() - clipped_edit->to_length(), clipped_edit->to_length())); | ||
| } | ||
| } | ||
| else if (m + 1 == unclipped_path.mapping_size() && e + 1 == unclipped_mapping.edit_size() | ||
| && subseq_end != source.sequence().size()) { | ||
| // final hard-clip | ||
| const auto& softclip_edit = unclipped_mapping.edit(e); | ||
| assert(softclip_edit.from_length() == 0); | ||
| if (source.sequence().size() - softclip_edit.to_length() < subseq_end) { | ||
| auto clipped_edit = clipped_mapping->add_edit(); | ||
| clipped_edit->set_from_length(0); | ||
| clipped_edit->set_to_length(subseq_end + softclip_edit.to_length() - source.sequence().size()); | ||
| clipped_edit->set_sequence(softclip_edit.sequence().substr(0, clipped_edit->to_length())); | ||
| } | ||
| } | ||
| else { | ||
| // internal edit | ||
| *clipped_mapping->add_edit() = unclipped_mapping.edit(e); | ||
| } | ||
| } | ||
|
|
||
| if (clipped_mapping->edit_size() == 0) { | ||
| // the only edits were filtered out, don't leave an empty mapping | ||
| clipped_path->mutable_mapping()->RemoveLast(); | ||
| } | ||
| else { | ||
| clipped_mapping->set_rank(clipped_path->mapping_size()); | ||
| } | ||
| } | ||
| } | ||
| #ifdef debug_anchored_surject | ||
| cerr << "hard-clipped surjection input:" << endl; | ||
| cerr << pb2json(clipped) << endl; | ||
| #endif | ||
| // do the surjection | ||
| vector<Alignment> clipped_surjected; | ||
| vector<tuple<string, int64_t, bool>> clipped_positions_out; | ||
| surject_internal(&clipped, nullptr, &clipped_surjected, nullptr, paths, clipped_positions_out, | ||
| allow_negative_scores, preserve_deletions); | ||
|
|
||
| #ifdef debug_anchored_surject | ||
| cerr << "got hard-clipped surjections:" << endl; | ||
| for (const auto& surj : clipped_surjected) { | ||
| cerr << pb2json(surj) << endl; | ||
| } | ||
| size_t curr_surj_size = surjected.size(); | ||
| #endif | ||
|
|
||
| // restore soft-clips and add to output | ||
| for (size_t j = 0; j < clipped_surjected.size(); ++j) { | ||
| auto& clipped_surj = clipped_surjected[j]; | ||
|
|
||
| clipped_surj.set_sequence(source.sequence()); | ||
| clipped_surj.set_quality(source.quality()); | ||
|
|
||
| if (clipped_surj.path().mapping_size() != 0) { | ||
|
|
||
| if (subseq_begin != 0) { | ||
| // restore a softclip on the front | ||
| auto first_mapping = clipped_surj.mutable_path()->mutable_mapping(0); | ||
| auto first_edit = first_mapping->mutable_edit(0); | ||
| if (first_edit->from_length() == 0) { | ||
| first_edit->set_sequence(clipped_surj.sequence().substr(0, subseq_begin) + first_edit->sequence()); | ||
| first_edit->set_to_length(first_edit->sequence().size()); | ||
| } | ||
| else { | ||
| auto softclip_edit = first_mapping->add_edit(); | ||
| softclip_edit->set_from_length(0); | ||
| softclip_edit->set_to_length(subseq_begin); | ||
| softclip_edit->set_sequence(clipped_surj.sequence().substr(0, subseq_begin)); | ||
| // move it to the front | ||
| for (size_t k = first_mapping->edit_size() - 1; k > 0; --k) { | ||
| first_mapping->mutable_edit()->SwapElements(k, k - 1); | ||
| } | ||
| } | ||
| } | ||
| if (subseq_end != source.sequence().size()) { | ||
| // restore a softclip on the back | ||
| auto final_mapping = clipped_surj.mutable_path()->mutable_mapping(clipped_surj.path().mapping_size() - 1); | ||
| auto final_edit = final_mapping->mutable_edit(final_mapping->edit_size() - 1); | ||
| if (final_edit->from_length() == 0) { | ||
| final_edit->set_sequence(final_edit->sequence() + | ||
| clipped_surj.sequence().substr(subseq_end, clipped_surj.sequence().size() - subseq_end)); | ||
| final_edit->set_to_length(final_edit->sequence().size()); | ||
| } | ||
| else { | ||
| auto softclip_edit = final_mapping->add_edit(); | ||
| softclip_edit->set_from_length(0); | ||
| softclip_edit->set_sequence(clipped_surj.sequence().substr(subseq_end, clipped_surj.sequence().size() - subseq_end)); | ||
| softclip_edit->set_to_length(softclip_edit->sequence().size()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this piece gets a bit deep into the weeds of juggling individual edits to belong in the top-level ::surject() function. It probably should be split off into a few functions that do the different manipulations and then orchestrated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factored this out into separate functions
|
@eizengaj-roche Is the string-packed representation used in the annotation also used in GAF? |
GAF output is actually something I hadn't really considered, but I do think this structured string probably could/should be used in GAF output. I designed this format to be similar to the standard
I agree that using the annotation system is not exactly ideal. The main reason I turned to that is because of the GAM format's dependence on interleaving for expressing pairing, which fits poorly with the SAM/BAM formulation using separate alignment records. I guess I'm also still a bit allergic to further changes to the core GAM record after the early days when it was super unstable, but I can see the argument for a repeated Alignment field. However, considering GAF output as well, I think that using format would probably only succeed in pushing the Alignment <-> structured string conversion into the GAM-to-GAF re-encoding, rather than eliminating it entirely.
This was mostly an attempt to err on the side of being conservative in identifying supplementary alignments. Once you allow in fragmentary alignments, the search space grows a lot. For instance, my suspicion is that, in repetitive sequences, it will be common that a read can be partitioned into fragmentary alignments that erroneously beat out a primary that's just ok, score-wise. There's probably a way to parameterize around this issue through something like a "supplementary alignment penalty", but initially, I just have it focusing on the most clear-cut cases. |
|
@adamnovak Pending CI, how are you feeling about merging this? I've updated libvgio with vgteam/libvgio#86 so that GAF output will also preserve supplementary alignments, and the refactor in |
|
I think I'm positive on this overall, but I'm rushing around before curtailment and kind of not paying attention here, sorry. I didn't know about the SAM SA tag, and while that in turn raises questions about how it interacts with the SAM supplementary flag, it makes cramming the supplementary records into e.g. GAF tags more reasonable.
If we are going to move supplementary alignments to real GAM fields in GAM, and we merge/release something with them in annotations instead, we're going to need conversion code kind of forever unless we're willing to silently throw them out, right? Which maybe we are willing to do. If the same data will be an So if you want to merge it with the supplementary alignments in an annotation, you can, but I think you should move them to real GAM |
Changelog Entry
To be copied to the draft changelog by merger:
vg giraffecan compute supplementary alignments with the--supplementaryoption.Description
This PR provides the capability to identify supplementary alignments from non-adjacent regions of the graph in
vg giraffe. It compliments #4699, which added the capability to preserve contiguous alignments that became separated during surjection as supplementary alignments.The strategy for identifying supplementary alignments is intentionally fairly conservative. Supplementaries are only reported if it's possible to find a set of alignments that are disjoint on the read and completely cover the read. There are also some conditions on score and size that get at some notion of the supplementary alignment being "significant". These conditions are largely borrowed from BWA-MEM.
In order to get better candidates for supplementary alignments, the
--supplementaryflag additionally triggers behavior that allows clusters to escape being filtered by the funnel if they cover a part of the read that is disjoint with all alignments computed previously. My testing suggest that this has a negligible impact on mapping speed.The current implementation is entirely inside the
map_from_extensionscode path. If this code path is eventually eliminated, there will be a bit of work to port this functionality over to the chaining code path. There's probably some benefit in developing a supplementary alignment identification step in the chaining code path regardless, since split alignments play a bigger role in long read and contig mapping than in short read mapping.