Skip to content

Conversation

@eizengaj-roche
Copy link

Changelog Entry

To be copied to the draft changelog by merger:

  • vg giraffe can compute supplementary alignments with the --supplementary option.

Description

This PR provides the capability to identify supplementary alignments from non-adjacent regions of the graph in vg giraffe. It compliments #4699, which added the capability to preserve contiguous alignments that became separated during surjection as supplementary alignments.

The strategy for identifying supplementary alignments is intentionally fairly conservative. Supplementaries are only reported if it's possible to find a set of alignments that are disjoint on the read and completely cover the read. There are also some conditions on score and size that get at some notion of the supplementary alignment being "significant". These conditions are largely borrowed from BWA-MEM.

In order to get better candidates for supplementary alignments, the --supplementary flag additionally triggers behavior that allows clusters to escape being filtered by the funnel if they cover a part of the read that is disjoint with all alignments computed previously. My testing suggest that this has a negligible impact on mapping speed.

The current implementation is entirely inside the map_from_extensions code path. If this code path is eventually eliminated, there will be a bit of work to port this functionality over to the chaining code path. There's probably some benefit in developing a supplementary alignment identification step in the chaining code path regardless, since split alignments play a bigger role in long read and contig mapping than in short read mapping.

@benedictpaten
Copy link
Contributor

benedictpaten commented Dec 6, 2025 via email

@jeizenga
Copy link
Contributor

jeizenga commented Dec 6, 2025

I think some of the components here should be pretty easy to repurpose. If someone else plans to work on this, maybe we should meet to figure out how the joints could fit together.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty good.

I think there's one place in Surjector::surject() where the new code is not really sure what level of abstraction it lives at, where it probably should be split up so the top-level surject method doesn't need to think about individual edits, even if that results in functions that only get used once.

I don't like the idea of hiding supplementary alignments inside annotations, especially as an externally-visible feature of the GAM format. Then you can't do things like put annotations on them themselves, and you have to deal with packing and unpacking these strings. I would recommend instead adding a real Protobuf repeated Alignment message field inside Alignment in libvgio, to hold supplementary alignments. That's still going to have a good impedance match with storing them in GAF tags (which I think is what we're doing there?), but it's more natural for how GAM would do it if we weren't now backporting ideas from GAF. I think it does make sense to have the supplementary alignments physically attached to the primary one, because it makes pairing easier and sort of captures that this is "really" one longer alignment with the gaps elided.

(The real right way to handle it would be to be operating on some higher-level frame object that covers a set of pieces of an alignment, and maybe also represents a pair, which might be able to fit into the same notion of partial-ness, but I don't think we're ever going to refactor over to that.)

It looks like the way the supplementary-assignment logic works is, if you have a currently-winning primary alignment, it can pick up other smaller alignment pieces as supplementary. This adds a constraint that means two smaller pieces can't gang up as supplementary with each other and dethrone an incompatible primary that's only a little better than either alone. Is that what we want? Does that match the behavior of current linear mappers?

Comment on lines +16 to +19
/**
* The union of a collection of intervals
*/
class IntervalUnion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a candidate for living in structures.

Comment on lines +1469 to +1497
/**
* Same as the process_until_threshold_b, except that items can escape the score
* threshold by a function-specified condition. The process_item function receives
* an additional bool indicating whether an item passed by threshold escape
*/
template<typename Score = double>
void process_until_threshold_d(const vector<Score>& scores,
const function<bool(size_t)>& threshold_escape,
double threshold, size_t min_count, size_t max_count,
LazyRNG& get_seed,
const function<bool(size_t, size_t, bool)>& process_item,
const function<void(size_t)>& discard_item_by_count,
const function<void(size_t)>& discard_item_by_score) const;

/**
* Same as the process_until_threshold_c, except that items can escape the score
* threshold by a function-specified condition. The process_item function receives
* an additional bool indicating whether an item passed by threshold escape
*/
template<typename Score = double>
void process_until_threshold_e(size_t items,
const function<Score(size_t)>& get_score,
const function<bool(size_t, size_t)>& comparator,
const function<bool(size_t)>& threshold_escape,
double threshold, size_t min_count, size_t max_count,
LazyRNG& get_seed,
const function<bool(size_t, size_t, bool)>& process_item,
const function<void(size_t)>& discard_item_by_count,
const function<void(size_t)>& discard_item_by_score) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are getting pretty complex, and we're now filling in a matrix of possible combinations of features, so adding one feature gives us two new functions.

Maybe these (eventually?) need to become some kind of builder pattern, where you have an object you construct, and then you hand it the various thresholds and sorting functions and discard event listener callbacks with specific methods for each, so you don't need a function version for every combination one might use. Then once you've set it up you run it.

Comment on lines 161 to 347
// figure out which interval should be assigned to the primary

// collect the supplementary intervals (begin, end, idx)
vector<tuple<size_t, size_t, size_t>> suppl_intervals;
suppl_intervals.reserve(mapper_supplementaries.size());
for (size_t i = 0; i < mapper_supplementaries.size(); ++i) {
const auto& suppl = mapper_supplementaries[i];
auto interval = aligned_interval(suppl);
if (interval.second < interval.first) {
// unaligned -- TODO: this shouldn't ever happen though, remove this?
continue;
}
suppl_intervals.emplace_back(interval.first, interval.second, i);
}
auto primary_interval = aligned_interval(source);
suppl_intervals.emplace_back(primary_interval.first, primary_interval.second, -1);
sort(suppl_intervals.begin(), suppl_intervals.end());

// make hard-clipped Alignments
for (size_t i = 0; i < suppl_intervals.size(); ++i) {

const auto& interval = suppl_intervals[i];
const auto& unclipped = get<2>(interval) < mapper_supplementaries.size() ? mapper_supplementaries[get<2>(interval)] : source;

Alignment clipped;

// copy over metadata
clipped.set_score(unclipped.score());
clipped.set_mapping_quality(unclipped.mapping_quality());
clipped.set_name(source.name());
clipped.set_read_group(source.read_group());
clipped.set_sample_name(source.sample_name());
clipped.set_is_secondary(source.is_secondary());
if (source.has_fragment_next()) {
*clipped.mutable_fragment_next() = source.fragment_next();
}
if (source.has_fragment_prev()) {
*clipped.mutable_fragment_prev() = source.fragment_prev();
}
if (Annotation<Alignment>::get(source).fields().size() > 1) {
// there are additional annotations to the supplementaries that we already parsed
*clipped.mutable_annotation() = source.annotation();
// don't pass through the mapper supplementaries
clear_annotation(clipped, "supplementaries");
}
if (&unclipped != &source) {
set_annotation<bool>(clipped, "supplementary", true);
}

// possibly expand the sequence interval into "unclaimed" adjacent sequence
size_t left_bnd = i != 0 ? get<1>(suppl_intervals[i - 1]) : 0;
size_t right_bnd = i + 1 < suppl_intervals.size() ? get<0>(suppl_intervals[i + 1]) : source.sequence().size();
size_t subseq_begin = min(left_bnd, get<0>(interval));
size_t subseq_end = max(right_bnd, get<1>(interval));

#ifdef debug_anchored_surject
cerr << "hard clipping ";
if (get<2>(interval) >= mapper_supplementaries.size()) {
cerr << "primary alignment";
}
else {
cerr << "supplementary " << get<2>(interval);
}
cerr << " to the interval " << subseq_begin << ":" << subseq_end << " for surjection" << endl;
#endif

// make the clipped sequence
clipped.set_sequence(source.sequence().substr(subseq_begin, subseq_end - subseq_begin));
if (!source.quality().empty()) {
clipped.set_quality(source.quality().substr(subseq_begin, subseq_end - subseq_begin));
}

auto clipped_path = clipped.mutable_path();
const auto& unclipped_path = unclipped.path();
for (size_t m = 0; m < unclipped_path.mapping_size(); ++m) {
const auto& unclipped_mapping = unclipped_path.mapping(m);
auto clipped_mapping = clipped_path->add_mapping();
*clipped_mapping->mutable_position() = unclipped_mapping.position();
if (m != 0 && m + 1 != unclipped_path.mapping_size()) {
// internal mapping
*clipped_mapping = unclipped_mapping;
clipped_mapping->set_rank(clipped_path->mapping_size());
}
else {
// first and/or last mapping
for (size_t e = 0; e < unclipped_mapping.edit_size(); ++e) {
if (m == 0 && e == 0 && subseq_begin != 0) {
// initial hard-clip
const auto& softclip_edit = unclipped_mapping.edit(e);
assert(softclip_edit.from_length() == 0);
if (subseq_begin < softclip_edit.to_length()) {
auto clipped_edit = clipped_mapping->add_edit();
clipped_edit->set_from_length(0);
clipped_edit->set_to_length(softclip_edit.to_length() - subseq_begin);
clipped_edit->set_sequence(softclip_edit.sequence().substr(softclip_edit.to_length() - clipped_edit->to_length(), clipped_edit->to_length()));
}
}
else if (m + 1 == unclipped_path.mapping_size() && e + 1 == unclipped_mapping.edit_size()
&& subseq_end != source.sequence().size()) {
// final hard-clip
const auto& softclip_edit = unclipped_mapping.edit(e);
assert(softclip_edit.from_length() == 0);
if (source.sequence().size() - softclip_edit.to_length() < subseq_end) {
auto clipped_edit = clipped_mapping->add_edit();
clipped_edit->set_from_length(0);
clipped_edit->set_to_length(subseq_end + softclip_edit.to_length() - source.sequence().size());
clipped_edit->set_sequence(softclip_edit.sequence().substr(0, clipped_edit->to_length()));
}
}
else {
// internal edit
*clipped_mapping->add_edit() = unclipped_mapping.edit(e);
}
}

if (clipped_mapping->edit_size() == 0) {
// the only edits were filtered out, don't leave an empty mapping
clipped_path->mutable_mapping()->RemoveLast();
}
else {
clipped_mapping->set_rank(clipped_path->mapping_size());
}
}
}
#ifdef debug_anchored_surject
cerr << "hard-clipped surjection input:" << endl;
cerr << pb2json(clipped) << endl;
#endif
// do the surjection
vector<Alignment> clipped_surjected;
vector<tuple<string, int64_t, bool>> clipped_positions_out;
surject_internal(&clipped, nullptr, &clipped_surjected, nullptr, paths, clipped_positions_out,
allow_negative_scores, preserve_deletions);

#ifdef debug_anchored_surject
cerr << "got hard-clipped surjections:" << endl;
for (const auto& surj : clipped_surjected) {
cerr << pb2json(surj) << endl;
}
size_t curr_surj_size = surjected.size();
#endif

// restore soft-clips and add to output
for (size_t j = 0; j < clipped_surjected.size(); ++j) {
auto& clipped_surj = clipped_surjected[j];

clipped_surj.set_sequence(source.sequence());
clipped_surj.set_quality(source.quality());

if (clipped_surj.path().mapping_size() != 0) {

if (subseq_begin != 0) {
// restore a softclip on the front
auto first_mapping = clipped_surj.mutable_path()->mutable_mapping(0);
auto first_edit = first_mapping->mutable_edit(0);
if (first_edit->from_length() == 0) {
first_edit->set_sequence(clipped_surj.sequence().substr(0, subseq_begin) + first_edit->sequence());
first_edit->set_to_length(first_edit->sequence().size());
}
else {
auto softclip_edit = first_mapping->add_edit();
softclip_edit->set_from_length(0);
softclip_edit->set_to_length(subseq_begin);
softclip_edit->set_sequence(clipped_surj.sequence().substr(0, subseq_begin));
// move it to the front
for (size_t k = first_mapping->edit_size() - 1; k > 0; --k) {
first_mapping->mutable_edit()->SwapElements(k, k - 1);
}
}
}
if (subseq_end != source.sequence().size()) {
// restore a softclip on the back
auto final_mapping = clipped_surj.mutable_path()->mutable_mapping(clipped_surj.path().mapping_size() - 1);
auto final_edit = final_mapping->mutable_edit(final_mapping->edit_size() - 1);
if (final_edit->from_length() == 0) {
final_edit->set_sequence(final_edit->sequence() +
clipped_surj.sequence().substr(subseq_end, clipped_surj.sequence().size() - subseq_end));
final_edit->set_to_length(final_edit->sequence().size());
}
else {
auto softclip_edit = final_mapping->add_edit();
softclip_edit->set_from_length(0);
softclip_edit->set_sequence(clipped_surj.sequence().substr(subseq_end, clipped_surj.sequence().size() - subseq_end));
softclip_edit->set_to_length(softclip_edit->sequence().size());
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this piece gets a bit deep into the weeds of juggling individual edits to belong in the top-level ::surject() function. It probably should be split off into a few functions that do the different manipulations and then orchestrated here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored this out into separate functions

@adamnovak
Copy link
Member

@eizengaj-roche Is the string-packed representation used in the annotation also used in GAF?

@eizengaj-roche
Copy link
Author

@eizengaj-roche Is the string-packed representation used in the annotation also used in GAF?

GAF output is actually something I hadn't really considered, but I do think this structured string probably could/should be used in GAF output. I designed this format to be similar to the standard SA tag that's used in SAM/BAM files to indicate supplementary alignments.

I don't like the idea of hiding supplementary alignments inside annotations, especially as an externally-visible feature of the GAM format. Then you can't do things like put annotations on them themselves, and you have to deal with packing and unpacking these strings. I would recommend instead adding a real Protobuf repeated Alignment message field inside Alignment in libvgio, to hold supplementary alignments. That's still going to have a good impedance match with storing them in GAF tags (which I think is what we're doing there?), but it's more natural for how GAM would do it if we weren't now backporting ideas from GAF. I think it does make sense to have the supplementary alignments physically attached to the primary one, because it makes pairing easier and sort of captures that this is "really" one longer alignment with the gaps elided.

I agree that using the annotation system is not exactly ideal. The main reason I turned to that is because of the GAM format's dependence on interleaving for expressing pairing, which fits poorly with the SAM/BAM formulation using separate alignment records. I guess I'm also still a bit allergic to further changes to the core GAM record after the early days when it was super unstable, but I can see the argument for a repeated Alignment field. However, considering GAF output as well, I think that using format would probably only succeed in pushing the Alignment <-> structured string conversion into the GAM-to-GAF re-encoding, rather than eliminating it entirely.

It looks like the way the supplementary-assignment logic works is, if you have a currently-winning primary alignment, it can pick up other smaller alignment pieces as supplementary. This adds a constraint that means two smaller pieces can't gang up as supplementary with each other and dethrone an incompatible primary that's only a little better than either alone. Is that what we want? Does that match the behavior of current linear mappers?

This was mostly an attempt to err on the side of being conservative in identifying supplementary alignments. Once you allow in fragmentary alignments, the search space grows a lot. For instance, my suspicion is that, in repetitive sequences, it will be common that a read can be partitioned into fragmentary alignments that erroneously beat out a primary that's just ok, score-wise. There's probably a way to parameterize around this issue through something like a "supplementary alignment penalty", but initially, I just have it focusing on the most clear-cut cases.

@eizengaj-roche
Copy link
Author

@adamnovak Pending CI, how are you feeling about merging this? I've updated libvgio with vgteam/libvgio#86 so that GAF output will also preserve supplementary alignments, and the refactor in Surjector::surject() is finished. That leaves unaddressed the possibility of moving IntervalUnion to structures, moving the GAM annotation into a first-rank message field, and the exact format of the supplementary annotation/tag. I'm open to adjustments on any of those, but AFAIK they shouldn't block this PR from being merged.

@adamnovak
Copy link
Member

adamnovak commented Dec 17, 2025

I think I'm positive on this overall, but I'm rushing around before curtailment and kind of not paying attention here, sorry.

I didn't know about the SAM SA tag, and while that in turn raises questions about how it interacts with the SAM supplementary flag, it makes cramming the supplementary records into e.g. GAF tags more reasonable.

IntervalUnion is fine to stay where it is for now.

If we are going to move supplementary alignments to real GAM fields in GAM, and we merge/release something with them in annotations instead, we're going to need conversion code kind of forever unless we're willing to silently throw them out, right? Which maybe we are willing to do.

If the same data will be an sa tag in GAF, we could store it in GAM with the GAF-tag-round-tripping machinery. But I'm not sure those are indexed for access by tag name...

So if you want to merge it with the supplementary alignments in an annotation, you can, but I think you should move them to real GAM Alignment fields and put the sa tag generation in the GAF IO layer before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants