Skip to content

Conversation

@ilia-kats
Copy link
Collaborator

  • add unit tests for the new update()
  • remove redundant/obsolete code and streamline the new update()
  • fix bugs and make it pass all tests, in particular with duplicated index entries and different axes

@ilia-kats ilia-kats requested a review from gtca September 30, 2025 09:26
@ilia-kats ilia-kats force-pushed the update_fixes branch 2 times, most recently from 74444ba to 0cd6110 Compare September 30, 2025 12:44
@ilia-kats ilia-kats marked this pull request as draft September 30, 2025 14:13
@ilia-kats ilia-kats marked this pull request as ready for review October 2, 2025 16:15
@ilia-kats ilia-kats requested a review from ilan-gold October 9, 2025 16:11
@gtca
Copy link
Collaborator

gtca commented Oct 9, 2025

Thanks, @ilia-kats! I especially appreciate the code reduction — it's a great direction, and maybe we can eliminate even more code in the .update() functionality to make it simpler in future.

Because it's a difficult part of the codebase to track the logic across so many lines, if you think there are more comments explaining the logic that can help this to be more maintainable, feel free to add them!

The only question before we merge it is about potential performance regressions with the changes to the .update() details: do you know if the execution time was impacted by any of the changes?

@ilia-kats
Copy link
Collaborator Author

I haven't measured performance. On the one hand, I would expect it to be faster since I replaced this entire code block at the bottom of the function with a single call to data_global.index.get_indexer() further up, but on the other hand it now does a full comparison of the modmaps for some modalities, which will be slower. However, that was the only way I could think of to make all the tests pass (i.e. the previous version was buggy and didn't work in some cases), so I don't think we have much choice there. But I'll be happy to run some benchmarks if you can point me to a particularly nasty dataset performance-wise, I've never had issues with update() performance myself.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix bugs and make it pass all tests, in particular with duplicated index entries and different axes

Can you highlight which lines fix which bugs?

@ilia-kats
Copy link
Collaborator Author

ilia-kats commented Oct 29, 2025

fix bugs and make it pass all tests, in particular with duplicated index entries and different axes

Can you highlight which lines fix which bugs?

I'm afraid not for everything. The previous version was not being tested in the unit tests, so I just enabled the tests and tried to make them all pass. This particularly concerns the first commit in this series. The others are more incremental, where one commit introduces some tests and makes them pass. I can tell you that these changes fix the case where there are duplicate indices and stuff in obsm/varm: The previous version (the big code block that I removed) attempted to reconstruct the mapping from old indices to new indices from scratch and was buggy. We have this information at the point where we join data_mod with data_global, but it is then discarded as data_mod is prepared to become the new .obs/.var. So now I just get the mapping from data_mod and data_global.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this seems better than what was there. And I'm glad there's test coverage. It's really hard to reason about what the function does so I think deduping is a good place to start. This _update_attr function is almost 300 lines (which is better than before and is awesome!). Sorry this took so long

Comment on lines 733 to 748
kept_idx = data_global.index[data_global.index.isin(data_mod.index)]
new_idx = data_mod.index[~data_mod.index.isin(data_global.index)]
data_mod = data_mod.loc[kept_idx.append(new_idx), :]

index_order = data_global.index.get_indexer(data_mod.index)
can_update = (
new_idx.shape[0] == 0 # filtered or reordered
or kept_idx.shape[0] == data_global.shape[0] # new rows only
or data_mod.shape[0]
== data_global.shape[
0
] # renamed (since new_idx.shape[0] > 0 and kept_idx.shape[0] < data_global.shape[0])
or (
axis == self._axis
and axis != -1
and data_mod.shape[0] > data_global.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks extremely similar to the other branch of attr_duplicated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, although I don't really like it. In terms of what this is doing, it should be two functions: The top part just reorders data_mod, while the bottom part prepares to subset obsm/varm. However, the bottom part uses results from the top part, so having this as two functions would mean even more function arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to inner functions with nonlocal variables now. Slightly arcane I suppose, but imho still cleaner (as in less boilerplate) than the previous version.

@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.12%. Comparing base (0d646ed) to head (b0091b2).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
+ Coverage   96.78%   98.12%   +1.33%     
==========================================
  Files          13       13              
  Lines         872     1011     +139     
==========================================
+ Hits          844      992     +148     
+ Misses         28       19       -9     
Files with missing lines Coverage Δ
tests/test_update.py 99.23% <100.00%> (+8.11%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilia-kats ilia-kats requested a review from ilan-gold November 17, 2025 08:34
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, seems good enough. These comments are take or leave. Thanks for sticking with this, hopefully something good came out of all this reviewing.

index_order = None
can_update = True

def fix_attrmap_col(data_mod: pd.DataFrame, mod: str, rowcol: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is readability, I think it's fine to make these inlined functions separate (i.e., not inline) and just give them docstrings. People's IDEs can pick those strings up. I really don't like these

dfs = [
getattr(a, attr)
.loc[:, []]
.assign(**{f"{m}:{rowcol}": np.arange(getattr(a, attr).shape[0])})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a memory bomb - why do we need to arange here? Are there other options?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we need to handle the case where the index (say obs_names) has duplicates (this case is what's making this entire update method so complex and why we have the .obsmap and .varmap attributes). So let's say you have two modalities, with partially intersecting obs_names, in different order, and one name is duplicated in one modality. We need to somehow make a global .obs_names from all of that, while retaining the ability to map the global .obs_names to observations in the individual modalities. So what happens here is that each obs in every modality gets a guaranteed unique index, which is kept through all the joins and reorders and is in the end assigned to .obsmap[modname], allowing us to unambiguosly map each global obs to a single local obs.

In the example above, the first duplicate obs would be mapped to the unique obs in the other modality, and the second duplicate obs would be treated as unique. I admit that this is somewhat arbitrary, but having .obsmap and .varmap allows us to at east be consistent about this through all update() / push_obs() / pull_obs() invocations. Also, when subsetting a mudata object, using integer indices from obsmap/varmap is faster than doing a Pandas index get_indexer or similar.

As far as a memory bomb is concerned, this is a 1D integer array of length n_obs or n_var for each modality. It should take less space than obs_names or var_names that are already there. If this modality already has arrays in obsmap or varmap, these will be replaced by the new array, so the increased memory consumption is temporary.

@ilia-kats ilia-kats merged commit adc5f53 into scverse:main Dec 9, 2025
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants