Fixes for the new update() #104

ilia-kats · 2025-09-30T09:26:02Z

add unit tests for the new update()
remove redundant/obsolete code and streamline the new update()
fix bugs and make it pass all tests, in particular with duplicated index entries and different axes

gtca · 2025-10-09T23:03:45Z

Thanks, @ilia-kats! I especially appreciate the code reduction — it's a great direction, and maybe we can eliminate even more code in the .update() functionality to make it simpler in future.

Because it's a difficult part of the codebase to track the logic across so many lines, if you think there are more comments explaining the logic that can help this to be more maintainable, feel free to add them!

The only question before we merge it is about potential performance regressions with the changes to the .update() details: do you know if the execution time was impacted by any of the changes?

ilia-kats · 2025-10-10T06:57:12Z

I haven't measured performance. On the one hand, I would expect it to be faster since I replaced this entire code block at the bottom of the function with a single call to data_global.index.get_indexer() further up, but on the other hand it now does a full comparison of the modmaps for some modalities, which will be slower. However, that was the only way I could think of to make all the tests pass (i.e. the previous version was buggy and didn't work in some cases), so I don't think we have much choice there. But I'll be happy to run some benchmarks if you can point me to a particularly nasty dataset performance-wise, I've never had issues with update() performance myself.

ilan-gold

fix bugs and make it pass all tests, in particular with duplicated index entries and different axes

Can you highlight which lines fix which bugs?

- all tests now also test with duplicated entries - streamline update() to remove redundant code, duplicated calcuations - fix bugs and make it pass all tests

ilia-kats · 2025-10-29T15:22:18Z

fix bugs and make it pass all tests, in particular with duplicated index entries and different axes

Can you highlight which lines fix which bugs?

I'm afraid not for everything. The previous version was not being tested in the unit tests, so I just enabled the tests and tried to make them all pass. This particularly concerns the first commit in this series. The others are more incremental, where one commit introduces some tests and makes them pass. I can tell you that these changes fix the case where there are duplicate indices and stuff in obsm/varm: The previous version (the big code block that I removed) attempted to reconstruct the mapping from old indices to new indices from scratch and was buggy. We have this information at the point where we join data_mod with data_global, but it is then discarded as data_mod is prepared to become the new .obs/.var. So now I just get the mapping from data_mod and data_global.

ilan-gold

Overall, this seems better than what was there. And I'm glad there's test coverage. It's really hard to reason about what the function does so I think deduping is a good place to start. This _update_attr function is almost 300 lines (which is better than before and is awesome!). Sorry this took so long

src/mudata/_core/mudata.py

ilan-gold · 2025-11-03T10:12:15Z

src/mudata/_core/mudata.py

+                kept_idx = data_global.index[data_global.index.isin(data_mod.index)]
+                new_idx = data_mod.index[~data_mod.index.isin(data_global.index)]
+                data_mod = data_mod.loc[kept_idx.append(new_idx), :]
+
+                index_order = data_global.index.get_indexer(data_mod.index)
+                can_update = (
+                    new_idx.shape[0] == 0  # filtered or reordered
+                    or kept_idx.shape[0] == data_global.shape[0]  # new rows only
+                    or data_mod.shape[0]
+                    == data_global.shape[
+                        0
+                    ]  # renamed (since new_idx.shape[0] > 0 and kept_idx.shape[0] < data_global.shape[0])
+                    or (
+                        axis == self._axis
+                        and axis != -1
+                        and data_mod.shape[0] > data_global.shape[0]


This also looks extremely similar to the other branch of attr_duplicated

Done, although I don't really like it. In terms of what this is doing, it should be two functions: The top part just reorders data_mod, while the bottom part prepares to subset obsm/varm. However, the bottom part uses results from the top part, so having this as two functions would mean even more function arguments.

I switched to inner functions with nonlocal variables now. Slightly arcane I suppose, but imho still cleaner (as in less boilerplate) than the previous version.

codecov · 2025-11-03T12:52:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.12%. Comparing base (0d646ed) to head (b0091b2).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
+ Coverage   96.78%   98.12%   +1.33%     
==========================================
  Files          13       13              
  Lines         872     1011     +139     
==========================================
+ Hits          844      992     +148     
+ Misses         28       19       -9

Files with missing lines	Coverage Δ
tests/test_update.py	`99.23% <100.00%> (+8.11%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold

At this point, seems good enough. These comments are take or leave. Thanks for sticking with this, hopefully something good came out of all this reviewing.

ilan-gold · 2025-12-09T13:05:34Z

src/mudata/_core/mudata.py

+        index_order = None
+        can_update = True
+
+        def fix_attrmap_col(data_mod: pd.DataFrame, mod: str, rowcol: str) -> str:


If the goal is readability, I think it's fine to make these inlined functions separate (i.e., not inline) and just give them docstrings. People's IDEs can pick those strings up. I really don't like these

ilan-gold · 2025-12-09T13:18:56Z

src/mudata/_core/mudata.py

+        dfs = [
+            getattr(a, attr)
+            .loc[:, []]
+            .assign(**{f"{m}:{rowcol}": np.arange(getattr(a, attr).shape[0])})


This seems like a memory bomb - why do we need to arange here? Are there other options?

This is because we need to handle the case where the index (say obs_names) has duplicates (this case is what's making this entire update method so complex and why we have the .obsmap and .varmap attributes). So let's say you have two modalities, with partially intersecting obs_names, in different order, and one name is duplicated in one modality. We need to somehow make a global .obs_names from all of that, while retaining the ability to map the global .obs_names to observations in the individual modalities. So what happens here is that each obs in every modality gets a guaranteed unique index, which is kept through all the joins and reorders and is in the end assigned to .obsmap[modname], allowing us to unambiguosly map each global obs to a single local obs.

In the example above, the first duplicate obs would be mapped to the unique obs in the other modality, and the second duplicate obs would be treated as unique. I admit that this is somewhat arbitrary, but having .obsmap and .varmap allows us to at east be consistent about this through all update() / push_obs() / pull_obs() invocations. Also, when subsetting a mudata object, using integer indices from obsmap/varmap is faster than doing a Pandas index get_indexer or similar.

As far as a memory bomb is concerned, this is a 1D integer array of length n_obs or n_var for each modality. It should take less space than obs_names or var_names that are already there. If this modality already has arrays in obsmap or varmap, these will be replaced by the new array, so the increased memory consumption is temporary.

src/mudata/_core/utils.py

ilia-kats requested a review from gtca September 30, 2025 09:26

ilia-kats force-pushed the update_fixes branch 2 times, most recently from 74444ba to 0cd6110 Compare September 30, 2025 12:44

ilia-kats marked this pull request as draft September 30, 2025 14:13

ilia-kats force-pushed the update_fixes branch from 5256ea9 to 8040ea1 Compare October 1, 2025 09:22

ilia-kats marked this pull request as ready for review October 2, 2025 16:15

ilia-kats force-pushed the update_fixes branch from b29c475 to ed46755 Compare October 8, 2025 07:58

ilia-kats mentioned this pull request Oct 8, 2025

concat error if mdata.obsm contains multi-dimensional array and merge is not None #103

Closed

ilia-kats requested a review from ilan-gold October 9, 2025 16:11

ilia-kats mentioned this pull request Oct 22, 2025

fixes for push_attr/pull_attr #105

Merged

ilan-gold reviewed Oct 29, 2025

View reviewed changes

ilia-kats added 7 commits October 29, 2025 15:44

update(): remove redundant code, fix some bugs and add tests

0cf3501

make update() work well when indices have duplicates

723a63c

- all tests now also test with duplicated entries - streamline update() to remove redundant code, duplicated calcuations - fix bugs and make it pass all tests

fix inserting modalities

22a1b14

fix deleting modalities

dfe9ec0

remove obsolete asserts

0be9ecf

more comprehensive tests and fixes for _update_attr()

082433f

add test for extremely duplicated names

826c8ce

ilia-kats force-pushed the update_fixes branch from ed46755 to 9477514 Compare October 29, 2025 15:11

add tests for dtype

d8610cb

ilia-kats force-pushed the update_fixes branch from 9477514 to d8610cb Compare October 29, 2025 15:24

ilan-gold reviewed Nov 3, 2025

View reviewed changes

ilia-kats added 3 commits November 3, 2025 13:22

deduplicate processing of the map column

c1ba851

deduplicate reordering of data_mod and attrm

106a362

remove unnecessary boolean coercion

2d04cea

use inner functions

b0091b2

silence ruff false positives

dd212ed

ilia-kats requested a review from ilan-gold November 17, 2025 08:34

ilan-gold approved these changes Dec 9, 2025

View reviewed changes

minor simplification

826c2b9

ilia-kats merged commit adc5f53 into scverse:main Dec 9, 2025
5 of 8 checks passed

Fixes for the new update() #104

Fixes for the new update() #104

Uh oh!

Conversation

ilia-kats commented Sep 30, 2025

Uh oh!

gtca commented Oct 9, 2025

Uh oh!

ilia-kats commented Oct 10, 2025

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilia-kats commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ilia-kats Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ilia-kats Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ilia-kats Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilia-kats commented Oct 29, 2025 •

edited

Loading

codecov bot commented Nov 3, 2025 •

edited

Loading