-
Notifications
You must be signed in to change notification settings - Fork 21
Fixes for the new update() #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ilia-kats
commented
Sep 30, 2025
- add unit tests for the new update()
- remove redundant/obsolete code and streamline the new update()
- fix bugs and make it pass all tests, in particular with duplicated index entries and different axes
74444ba to
0cd6110
Compare
5256ea9 to
8040ea1
Compare
b29c475 to
ed46755
Compare
|
Thanks, @ilia-kats! I especially appreciate the code reduction — it's a great direction, and maybe we can eliminate even more code in the Because it's a difficult part of the codebase to track the logic across so many lines, if you think there are more comments explaining the logic that can help this to be more maintainable, feel free to add them! The only question before we merge it is about potential performance regressions with the changes to the |
|
I haven't measured performance. On the one hand, I would expect it to be faster since I replaced this entire code block at the bottom of the function with a single call to |
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix bugs and make it pass all tests, in particular with duplicated index entries and different axes
Can you highlight which lines fix which bugs?
- all tests now also test with duplicated entries - streamline update() to remove redundant code, duplicated calcuations - fix bugs and make it pass all tests
ed46755 to
9477514
Compare
I'm afraid not for everything. The previous version was not being tested in the unit tests, so I just enabled the tests and tried to make them all pass. This particularly concerns the first commit in this series. The others are more incremental, where one commit introduces some tests and makes them pass. I can tell you that these changes fix the case where there are duplicate indices and stuff in |
9477514 to
d8610cb
Compare
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this seems better than what was there. And I'm glad there's test coverage. It's really hard to reason about what the function does so I think deduping is a good place to start. This _update_attr function is almost 300 lines (which is better than before and is awesome!). Sorry this took so long
src/mudata/_core/mudata.py
Outdated
| kept_idx = data_global.index[data_global.index.isin(data_mod.index)] | ||
| new_idx = data_mod.index[~data_mod.index.isin(data_global.index)] | ||
| data_mod = data_mod.loc[kept_idx.append(new_idx), :] | ||
|
|
||
| index_order = data_global.index.get_indexer(data_mod.index) | ||
| can_update = ( | ||
| new_idx.shape[0] == 0 # filtered or reordered | ||
| or kept_idx.shape[0] == data_global.shape[0] # new rows only | ||
| or data_mod.shape[0] | ||
| == data_global.shape[ | ||
| 0 | ||
| ] # renamed (since new_idx.shape[0] > 0 and kept_idx.shape[0] < data_global.shape[0]) | ||
| or ( | ||
| axis == self._axis | ||
| and axis != -1 | ||
| and data_mod.shape[0] > data_global.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also looks extremely similar to the other branch of attr_duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, although I don't really like it. In terms of what this is doing, it should be two functions: The top part just reorders data_mod, while the bottom part prepares to subset obsm/varm. However, the bottom part uses results from the top part, so having this as two functions would mean even more function arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to inner functions with nonlocal variables now. Slightly arcane I suppose, but imho still cleaner (as in less boilerplate) than the previous version.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #104 +/- ##
==========================================
+ Coverage 96.78% 98.12% +1.33%
==========================================
Files 13 13
Lines 872 1011 +139
==========================================
+ Hits 844 992 +148
+ Misses 28 19 -9
🚀 New features to boost your workflow:
|
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, seems good enough. These comments are take or leave. Thanks for sticking with this, hopefully something good came out of all this reviewing.
| index_order = None | ||
| can_update = True | ||
|
|
||
| def fix_attrmap_col(data_mod: pd.DataFrame, mod: str, rowcol: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is readability, I think it's fine to make these inlined functions separate (i.e., not inline) and just give them docstrings. People's IDEs can pick those strings up. I really don't like these
| dfs = [ | ||
| getattr(a, attr) | ||
| .loc[:, []] | ||
| .assign(**{f"{m}:{rowcol}": np.arange(getattr(a, attr).shape[0])}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a memory bomb - why do we need to arange here? Are there other options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we need to handle the case where the index (say obs_names) has duplicates (this case is what's making this entire update method so complex and why we have the .obsmap and .varmap attributes). So let's say you have two modalities, with partially intersecting obs_names, in different order, and one name is duplicated in one modality. We need to somehow make a global .obs_names from all of that, while retaining the ability to map the global .obs_names to observations in the individual modalities. So what happens here is that each obs in every modality gets a guaranteed unique index, which is kept through all the joins and reorders and is in the end assigned to .obsmap[modname], allowing us to unambiguosly map each global obs to a single local obs.
In the example above, the first duplicate obs would be mapped to the unique obs in the other modality, and the second duplicate obs would be treated as unique. I admit that this is somewhat arbitrary, but having .obsmap and .varmap allows us to at east be consistent about this through all update() / push_obs() / pull_obs() invocations. Also, when subsetting a mudata object, using integer indices from obsmap/varmap is faster than doing a Pandas index get_indexer or similar.
As far as a memory bomb is concerned, this is a 1D integer array of length n_obs or n_var for each modality. It should take less space than obs_names or var_names that are already there. If this modality already has arrays in obsmap or varmap, these will be replaced by the new array, so the increased memory consumption is temporary.