Skip to content

Conversation

@stsnel
Copy link

@stsnel stsnel commented Apr 1, 2025

Yoda is a research data management solution developed by Utrecht University and used by multiple institutes around the world. It enables researchers and their partners to securely deposit, share, publish and preserve large amounts of research data during all stages of a research project.

This adds support for the Yoda repositories of Utrecht University (UU) and Vrije Universiteit Amsterdam (VU).

This PR addresses issue #5

@stsnel stsnel force-pushed the add-yoda-support branch from d7d2b65 to 9437016 Compare May 20, 2025 16:18
@stsnel
Copy link
Author

stsnel commented May 20, 2025

Just made a small edit to also add the WUR Yoda repository, which published its first data package recently.

@J535D165
Copy link
Owner

Thanks a lot, @stsnel. I will review it tomorrow!

Copy link
Owner

@J535D165 J535D165 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to review @stsnel. I'm slowly catching up with my GitHub notifications now.

I love the PR. I hope to see a REST API for Yoda in the future, but having support for Yoda in DataHugger before that is amazing. I have a couple of small feedback for you. I hope to merge soon.

Comment on lines 472 to 476
if not hasattr(self, "_files"):
self._requests_cache_file = tempfile.NamedTemporaryFile(delete=False)
requests_cache.install_cache(self._requests_cache_file.name)
self._files = self._harvest_files()
self._cleanup_requests_cache()
return self._files
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why you cache this. It makes sense, but is there a reason to do this for Yoda specifically? Or should we implement this feature for all services in a generic way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intent of this part of the code was to cache DNS query responses. In case of Yoda we need to request every file in the data set individually (rather than say, just requesting a single zip file that contains all the files). This can result in significant overhead for name resolution. Apart from that, flaky DNS servers can result in failures to harvest all files.

However, this solution ultimately involves monkey patching the requests module (or one of the lower-level modules), which can potentially interfere with other software that depends on datahugger. The implementation also didn't help (that much) with improving performance.

After reconsidering, I have removed this part of the code.

folders_to_process = [contents_url]
files_to_download = []

while True:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the while loop is needed here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the loop is to iterate through any subcollections (subdirectories) of the data packages if needed. I've adjusted the loop condition and added a comment to make this clearer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@J535D165
Copy link
Owner

By the way, I'm fixing some of the broken tests, so don't worry about them.

@stsnel
Copy link
Author

stsnel commented Jun 27, 2025

Thank you for the feedback 👍 ! I expect to be able to process the feedback and respond within a few days.

@stsnel stsnel force-pushed the add-yoda-support branch 2 times, most recently from 3705bb7 to 165fb6c Compare July 1, 2025 20:45
Yoda is a research data management solution developed by Utrecht University and used
by multiple institutes around the world. It enables researchers and their partners to
securely deposit, share, publish and preserve large amounts of research data during
all stages of a research project.

This adds support for the Yoda repositories of Utrecht University (UU),
Vrije Universiteit Amsterdam (VU), as well as Wageningen University &
Research (WUR).
@stsnel stsnel force-pushed the add-yoda-support branch from 031dd5f to bafb85b Compare July 1, 2025 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants