-
Notifications
You must be signed in to change notification settings - Fork 13
Add support for Yoda repositories #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Just made a small edit to also add the WUR Yoda repository, which published its first data package recently. |
|
Thanks a lot, @stsnel. I will review it tomorrow! |
J535D165
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to review @stsnel. I'm slowly catching up with my GitHub notifications now.
I love the PR. I hope to see a REST API for Yoda in the future, but having support for Yoda in DataHugger before that is amazing. I have a couple of small feedback for you. I hope to merge soon.
| if not hasattr(self, "_files"): | ||
| self._requests_cache_file = tempfile.NamedTemporaryFile(delete=False) | ||
| requests_cache.install_cache(self._requests_cache_file.name) | ||
| self._files = self._harvest_files() | ||
| self._cleanup_requests_cache() | ||
| return self._files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why you cache this. It makes sense, but is there a reason to do this for Yoda specifically? Or should we implement this feature for all services in a generic way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original intent of this part of the code was to cache DNS query responses. In case of Yoda we need to request every file in the data set individually (rather than say, just requesting a single zip file that contains all the files). This can result in significant overhead for name resolution. Apart from that, flaky DNS servers can result in failures to harvest all files.
However, this solution ultimately involves monkey patching the requests module (or one of the lower-level modules), which can potentially interfere with other software that depends on datahugger. The implementation also didn't help (that much) with improving performance.
After reconsidering, I have removed this part of the code.
datahugger/services.py
Outdated
| folders_to_process = [contents_url] | ||
| files_to_download = [] | ||
|
|
||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the while loop is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the loop is to iterate through any subcollections (subdirectories) of the data packages if needed. I've adjusted the loop condition and added a comment to make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
|
By the way, I'm fixing some of the broken tests, so don't worry about them. |
|
Thank you for the feedback 👍 ! I expect to be able to process the feedback and respond within a few days. |
3705bb7 to
165fb6c
Compare
Yoda is a research data management solution developed by Utrecht University and used by multiple institutes around the world. It enables researchers and their partners to securely deposit, share, publish and preserve large amounts of research data during all stages of a research project. This adds support for the Yoda repositories of Utrecht University (UU), Vrije Universiteit Amsterdam (VU), as well as Wageningen University & Research (WUR).
for more information, see https://pre-commit.ci
Yoda is a research data management solution developed by Utrecht University and used by multiple institutes around the world. It enables researchers and their partners to securely deposit, share, publish and preserve large amounts of research data during all stages of a research project.
This adds support for the Yoda repositories of Utrecht University (UU) and Vrije Universiteit Amsterdam (VU).
This PR addresses issue #5