Optimise remote file handling #46
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Preprocess remote file lists in batches and stagger job submission
Attempts to helps prevent
XRootD errorissues during parallelisation with large numbers of jobs (50+).Added a 50-200 ms random sleep with
time.sleep(random.uniform(0.05, 0.20))to each worker function, to help with I/O hammering.Modified pyprocess.py:Processor to resolve
xrootdURLs prior to job submission, which means that each job no longer needs to make anmdhsubprocess call inpyreadwhich can lead toXRootD errorissues if you submit too many jobs. This also means thatpyreadcan be deleted entirely, since we can justuproot.open(file_name)inpyimport.py.I have not done any testing on the performance yet, so I can't say if the tradeoff between submitting more jobs and the staggering + preprocessing results in an improvement in processing time. It looks like it might be a slight improvement...
This what it looks like with the preprocessing:
I also added a
postprocessmethod toSkeleton, which by default does not do anything. But, it can be overridden so that users can combine their results there.