Optimise remote file handling #46

sam-grant · 2025-10-23T00:50:29Z

Preprocess remote file lists in batches and stagger job submission

Attempts to helps prevent XRootD error issues during parallelisation with large numbers of jobs (50+).

Added a 50-200 ms random sleep with time.sleep(random.uniform(0.05, 0.20)) to each worker function, to help with I/O hammering.
Modified pyprocess.py:Processor to resolve xrootd URLs prior to job submission, which means that each job no longer needs to make an mdh subprocess call in pyread which can lead to XRootD error issues if you submit too many jobs. This also means that pyread can be deleted entirely, since we can just uproot.open(file_name) in pyimport.py.

I have not done any testing on the performance yet, so I can't say if the tradeoff between submitting more jobs and the staggering + preprocessing results in an improvement in processing time. It looks like it might be a slight improvement...

This what it looks like with the preprocessing:

[CosmicProcessor] ⭐️ Starting analysis
[pyutils] ⭐️ Setting up...
[pyutils] ✅ Ready
[pyprocess] ⭐️ Initialised Processor:
        path = 'EventNtuple/ntuple'
        use_remote = True
        location = disk
        schema = root
        verbosity=1
[pyprocess] ⭐️ Resolving remote file list in batches:
        822 files, 17 batches, 17 threads
Resolving: 100%|██████████████████████████████| 17/17 [00:02<00:00,  6.30batch/s]
[pyprocess] ✅ Successfully loaded file list
        Count: 822 files
[pyprocess] ⭐️ Starting processing on 822 files with 50 processes
Processing: 100%|██████████████████████████████| 822/822 [03:38<00:00,  3.76file/s, successful=822, failed=0]
[pyprocess] ⭐️ Returning 822 results
[CosmicProcessor] ✅ Analysis complete

I also added a postprocess method to Skeleton, which by default does not do anything. But, it can be overridden so that users can combine their results there.

…bprocess calls during parallelisation

sophiemiddleton · 2025-10-28T13:36:58Z

Does this change any of our "Examples"?

sam-grant · 2025-10-29T20:31:14Z

Does this change any of our "Examples"?

It will change the printouts, but it doesn't change how it you use it.

sam-grant added 2 commits October 22, 2025 18:20

Added postprocess method to Skeleton

f2b56f2

Preprocess remote file lists in batches, helps prevent errors from su…

bd966d4

…bprocess calls during parallelisation

sam-grant marked this pull request as draft October 23, 2025 00:50

sam-grant assigned sam-grant and sophiemiddleton Oct 23, 2025

sam-grant requested a review from sophiemiddleton October 23, 2025 00:51

sam-grant changed the title ~~Prevent errors from subprocess calls during parallelisation with by preprocessing remote file lists~~ Optimise remote file handling Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimise remote file handling #46

Optimise remote file handling #46

Uh oh!

sam-grant commented Oct 23, 2025

Uh oh!

sophiemiddleton commented Oct 28, 2025

Uh oh!

sam-grant commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimise remote file handling #46

Are you sure you want to change the base?

Optimise remote file handling #46

Uh oh!

Conversation

sam-grant commented Oct 23, 2025

Preprocess remote file lists in batches and stagger job submission

Uh oh!

sophiemiddleton commented Oct 28, 2025

Uh oh!

sam-grant commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants