Several Infrastructure-Related Issues Getting cactus-pangenome to work

Hello,

I finally got the cactus-pangenome pipeline to run, after encountering a number of different issues. most of which I believe are due to difference in the cluster environment where I'm trying to run this. Notably, our main filesystems are NFS (also Weka/Lustre). I often use docker, though our cluster uses podman for 'rootless' docker execution.

I hit these issues in various different ways, but the two categories were: 1) issues with jemalloc, and 2) what I assume are issues with NFS filesystems. I'm reporting them here in case others hit them, and at least the NFS piece might make sense as a more obvious note in your documentation.

1) Category 1: I ran into many issues with jemalloc and vg. Initially reported here: https://github.com/ComparativeGenomicsToolkit/cactus/issues/1871. This occurred both with docker and without docker. I ended up creating a new docker container where vg is installed with "mimalloc=on" (see: https://github.com/bimberlabinternal/DevOps/blob/master/containers/cactus/Dockerfile, and ), based on this thread: https://github.com/vgteam/vg/issues/4645. 

2) Category 2: what I assume are issues with an NFS or lustre filesystem. Both within docker and without docker, I initially ran the pipeline giving cactus jobStore and output directories on our default NFS filesystem. The exact error would change, but one example is below:

```
[2026-01-15T11:16:29+0000] [MainThread] [I] [toil.lib.history] Workflow a10623d8-fe9f-4af7-9162-ba2eca93bebf stopped. Success: False
Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus-pangenome", line 7, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_pangenome.py", line 269, in main
    toil.start(Job.wrapJobFn(pangenome_end_to_end_workflow, options, config_wrapper, input_seq_id_map, input_path_map, input_seq_order, ref_collapse_paf_id, last_scores_id))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1185, in start
    return self._runMainLoop(rootJobDescription)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1795, in _runMainLoop
    ).run()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 341, in run
    raise FailedJobsException(
toil.exceptions.FailedJobsException: The job store '/home/exacloud/gscratch/prime-seq/Bimber/ONT/cactus/js' contains 12 failed jobs: 'Job' kind-Job/instance-umzrgkrn v2, 'batch_align_jobs' kind-export_split_wrapper/instance-qriu1rae v12, 'Job' kind-export_minigraph_wrapper/instance-finl2_qq v8, 'join_vg' kind-join_vg/instance-jnr68g00 v3, 'Job' kind-Job/instance-a6njlacf v2, 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v6, 'sort_minigraph_input_with_mash' kind-minigraph_construct_workflow/instance-q3fvzif2 v7, 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-xj5a6703 v6, 'graphmap_join_workflow' kind-export_align_wrapper/instance-fl2ts3oy v6, 'Job' kind-make_vcf/instance-jttebdk7 v5, 'Job' kind-export_graphmap_wrapper/instance-jcdts05s v11, 'Job' kind-Job/instance-m9sar_yg v2
Log from job "'vcf_cat' kind-vcf_cat/instance-msw9hw0m v6" follows:
=========>
	[2026-01-15T11:15:56+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
	[2026-01-15T11:15:56+0000] [MainThread] [I] [toil] Running Toil version 9.1.2-355b72ba0e425619c0367562caf2a337078dba65 on host cbcebd992853.
	[2026-01-15T11:15:56+0000] [MainThread] [I] [toil.worker] Working on job 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v4
	[2026-01-15T11:16:03+0000] [MainThread] [I] [toil.worker] Loaded body Job('vcf_cat' kind-vcf_cat/instance-msw9hw0m v4) from description 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v4
	[2026-01-15T11:16:03+0000] [MainThread] [C] [toil.worker] Worker crashed with traceback:
	Traceback (most recent call last):
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 595, in workerScript
	    job._runner(
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3254, in _runner
	    returnValues = self._run(jobGraph=None, fileStore=fileStore)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3137, in _run
	    return self.run(fileStore)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3452, in run
	    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 1467, in vcf_cat
	    job.fileStore.readGlobalFile(vcf_id, vcf_path)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/fileStores/nonCachingFileStore.py", line 162, in readGlobalFile
	    self.jobStore.read_file(fileStoreID, localFilePath, symlink=symlink)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 551, in read_file
	    self._check_job_store_file_id(file_id)
	  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 1051, in _check_job_store_file_id
	    raise NoSuchFileException(jobStoreFileID)
	toil.jobStores.abstractJobStore.NoSuchFileException: File 'files/for-job/kind-deconstruct/instance-4wy55cn1/file-e3f47756c3b041f183faa91a72f72605/rm-pg.chr1.clip.raw.vcf.gz' does not exist.
	
	[2026-01-15T11:16:03+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host cbcebd992853
<=========

```
I also frequently got errors about /var/run/lock* files not existing. Eventually I ran cactus using:
```
# $WORK_DIR is a local SSD connected to the node:
docker run \
	--rm \
	-v /mnt/scratch:/mnt/scratch \
	-e TMPDIR=$WORK_DIR \
	$CONTAINER_NAME \
	cactus-pangenome \
		$JOB_STORE \
		--mgCores $THREADS \
		--maxMemory ${MEM}G \
		--workDir $WORK_DIR \
		--coordinationDir $WORK_DIR \
		$GENOME_FILE \
		--outDir $OUTPUT \
		--outName $OUTPUT \
		--reference $REF_NAME \
		--vcf \
		--giraffe \
		--gfa \
		--gbz

```
I didnt see 'NFS' listed anywhere in your documentation, so I thought I'd report this in case there is user-guidance that would make sense to add to that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several Infrastructure-Related Issues Getting cactus-pangenome to work #1878

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Several Infrastructure-Related Issues Getting cactus-pangenome to work #1878

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions