-
Notifications
You must be signed in to change notification settings - Fork 123
Description
Hello,
I finally got the cactus-pangenome pipeline to run, after encountering a number of different issues. most of which I believe are due to difference in the cluster environment where I'm trying to run this. Notably, our main filesystems are NFS (also Weka/Lustre). I often use docker, though our cluster uses podman for 'rootless' docker execution.
I hit these issues in various different ways, but the two categories were: 1) issues with jemalloc, and 2) what I assume are issues with NFS filesystems. I'm reporting them here in case others hit them, and at least the NFS piece might make sense as a more obvious note in your documentation.
-
Category 1: I ran into many issues with jemalloc and vg. Initially reported here: <jemalloc>: Error in munmap(): Cannot allocate memory #1871. This occurred both with docker and without docker. I ended up creating a new docker container where vg is installed with "mimalloc=on" (see: https://github.com/bimberlabinternal/DevOps/blob/master/containers/cactus/Dockerfile, and ), based on this thread: vg giraffe hangs or stalls when reading large gzipped FASTQ files vgteam/vg#4645.
-
Category 2: what I assume are issues with an NFS or lustre filesystem. Both within docker and without docker, I initially ran the pipeline giving cactus jobStore and output directories on our default NFS filesystem. The exact error would change, but one example is below:
[2026-01-15T11:16:29+0000] [MainThread] [I] [toil.lib.history] Workflow a10623d8-fe9f-4af7-9162-ba2eca93bebf stopped. Success: False
Traceback (most recent call last):
File "/home/cactus/cactus_env/bin/cactus-pangenome", line 7, in <module>
sys.exit(main())
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_pangenome.py", line 269, in main
toil.start(Job.wrapJobFn(pangenome_end_to_end_workflow, options, config_wrapper, input_seq_id_map, input_path_map, input_seq_order, ref_collapse_paf_id, last_scores_id))
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1185, in start
return self._runMainLoop(rootJobDescription)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1795, in _runMainLoop
).run()
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 341, in run
raise FailedJobsException(
toil.exceptions.FailedJobsException: The job store '/home/exacloud/gscratch/prime-seq/Bimber/ONT/cactus/js' contains 12 failed jobs: 'Job' kind-Job/instance-umzrgkrn v2, 'batch_align_jobs' kind-export_split_wrapper/instance-qriu1rae v12, 'Job' kind-export_minigraph_wrapper/instance-finl2_qq v8, 'join_vg' kind-join_vg/instance-jnr68g00 v3, 'Job' kind-Job/instance-a6njlacf v2, 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v6, 'sort_minigraph_input_with_mash' kind-minigraph_construct_workflow/instance-q3fvzif2 v7, 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-xj5a6703 v6, 'graphmap_join_workflow' kind-export_align_wrapper/instance-fl2ts3oy v6, 'Job' kind-make_vcf/instance-jttebdk7 v5, 'Job' kind-export_graphmap_wrapper/instance-jcdts05s v11, 'Job' kind-Job/instance-m9sar_yg v2
Log from job "'vcf_cat' kind-vcf_cat/instance-msw9hw0m v6" follows:
=========>
[2026-01-15T11:15:56+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
[2026-01-15T11:15:56+0000] [MainThread] [I] [toil] Running Toil version 9.1.2-355b72ba0e425619c0367562caf2a337078dba65 on host cbcebd992853.
[2026-01-15T11:15:56+0000] [MainThread] [I] [toil.worker] Working on job 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v4
[2026-01-15T11:16:03+0000] [MainThread] [I] [toil.worker] Loaded body Job('vcf_cat' kind-vcf_cat/instance-msw9hw0m v4) from description 'vcf_cat' kind-vcf_cat/instance-msw9hw0m v4
[2026-01-15T11:16:03+0000] [MainThread] [C] [toil.worker] Worker crashed with traceback:
Traceback (most recent call last):
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 595, in workerScript
job._runner(
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3254, in _runner
returnValues = self._run(jobGraph=None, fileStore=fileStore)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3137, in _run
return self.run(fileStore)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 3452, in run
rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 1467, in vcf_cat
job.fileStore.readGlobalFile(vcf_id, vcf_path)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/fileStores/nonCachingFileStore.py", line 162, in readGlobalFile
self.jobStore.read_file(fileStoreID, localFilePath, symlink=symlink)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 551, in read_file
self._check_job_store_file_id(file_id)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 1051, in _check_job_store_file_id
raise NoSuchFileException(jobStoreFileID)
toil.jobStores.abstractJobStore.NoSuchFileException: File 'files/for-job/kind-deconstruct/instance-4wy55cn1/file-e3f47756c3b041f183faa91a72f72605/rm-pg.chr1.clip.raw.vcf.gz' does not exist.
[2026-01-15T11:16:03+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host cbcebd992853
<=========
I also frequently got errors about /var/run/lock* files not existing. Eventually I ran cactus using:
# $WORK_DIR is a local SSD connected to the node:
docker run \
--rm \
-v /mnt/scratch:/mnt/scratch \
-e TMPDIR=$WORK_DIR \
$CONTAINER_NAME \
cactus-pangenome \
$JOB_STORE \
--mgCores $THREADS \
--maxMemory ${MEM}G \
--workDir $WORK_DIR \
--coordinationDir $WORK_DIR \
$GENOME_FILE \
--outDir $OUTPUT \
--outName $OUTPUT \
--reference $REF_NAME \
--vcf \
--giraffe \
--gfa \
--gbz
I didnt see 'NFS' listed anywhere in your documentation, so I thought I'd report this in case there is user-guidance that would make sense to add to that.