-
Notifications
You must be signed in to change notification settings - Fork 263
Description
Describe the bug
We noticed some strange behavior when using trident version 25.10.0 with the concurrency feature enabled.
We really need the feature due to high PVC count and GPU workload otherwise trident doesn't respond to api calls with the feature disabled.
Issue 1:
If an application with an pvc is being deleted the following happens:
The pvc is deleted.
The pv is being marked as released with a reclaim policy “Delete”
The export-policy on this volume is changed to trident_empty with no Rules attached.
A few second later in the trident-controller pod is complaining:
In the trident-controller logs we are seeing the following error message:
csi-attacher I1202 10:35:30.821720
1 csi_handler.go:241] "Error processing" driver="csi.trident.netapp.io" VolumeAttachment="csi-d1c1b612d22ab7761af8eb1b3f04b60e3872179efc2c279a0fa1425b53fe10d2" err="failed to detach:
rpc error: code = Internal desc = error listing export policy rules; could not get export policy trident_pvc_e73bc978_1395_4b93_a5dc_3daa9be182fd"
csi-provisioner I1202 10:39:21.311833
1 controller.go:1317] volume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd does not need any deletion secrets
csi-provisioner E1202 10:39:21.311871
1 controller.go:1569] "Volume deletion failed" err="persistentvolume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd is still attached to node <dns-name>"
PV="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd" csi-provisioner I1202 10:39:21.311886 1
controller.go:1021] "Retrying syncing volume" key="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd" failures=575
csi-provisioner E1202 10:39:21.311895
1 controller.go:1039] "Unhandled Error" err="error syncing volume \"pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd\": persistentvolume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd
is still attached to node <dns-name>"
logger="UnhandledError" csi-provisioner I1202 10:39:21.311907
1 event.go:389] "Event occurred" object="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning"
reason="VolumeFailedDelete" message="persistentvolume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd is still attached to node <dns-name>"
When I manually create the export-policy with the same name as the volume and with an subnet rule attached to it trident will try to delete the volume after some time and is succeeds.
Trident-controller logs after some time:
trident-main time="2025-12-02T10:41:26Z" level=info msg="Unpublishing volume from node." logLayer=core node=<dns-name> requestID=38124ad7-5355-4e22-89ad-275a64a4ce8c
requestSource=CSI volume=pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd workflow="controller_server=unpublish"
csi-provisioner I1202 10:41:51.316845
1 controller.go:1317] volume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd does not need any deletion secrets
csi-attacher E1202 10:41:56.216292
1 csi_handler.go:718] "Failed to remove finalizer from PersistentVolume" err="PersistentVolume \"pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd\" is invalid: metadata.finalizers:
Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"kubernetes.io/pv-protection\"}"
driver="csi.trident.netapp.io" PersistentVolume="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd"
csi-attacher I1202 10:41:56.223849
1 csi_handler.go:723] "Removed finalizer from PersistentVolume" driver="csi.trident.netapp.io" PersistentVolume="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd"
Issue 2:
The same thing happen when an application pod is restarted and trident needs to remount the pv to the correct node. The application pod complains:
Normal Scheduled 29s
default-scheduler Successfully assigned <namespace>/dex-postgres-cluster-1 to <dns-name>
Warning FailedMount 24s (x25 over 30s) kubelet
MountVolume.SetUp failed for volume "pvc-a726b4e1-5955-41f0-9c2a-b01f13d3740b" :
rpc error: code = Internal desc = error mounting NFS volume <dns-name>:/trident_pvc_a726b4e1_5955_41f0_9c2a_b01f13d3740b on mountpoint /var/lib/kubelet/pods/bc07bb4d-9796-4bbd-bb8e-9a7bcaed3e43/volumes/kubernetes.io~csi/pvc-a726b4e1-5955-41f0-9c2a-b01f13d3740b/mount: exit status 32
Trident will change the volume and export-policy with the uuid of the backend with no rules attached to it.
Again: When I manually create the export-policy with the same name as the volume and with an subnet rule attached to it trident will try function again and it will mount the volume to the correct node.
Netapp Support
NetApp support case already running with an trident engineering ticket ( CPE-11180 ) as well.
Environment
Provide accurate information about the environment to help us reproduce the issue.
- Trident version: 25.10.0
- Trident installation flags used: Helm deployment and --enableConcurrency: true
- Container runtime: containerd
- Kubernetes version: v1.33.4+rke2r1
- Kubernetes orchestrator: Rancher v2.12.4
- OS: Ubuntu 22.04.5 LTS
- NetApp backend types: ONTAP AFF 9.15.1P13
- Backend config: ontap-nas
- Other:
To Reproduce
Enable the concurrency feature
Additional context
Clusters with the feature disabled are running fine.
Maybe this information could help other people who run in this issue.