Skip to content

failed to update export-policies with version 25.10.0 when concurrency enabled #1093

@jdm85nl

Description

@jdm85nl

Describe the bug

We noticed some strange behavior when using trident version 25.10.0 with the concurrency feature enabled.
We really need the feature due to high PVC count and GPU workload otherwise trident doesn't respond to api calls with the feature disabled.

Issue 1:

If an application with an pvc is being deleted the following happens:

The pvc is deleted.
The pv is being marked as released with a reclaim policy “Delete”
The export-policy on this volume is changed to trident_empty with no Rules attached.

A few second later in the trident-controller pod is complaining:

In the trident-controller logs we are seeing the following error message:

csi-attacher I1202 10:35:30.821720       
1 csi_handler.go:241] "Error processing" driver="csi.trident.netapp.io" VolumeAttachment="csi-d1c1b612d22ab7761af8eb1b3f04b60e3872179efc2c279a0fa1425b53fe10d2" err="failed to detach: 
rpc error: code = Internal desc = error listing export policy rules; could not get export policy trident_pvc_e73bc978_1395_4b93_a5dc_3daa9be182fd" 

csi-provisioner I1202 10:39:21.311833       
1 controller.go:1317] volume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd does not need any deletion secrets                                                                                                                                                                                          
 
csi-provisioner E1202 10:39:21.311871       
1 controller.go:1569] "Volume deletion failed" err="persistentvolume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd is still attached to node <dns-name>" 
PV="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd" csi-provisioner I1202 10:39:21.311886       1 
controller.go:1021] "Retrying syncing volume" key="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd" failures=575                                                                                                                                                                                       
 
csi-provisioner E1202 10:39:21.311895       
1 controller.go:1039] "Unhandled Error" err="error syncing volume \"pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd\": persistentvolume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd 
is still attached to node <dns-name>" 
logger="UnhandledError" csi-provisioner I1202 10:39:21.311907       
1 event.go:389] "Event occurred" object="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" 
reason="VolumeFailedDelete" message="persistentvolume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd is still attached to node <dns-name>"

When I manually create the export-policy with the same name as the volume and with an subnet rule attached to it trident will try to delete the volume after some time and is succeeds.

Trident-controller logs after some time:
trident-main time="2025-12-02T10:41:26Z" level=info msg="Unpublishing volume from node." logLayer=core node=<dns-name> requestID=38124ad7-5355-4e22-89ad-275a64a4ce8c 
requestSource=CSI volume=pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd workflow="controller_server=unpublish"
 
csi-provisioner I1202 10:41:51.316845       
1 controller.go:1317] volume pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd does not need any deletion secrets                                                                                                                                                                                          
 
csi-attacher E1202 10:41:56.216292       
1 csi_handler.go:718] "Failed to remove finalizer from PersistentVolume" err="PersistentVolume \"pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd\" is invalid: metadata.finalizers: 
Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"kubernetes.io/pv-protection\"}" 
driver="csi.trident.netapp.io" PersistentVolume="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd"                                                                                                                                                                                                                                 
 
csi-attacher I1202 10:41:56.223849       
1 csi_handler.go:723] "Removed finalizer from PersistentVolume" driver="csi.trident.netapp.io" PersistentVolume="pvc-e73bc978-1395-4b93-a5dc-3daa9be182fd"

Issue 2:

The same thing happen when an application pod is restarted and trident needs to remount the pv to the correct node. The application pod complains:

Normal   Scheduled    29s                 
default-scheduler  Successfully assigned <namespace>/dex-postgres-cluster-1 to <dns-name> 
Warning  FailedMount  24s (x25 over 30s)  kubelet            
MountVolume.SetUp failed for volume "pvc-a726b4e1-5955-41f0-9c2a-b01f13d3740b" : 
rpc error: code = Internal desc = error mounting NFS volume <dns-name>:/trident_pvc_a726b4e1_5955_41f0_9c2a_b01f13d3740b on mountpoint /var/lib/kubelet/pods/bc07bb4d-9796-4bbd-bb8e-9a7bcaed3e43/volumes/kubernetes.io~csi/pvc-a726b4e1-5955-41f0-9c2a-b01f13d3740b/mount: exit status 32

Trident will change the volume and export-policy with the uuid of the backend with no rules attached to it.

Again: When I manually create the export-policy with the same name as the volume and with an subnet rule attached to it trident will try function again and it will mount the volume to the correct node.

Netapp Support
NetApp support case already running with an trident engineering ticket ( CPE-11180 ) as well.

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: 25.10.0
  • Trident installation flags used: Helm deployment and --enableConcurrency: true
  • Container runtime: containerd
  • Kubernetes version: v1.33.4+rke2r1
  • Kubernetes orchestrator: Rancher v2.12.4
  • OS: Ubuntu 22.04.5 LTS
  • NetApp backend types: ONTAP AFF 9.15.1P13
  • Backend config: ontap-nas
  • Other:

To Reproduce
Enable the concurrency feature

Additional context
Clusters with the feature disabled are running fine.

Maybe this information could help other people who run in this issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions