Detach code in vsphere csi driver is failing - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vmware-vsphere-csi-driver-operator/98/pull-ci-openshift-vmware-vsphere-csi-driver-operator-release-4.11-e2e-vsphere-csi/1546935068826537984
I wonder if detach calls in drivers are idempotent. Looking at failure - it looks like genuine errors: ./e2e-vsphere-csi/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-dkpbsvnr-4d32a-nm678-master-0_kube-controller-manager.log:E0712 20:20:34.091621 1 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3 podName: nodeName:}" failed. No retries permitted until 2022-07-12 20:22:36.091601709 +0000 UTC m=+2029.047486195 (durationBeforeRetry 2m2s). Error: DetachVolume.Detach failed for volume "pvc-b13f4fb2-36ba-49db-ac2f-bbe69414b47f" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3") on node "ci-op-dkpbsvnr-4d32a-nm678-worker-sq4hq" : rpc error: code = Internal desc = volumeID "06e1f6b8-cbe9-4229-8c93-f1a827eb36f3" not found in QueryVolume So basically detach is failing because of above error and then delete is not even being attempted because k8s thinks that volume is still attached to the node.
I manually detached a volume from a VM and the driver in 4.10 correctly recovered from that: {"level":"info","time":"2022-07-14T12:48:30.496357788Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:96733190-dc4f-48b6-8c3b-f4be69588369 NodeId:jsafrane-vwz44-worker-xgtf8 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"} {"level":"info","time":"2022-07-14T12:48:30.816164097Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\", vm: \"VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]\", opId: \"3e21db9f\"","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"} {"level":"info","time":"2022-07-14T12:48:30.816237359Z","caller":"volume/util.go:364","msg":"Extract vimfault type: +*types.NotFound vimFault: +&{{{<nil> []}}} Fault: &{DynamicData:{} Fault:0xc0010e4360 LocalizedMessage:The object or item referred to could not be found.} from resp: +&{{} {{} } 0xc0010e4320}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"} {"level":"info","time":"2022-07-14T12:48:30.819082913Z","caller":"volume/manager.go:780","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\" not found on vm: VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]. Assuming it is already detached","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"} {"level":"info","time":"2022-07-14T12:48:30.819126652Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: 96733190-dc4f-48b6-8c3b-f4be69588369","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"} Investigating further.
If I detach + delete the volume in vCenter, I get the same output as above and everything succeeds. vCenter does not allow me to delete a volume that is attached to a node to test that scenario.
It's reproducible only with the test "provisioning should mount multiple PV pointing to the same storage on the same node". In this test, two PVs have the same VolumeHandle and the CSI driver does not seem to handle it correctly. Most probably because it tries to talk to the API server.
I debugged the CSI driver + syncer, it does not support using two separate PVs with the same `volumeHandle`, which is exactly what the test does. I filed https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1913 upstream.
PR to skip the failing test: https://github.com/openshift/release/pull/31209
Discussed on yesterday's upstream CSI meeting, there should be a new test capability for such CSI drivers that can't handle two PVs with the same VolumeHandle.
Upstream PR: https://github.com/kubernetes/kubernetes/pull/113046
The code has been merged upstream, waiting for rebase in OCP + we need to update test manifest of *all* CSI drivers to include "multiplePVsSameID: true", only vSphere needs "multiplePVsSameID: false".
Waiting for Kubernetes rebase.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9389