Bug 2106736

Summary: Detach code in vsphere csi driver is failing
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: medium CC: jsafrane
Version: 4.10Keywords: Rebase
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:24:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Hemant Kumar 2022-07-13 12:09:44 UTC
I wonder if detach calls in drivers are idempotent. Looking at failure - it looks like genuine errors:

./e2e-vsphere-csi/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-dkpbsvnr-4d32a-nm678-master-0_kube-controller-manager.log:E0712 20:20:34.091621       1 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3 podName: nodeName:}" failed. No retries permitted until 2022-07-12 20:22:36.091601709 +0000 UTC m=+2029.047486195 (durationBeforeRetry 2m2s). Error: DetachVolume.Detach failed for volume "pvc-b13f4fb2-36ba-49db-ac2f-bbe69414b47f" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3") on node "ci-op-dkpbsvnr-4d32a-nm678-worker-sq4hq" : rpc error: code = Internal desc = volumeID "06e1f6b8-cbe9-4229-8c93-f1a827eb36f3" not found in QueryVolume


So basically detach is failing because of above error and then delete is not even being attempted because k8s thinks that volume is still attached to the node.

Comment 2 Jan Safranek 2022-07-14 12:51:28 UTC
I manually detached a volume from a VM and the driver in 4.10 correctly recovered from that:

{"level":"info","time":"2022-07-14T12:48:30.496357788Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:96733190-dc4f-48b6-8c3b-f4be69588369 NodeId:jsafrane-vwz44-worker-xgtf8 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816164097Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\", vm: \"VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]\", opId: \"3e21db9f\"","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816237359Z","caller":"volume/util.go:364","msg":"Extract vimfault type: +*types.NotFound  vimFault: +&{{{<nil> []}}} Fault: &{DynamicData:{} Fault:0xc0010e4360 LocalizedMessage:The object or item referred to could not be found.} from resp: +&{{} {{} } 0xc0010e4320}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819082913Z","caller":"volume/manager.go:780","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\" not found on vm: VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]. Assuming it is already detached","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819126652Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: 96733190-dc4f-48b6-8c3b-f4be69588369","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}

Investigating further.

Comment 3 Jan Safranek 2022-07-14 13:00:30 UTC
If I detach + delete the volume in vCenter, I get the same output as above and everything succeeds.
vCenter does not allow me to delete a volume that is attached to a node to test that scenario.

Comment 4 Jan Safranek 2022-08-05 11:36:57 UTC
It's reproducible only with the test "provisioning should mount multiple PV pointing to the same storage on the same node". In this test, two PVs have the same VolumeHandle and the CSI driver does not seem to handle it correctly. Most probably because it tries to talk to the API server.

Comment 5 Jan Safranek 2022-08-09 12:42:59 UTC
I debugged the CSI driver + syncer, it does not support using two separate PVs with the same `volumeHandle`, which is exactly what the test does.
I filed https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1913 upstream.

Comment 6 Jan Safranek 2022-08-10 09:07:46 UTC
PR to skip the failing test: https://github.com/openshift/release/pull/31209

Comment 7 Jan Safranek 2022-10-13 12:30:48 UTC
Discussed on yesterday's upstream CSI meeting, there should be a new test capability for such CSI drivers that can't handle two PVs with the same VolumeHandle.

Comment 8 Jan Safranek 2022-10-13 15:36:34 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/113046

Comment 9 Jan Safranek 2022-11-24 14:33:11 UTC
The code has been merged upstream, waiting for rebase in OCP + we need to update test manifest of *all* CSI drivers to include "multiplePVsSameID: true", only vSphere needs "multiplePVsSameID: false".

Comment 11 Jan Safranek 2023-01-18 12:05:27 UTC
Waiting for Kubernetes rebase.

Comment 14 Shiftzilla 2023-03-09 01:24:29 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9389