Bug 2106736

Summary:	Detach code in vsphere csi driver is failing
Product:	OpenShift Container Platform	Reporter:	Hemant Kumar <hekumar>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	medium
Priority:	medium	CC:	jsafrane
Version:	4.10	Keywords:	Rebase
Target Milestone:	---
Target Release:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-03-09 01:24:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hemant Kumar 2022-07-13 12:08:16 UTC

Detach code in vsphere csi driver is failing - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vmware-vsphere-csi-driver-operator/98/pull-ci-openshift-vmware-vsphere-csi-driver-operator-release-4.11-e2e-vsphere-csi/1546935068826537984

Comment 1 Hemant Kumar 2022-07-13 12:09:44 UTC

I wonder if detach calls in drivers are idempotent. Looking at failure - it looks like genuine errors:

./e2e-vsphere-csi/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-dkpbsvnr-4d32a-nm678-master-0_kube-controller-manager.log:E0712 20:20:34.091621       1 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3 podName: nodeName:}" failed. No retries permitted until 2022-07-12 20:22:36.091601709 +0000 UTC m=+2029.047486195 (durationBeforeRetry 2m2s). Error: DetachVolume.Detach failed for volume "pvc-b13f4fb2-36ba-49db-ac2f-bbe69414b47f" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3") on node "ci-op-dkpbsvnr-4d32a-nm678-worker-sq4hq" : rpc error: code = Internal desc = volumeID "06e1f6b8-cbe9-4229-8c93-f1a827eb36f3" not found in QueryVolume


So basically detach is failing because of above error and then delete is not even being attempted because k8s thinks that volume is still attached to the node.

Comment 2 Jan Safranek 2022-07-14 12:51:28 UTC

I manually detached a volume from a VM and the driver in 4.10 correctly recovered from that:

{"level":"info","time":"2022-07-14T12:48:30.496357788Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:96733190-dc4f-48b6-8c3b-f4be69588369 NodeId:jsafrane-vwz44-worker-xgtf8 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816164097Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\", vm: \"VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]\", opId: \"3e21db9f\"","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816237359Z","caller":"volume/util.go:364","msg":"Extract vimfault type: +*types.NotFound  vimFault: +&{{{<nil> []}}} Fault: &{DynamicData:{} Fault:0xc0010e4360 LocalizedMessage:The object or item referred to could not be found.} from resp: +&{{} {{} } 0xc0010e4320}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819082913Z","caller":"volume/manager.go:780","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\" not found on vm: VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]. Assuming it is already detached","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819126652Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: 96733190-dc4f-48b6-8c3b-f4be69588369","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}

Investigating further.

Comment 3 Jan Safranek 2022-07-14 13:00:30 UTC

If I detach + delete the volume in vCenter, I get the same output as above and everything succeeds.
vCenter does not allow me to delete a volume that is attached to a node to test that scenario.

Comment 4 Jan Safranek 2022-08-05 11:36:57 UTC

It's reproducible only with the test "provisioning should mount multiple PV pointing to the same storage on the same node". In this test, two PVs have the same VolumeHandle and the CSI driver does not seem to handle it correctly. Most probably because it tries to talk to the API server.

Comment 5 Jan Safranek 2022-08-09 12:42:59 UTC

I debugged the CSI driver + syncer, it does not support using two separate PVs with the same `volumeHandle`, which is exactly what the test does.
I filed https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1913 upstream.

Comment 6 Jan Safranek 2022-08-10 09:07:46 UTC

PR to skip the failing test: https://github.com/openshift/release/pull/31209

Comment 7 Jan Safranek 2022-10-13 12:30:48 UTC

Discussed on yesterday's upstream CSI meeting, there should be a new test capability for such CSI drivers that can't handle two PVs with the same VolumeHandle.

Comment 8 Jan Safranek 2022-10-13 15:36:34 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/113046

Comment 9 Jan Safranek 2022-11-24 14:33:11 UTC

The code has been merged upstream, waiting for rebase in OCP + we need to update test manifest of *all* CSI drivers to include "multiplePVsSameID: true", only vSphere needs "multiplePVsSameID: false".

Comment 11 Jan Safranek 2023-01-18 12:05:27 UTC

Waiting for Kubernetes rebase.

Comment 14 Shiftzilla 2023-03-09 01:24:29 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9389