2106736 – Detach code in vsphere csi driver is failing

Bug 2106736 - Detach code in vsphere csi driver is failing

Summary: Detach code in vsphere csi driver is failing

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.13.0
Assignee:	Jan Safranek
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-13 12:08 UTC by Hemant Kumar
Modified:	2023-03-09 01:24 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-09 01:24:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift alibaba-disk-csi-driver-operator pull 43	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:21 UTC
Github	openshift aws-ebs-csi-driver-operator pull 175	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:37 UTC
Github	openshift aws-efs-csi-driver-operator pull 61	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:37 UTC
Github	openshift azure-disk-csi-driver-operator pull 65	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-18 10:56:58 UTC
Github	openshift azure-file-csi-driver-operator pull 44	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:36 UTC
Github	openshift csi-driver-manila-operator pull 164	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:24 UTC
Github	openshift gcp-filestore-csi-driver-operator pull 25	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:25 UTC
Github	openshift gcp-pd-csi-driver-operator pull 60	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:35 UTC
Github	openshift ibm-vpc-block-csi-driver-operator pull 50	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-18 12:08:18 UTC
Github	openshift openstack-cinder-csi-driver-operator pull 106	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:33 UTC
Github	openshift vmware-vsphere-csi-driver-operator pull 131	None	Merged	Bug 2106736: Add multiplePVsSameID capability	2023-01-16 16:54:33 UTC
Github	openshift vmware-vsphere-csi-driver-operator pull 132	None	Merged	Bug 2106736: Fix multiplePVsSameID value in tests	2023-01-17 13:47:29 UTC

Description Hemant Kumar 2022-07-13 12:08:16 UTC

Detach code in vsphere csi driver is failing - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vmware-vsphere-csi-driver-operator/98/pull-ci-openshift-vmware-vsphere-csi-driver-operator-release-4.11-e2e-vsphere-csi/1546935068826537984

Comment 1 Hemant Kumar 2022-07-13 12:09:44 UTC

I wonder if detach calls in drivers are idempotent. Looking at failure - it looks like genuine errors:

./e2e-vsphere-csi/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-dkpbsvnr-4d32a-nm678-master-0_kube-controller-manager.log:E0712 20:20:34.091621       1 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3 podName: nodeName:}" failed. No retries permitted until 2022-07-12 20:22:36.091601709 +0000 UTC m=+2029.047486195 (durationBeforeRetry 2m2s). Error: DetachVolume.Detach failed for volume "pvc-b13f4fb2-36ba-49db-ac2f-bbe69414b47f" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3") on node "ci-op-dkpbsvnr-4d32a-nm678-worker-sq4hq" : rpc error: code = Internal desc = volumeID "06e1f6b8-cbe9-4229-8c93-f1a827eb36f3" not found in QueryVolume


So basically detach is failing because of above error and then delete is not even being attempted because k8s thinks that volume is still attached to the node.

Comment 2 Jan Safranek 2022-07-14 12:51:28 UTC

I manually detached a volume from a VM and the driver in 4.10 correctly recovered from that:

{"level":"info","time":"2022-07-14T12:48:30.496357788Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:96733190-dc4f-48b6-8c3b-f4be69588369 NodeId:jsafrane-vwz44-worker-xgtf8 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816164097Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\", vm: \"VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]\", opId: \"3e21db9f\"","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816237359Z","caller":"volume/util.go:364","msg":"Extract vimfault type: +*types.NotFound  vimFault: +&{{{<nil> []}}} Fault: &{DynamicData:{} Fault:0xc0010e4360 LocalizedMessage:The object or item referred to could not be found.} from resp: +&{{} {{} } 0xc0010e4320}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819082913Z","caller":"volume/manager.go:780","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\" not found on vm: VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]. Assuming it is already detached","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819126652Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: 96733190-dc4f-48b6-8c3b-f4be69588369","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}

Investigating further.

Comment 3 Jan Safranek 2022-07-14 13:00:30 UTC

If I detach + delete the volume in vCenter, I get the same output as above and everything succeeds.
vCenter does not allow me to delete a volume that is attached to a node to test that scenario.

Comment 4 Jan Safranek 2022-08-05 11:36:57 UTC

It's reproducible only with the test "provisioning should mount multiple PV pointing to the same storage on the same node". In this test, two PVs have the same VolumeHandle and the CSI driver does not seem to handle it correctly. Most probably because it tries to talk to the API server.

Comment 5 Jan Safranek 2022-08-09 12:42:59 UTC

I debugged the CSI driver + syncer, it does not support using two separate PVs with the same `volumeHandle`, which is exactly what the test does.
I filed https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1913 upstream.

Comment 6 Jan Safranek 2022-08-10 09:07:46 UTC

PR to skip the failing test: https://github.com/openshift/release/pull/31209

Comment 7 Jan Safranek 2022-10-13 12:30:48 UTC

Discussed on yesterday's upstream CSI meeting, there should be a new test capability for such CSI drivers that can't handle two PVs with the same VolumeHandle.

Comment 8 Jan Safranek 2022-10-13 15:36:34 UTC

Upstream PR: https://github.com/kubernetes/kubernetes/pull/113046

Comment 9 Jan Safranek 2022-11-24 14:33:11 UTC

The code has been merged upstream, waiting for rebase in OCP + we need to update test manifest of *all* CSI drivers to include "multiplePVsSameID: true", only vSphere needs "multiplePVsSameID: false".

Comment 11 Jan Safranek 2023-01-18 12:05:27 UTC

Waiting for Kubernetes rebase.

Comment 14 Shiftzilla 2023-03-09 01:24:29 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9389

Note You need to log in before you can comment on or make changes to this bug.