Bug 2106736 - Detach code in vsphere csi driver is failing
Summary: Detach code in vsphere csi driver is failing
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.13.0
Assignee: Jan Safranek
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-13 12:08 UTC by Hemant Kumar
Modified: 2023-03-09 01:24 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:24:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift alibaba-disk-csi-driver-operator pull 43 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:21 UTC
Github openshift aws-ebs-csi-driver-operator pull 175 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:37 UTC
Github openshift aws-efs-csi-driver-operator pull 61 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:37 UTC
Github openshift azure-disk-csi-driver-operator pull 65 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-18 10:56:58 UTC
Github openshift azure-file-csi-driver-operator pull 44 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:36 UTC
Github openshift csi-driver-manila-operator pull 164 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:24 UTC
Github openshift gcp-filestore-csi-driver-operator pull 25 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:25 UTC
Github openshift gcp-pd-csi-driver-operator pull 60 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:35 UTC
Github openshift ibm-vpc-block-csi-driver-operator pull 50 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-18 12:08:18 UTC
Github openshift openstack-cinder-csi-driver-operator pull 106 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:33 UTC
Github openshift vmware-vsphere-csi-driver-operator pull 131 0 None Merged Bug 2106736: Add multiplePVsSameID capability 2023-01-16 16:54:33 UTC
Github openshift vmware-vsphere-csi-driver-operator pull 132 0 None Merged Bug 2106736: Fix multiplePVsSameID value in tests 2023-01-17 13:47:29 UTC

Comment 1 Hemant Kumar 2022-07-13 12:09:44 UTC
I wonder if detach calls in drivers are idempotent. Looking at failure - it looks like genuine errors:

./e2e-vsphere-csi/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-dkpbsvnr-4d32a-nm678-master-0_kube-controller-manager.log:E0712 20:20:34.091621       1 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3 podName: nodeName:}" failed. No retries permitted until 2022-07-12 20:22:36.091601709 +0000 UTC m=+2029.047486195 (durationBeforeRetry 2m2s). Error: DetachVolume.Detach failed for volume "pvc-b13f4fb2-36ba-49db-ac2f-bbe69414b47f" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^06e1f6b8-cbe9-4229-8c93-f1a827eb36f3") on node "ci-op-dkpbsvnr-4d32a-nm678-worker-sq4hq" : rpc error: code = Internal desc = volumeID "06e1f6b8-cbe9-4229-8c93-f1a827eb36f3" not found in QueryVolume


So basically detach is failing because of above error and then delete is not even being attempted because k8s thinks that volume is still attached to the node.

Comment 2 Jan Safranek 2022-07-14 12:51:28 UTC
I manually detached a volume from a VM and the driver in 4.10 correctly recovered from that:

{"level":"info","time":"2022-07-14T12:48:30.496357788Z","caller":"vanilla/controller.go:1075","msg":"ControllerUnpublishVolume: called with args {VolumeId:96733190-dc4f-48b6-8c3b-f4be69588369 NodeId:jsafrane-vwz44-worker-xgtf8 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816164097Z","caller":"volume/manager.go:753","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\", vm: \"VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]\", opId: \"3e21db9f\"","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.816237359Z","caller":"volume/util.go:364","msg":"Extract vimfault type: +*types.NotFound  vimFault: +&{{{<nil> []}}} Fault: &{DynamicData:{} Fault:0xc0010e4360 LocalizedMessage:The object or item referred to could not be found.} from resp: +&{{} {{} } 0xc0010e4320}","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819082913Z","caller":"volume/manager.go:780","msg":"DetachVolume: volumeID: \"96733190-dc4f-48b6-8c3b-f4be69588369\" not found on vm: VirtualMachine:vm-891067 [VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com, UUID: 422cee8b-36f8-e69b-ea9f-2d8098e4e73a, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.sddc-44-236-21-251.vmwarevmc.com]]. Assuming it is already detached","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}
{"level":"info","time":"2022-07-14T12:48:30.819126652Z","caller":"vanilla/controller.go:1159","msg":"ControllerUnpublishVolume successful for volume ID: 96733190-dc4f-48b6-8c3b-f4be69588369","TraceId":"ea430ead-3f9a-4716-b909-c15e8414e406"}

Investigating further.

Comment 3 Jan Safranek 2022-07-14 13:00:30 UTC
If I detach + delete the volume in vCenter, I get the same output as above and everything succeeds.
vCenter does not allow me to delete a volume that is attached to a node to test that scenario.

Comment 4 Jan Safranek 2022-08-05 11:36:57 UTC
It's reproducible only with the test "provisioning should mount multiple PV pointing to the same storage on the same node". In this test, two PVs have the same VolumeHandle and the CSI driver does not seem to handle it correctly. Most probably because it tries to talk to the API server.

Comment 5 Jan Safranek 2022-08-09 12:42:59 UTC
I debugged the CSI driver + syncer, it does not support using two separate PVs with the same `volumeHandle`, which is exactly what the test does.
I filed https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1913 upstream.

Comment 6 Jan Safranek 2022-08-10 09:07:46 UTC
PR to skip the failing test: https://github.com/openshift/release/pull/31209

Comment 7 Jan Safranek 2022-10-13 12:30:48 UTC
Discussed on yesterday's upstream CSI meeting, there should be a new test capability for such CSI drivers that can't handle two PVs with the same VolumeHandle.

Comment 8 Jan Safranek 2022-10-13 15:36:34 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/113046

Comment 9 Jan Safranek 2022-11-24 14:33:11 UTC
The code has been merged upstream, waiting for rebase in OCP + we need to update test manifest of *all* CSI drivers to include "multiplePVsSameID: true", only vSphere needs "multiplePVsSameID: false".

Comment 11 Jan Safranek 2023-01-18 12:05:27 UTC
Waiting for Kubernetes rebase.

Comment 14 Shiftzilla 2023-03-09 01:24:29 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9389


Note You need to log in before you can comment on or make changes to this bug.