Description of problem: Volumes for Jenkins pods failed to mount during upgrade of cluster. Had to manually go into vSphere to remove them from old node so new node could pick them up. Version-Release number of selected component (if applicable): OCP 4.4 How reproducible: Rare. Actual results: Many volumes failed to attach to pods Expected results: No volume failures when moving pods Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): kubernetes.io/vsphere-volume Additional info:
I have backported the fix - https://github.com/openshift/kubernetes/pull/445 Note to QE though - this requires at least one pod with vsphere volume to be running in the cluster.
The limitation of the fix is: The dangling volume mechanism in vsphere has one limitation though - since this code relies on verifying-volumes -as-attached feature of ADC, there must be at least one pod running with any vsphere/attachable volumes in the cluster. This means that, if a cluster had only one pod with vsphere volume and that volume was attached to wrong node (i.e dangling to wrong node), then this mechanism won't work. In practice I hope, it is unlikely that we will have just one pod with a dangling volume.
It is hard to reproduce during upgrade or node reboot. Checked with Hemant, as the root cause is the attach/detach issue when met un-expected node status and the fix is to check and make sure volume can be attached finally. With the following scenario, issue re-produced in OCP4.5 OCP cluster and passed with the fix in 4.7.0-0.nightly-2021-01-18-000316 1. Create volume and attach to compute-0 2. Create pv/pvc with this volume 3. Create pod with pvc and schedule to compute-1 4. After some failure try, volume attach/mount to compute-1 with following even and kcm log: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> Successfully assigned wduan-1/pod-vsphere to compute-1 Warning FailedAttachVolume <invalid> (x3 over <invalid>) attachdetach-controller AttachVolume.Attach failed for volume "pv-vsphere-1" : Failed to add disk 'scsi0:1'. Normal SuccessfulAttachVolume <invalid> attachdetach-controller AttachVolume.Attach succeeded for volume "pv-vsphere-1" Normal AddedInterface <invalid> multus Add eth0 [10.131.0.62/23] Normal Pulled <invalid> kubelet, compute-1 Container image "quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40" already present on machine Normal Created <invalid> kubelet, compute-1 Created container myfrontend Normal Started <invalid> kubelet, compute-1 Started container myfrontend KCM log: E0119 02:50:59.148140 1 vsphere.go:945] Failed to attach disk: [WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk for node: compute-1. err: +Failed to add disk 'scsi0:1'. E0119 02:50:59.161836 1 attacher.go:93] Error attaching volume "[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" to node "compute-1": Failed to add disk 'scsi0:1'. I0119 02:50:59.161884 1 actual_state_of_world.go:322] Volume "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" is already added to attachedVolume list to node "compute-0", update device path "" E0119 02:50:59.161950 1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk podName: nodeName:}" failed. No retries permitted until 2021-01-19 02:51:01.161924255 +0000 UTC m=+3807.819892376 (durationBeforeRetry 2s). Error: "AttachVolume.Attach failed for volume \"pv-vsphere-1\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk\") from node \"compute-1\" : Failed to add disk 'scsi0:1'." I0119 02:50:59.162422 1 event.go:291] "Event occurred" object="wduan-1/pod-vsphere" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pv-vsphere-1\" : Failed to add disk 'scsi0:1'." I0119 02:50:59.220845 1 reconciler.go:203] attacherDetacher.DetachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0" I0119 02:50:59.226117 1 operation_generator.go:1409] Verified volume is safe to detach for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0" I0119 02:50:59.578699 1 operation_generator.go:470] DetachVolume.Detach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0" I0119 02:50:59.621533 1 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1" I0119 02:51:25.411742 1 operation_generator.go:360] AttachVolume.Attach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633