Bug 1889420
Summary: | OCP failed to add vsphere disk when pod moved to new node during cluster upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Neil Girard <ngirard> |
Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
Storage sub component: | Kubernetes | QA Contact: | Wei Duan <wduan> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, dgautam, hekumar |
Version: | 4.4 | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:26:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1981187 |
Description
Neil Girard
2020-10-19 15:55:06 UTC
I have backported the fix - https://github.com/openshift/kubernetes/pull/445 Note to QE though - this requires at least one pod with vsphere volume to be running in the cluster. The limitation of the fix is: The dangling volume mechanism in vsphere has one limitation though - since this code relies on verifying-volumes -as-attached feature of ADC, there must be at least one pod running with any vsphere/attachable volumes in the cluster. This means that, if a cluster had only one pod with vsphere volume and that volume was attached to wrong node (i.e dangling to wrong node), then this mechanism won't work. In practice I hope, it is unlikely that we will have just one pod with a dangling volume. It is hard to reproduce during upgrade or node reboot. Checked with Hemant, as the root cause is the attach/detach issue when met un-expected node status and the fix is to check and make sure volume can be attached finally. With the following scenario, issue re-produced in OCP4.5 OCP cluster and passed with the fix in 4.7.0-0.nightly-2021-01-18-000316 1. Create volume and attach to compute-0 2. Create pv/pvc with this volume 3. Create pod with pvc and schedule to compute-1 4. After some failure try, volume attach/mount to compute-1 with following even and kcm log: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> Successfully assigned wduan-1/pod-vsphere to compute-1 Warning FailedAttachVolume <invalid> (x3 over <invalid>) attachdetach-controller AttachVolume.Attach failed for volume "pv-vsphere-1" : Failed to add disk 'scsi0:1'. Normal SuccessfulAttachVolume <invalid> attachdetach-controller AttachVolume.Attach succeeded for volume "pv-vsphere-1" Normal AddedInterface <invalid> multus Add eth0 [10.131.0.62/23] Normal Pulled <invalid> kubelet, compute-1 Container image "quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40" already present on machine Normal Created <invalid> kubelet, compute-1 Created container myfrontend Normal Started <invalid> kubelet, compute-1 Started container myfrontend KCM log: E0119 02:50:59.148140 1 vsphere.go:945] Failed to attach disk: [WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk for node: compute-1. err: +Failed to add disk 'scsi0:1'. E0119 02:50:59.161836 1 attacher.go:93] Error attaching volume "[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" to node "compute-1": Failed to add disk 'scsi0:1'. I0119 02:50:59.161884 1 actual_state_of_world.go:322] Volume "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" is already added to attachedVolume list to node "compute-0", update device path "" E0119 02:50:59.161950 1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk podName: nodeName:}" failed. No retries permitted until 2021-01-19 02:51:01.161924255 +0000 UTC m=+3807.819892376 (durationBeforeRetry 2s). Error: "AttachVolume.Attach failed for volume \"pv-vsphere-1\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk\") from node \"compute-1\" : Failed to add disk 'scsi0:1'." I0119 02:50:59.162422 1 event.go:291] "Event occurred" object="wduan-1/pod-vsphere" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pv-vsphere-1\" : Failed to add disk 'scsi0:1'." I0119 02:50:59.220845 1 reconciler.go:203] attacherDetacher.DetachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0" I0119 02:50:59.226117 1 operation_generator.go:1409] Verified volume is safe to detach for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0" I0119 02:50:59.578699 1 operation_generator.go:470] DetachVolume.Detach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0" I0119 02:50:59.621533 1 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1" I0119 02:51:25.411742 1 operation_generator.go:360] AttachVolume.Attach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |