Bug 1889420 - OCP failed to add vsphere disk when pod moved to new node during cluster upgrade
Summary: OCP failed to add vsphere disk when pod moved to new node during cluster upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Hemant Kumar
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks: 1981187
TreeView+ depends on / blocked
 
Reported: 2020-10-19 15:55 UTC by Neil Girard
Modified: 2021-12-09 01:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:26:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 445 0 None closed Bug 1889420: Add dangling volume check for vsphere 2021-02-08 13:57:01 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:27:08 UTC

Description Neil Girard 2020-10-19 15:55:06 UTC
Description of problem:

Volumes for Jenkins pods failed to mount during upgrade of cluster.  Had to manually go into vSphere to remove them from old node so new node could pick them up.

Version-Release number of selected component (if applicable):

OCP 4.4

How reproducible:

Rare.

Actual results:

Many volumes failed to attach to pods

Expected results:

No volume failures when moving pods

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

kubernetes.io/vsphere-volume

Additional info:

Comment 15 Hemant Kumar 2021-01-12 03:53:35 UTC
I have backported the fix - https://github.com/openshift/kubernetes/pull/445


Note to QE though - this requires at least one pod with vsphere volume to be running in the cluster.

Comment 16 Hemant Kumar 2021-01-12 14:49:35 UTC
The limitation of the fix is:

The dangling volume mechanism in vsphere has one limitation though - since this code relies on verifying-volumes -as-attached feature of ADC, there must be at least one pod running with any vsphere/attachable volumes in the cluster. This means that, if a cluster had only one pod with vsphere volume and that volume was attached to wrong node (i.e dangling to wrong node), then this mechanism won't work. In practice I hope, it is unlikely that we will have just one pod with a dangling volume.

Comment 18 Wei Duan 2021-01-19 04:38:20 UTC
It is hard to reproduce during upgrade or node reboot. Checked with Hemant, as the root cause is the attach/detach issue when met un-expected node status and the fix is to check and make sure volume can be attached finally.

With the following scenario, issue re-produced in OCP4.5 OCP cluster and passed with the fix in 4.7.0-0.nightly-2021-01-18-000316
1. Create volume and attach to compute-0
2. Create pv/pvc with this volume
3. Create pod with pvc and schedule to compute-1
4. After some failure try, volume attach/mount to compute-1 with following even and kcm log:

Events:
  Type     Reason                  Age                            From                     Message
  ----     ------                  ----                           ----                     -------
  Normal   Scheduled               <unknown>                                               Successfully assigned wduan-1/pod-vsphere to compute-1
  Warning  FailedAttachVolume      <invalid> (x3 over <invalid>)  attachdetach-controller  AttachVolume.Attach failed for volume "pv-vsphere-1" : Failed to add disk 'scsi0:1'.
  Normal   SuccessfulAttachVolume  <invalid>                      attachdetach-controller  AttachVolume.Attach succeeded for volume "pv-vsphere-1"
  Normal   AddedInterface          <invalid>                      multus                   Add eth0 [10.131.0.62/23]
  Normal   Pulled                  <invalid>                      kubelet, compute-1       Container image "quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40" already present on machine
  Normal   Created                 <invalid>                      kubelet, compute-1       Created container myfrontend
  Normal   Started                 <invalid>                      kubelet, compute-1       Started container myfrontend

KCM log:
E0119 02:50:59.148140       1 vsphere.go:945] Failed to attach disk: [WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk for node: compute-1. err: +Failed to add disk 'scsi0:1'.
E0119 02:50:59.161836       1 attacher.go:93] Error attaching volume "[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" to node "compute-1": Failed to add disk 'scsi0:1'.
I0119 02:50:59.161884       1 actual_state_of_world.go:322] Volume "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" is already added to attachedVolume list to node "compute-0", update device path ""
E0119 02:50:59.161950       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk podName: nodeName:}" failed. No retries permitted until 2021-01-19 02:51:01.161924255 +0000 UTC m=+3807.819892376 (durationBeforeRetry 2s). Error: "AttachVolume.Attach failed for volume \"pv-vsphere-1\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk\") from node \"compute-1\" : Failed to add disk 'scsi0:1'."
I0119 02:50:59.162422       1 event.go:291] "Event occurred" object="wduan-1/pod-vsphere" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pv-vsphere-1\" : Failed to add disk 'scsi0:1'."
I0119 02:50:59.220845       1 reconciler.go:203] attacherDetacher.DetachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.226117       1 operation_generator.go:1409] Verified volume is safe to detach for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.578699       1 operation_generator.go:470] DetachVolume.Detach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.621533       1 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"
I0119 02:51:25.411742       1 operation_generator.go:360] AttachVolume.Attach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"

Comment 21 errata-xmlrpc 2021-02-24 15:26:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.