Bug 1889420

Summary:	OCP failed to add vsphere disk when pod moved to new node during cluster upgrade
Product:	OpenShift Container Platform	Reporter:	Neil Girard <ngirard>
Component:	Storage	Assignee:	Hemant Kumar <hekumar>
Storage sub component:	Kubernetes	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, dgautam, hekumar
Version:	4.4
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:26:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1981187

Description Neil Girard 2020-10-19 15:55:06 UTC

Description of problem:

Volumes for Jenkins pods failed to mount during upgrade of cluster.  Had to manually go into vSphere to remove them from old node so new node could pick them up.

Version-Release number of selected component (if applicable):

OCP 4.4

How reproducible:

Rare.

Actual results:

Many volumes failed to attach to pods

Expected results:

No volume failures when moving pods

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

kubernetes.io/vsphere-volume

Additional info:

Comment 15 Hemant Kumar 2021-01-12 03:53:35 UTC

I have backported the fix - https://github.com/openshift/kubernetes/pull/445


Note to QE though - this requires at least one pod with vsphere volume to be running in the cluster.

Comment 16 Hemant Kumar 2021-01-12 14:49:35 UTC

The limitation of the fix is:

The dangling volume mechanism in vsphere has one limitation though - since this code relies on verifying-volumes -as-attached feature of ADC, there must be at least one pod running with any vsphere/attachable volumes in the cluster. This means that, if a cluster had only one pod with vsphere volume and that volume was attached to wrong node (i.e dangling to wrong node), then this mechanism won't work. In practice I hope, it is unlikely that we will have just one pod with a dangling volume.

Comment 18 Wei Duan 2021-01-19 04:38:20 UTC

It is hard to reproduce during upgrade or node reboot. Checked with Hemant, as the root cause is the attach/detach issue when met un-expected node status and the fix is to check and make sure volume can be attached finally.

With the following scenario, issue re-produced in OCP4.5 OCP cluster and passed with the fix in 4.7.0-0.nightly-2021-01-18-000316
1. Create volume and attach to compute-0
2. Create pv/pvc with this volume
3. Create pod with pvc and schedule to compute-1
4. After some failure try, volume attach/mount to compute-1 with following even and kcm log:

Events:
  Type     Reason                  Age                            From                     Message
  ----     ------                  ----                           ----                     -------
  Normal   Scheduled               <unknown>                                               Successfully assigned wduan-1/pod-vsphere to compute-1
  Warning  FailedAttachVolume      <invalid> (x3 over <invalid>)  attachdetach-controller  AttachVolume.Attach failed for volume "pv-vsphere-1" : Failed to add disk 'scsi0:1'.
  Normal   SuccessfulAttachVolume  <invalid>                      attachdetach-controller  AttachVolume.Attach succeeded for volume "pv-vsphere-1"
  Normal   AddedInterface          <invalid>                      multus                   Add eth0 [10.131.0.62/23]
  Normal   Pulled                  <invalid>                      kubelet, compute-1       Container image "quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40" already present on machine
  Normal   Created                 <invalid>                      kubelet, compute-1       Created container myfrontend
  Normal   Started                 <invalid>                      kubelet, compute-1       Started container myfrontend

KCM log:
E0119 02:50:59.148140       1 vsphere.go:945] Failed to attach disk: [WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk for node: compute-1. err: +Failed to add disk 'scsi0:1'.
E0119 02:50:59.161836       1 attacher.go:93] Error attaching volume "[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" to node "compute-1": Failed to add disk 'scsi0:1'.
I0119 02:50:59.161884       1 actual_state_of_world.go:322] Volume "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" is already added to attachedVolume list to node "compute-0", update device path ""
E0119 02:50:59.161950       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk podName: nodeName:}" failed. No retries permitted until 2021-01-19 02:51:01.161924255 +0000 UTC m=+3807.819892376 (durationBeforeRetry 2s). Error: "AttachVolume.Attach failed for volume \"pv-vsphere-1\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk\") from node \"compute-1\" : Failed to add disk 'scsi0:1'."
I0119 02:50:59.162422       1 event.go:291] "Event occurred" object="wduan-1/pod-vsphere" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pv-vsphere-1\" : Failed to add disk 'scsi0:1'."
I0119 02:50:59.220845       1 reconciler.go:203] attacherDetacher.DetachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.226117       1 operation_generator.go:1409] Verified volume is safe to detach for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.578699       1 operation_generator.go:470] DetachVolume.Detach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.621533       1 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"
I0119 02:51:25.411742       1 operation_generator.go:360] AttachVolume.Attach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"

Comment 21 errata-xmlrpc 2021-02-24 15:26:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633