Bug 1889420

Summary: OCP failed to add vsphere disk when pod moved to new node during cluster upgrade
Product: OpenShift Container Platform Reporter: Neil Girard <ngirard>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Kubernetes QA Contact: Wei Duan <wduan>
Status: VERIFIED --- Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, hekumar
Version: 4.4   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Neil Girard 2020-10-19 15:55:06 UTC
Description of problem:

Volumes for Jenkins pods failed to mount during upgrade of cluster.  Had to manually go into vSphere to remove them from old node so new node could pick them up.

Version-Release number of selected component (if applicable):

OCP 4.4

How reproducible:

Rare.

Actual results:

Many volumes failed to attach to pods

Expected results:

No volume failures when moving pods

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

kubernetes.io/vsphere-volume

Additional info:

Comment 15 Hemant Kumar 2021-01-12 03:53:35 UTC
I have backported the fix - https://github.com/openshift/kubernetes/pull/445


Note to QE though - this requires at least one pod with vsphere volume to be running in the cluster.

Comment 16 Hemant Kumar 2021-01-12 14:49:35 UTC
The limitation of the fix is:

The dangling volume mechanism in vsphere has one limitation though - since this code relies on verifying-volumes -as-attached feature of ADC, there must be at least one pod running with any vsphere/attachable volumes in the cluster. This means that, if a cluster had only one pod with vsphere volume and that volume was attached to wrong node (i.e dangling to wrong node), then this mechanism won't work. In practice I hope, it is unlikely that we will have just one pod with a dangling volume.

Comment 18 Wei Duan 2021-01-19 04:38:20 UTC
It is hard to reproduce during upgrade or node reboot. Checked with Hemant, as the root cause is the attach/detach issue when met un-expected node status and the fix is to check and make sure volume can be attached finally.

With the following scenario, issue re-produced in OCP4.5 OCP cluster and passed with the fix in 4.7.0-0.nightly-2021-01-18-000316
1. Create volume and attach to compute-0
2. Create pv/pvc with this volume
3. Create pod with pvc and schedule to compute-1
4. After some failure try, volume attach/mount to compute-1 with following even and kcm log:

Events:
  Type     Reason                  Age                            From                     Message
  ----     ------                  ----                           ----                     -------
  Normal   Scheduled               <unknown>                                               Successfully assigned wduan-1/pod-vsphere to compute-1
  Warning  FailedAttachVolume      <invalid> (x3 over <invalid>)  attachdetach-controller  AttachVolume.Attach failed for volume "pv-vsphere-1" : Failed to add disk 'scsi0:1'.
  Normal   SuccessfulAttachVolume  <invalid>                      attachdetach-controller  AttachVolume.Attach succeeded for volume "pv-vsphere-1"
  Normal   AddedInterface          <invalid>                      multus                   Add eth0 [10.131.0.62/23]
  Normal   Pulled                  <invalid>                      kubelet, compute-1       Container image "quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40" already present on machine
  Normal   Created                 <invalid>                      kubelet, compute-1       Created container myfrontend
  Normal   Started                 <invalid>                      kubelet, compute-1       Started container myfrontend

KCM log:
E0119 02:50:59.148140       1 vsphere.go:945] Failed to attach disk: [WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk for node: compute-1. err: +Failed to add disk 'scsi0:1'.
E0119 02:50:59.161836       1 attacher.go:93] Error attaching volume "[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" to node "compute-1": Failed to add disk 'scsi0:1'.
I0119 02:50:59.161884       1 actual_state_of_world.go:322] Volume "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk" is already added to attachedVolume list to node "compute-0", update device path ""
E0119 02:50:59.161950       1 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk podName: nodeName:}" failed. No retries permitted until 2021-01-19 02:51:01.161924255 +0000 UTC m=+3807.819892376 (durationBeforeRetry 2s). Error: "AttachVolume.Attach failed for volume \"pv-vsphere-1\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk\") from node \"compute-1\" : Failed to add disk 'scsi0:1'."
I0119 02:50:59.162422       1 event.go:291] "Event occurred" object="wduan-1/pod-vsphere" kind="Pod" apiVersion="v1" type="Warning" reason="FailedAttachVolume" message="AttachVolume.Attach failed for volume \"pv-vsphere-1\" : Failed to add disk 'scsi0:1'."
I0119 02:50:59.220845       1 reconciler.go:203] attacherDetacher.DetachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.226117       1 operation_generator.go:1409] Verified volume is safe to detach for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.578699       1 operation_generator.go:470] DetachVolume.Detach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") on node "compute-0"
I0119 02:50:59.621533       1 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"
I0119 02:51:25.411742       1 operation_generator.go:360] AttachVolume.Attach succeeded for volume "pv-vsphere-1" (UniqueName: "kubernetes.io/vsphere-volume/[WorkloadDatastore] 6a360660-caca-8237-786c-06b05fd3a2f7/compute-0_1.vmdk") from node "compute-1"