Bug 1622245 - Unable to start a ESXi node after a pod got migrated from a shutdown VM
Summary: Unable to start a ESXi node after a pod got migrated from a shutdown VM
Keywords:
Status: CLOSED DUPLICATE of bug 1619514
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.z
Assignee: Hemant Kumar
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-24 21:26 UTC by Hemant Kumar
Modified: 2018-11-02 17:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-02 17:11:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Hemant Kumar 2018-08-24 21:26:58 UTC
Description of problem:


Version-Release number of selected component (if applicable): latest 3.9/rhel-7.5


How reproducible:


Steps to Reproduce:
1. Create a 2 VM openshift cluster on vsphere.
2. Create some deployments on the cluster with vsphere persistent volumes. Make sure that slave node gets some pods.
3. Now shutdown the slave node.
4. The node api object of shutdown node gets removed and all pods from it are migrated.
5. Try to bring back the old node. 

What happens:
Old node refuses to start with - https://gist.github.com/gnufied/40fe436dd885311e8ee520ac67bd84ad

because volumes never gets detached from old node. And those same volumes get attached to a new node.

ug 24 11:52:06 vim-master.lan atomic-openshift-master-controllers[23519]: E0824 11:52:06.691814   23519 attacher.go:260] Error checking if volume ("[datastore1] kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk") is already attached to current node ("vim-node
.lan"). Will continue and try detach anyway. err=No VM found
Aug 24 11:52:06 vim-master.lan atomic-openshift-master-controllers[23519]: E0824 11:52:06.691828   23519 attacher.go:274] Error detaching volume "[datastore1] kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk": No VM found
Aug 24 11:52:06 vim-master.lan atomic-openshift-master-controllers[23519]: E0824 11:52:06.691846   23519 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/vsphere-volume/[datastore1] kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk\"" failed. No 
retries permitted until 2018-08-24 11:52:07.191836175 -0400 EDT m=+3859.550119407 (durationBeforeRetry 500ms). Error: "DetachVolume.Detach failed for volume \"pvc-0e972db5-a7b1-11e8-a954-00505694a8ab\" (UniqueName: \"kubernetes.io/vsphere-volume/[datastore1] kubevols/kubernetes-dynam
ic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk\") on node \"vim-node.lan\" : No VM found"
Aug 24 11:52:07 vim-master.lan atomic-openshift-master-controllers[23519]: W0824 11:52:07.192523   23519 reconciler.go:235] attacherDetacher.DetachVolume started for volume "pvc-0e972db5-a7b1-11e8-a954-00505694a8ab" (UniqueName: "kubernetes.io/vsphere-volume/[datastore1] kubevols/kub
ernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk") on node "vim-node.lan" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
Aug 24 11:52:07 vim-master.lan atomic-openshift-master-controllers[23519]: E0824 11:52:07.192729   23519 attacher.go:260] Error checking if volume ("[datastore1] kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk") is already attached to current node ("vim-node
.lan"). Will continue and try detach anyway. err=No VM found
Aug 24 11:52:07 vim-master.lan atomic-openshift-master-controllers[23519]: E0824 11:52:07.192743   23519 attacher.go:274] Error detaching volume "[datastore1] kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk": No VM found
Aug 24 11:52:07 vim-master.lan atomic-openshift-master-controllers[23519]: E0824 11:52:07.192759   23519 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/vsphere-volume/[datastore1] kubevols/kubernetes-dynamic-pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk\"" failed. No 
retries permitted until 2018-08-24 11:52:08.192751187 -0400 EDT m=+3860.551034418 (durationBeforeRetry 1s). Error: "DetachVolume.Detach failed for volume \"pvc-0e972db5-a7b1-11e8-a954-00505694a8ab\" (UniqueName: \"kubernetes.io/vsphere-volume/[datastore1] kubevols/kubernetes-dynamic-
pvc-0e972db5-a7b1-11e8-a954-00505694a8ab.vmdk\") on node \"vim-node.lan\" : No VM found"



Actual results:
Can't resume a shutdown node.

Expected results:
Should be able to resume a shutdown node.

Additional info:
The main bug here is - detaching from shutdown node never really works. That is why we can't resume a shutdown node.


Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Hemant Kumar 2018-08-24 21:29:05 UTC
Possibly related bug - https://github.com/kubernetes/kubernetes/pull/67825

it looks like vsphere volumes do not support multiattach and yet that flag is enabled and hence causes same volume to be mounted in multiple places without detaching from old place first.

Comment 2 Hemant Kumar 2018-09-18 18:59:08 UTC
Opened https://github.com/openshift/origin/pull/21025 to backport the fix to Openshift.

Comment 6 Jianwei Hou 2018-10-22 15:18:46 UTC
This failed my test on v3.9.45, after the node is shutdown, it can not be started unless the volume is manually removed from it.

Steps:
1. Prepare two nodes, create some deployments on node a.
2. Shutdown node a, Pods are scheduled to node b, volume is attached to node b
3. Start node a. 

Node a cannot be started.


Note You need to log in before you can comment on or make changes to this bug.