Bug 1645260

Summary: vSphere Cloud provider: detach volume when node is not present/ powered off
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED ERRATA QA Contact: Wenqi He <wehe>
Severity: low Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, aos-storage-staff, bbennett, bchilds, fshaikh, hekumar, hgomes, jkaur, jokerman, lxia, misalunk, mmccomas, sgarciam, tmanor
Target Milestone: ---Keywords: NeedsTestCase
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1619514 Environment:
Last Closed: 2018-12-13 17:09:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1619514    
Bug Blocks: 1645258    

Comment 1 Hemant Kumar 2018-11-01 18:34:02 UTC
https://github.com/openshift/origin/pull/21409

Comment 7 Hemant Kumar 2018-11-29 15:07:43 UTC
Hmm, small modifications. 

1. It is better to create a deployment rather than a plain pod for verifying this bug. Because if pod is backed by a deployment it will be migrated when node where it was running is shutdown.
2. You will notice that newly created pod (on new node) is stuck in ContainerCreating state while old pod (on shutdown node) is in Unknwown state. At this point you must force delete the old pod. Typically this can be done by:

oc delete pod <pod> --force=true --grace-period=0

3. This should allow new pod to be running on new node.

Comment 8 Wenqi He 2018-11-30 02:52:04 UTC
Thanks HK!
I have verified this bug on below version:

openshift v3.10.79
kubernetes v1.10.0+b81c8f8

1. Create sc, pvc and dc
# oc get dc
NAME               REVISION   DESIRED   CURRENT   TRIGGERED BY
vsphere            1          1         1         config

# oc get pvc
NAME      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
azpvc     Bound     pvc-6696150c-f3af-11e8-8288-0050569f4627   1Gi        RWO            standard       18h

# oc get pods
NAME              READY     STATUS    RESTARTS   AGE
vsphere-1-4kwj7   1/1       Running   0          2m

# oc get pods vsphere-1-4kwj7 -o yaml | grep node
  nodeName: ocp310.node1.vsphere.local

2. Power off the node and delete the pod

# oc get nodes
NAME                          STATUS     ROLES     AGE       VERSION
ocp310.master.vsphere.local   Ready      master    49d       v1.10.0+b81c8f8
ocp310.node1.vsphere.local    NotReady   compute   49d       v1.10.0+b81c8f8

# oc delete pod vsphere-1-4kwj7

# oc get pods
NAME              READY     STATUS              RESTARTS   AGE
vsphere-1-4kwj7   1/1       Terminating         0          5m
vsphere-1-vh4zl   0/1       ContainerCreating   0          10s

3. Then force delete the pod
# oc delete pod vsphere-1-4kwj7 --force=true --grace-period=0

# oc get pods
NAME              READY     STATUS              RESTARTS   AGE
vsphere-1-vh4zl   0/1       ContainerCreating   0          4m

# oc get pods
NAME              READY     STATUS    RESTARTS   AGE
vsphere-1-vh4zl   1/1       Running   0          11m

# oc describe pods vsphere-1-vh4zl
Events:
  Type     Reason                  Age              From                                  Message
  ----     ------                  ----             ----                                  -------
  Normal   Scheduled               11m              default-scheduler                     Successfully assigned vsphere-1-vh4zl to ocp310.master.vsphere.local
  Warning  FailedAttachVolume      11m              attachdetach-controller               Multi-Attach error for volume "pvc-6696150c-f3af-11e8-8288-0050569f4627" Volume is already used by pod(s) vsphere-1-4kwj7
  Warning  FailedMount             2m (x4 over 9m)  kubelet, ocp310.master.vsphere.local  Unable to mount volumes for pod "vsphere-1-vh4zl_default(ed1be1a9-f448-11e8-8288-0050569f4627)": timeout expired waiting for volumes to attach or mount for pod "default"/"vsphere-1-vh4zl". list of unmounted volumes=[azure]. list of unattached volumes=[azure default-token-t9wrs]
  Normal   SuccessfulAttachVolume  1m               attachdetach-controller               AttachVolume.Attach succeeded for volume "pvc-6696150c-f3af-11e8-8288-0050569f4627"
  Normal   Pulled                  1m               kubelet, ocp310.master.vsphere.local  Container image "aosqe/hello-openshift" already present on machine
  Normal   Created                 1m               kubelet, ocp310.master.vsphere.local  Created container
  Normal   Started                 1m               kubelet, ocp310.master.vsphere.local  Started container

This fix is awesome, even the node is shutdown we can still allow the disk to detach and attach to new node again. Thanks!

Comment 10 errata-xmlrpc 2018-12-13 17:09:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3750