Affects all openshift versions. If a user shuts off a AWS node then node gets removed from node list and pods managed by replication controller etc are evicted. But while this eviction is in progress if controller-manager is restarted or active controller manager switches in HA environment, Volumes attached to shutdown node will never be detached even when node comes back online. Steps to reproduce: 1. Create a multi node cluster and schedule bunch of Deployments to different nodes of cluster. 2. Shut down one of the nodes of the cluster. Wait for awhile for node to be removed from node list (spam oc get nodes) 3. Once node is removed you will notice that pods are getting migrated to healthy nodes but nodes with volume will not start correctly because volume is still attached to older (now switched off node). 4. Right at this time, restart controller-manager 5. Observe that pods that are being migrated are stuck forever in ContainerCreating 6. Bring back the old node. Volumes remain attached to the node. The root cause of this problem is, controller-manager rebuilds "known" volumes from node's status. If a node is restarted and controller-manager is restarted at right after then Attach/Detach Controller can not recover volumes attached to old node.
We are discussing couple of approaches to fix this category of problems for good, but I do not think it will be ready in time for 3.7.
We have a working fix for this problem - https://github.com/openshift/origin/pull/17544 We are handling detaches from shutdown nodes properly now and any dangling volume errors will autocorrect themselves.
This has been fixed in 3.9.
It is passed on oc v3.9.0-0.34.0 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-1-63.ec2.internal:443 openshift v3.9.0-0.34.0 kubernetes v1.9.1+a0ce1bc657 1.oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git 2.oc volume dc/ruby-ex --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=ebsc1 3.Shut down the node pod rescheduled to 4.pod will rescheduled to new node 5.restart service atomic-openshift-master-controllers.service 6.Pod is running on the new node
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489