Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1510178

Summary: Restarting the controller-manager while pod migration is in process can leave volumes permanently attached
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED ERRATA QA Contact: Chao Yang <chaoyang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.7.1CC: aos-bugs, aos-storage-staff, jupierce
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:11:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hemant Kumar 2017-11-06 20:58:42 UTC
Affects all openshift versions.

If a user shuts off a AWS node then node gets removed from node list and pods managed by replication controller etc are evicted. But while this eviction is in progress if controller-manager is restarted or active controller manager switches in HA environment, Volumes attached to shutdown node will never be detached even when node comes back online.

Steps to reproduce:
1. Create a multi node cluster and schedule bunch of Deployments to different nodes of cluster.
2. Shut down one of the nodes of the cluster. Wait for awhile for node to be removed from node list (spam oc get nodes)
3. Once node is removed you will notice that pods are getting migrated to healthy nodes but nodes with volume will not start correctly because volume is still attached to older (now switched off node).
4. Right at this time, restart controller-manager
5. Observe that pods that are being migrated are stuck forever in ContainerCreating
6. Bring back the old node. Volumes remain attached to the node.

The root cause of this problem is, controller-manager rebuilds "known" volumes from node's status. If a node is restarted and controller-manager is restarted at right after then Attach/Detach Controller can not recover volumes attached to old node.

Comment 1 Hemant Kumar 2017-11-07 16:56:52 UTC
We are discussing couple of approaches to fix this category of problems for good, but I do not think it will be ready in time for 3.7.

Comment 2 Hemant Kumar 2017-12-20 01:51:22 UTC
We have a working fix for this problem - https://github.com/openshift/origin/pull/17544

We are handling detaches from shutdown nodes properly now and any dangling volume errors will autocorrect themselves.

Comment 3 Hemant Kumar 2018-01-18 22:54:18 UTC
This has been fixed in 3.9.

Comment 5 Chao Yang 2018-02-02 07:08:35 UTC
It is passed on 
oc v3.9.0-0.34.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-1-63.ec2.internal:443
openshift v3.9.0-0.34.0
kubernetes v1.9.1+a0ce1bc657

1.oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
2.oc volume dc/ruby-ex --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=ebsc1
3.Shut down the node pod rescheduled to
4.pod will rescheduled to new node
5.restart service atomic-openshift-master-controllers.service
6.Pod is running on the new node

Comment 8 errata-xmlrpc 2018-03-28 14:11:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489