Bug 1510178 - Restarting the controller-manager while pod migration is in process can leave volumes permanently attached
Summary: Restarting the controller-manager while pod migration is in process can leave...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.7.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.0
Assignee: Hemant Kumar
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-06 20:58 UTC by Hemant Kumar
Modified: 2018-03-28 14:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-28 14:11:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:11:45 UTC

Description Hemant Kumar 2017-11-06 20:58:42 UTC
Affects all openshift versions.

If a user shuts off a AWS node then node gets removed from node list and pods managed by replication controller etc are evicted. But while this eviction is in progress if controller-manager is restarted or active controller manager switches in HA environment, Volumes attached to shutdown node will never be detached even when node comes back online.

Steps to reproduce:
1. Create a multi node cluster and schedule bunch of Deployments to different nodes of cluster.
2. Shut down one of the nodes of the cluster. Wait for awhile for node to be removed from node list (spam oc get nodes)
3. Once node is removed you will notice that pods are getting migrated to healthy nodes but nodes with volume will not start correctly because volume is still attached to older (now switched off node).
4. Right at this time, restart controller-manager
5. Observe that pods that are being migrated are stuck forever in ContainerCreating
6. Bring back the old node. Volumes remain attached to the node.

The root cause of this problem is, controller-manager rebuilds "known" volumes from node's status. If a node is restarted and controller-manager is restarted at right after then Attach/Detach Controller can not recover volumes attached to old node.

Comment 1 Hemant Kumar 2017-11-07 16:56:52 UTC
We are discussing couple of approaches to fix this category of problems for good, but I do not think it will be ready in time for 3.7.

Comment 2 Hemant Kumar 2017-12-20 01:51:22 UTC
We have a working fix for this problem - https://github.com/openshift/origin/pull/17544

We are handling detaches from shutdown nodes properly now and any dangling volume errors will autocorrect themselves.

Comment 3 Hemant Kumar 2018-01-18 22:54:18 UTC
This has been fixed in 3.9.

Comment 5 Chao Yang 2018-02-02 07:08:35 UTC
It is passed on 
oc v3.9.0-0.34.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-1-63.ec2.internal:443
openshift v3.9.0-0.34.0
kubernetes v1.9.1+a0ce1bc657

1.oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
2.oc volume dc/ruby-ex --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=ebsc1
3.Shut down the node pod rescheduled to
4.pod will rescheduled to new node
5.restart service atomic-openshift-master-controllers.service
6.Pod is running on the new node

Comment 8 errata-xmlrpc 2018-03-28 14:11:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489


Note You need to log in before you can comment on or make changes to this bug.