Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1510178 - Restarting the controller-manager while pod migration is in process can leave volumes permanently attached
Restarting the controller-manager while pod migration is in process can leave...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage (Show other bugs)
3.7.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: 3.9.0
Assigned To: Hemant Kumar
chaoyang
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-11-06 15:58 EST by Hemant Kumar
Modified: 2018-03-28 10:11 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-28 10:11:22 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 None None None 2018-03-28 10:11 EDT

  None (edit)
Description Hemant Kumar 2017-11-06 15:58:42 EST
Affects all openshift versions.

If a user shuts off a AWS node then node gets removed from node list and pods managed by replication controller etc are evicted. But while this eviction is in progress if controller-manager is restarted or active controller manager switches in HA environment, Volumes attached to shutdown node will never be detached even when node comes back online.

Steps to reproduce:
1. Create a multi node cluster and schedule bunch of Deployments to different nodes of cluster.
2. Shut down one of the nodes of the cluster. Wait for awhile for node to be removed from node list (spam oc get nodes)
3. Once node is removed you will notice that pods are getting migrated to healthy nodes but nodes with volume will not start correctly because volume is still attached to older (now switched off node).
4. Right at this time, restart controller-manager
5. Observe that pods that are being migrated are stuck forever in ContainerCreating
6. Bring back the old node. Volumes remain attached to the node.

The root cause of this problem is, controller-manager rebuilds "known" volumes from node's status. If a node is restarted and controller-manager is restarted at right after then Attach/Detach Controller can not recover volumes attached to old node.
Comment 1 Hemant Kumar 2017-11-07 11:56:52 EST
We are discussing couple of approaches to fix this category of problems for good, but I do not think it will be ready in time for 3.7.
Comment 2 Hemant Kumar 2017-12-19 20:51:22 EST
We have a working fix for this problem - https://github.com/openshift/origin/pull/17544

We are handling detaches from shutdown nodes properly now and any dangling volume errors will autocorrect themselves.
Comment 3 Hemant Kumar 2018-01-18 17:54:18 EST
This has been fixed in 3.9.
Comment 5 chaoyang 2018-02-02 02:08:35 EST
It is passed on 
oc v3.9.0-0.34.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-1-63.ec2.internal:443
openshift v3.9.0-0.34.0
kubernetes v1.9.1+a0ce1bc657

1.oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git
2.oc volume dc/ruby-ex --add --type=persistentVolumeClaim --mount-path=/opt1 --name=v1 --claim-name=ebsc1
3.Shut down the node pod rescheduled to
4.pod will rescheduled to new node
5.restart service atomic-openshift-master-controllers.service
6.Pod is running on the new node
Comment 8 errata-xmlrpc 2018-03-28 10:11:22 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.