Bug 1318948 - Master container not running after docker restart
Summary: Master container not running after docker restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Ma xiaoqiang
URL:
Whiteboard:
Depends On: 1322728
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-18 08:55 UTC by Gaoyun Pei
Modified: 2016-07-04 00:46 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 16:38:52 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1065 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise atomic-openshift-utils bug fix update 2016-05-12 20:32:56 UTC

Description Gaoyun Pei 2016-03-18 08:55:52 UTC
Description of problem:
During a containerized installation, master container didn't get running again after a docker restart during installing node on the same host. 

ansible will fail at step:
TASK: [openshift_node | Wait for master API to become available before proceeding] *** 
failed: [openshift-x.x.com] => {"attempts": 120, "changed": false, "cmd": ["curl", "--silent", "--cacert", "/etc/origin/node/ca.crt", "https://openshift-x.x.com:8443/healthz/ready"], "delta": "0:00:00.131259", "end": "2016-03-18 03:47:16.844996", "failed": true, "rc": 7, "start": "2016-03-18 03:47:16.713737", "stdout_lines": [], "warnings": ["Consider using get_url module rather than running curl"]}
msg: Task failed as maximum retries was encountered


Version-Release number of selected component (if applicable):
openshift-ansible-3.0.61-1.git.0.8150c45.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1.Install a containerized env using openshift-ansible playbook


Actual results:
The master container is running in the first, when ansible start installing node, it would change the docker options and restart docker on the host, then the master container is not getting up again.  

Expected results:
The master container should be running and ansible could finish the installation

Additional info:

Comment 1 Jason DeTiberus 2016-03-18 15:49:19 UTC
I've run the master branch against multiple containerized environments and it's not replicating on my local test system, so there must be an edge case involved, either with the way the handlers are processed or the systemd unit ordering/dependencies.

Comment 2 Gaoyun Pei 2016-03-21 10:41:24 UTC
Still could reproduce this issue with latest openshift-ansible.

When installing an all-in-one env, master host is also a node, master container was not running again after the docker restart during configuring node.

Comment 5 Scott Dodson 2016-03-21 13:26:54 UTC
I was working on a fix for this Friday but I'm still having some problems with it. I should have this fixed in openshift-ansible by the end of the day RDU hours. Sorry for the delay.

Comment 6 Scott Dodson 2016-03-24 20:39:49 UTC
I've spent quite a bit of time trying to figure this one out. The root of the problem is that when we're restarting docker the services that depend on docker aren't being restarted in all cases. With recent refactoring of the docker role docker now restarts after the master has started and at that point the master doesn't restart even though its systemd unit is Restart=always.

I've tried reverting this commit which got my further but it wasn't a complete solution 8f7b31051dae0cdb853ca2f7fb68c31a40ae2967

Docker has always been restarted during containerized installs and even when starting the node service for the first time so I'm baffled as to why its suddenly stopped restarting the containers that depend on it. For instance, the node when starting up the first time must re-configure docker's bridge after configuring it for SDN. The node container does this by notifying systemd via dbus.

I believe we should attempt to ensure that docker is fully configured before we start deploying the containers but I haven't had chance to figure out what needs to be done after the role refactoring.

Comment 7 Seth Jennings 2016-03-28 14:59:02 UTC
Found an upstream bug against systemd that may be relevant here (unfortunately, no action for almost a year)

https://bugs.freedesktop.org/show_bug.cgi?id=89087

Comment 8 Seth Jennings 2016-03-28 16:56:57 UTC
Proposed fix:
https://github.com/openshift/openshift-ansible/pull/1671

Comment 9 Gaoyun Pei 2016-03-29 05:26:20 UTC
Tried with https://github.com/sjenning/openshift-ansible -b systemd-unit-fixes, containerized installation could be finished successfully, the master container get started after docker restart.

Comment 10 Scott Dodson 2016-03-29 13:52:50 UTC
Fix has been merged.

Comment 11 Troy Dawson 2016-03-30 18:56:10 UTC
Should be in openshift-ansible-3.0.67-1.git.0.d4d0e1d.el7 which is in the latest aos puddles.

Comment 12 Gaoyun Pei 2016-04-01 01:56:19 UTC
Verify this bug with openshift-ansible-3.0.68-1.git.0.d80d45e.el7.noarch.

Master container get running again after docker restart when doing node configuration, and no error encountered during the installation job.

Comment 14 errata-xmlrpc 2016-05-12 16:38:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1065


Note You need to log in before you can comment on or make changes to this bug.