1451192 – atomic-openshift-node service entered failed state after restarting container-engine in containerized environment

Bug 1451192 - atomic-openshift-node service entered failed state after restarting container-engine in containerized environment

Summary: atomic-openshift-node service entered failed state after restarting container...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Giuseppe Scrivano
QA Contact:	Gan Huang
Docs Contact:
URL:
Whiteboard:
Depends On:	1450307
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-16 05:53 UTC by Gan Huang
Modified:	2017-08-16 19:51 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-10 05:24:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description Gan Huang 2017-05-16 05:53:41 UTC

Description of problem:
atomic-openshift-node service entered failed state after restarting `container-engine` in containerized environment.

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.68-1.git.0.9cbe2b7.el7.noarch.rpm

How reproducible:
always

Steps to Reproduce:
1. Trigger containerized installation with docker system container
#cat inventory_hosts
<--snip-->
openshift_docker_use_system_container=true
openshift_docker_systemcontainer_image_registry_override=brew-xxx.redhat.com:8888/rhel7
containerized=true
<--snip-->

2. Restart `container-engine` after installation succeed.


Actual results:
`container-engine` restarted successfully, but the service atomic-openshift-node entered failed state.

#journalctl -u atomic-openshift-node
May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: Stopping atomic-openshift-node.service...
May 16 01:37:31 openshift-147.lab.sjc.redhat.com atomic-openshift-node[21144]: I0516 01:37:31.613392   21206 docker_server.go:87] Stop docker server
May 16 01:37:31 openshift-147.lab.sjc.redhat.com atomic-openshift-node[22447]: atomic-openshift-node
May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE
May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state.
May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: atomic-openshift-node.service failed.
May 16 01:37:37 openshift-147.lab.sjc.redhat.com systemd[1]: Dependency failed for atomic-openshift-node.service.
May 16 01:37:37 openshift-147.lab.sjc.redhat.com systemd[1]: Job atomic-openshift-node.service/start failed with result 'dependency'.


Expected results:


Additional info:
# cat /etc/systemd/system/atomic-openshift-node.service 
[Unit]
After=atomic-openshift-master.service
After=container-engine.service
After=openvswitch.service
PartOf=container-engine.service
Requires=container-engine.service
Requires=openvswitch.service
After=ovsdb-server.service
After=ovs-vswitchd.service
Wants=atomic-openshift-master.service
Requires=atomic-openshift-node-dep.service
After=atomic-openshift-node-dep.service

[Service]
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
EnvironmentFile=/etc/sysconfig/atomic-openshift-node-dep
ExecStartPre=-/usr/bin/docker rm -f atomic-openshift-node
ExecStart=/usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=${CONFIG_FILE} -e OPTIONS=${OPTIONS} -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS openshift3/node:${IMAGE_VERSION}
ExecStartPost=/usr/bin/sleep 10
ExecStop=/usr/bin/docker stop atomic-openshift-node
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s

[Install]
WantedBy=container-engine.service

Comment 1 Giuseppe Scrivano 2017-05-16 14:54:42 UTC

atomic-openshift-node needs to wait for other services (openvswitch, atomic-openshift-master, ovsdb-server, ovs-vswitchd...) to be started before it can be loaded so it takes some time.  After a while it gets loaded correctly for me.  

Can you verify that?

I am going to do a change to the container-engine container to use systemd-notify so that it notifies systemd exactly when it is ready, although it won't change that atomic-openshift-node requires some time to be ready after container-engine is restarted.

Comment 2 Gan Huang 2017-05-17 06:26:19 UTC

atomic-openshift-node won't get active any more after restarting container-engine.

openvswitch got active in about 18 seconds, atomic-openshift-master needs 28 seconds, and atomic-openshift-node never got active in my testing.

Looks like it's the issue: https://github.com/coreos/bugs/issues/1395#issuecomment-224741608

After modifying /etc/systemd/system/atomic-openshift-node.service:

-Requires=openvswitch.service
+Wants=openvswitch.service

atomic-openshift-node was able to get active automatically in 39 seconds.

Hopefully useful for you.

Comment 3 Giuseppe Scrivano 2017-05-17 08:38:02 UTC

Thanks for investigating it.

I've opened a PR to add that patch:

https://github.com/openshift/openshift-ansible/pull/4213

I've tested it locally and it still works for me (the node container restarts after some time).

Comment 5 Gan Huang 2017-06-12 07:11:56 UTC

Verified with openshift-ansible-3.6.98-1.git.0.e651d65.el7.noarch.rpm

atomic-openshift-node service got active after a while when restarting container-engine service.

Comment 7 errata-xmlrpc 2017-08-10 05:24:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.