Description of problem: atomic-openshift-node service entered failed state after restarting `container-engine` in containerized environment. Version-Release number of selected component (if applicable): openshift-ansible-3.6.68-1.git.0.9cbe2b7.el7.noarch.rpm How reproducible: always Steps to Reproduce: 1. Trigger containerized installation with docker system container #cat inventory_hosts <--snip--> openshift_docker_use_system_container=true openshift_docker_systemcontainer_image_registry_override=brew-xxx.redhat.com:8888/rhel7 containerized=true <--snip--> 2. Restart `container-engine` after installation succeed. Actual results: `container-engine` restarted successfully, but the service atomic-openshift-node entered failed state. #journalctl -u atomic-openshift-node May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: Stopping atomic-openshift-node.service... May 16 01:37:31 openshift-147.lab.sjc.redhat.com atomic-openshift-node[21144]: I0516 01:37:31.613392 21206 docker_server.go:87] Stop docker server May 16 01:37:31 openshift-147.lab.sjc.redhat.com atomic-openshift-node[22447]: atomic-openshift-node May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state. May 16 01:37:31 openshift-147.lab.sjc.redhat.com systemd[1]: atomic-openshift-node.service failed. May 16 01:37:37 openshift-147.lab.sjc.redhat.com systemd[1]: Dependency failed for atomic-openshift-node.service. May 16 01:37:37 openshift-147.lab.sjc.redhat.com systemd[1]: Job atomic-openshift-node.service/start failed with result 'dependency'. Expected results: Additional info: # cat /etc/systemd/system/atomic-openshift-node.service [Unit] After=atomic-openshift-master.service After=container-engine.service After=openvswitch.service PartOf=container-engine.service Requires=container-engine.service Requires=openvswitch.service After=ovsdb-server.service After=ovs-vswitchd.service Wants=atomic-openshift-master.service Requires=atomic-openshift-node-dep.service After=atomic-openshift-node-dep.service [Service] EnvironmentFile=/etc/sysconfig/atomic-openshift-node EnvironmentFile=/etc/sysconfig/atomic-openshift-node-dep ExecStartPre=-/usr/bin/docker rm -f atomic-openshift-node ExecStart=/usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host --env-file=/etc/sysconfig/atomic-openshift-node -v /:/rootfs:ro,rslave -e CONFIG_FILE=${CONFIG_FILE} -e OPTIONS=${OPTIONS} -e HOST=/rootfs -e HOST_ETC=/host-etc -v /var/lib/origin:/var/lib/origin:rslave -v /etc/origin/node:/etc/origin/node -v /etc/localtime:/etc/localtime:ro -v /etc/machine-id:/etc/machine-id:ro -v /run:/run -v /sys:/sys:rw -v /sys/fs/cgroup:/sys/fs/cgroup:rw -v /usr/bin/docker:/usr/bin/docker:ro -v /var/lib/docker:/var/lib/docker -v /lib/modules:/lib/modules -v /etc/origin/openvswitch:/etc/openvswitch -v /etc/origin/sdn:/etc/openshift-sdn -v /var/lib/cni:/var/lib/cni -v /etc/systemd/system:/host-etc/systemd/system -v /var/log:/var/log -v /dev:/dev $DOCKER_ADDTL_BIND_MOUNTS openshift3/node:${IMAGE_VERSION} ExecStartPost=/usr/bin/sleep 10 ExecStop=/usr/bin/docker stop atomic-openshift-node SyslogIdentifier=atomic-openshift-node Restart=always RestartSec=5s [Install] WantedBy=container-engine.service
atomic-openshift-node needs to wait for other services (openvswitch, atomic-openshift-master, ovsdb-server, ovs-vswitchd...) to be started before it can be loaded so it takes some time. After a while it gets loaded correctly for me. Can you verify that? I am going to do a change to the container-engine container to use systemd-notify so that it notifies systemd exactly when it is ready, although it won't change that atomic-openshift-node requires some time to be ready after container-engine is restarted.
atomic-openshift-node won't get active any more after restarting container-engine. openvswitch got active in about 18 seconds, atomic-openshift-master needs 28 seconds, and atomic-openshift-node never got active in my testing. Looks like it's the issue: https://github.com/coreos/bugs/issues/1395#issuecomment-224741608 After modifying /etc/systemd/system/atomic-openshift-node.service: -Requires=openvswitch.service +Wants=openvswitch.service atomic-openshift-node was able to get active automatically in 39 seconds. Hopefully useful for you.
Thanks for investigating it. I've opened a PR to add that patch: https://github.com/openshift/openshift-ansible/pull/4213 I've tested it locally and it still works for me (the node container restarts after some time).
Verified with openshift-ansible-3.6.98-1.git.0.e651d65.el7.noarch.rpm atomic-openshift-node service got active after a while when restarting container-engine service.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716