Description of problem: After node reboot, node service can not start automaically. openshift_ansible_vars: containerized: true openshift_use_system_containers: true system_images_registry: registry.ops.openshift.com openshift_docker_use_system_container: true openshift_docker_systemcontainer_image_override: registry.ops.openshift.com/openshift3/container-engine:v3.7 openshift_use_crio: true openshift_crio_systemcontainer_image_override: registry.ops.openshift.com/openshift3/cri-o:v3.7 Version-Release number of the following components: openshift openshift v3.7.0-0.184.0 and openshift-ansible-3.7.0-0.184.0 How reproducible: Always Steps to Reproduce: 1. Using the ansible vars listed above to install the OCP cluster 2. Reboot one of node Actual results: service atomic-openshift-node is not started. Expected results: service atomic-openshift-node is started. Additional info: Docker is hardcoded as a dependence of `WantedBy` in node unit file. Node can be started correctly after modifying `WantedBy=docker.service` to `WantedBy=container-engine.service` # systemctl cat atomic-openshift-node # /etc/systemd/system/atomic-openshift-node.service [Unit] After=container-engine.service After=openvswitch.service Wants=container-engine.service After=atomic-openshift-node-dep.service After=atomic-openshift.service Requires=dnsmasq.service After=dnsmasq.service [Service] Type=notify EnvironmentFile=/etc/sysconfig/atomic-openshift-node EnvironmentFile=/etc/sysconfig/atomic-openshift-node-dep ExecStartPre=/bin/bash -c 'export -p > /run/atomic-openshift-node-env' ExecStart=/bin/runc --systemd-cgroup run 'atomic-openshift-node' ExecStop=/bin/runc --systemd-cgroup kill 'atomic-openshift-node' SyslogIdentifier=atomic-openshift-node Restart=always RestartSec=5s WorkingDirectory=/var/lib/containers/atomic/atomic-openshift-node.0 RuntimeDirectory=atomic-openshift-node [Install] WantedBy=docker.service
I'm not sure how this bug is possible. This was solved quite some time ago: https://github.com/openshift/openshift-ansible/pull/4131
This problem still existed on OCP version: v3.9.0-0.23.0
I have figured out what the root of this issue is. 1) We install the origin-node.service / atomic-openshift-node.service unit file before installing the system container. 2) System container for node is installed, this causes atomic to overwrite the service unit we created. 3) Service starts correctly after installing the system container, but since the 'docker' target is not available on crio, the unit is never linked to a 'target.wants' Need to consult with syscontainer team to figure out best way forward. Should we not install node as a system container so we can create our own unit file? Even if we fix the ordering in asnible, it is likely that if the user were to update the container, the unit file would be overwritten again. Perhaps there is a way to instruct atomic to not create the unit file?
Looks like we can indeed template the service unit files. I have submitted a PR against origin here: https://github.com/openshift/origin/pull/18314
Should be fixed in v3.9.0-0.38.0
verified in v3.9.0-0.38.0