Description of problem: Installation sometimes (10%) failed at "TASK [openshift_node : Start and enable node]" Version-Release number of the following components: openshift-ansible-3.6.161-1 # openshift version openshift v3.6.153 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 docker-1.12.6-48.git0fdc778.el7.x86_64 RHEL-7.4 # uname -r 3.10.0-693.el7.x86_64 How reproducible: almost one fail in 10 attempts with same configurations Steps to Reproduce: 1. Trigger installation via openshift-ansible Actual results: TASK [openshift_node : Start and enable node] ********************************** Friday 21 July 2017 05:00:12 +0000 (0:00:00.109) 0:08:52.795 *********** FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). fatal: [host-8-241-28.host.centralci.eng.rdu2.redhat.com]: FAILED! => { "attempts": 1, "changed": false, "failed": true } MSG: Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. ...ignoring fatal: [host-8-241-40.host.centralci.eng.rdu2.redhat.com]: FAILED! => { "attempts": 1, "changed": false, "failed": true } MSG: Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details. ...ignoring Expected results: No errors Additional info: The node logs on the master-node and app-node indicate 2 different issues, and the node should able to back active after restart atomic-openshift-master service. On the master node: # journalctl -u atomic-openshift-node Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.795384 51700 start_node.go:251] Reading node configuration from /etc/origin/node/node-config.yaml Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.827406 51700 node.go:123] Initializing SDN node of type "redhat/openshift-ovs-subnet" with configured hostname "host-8-241-28.host.centralci.eng.rdu2.redhat.com" (IP ""), iptables sync period "30s" Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.827852 51700 ipcmd.go:44] Executing: /usr/sbin/ip link set lbr0 down Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.831722 51700 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0" Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.831835 51700 docker.go:364] Connecting to docker on unix:///var/run/docker.sock Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: I0721 01:35:39.831848 51700 docker.go:384] Start docker client with request timeout=2m0s Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: W0721 01:35:39.833592 51700 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[51700]: F0721 01:35:39.841989 51700 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: Failed to start OpenShift Node. Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state. Jul 21 01:35:39 host-8-241-28.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service failed. On the app node: # journalctl -u atomic-openshift-node Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: W0721 01:36:19.644933 33431 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.645039 33431 kubelet.go:2069] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: E0721 01:36:19.645060 33431 kubelet.go:2072] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.645156 33431 iptables.go:394] running iptables -C [KUBE-FIREWALL -t filter -m comment --comment kubernetes firewall for dropping marked packets -m mark --mark 0x00008000/0x00008000 -j DROP] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.648548 33431 iptables.go:394] running iptables -C [OUTPUT -t filter -j KUBE-FIREWALL] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.650726 33431 iptables.go:394] running iptables -C [INPUT -t filter -j KUBE-FIREWALL] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.652538 33431 iptables.go:394] running iptables -N [KUBE-MARK-MASQ -t nat] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.653987 33431 iptables.go:394] running iptables -N [KUBE-POSTROUTING -t nat] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.655618 33431 iptables.go:394] running iptables -C [KUBE-MARK-MASQ -t nat -j MARK --set-xmark 0x00004000/0x00004000] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.657336 33431 iptables.go:394] running iptables -C [POSTROUTING -t nat -m comment --comment kubernetes postrouting rules -j KUBE-POSTROUTING] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.659052 33431 iptables.go:394] running iptables -C [KUBE-POSTROUTING -t nat -m comment --comment kubernetes service traffic requiring SNAT -m mark --mark 0x00004000/0x00004000 -j MASQUERADE] Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: E0721 01:36:19.661405 33431 factory.go:330] devicemapper filesystem stats will not be reported: usage of thin_ls is disabled to preserve iops Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.661418 33431 factory.go:342] Registering Docker factory Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: W0721 01:36:19.661428 33431 manager.go:247] Registration of the rkt container factory failed: unable to communicate with Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp [::1]:15441: getsockopt: connection refused Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.661432 33431 factory.go:54] Registering systemd factory Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: I0721 01:36:19.661530 33431 factory.go:86] Registering Raw factory Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[33431]: F0721 01:36:19.661621 33431 kubelet.go:1235] Failed to start cAdvisor inotify_init: too many open files Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: Failed to start OpenShift Node. Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: Unit atomic-openshift-node.service entered failed state. Jul 21 01:36:19 host-8-241-40.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service failed.
The upstream commit that exposed this is being reverted: https://github.com/openshift/openshift-ansible/pull/4863 (master) https://github.com/openshift/openshift-ansible/pull/4864 (3.6) I'm going to keep this bug open though and reduce the severity and moving target to 3.7 because we should stop the exec'ed journalctl processes cleanly but that will require an upstream kube/cadvisor change.
installer change that exposed this issue has been reverted so the journalctl process is killed with the node process at unit stop time.