Description of problem: Trigger installation with static ip on RHEL hosts, it resulted in that atomic-openshift-node failed to start According to the logs, the issue would be hit even in DHCP env, but not very frequently. Version-Release number of selected component (if applicable): openshift-ansible-3.6.169-1.git.0.440d532.el7.noarch.rpm How reproducible: always on the env with static ip sometimes on the DHCP env Steps to Reproduce: 1. Prepare hosts with static interfaces to be installed # cat /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 BOOTPROTO=static ONBOOT="yes" TYPE="Ethernet" NM_CONTROLLED=no GATEWAY=172.16.120.1 NETMASK=255.255.255.0 IPADDR=172.16.120.79 PEERDNS=yes DNS1=172.16.120.2 2. Trigger 3.6 installation Actual results: <--snip--> TASK [openshift_node : Start and enable node] ********************************** FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left). fatal: [host-8-241-90.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"attempts": 1, "changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} ...ignoring Expected results: No errors Additional info: #journalctl -u atomic-openshift-node <--snip--> Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827186 52953 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0" Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827314 52953 server.go:137] Running kubelet in containerized mode (experimental) Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827358 52953 docker.go:364] Connecting to docker on unix:///var/run/docker.sock Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827369 52953 docker.go:384] Start docker client with request timeout=2m0s Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: W0725 08:14:01.829429 52953 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.837666 52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.841192 52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.843542 52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: F0725 08:14:01.844672 52953 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
The problem is almost certainly related to the 'NM_CONTROLLED=no' line. Could you try configuring a static IP with nmcli: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Using_the_NetworkManager_Command_Line_Tool_nmcli.html If disabling network manager was ever supported I think we should update our documentation to mention the new requirement. For now I'm moving this to 3.6.1 unless we find that it still doesn't work with nmcli.
Hi Brenton, Actually I don't think it's related to the configuration of the interface. In today's testing, I can easily reproduce it in DHCP environment, But that's not reproduced 100%. I found that it could be reproduced 100% while using static interface. I'll spin up a new cluster with DHCP environment (there is no 'NM_CONTROLLED=no' settings) for you tomorrow in my time if needed.
Thanks for the additional info. I'll move it back to the blocker list then.
Thanks for pointing it out. I've set up another env with DHCP. # cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Created by cloud-init on instance boot automatically, do not edit. # BOOTPROTO=dhcp DEVICE=eth0 HWADDR=fa:16:3e:48:80:91 ONBOOT=yes TYPE=Ethernet USERCTL=no It appears to me that it would fail once in almost 3 attempts. Looked at the logs, it seems the same issue with Comment 6. Logs for NetworkManager-dispatcher attached
I think this may be fixed in https://github.com/openshift/openshift-ansible/pull/4890 Can you try it?
I set up a huge bunch of environments, and haven't hit it yet. I assume it has been fixed according to my past experience. Verified with openshift-ansible-3.6.172.0.0-1.git.0.d90ca2b.el7.noarch.rpm Thanks Scott!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716