Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1474707

Summary:	could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf
Product:	OpenShift Container Platform	Reporter:	Gan Huang <ghuang>
Component:	Installer	Assignee:	Scott Dodson <sdodson>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.6.0	CC:	aos-bugs, bleanhar, jokerman, mmccomas, sdodson, tkimura
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-10 05:32:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gan Huang 2017-07-25 08:48:41 UTC

Description of problem:
Trigger installation with static ip on RHEL hosts, it resulted in that atomic-openshift-node failed to start

According to the logs, the issue would be hit even in DHCP env, but not very frequently.

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.169-1.git.0.440d532.el7.noarch.rpm

How reproducible:
always on the env with static ip
sometimes on the DHCP env

Steps to Reproduce:
1. Prepare hosts with static interfaces to be installed
# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=static
ONBOOT="yes"
TYPE="Ethernet"
NM_CONTROLLED=no
GATEWAY=172.16.120.1
NETMASK=255.255.255.0
IPADDR=172.16.120.79
PEERDNS=yes
DNS1=172.16.120.2


2. Trigger 3.6 installation


Actual results:
<--snip-->
TASK [openshift_node : Start and enable node] **********************************
FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).
fatal: [host-8-241-90.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"attempts": 1, "changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}
...ignoring

Expected results:
No errors

Additional info:
#journalctl -u atomic-openshift-node
<--snip-->
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827186   52953 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0"
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827314   52953 server.go:137] Running kubelet in containerized mode (experimental)
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827358   52953 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827369   52953 docker.go:384] Start docker client with request timeout=2m0s
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: W0725 08:14:01.829429   52953 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.837666   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.841192   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.843542   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: F0725 08:14:01.844672   52953 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a

Comment 3 Brenton Leanhardt 2017-07-25 14:53:16 UTC

The problem is almost certainly related to the 'NM_CONTROLLED=no' line.  Could you try configuring a static IP with nmcli:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Using_the_NetworkManager_Command_Line_Tool_nmcli.html

If disabling network manager was ever supported I think we should update our documentation to mention the new requirement.  For now I'm moving this to 3.6.1 unless we find that it still doesn't work with nmcli.

Comment 4 Gan Huang 2017-07-25 15:01:31 UTC

Hi Brenton,

Actually I don't think it's related to the configuration of the interface. In today's testing, I can easily reproduce it in DHCP environment, But that's not reproduced 100%. 

I found that it could be reproduced 100% while using static interface. 

I'll spin up a new cluster with DHCP environment (there is no 'NM_CONTROLLED=no' settings) for you tomorrow in my time if needed.

Comment 5 Brenton Leanhardt 2017-07-25 15:03:12 UTC

Thanks for the additional info.  I'll move it back to the blocker list then.

Comment 12 Gan Huang 2017-07-26 00:13:29 UTC

Thanks for pointing it out.

I've set up another env with DHCP.
# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
# Created by cloud-init on instance boot automatically, do not edit.
#
BOOTPROTO=dhcp
DEVICE=eth0
HWADDR=fa:16:3e:48:80:91
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

It appears to me that it would fail once in almost 3 attempts.

Looked at the logs, it seems the same issue with Comment 6.

Logs for NetworkManager-dispatcher attached

Comment 15 Scott Dodson 2017-07-27 12:06:49 UTC

I think this may be fixed in https://github.com/openshift/openshift-ansible/pull/4890

Can you try it?

Comment 16 Gan Huang 2017-07-28 10:09:13 UTC

I set up a huge bunch of environments, and haven't hit it yet.

I assume it has been fixed according to my past experience.

Verified with openshift-ansible-3.6.172.0.0-1.git.0.d90ca2b.el7.noarch.rpm

Thanks Scott!

Comment 18 errata-xmlrpc 2017-08-10 05:32:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716