Bug 1474707 - could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf
could not start DNS, unable to read config file: open /etc/origin/node/resolv...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.6.0
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Scott Dodson
Johnny Liu
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-25 04:48 EDT by Gan Huang
Modified: 2017-10-10 20:43 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:32:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Gan Huang 2017-07-25 04:48:41 EDT
Description of problem:
Trigger installation with static ip on RHEL hosts, it resulted in that atomic-openshift-node failed to start

According to the logs, the issue would be hit even in DHCP env, but not very frequently.

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.169-1.git.0.440d532.el7.noarch.rpm

How reproducible:
always on the env with static ip
sometimes on the DHCP env

Steps to Reproduce:
1. Prepare hosts with static interfaces to be installed
# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=static
ONBOOT="yes"
TYPE="Ethernet"
NM_CONTROLLED=no
GATEWAY=172.16.120.1
NETMASK=255.255.255.0
IPADDR=172.16.120.79
PEERDNS=yes
DNS1=172.16.120.2


2. Trigger 3.6 installation


Actual results:
<--snip-->
TASK [openshift_node : Start and enable node] **********************************
FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).
fatal: [host-8-241-90.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"attempts": 1, "changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}
...ignoring

Expected results:
No errors

Additional info:
#journalctl -u atomic-openshift-node
<--snip-->
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827186   52953 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0"
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827314   52953 server.go:137] Running kubelet in containerized mode (experimental)
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827358   52953 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827369   52953 docker.go:384] Start docker client with request timeout=2m0s
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: W0725 08:14:01.829429   52953 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.837666   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.841192   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.843542   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: F0725 08:14:01.844672   52953 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Comment 3 Brenton Leanhardt 2017-07-25 10:53:16 EDT
The problem is almost certainly related to the 'NM_CONTROLLED=no' line.  Could you try configuring a static IP with nmcli:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Using_the_NetworkManager_Command_Line_Tool_nmcli.html

If disabling network manager was ever supported I think we should update our documentation to mention the new requirement.  For now I'm moving this to 3.6.1 unless we find that it still doesn't work with nmcli.
Comment 4 Gan Huang 2017-07-25 11:01:31 EDT
Hi Brenton,

Actually I don't think it's related to the configuration of the interface. In today's testing, I can easily reproduce it in DHCP environment, But that's not reproduced 100%. 

I found that it could be reproduced 100% while using static interface. 

I'll spin up a new cluster with DHCP environment (there is no 'NM_CONTROLLED=no' settings) for you tomorrow in my time if needed.
Comment 5 Brenton Leanhardt 2017-07-25 11:03:12 EDT
Thanks for the additional info.  I'll move it back to the blocker list then.
Comment 12 Gan Huang 2017-07-25 20:13:29 EDT
Thanks for pointing it out.

I've set up another env with DHCP.
# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
# Created by cloud-init on instance boot automatically, do not edit.
#
BOOTPROTO=dhcp
DEVICE=eth0
HWADDR=fa:16:3e:48:80:91
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

It appears to me that it would fail once in almost 3 attempts.

Looked at the logs, it seems the same issue with Comment 6.

Logs for NetworkManager-dispatcher attached
Comment 15 Scott Dodson 2017-07-27 08:06:49 EDT
I think this may be fixed in https://github.com/openshift/openshift-ansible/pull/4890

Can you try it?
Comment 16 Gan Huang 2017-07-28 06:09:13 EDT
I set up a huge bunch of environments, and haven't hit it yet.

I assume it has been fixed according to my past experience.

Verified with openshift-ansible-3.6.172.0.0-1.git.0.d90ca2b.el7.noarch.rpm

Thanks Scott!
Comment 18 errata-xmlrpc 2017-08-10 01:32:16 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.