Bug 1474707 - could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf
Summary: could not start DNS, unable to read config file: open /etc/origin/node/resolv...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-25 08:48 UTC by Gan Huang
Modified: 2017-10-11 00:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-10 05:32:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Gan Huang 2017-07-25 08:48:41 UTC
Description of problem:
Trigger installation with static ip on RHEL hosts, it resulted in that atomic-openshift-node failed to start

According to the logs, the issue would be hit even in DHCP env, but not very frequently.

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.169-1.git.0.440d532.el7.noarch.rpm

How reproducible:
always on the env with static ip
sometimes on the DHCP env

Steps to Reproduce:
1. Prepare hosts with static interfaces to be installed
# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
BOOTPROTO=static
ONBOOT="yes"
TYPE="Ethernet"
NM_CONTROLLED=no
GATEWAY=172.16.120.1
NETMASK=255.255.255.0
IPADDR=172.16.120.79
PEERDNS=yes
DNS1=172.16.120.2


2. Trigger 3.6 installation


Actual results:
<--snip-->
TASK [openshift_node : Start and enable node] **********************************
FAILED - RETRYING: TASK: openshift_node : Start and enable node (1 retries left).
fatal: [host-8-241-90.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"attempts": 1, "changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}
...ignoring

Expected results:
No errors

Additional info:
#journalctl -u atomic-openshift-node
<--snip-->
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827186   52953 ipcmd.go:48] Error executing /usr/sbin/ip: Cannot find device "lbr0"
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827314   52953 server.go:137] Running kubelet in containerized mode (experimental)
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827358   52953 docker.go:364] Connecting to docker on unix:///var/run/docker.sock
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.827369   52953 docker.go:384] Start docker client with request timeout=2m0s
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: W0725 08:14:01.829429   52953 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.837666   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.841192   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: I0725 08:14:01.843542   52953 iptables.go:562] couldn't get iptables-restore version; assuming it doesn't support --wait
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com atomic-openshift-node[52883]: F0725 08:14:01.844672   52953 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory
Jul 25 08:14:01 host-8-241-90.host.centralci.eng.rdu2.redhat.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a

Comment 3 Brenton Leanhardt 2017-07-25 14:53:16 UTC
The problem is almost certainly related to the 'NM_CONTROLLED=no' line.  Could you try configuring a static IP with nmcli:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Using_the_NetworkManager_Command_Line_Tool_nmcli.html

If disabling network manager was ever supported I think we should update our documentation to mention the new requirement.  For now I'm moving this to 3.6.1 unless we find that it still doesn't work with nmcli.

Comment 4 Gan Huang 2017-07-25 15:01:31 UTC
Hi Brenton,

Actually I don't think it's related to the configuration of the interface. In today's testing, I can easily reproduce it in DHCP environment, But that's not reproduced 100%. 

I found that it could be reproduced 100% while using static interface. 

I'll spin up a new cluster with DHCP environment (there is no 'NM_CONTROLLED=no' settings) for you tomorrow in my time if needed.

Comment 5 Brenton Leanhardt 2017-07-25 15:03:12 UTC
Thanks for the additional info.  I'll move it back to the blocker list then.

Comment 12 Gan Huang 2017-07-26 00:13:29 UTC
Thanks for pointing it out.

I've set up another env with DHCP.
# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
# Created by cloud-init on instance boot automatically, do not edit.
#
BOOTPROTO=dhcp
DEVICE=eth0
HWADDR=fa:16:3e:48:80:91
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

It appears to me that it would fail once in almost 3 attempts.

Looked at the logs, it seems the same issue with Comment 6.

Logs for NetworkManager-dispatcher attached

Comment 15 Scott Dodson 2017-07-27 12:06:49 UTC
I think this may be fixed in https://github.com/openshift/openshift-ansible/pull/4890

Can you try it?

Comment 16 Gan Huang 2017-07-28 10:09:13 UTC
I set up a huge bunch of environments, and haven't hit it yet.

I assume it has been fixed according to my past experience.

Verified with openshift-ansible-3.6.172.0.0-1.git.0.d90ca2b.el7.noarch.rpm

Thanks Scott!

Comment 18 errata-xmlrpc 2017-08-10 05:32:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716


Note You need to log in before you can comment on or make changes to this bug.