Bug 1480438 - [3.6] Installer fails due to missing /etc/origin/node/resolv.conf
Summary: [3.6] Installer fails due to missing /etc/origin/node/resolv.conf
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.6.z
Assignee: Scott Dodson
QA Contact: Gan Huang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-11 06:07 UTC by Marko Myllynen
Modified: 2018-08-13 07:59 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When the nameservers were specified directly in /etc/resolv.conf the node dns configuration scripts failed to determine the correct nameservers. The configuration scripts have been updated to pull the nameservers from /etc/resolv.conf when they're not specified via other means.
Clone Of:
Environment:
Last Closed: 2017-09-05 17:42:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible.hosts (3.28 KB, text/plain)
2017-08-11 06:07 UTC, Marko Myllynen
no flags Details
ansible.log (6.77 MB, text/plain)
2017-08-11 06:08 UTC, Marko Myllynen
no flags Details
fixed script (5.07 KB, application/x-shellscript)
2017-08-18 20:31 UTC, Scott Dodson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2639 0 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix and enhancement 2017-09-05 21:42:36 UTC

Description Marko Myllynen 2017-08-11 06:07:41 UTC
Created attachment 1311986 [details]
ansible.hosts

Description of problem:
OCP 3.6 installation fails with:

Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journal
ctl -xe" for details.

Investigating this further reveals:

F0810 09:06:10.912359    4298 start_node.go:140] could not start DNS,
unable to read config file: open /etc/origin/node/resolv.conf: no such
file or directory

I touching the files before installation then the service starts up ok.

I'm attaching full ansible log, the used inventory file, and sosreport from the node after the failure.

Version-Release number of the following components:
openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch
ansible-2.3.1.0-3.el7.noarch
ansible 2.3.1.0
  config file = /root/.ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

How reproducible:
Always.

Comment 1 Marko Myllynen 2017-08-11 06:08:52 UTC
Created attachment 1311987 [details]
ansible.log

ansible.log from the failed installation

Comment 3 Gan Huang 2017-08-11 06:36:27 UTC
Very weird to me it still can be reproduced.

I thought it should have been fixed since openshift-ansible-3.6.172.0.0-1

Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707

What kind of network configurations are you using? NM_CONTROLLED disabled on the interface?

Comment 4 Marko Myllynen 2017-08-11 06:40:59 UTC
(In reply to Gan Huang from comment #3)
> Very weird to me it still can be reproduced.
> 
> I thought it should have been fixed since openshift-ansible-3.6.172.0.0-1
> 
> Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707
> 
> What kind of network configurations are you using? NM_CONTROLLED disabled on
> the interface?

One interface per node, ifcfg-eth0 being something like:

[root@master01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
# Generated by parse-kickstart
IPV6INIT="yes"
BOOTPROTO=none
DEVICE="eth0"
ONBOOT="yes"
UUID="6a31ba44-ce22-48ef-941b-fab4b5defacc"
IPADDR=192.168.200.254
NETWORK=192.168.200.0
NETMASK=255.255.255.0
GATEWAY=192.168.200.200

Thanks.

Comment 5 Marko Myllynen 2017-08-11 06:41:39 UTC
And NM in use, as required per the documentation.

Comment 6 Gan Huang 2017-08-11 08:18:52 UTC
Could you please attach your log for NetworkManager-dispatcher service? and /etc/resolv.conf?

journalctl -u NetworkManager-dispatcher > dispatcher.log

cat /etc/resolv.conf

Comment 7 Marko Myllynen 2017-08-11 08:30:28 UTC
Can you please check the attached sosreport, it should container at last most of this information? Thanks.

Comment 9 Scott Dodson 2017-08-11 12:57:49 UTC
From the logs it's happening because there are no nameservers in IP4_NAMESERVERS. How are your nameservers for this host defined?

https://github.com/openshift/openshift-ansible/issues/4935 probably related

Comment 10 Scott Dodson 2017-08-11 12:59:55 UTC
When I say how are they configured I mean, what configuration method is used to define them? They're not in the interface config you mentioned above and it's not using DHCP. Are they in /etc/sysconfig/network ? If they're manually added to /etc/resolv.conf then we definitely don't account for that but that seems like an invalid method to define the nameservers.

Comment 11 Marko Myllynen 2017-08-11 13:03:54 UTC
The host has:

[root@master01 ~]# cat /etc/resolv.conf 
search test.example.com
nameserver 192.168.200.200

But in fact the nameserver is currently unreachable:

[root@master01 ~]# host www.redhat.com
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached
[root@master01 ~]# 

Nothing in /etc/sysconfig/network. I'm not expecting these configuration to survive the installation (we can e.g. openshift_node_dnsmasq_additional_config_file for custom local configs if needed) but the installer should not choke in case there's something working or not in /etc/resolv.conf.

Thanks.

Comment 12 Scott Dodson 2017-08-11 13:11:05 UTC
Right, we shouldn't have touched /etc/resolv.conf and we need to fix that. It's still not clear to me how, prior to any openshift-ansible playbook, your dns servers were configured? Can you help me understand that? Did some process directly edit /etc/resolv.conf?

If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and reboot the host what do you have in /etc/resolv.conf and what method was used to populate the nameservers there?

Comment 13 Marko Myllynen 2017-08-11 13:38:41 UTC
(In reply to Scott Dodson from comment #12)
> Right, we shouldn't have touched /etc/resolv.conf and we need to fix that.
> It's still not clear to me how, prior to any openshift-ansible playbook,
> your dns servers were configured? Can you help me understand that? Did some
> process directly edit /etc/resolv.conf?

What happened roughly was 1) RHEL 7 base installation and network configuration, 2) test something by manually updating /etc/resolv.conf and forget about it, 3) follow the host preparation steps, 4) install.

> If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and
> reboot the host what do you have in /etc/resolv.conf and what method was
> used to populate the nameservers there?

After this the contents are still the same.

Thanks.

Comment 14 Scott Dodson 2017-08-11 14:16:06 UTC
Thanks, I think I understand the problem now.

Comment 15 Scott Dodson 2017-08-18 20:29:50 UTC
https://github.com/openshift/openshift-ansible/pull/5145 proposed fix

Comment 16 Scott Dodson 2017-08-18 20:31:08 UTC
Created attachment 1315400 [details]
fixed script

can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with this version?

Comment 17 Marko Myllynen 2017-08-21 02:20:39 UTC
(In reply to Scott Dodson from comment #16)
> Created attachment 1315400 [details]
> fixed script
> 
> can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with
> this version?

I replaced the file after installation and rebooted a node, this seems to work.

Thanks.

Comment 18 Scott Dodson 2017-08-24 12:17:18 UTC
PR merged

Comment 20 Gan Huang 2017-08-28 08:46:54 UTC
Reproduced with openshift-ansible-3.6.173.0.5-3

Verified with openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7.noarch

Steps:

1) Create 2 vms by using libvirt on local host, and using NAT network mode

2) Killed the dnsmasq process on local host so that no nameservers generated on the vms by NetworkManager

3) On the vms, manually add a nameserver to /etc/resolv.conf, and modify /etc/sysconfig/network-scripts/ifcfg-eth0 to use static IP.

4) Trigger the installation.

Results:

Can be reproduced with openshift-ansible-3.6.173.0.5-3

No errors with openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7.noarch

Comment 22 errata-xmlrpc 2017-09-05 17:42:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2639

Comment 23 Kamesh Sampath 2017-11-10 02:00:54 UTC
How to know the right release version of this Bug in upstream. I am facing this issue in latest versions of atomic-openshift-utils installed via upstream centos-release-openshift-origin repo. I have rebounce the servers to see the `/etc/origin/node/resolv.conf` to be created. wondering ansible handlers queue is fired bit late than needed. This usually happens after node is restart handler is fired after  opensvs switch install and resart

Comment 24 Marc Jadoul 2017-11-10 15:43:42 UTC
Hello,
We are also facing the same issue today on latest OSCP 3.6 on one of our nodes.
And I am not aware we changed /etc/resolv.conf manually. But I effectivelly see of is slighly different on the node failing. Still it is generated at boot on the failing server!

Regards,

Marc

Comment 25 Marc Jadoul 2017-11-10 15:50:45 UTC
It is apparently NOT pushed to latest rhel7 yet!



# diff /etc/NetworkManager/dispatcher.d/99-origin-dns.sh 99-origin-dns.sh
66c66
<       if [[ -z "${IP4_NAMESERVERS}" || "${IP4_NAMESERVERS}" == "${def_route_ip}" ]]; then
---
>       if ! [[ -n "${IP4_NAMESERVERS}" && "${IP4_NAMESERVERS}" != "${def_route_ip}" ]]; then

Comment 26 Marc Jadoul 2017-11-10 16:05:13 UTC
Could you clarify why IP4_NAMESERVERS would be null?
We do not modify /etc/resolv.conf manually and on one node we had this issue. I would like to understand why this node is different.

Comment 27 Marc Jadoul 2017-11-10 16:40:16 UTC
Hello,

We tried to implement the fix but it does not solve the issue.

Comment 28 Marc Jadoul 2017-11-10 16:53:49 UTC
OK.
We found our issue.
In our case the watermark was always there!! This is because some stupid operator disable the plugin of NetworkManager to manage the DNS. Therefor the watermark was not "removed" by NetworkManager at boot.

So I think the logic of the script is not fully bullet proof!

Regards,

Marc


Note You need to log in before you can comment on or make changes to this bug.