Bug 1480438 - [3.6] Installer fails due to missing /etc/origin/node/resolv.conf
[3.6] Installer fails due to missing /etc/origin/node/resolv.conf
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity medium
: ---
: 3.6.z
Assigned To: Scott Dodson
Gan Huang
: NeedsTestCase
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-11 02:07 EDT by Marko Myllynen
Modified: 2017-11-10 11:53 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When the nameservers were specified directly in /etc/resolv.conf the node dns configuration scripts failed to determine the correct nameservers. The configuration scripts have been updated to pull the nameservers from /etc/resolv.conf when they're not specified via other means.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-05 13:42:58 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ansible.hosts (3.28 KB, text/plain)
2017-08-11 02:07 EDT, Marko Myllynen
no flags Details
ansible.log (6.77 MB, text/plain)
2017-08-11 02:08 EDT, Marko Myllynen
no flags Details
fixed script (5.07 KB, application/x-shellscript)
2017-08-18 16:31 EDT, Scott Dodson
no flags Details

  None (edit)
Description Marko Myllynen 2017-08-11 02:07:41 EDT
Created attachment 1311986 [details]
ansible.hosts

Description of problem:
OCP 3.6 installation fails with:

Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journal
ctl -xe" for details.

Investigating this further reveals:

F0810 09:06:10.912359    4298 start_node.go:140] could not start DNS,
unable to read config file: open /etc/origin/node/resolv.conf: no such
file or directory

I touching the files before installation then the service starts up ok.

I'm attaching full ansible log, the used inventory file, and sosreport from the node after the failure.

Version-Release number of the following components:
openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch
ansible-2.3.1.0-3.el7.noarch
ansible 2.3.1.0
  config file = /root/.ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

How reproducible:
Always.
Comment 1 Marko Myllynen 2017-08-11 02:08 EDT
Created attachment 1311987 [details]
ansible.log

ansible.log from the failed installation
Comment 3 Gan Huang 2017-08-11 02:36:27 EDT
Very weird to me it still can be reproduced.

I thought it should have been fixed since openshift-ansible-3.6.172.0.0-1

Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707

What kind of network configurations are you using? NM_CONTROLLED disabled on the interface?
Comment 4 Marko Myllynen 2017-08-11 02:40:59 EDT
(In reply to Gan Huang from comment #3)
> Very weird to me it still can be reproduced.
> 
> I thought it should have been fixed since openshift-ansible-3.6.172.0.0-1
> 
> Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707
> 
> What kind of network configurations are you using? NM_CONTROLLED disabled on
> the interface?

One interface per node, ifcfg-eth0 being something like:

[root@master01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 
# Generated by parse-kickstart
IPV6INIT="yes"
BOOTPROTO=none
DEVICE="eth0"
ONBOOT="yes"
UUID="6a31ba44-ce22-48ef-941b-fab4b5defacc"
IPADDR=192.168.200.254
NETWORK=192.168.200.0
NETMASK=255.255.255.0
GATEWAY=192.168.200.200

Thanks.
Comment 5 Marko Myllynen 2017-08-11 02:41:39 EDT
And NM in use, as required per the documentation.
Comment 6 Gan Huang 2017-08-11 04:18:52 EDT
Could you please attach your log for NetworkManager-dispatcher service? and /etc/resolv.conf?

journalctl -u NetworkManager-dispatcher > dispatcher.log

cat /etc/resolv.conf
Comment 7 Marko Myllynen 2017-08-11 04:30:28 EDT
Can you please check the attached sosreport, it should container at last most of this information? Thanks.
Comment 9 Scott Dodson 2017-08-11 08:57:49 EDT
From the logs it's happening because there are no nameservers in IP4_NAMESERVERS. How are your nameservers for this host defined?

https://github.com/openshift/openshift-ansible/issues/4935 probably related
Comment 10 Scott Dodson 2017-08-11 08:59:55 EDT
When I say how are they configured I mean, what configuration method is used to define them? They're not in the interface config you mentioned above and it's not using DHCP. Are they in /etc/sysconfig/network ? If they're manually added to /etc/resolv.conf then we definitely don't account for that but that seems like an invalid method to define the nameservers.
Comment 11 Marko Myllynen 2017-08-11 09:03:54 EDT
The host has:

[root@master01 ~]# cat /etc/resolv.conf 
search test.example.com
nameserver 192.168.200.200

But in fact the nameserver is currently unreachable:

[root@master01 ~]# host www.redhat.com
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached
[root@master01 ~]# 

Nothing in /etc/sysconfig/network. I'm not expecting these configuration to survive the installation (we can e.g. openshift_node_dnsmasq_additional_config_file for custom local configs if needed) but the installer should not choke in case there's something working or not in /etc/resolv.conf.

Thanks.
Comment 12 Scott Dodson 2017-08-11 09:11:05 EDT
Right, we shouldn't have touched /etc/resolv.conf and we need to fix that. It's still not clear to me how, prior to any openshift-ansible playbook, your dns servers were configured? Can you help me understand that? Did some process directly edit /etc/resolv.conf?

If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and reboot the host what do you have in /etc/resolv.conf and what method was used to populate the nameservers there?
Comment 13 Marko Myllynen 2017-08-11 09:38:41 EDT
(In reply to Scott Dodson from comment #12)
> Right, we shouldn't have touched /etc/resolv.conf and we need to fix that.
> It's still not clear to me how, prior to any openshift-ansible playbook,
> your dns servers were configured? Can you help me understand that? Did some
> process directly edit /etc/resolv.conf?

What happened roughly was 1) RHEL 7 base installation and network configuration, 2) test something by manually updating /etc/resolv.conf and forget about it, 3) follow the host preparation steps, 4) install.

> If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and
> reboot the host what do you have in /etc/resolv.conf and what method was
> used to populate the nameservers there?

After this the contents are still the same.

Thanks.
Comment 14 Scott Dodson 2017-08-11 10:16:06 EDT
Thanks, I think I understand the problem now.
Comment 15 Scott Dodson 2017-08-18 16:29:50 EDT
https://github.com/openshift/openshift-ansible/pull/5145 proposed fix
Comment 16 Scott Dodson 2017-08-18 16:31 EDT
Created attachment 1315400 [details]
fixed script

can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with this version?
Comment 17 Marko Myllynen 2017-08-20 22:20:39 EDT
(In reply to Scott Dodson from comment #16)
> Created attachment 1315400 [details]
> fixed script
> 
> can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with
> this version?

I replaced the file after installation and rebooted a node, this seems to work.

Thanks.
Comment 18 Scott Dodson 2017-08-24 08:17:18 EDT
PR merged
Comment 20 Gan Huang 2017-08-28 04:46:54 EDT
Reproduced with openshift-ansible-3.6.173.0.5-3

Verified with openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7.noarch

Steps:

1) Create 2 vms by using libvirt on local host, and using NAT network mode

2) Killed the dnsmasq process on local host so that no nameservers generated on the vms by NetworkManager

3) On the vms, manually add a nameserver to /etc/resolv.conf, and modify /etc/sysconfig/network-scripts/ifcfg-eth0 to use static IP.

4) Trigger the installation.

Results:

Can be reproduced with openshift-ansible-3.6.173.0.5-3

No errors with openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7.noarch
Comment 22 errata-xmlrpc 2017-09-05 13:42:58 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2639
Comment 23 Kamesh Sampath 2017-11-09 21:00:54 EST
How to know the right release version of this Bug in upstream. I am facing this issue in latest versions of atomic-openshift-utils installed via upstream centos-release-openshift-origin repo. I have rebounce the servers to see the `/etc/origin/node/resolv.conf` to be created. wondering ansible handlers queue is fired bit late than needed. This usually happens after node is restart handler is fired after  opensvs switch install and resart
Comment 24 Marc Jadoul 2017-11-10 10:43:42 EST
Hello,
We are also facing the same issue today on latest OSCP 3.6 on one of our nodes.
And I am not aware we changed /etc/resolv.conf manually. But I effectivelly see of is slighly different on the node failing. Still it is generated at boot on the failing server!

Regards,

Marc
Comment 25 Marc Jadoul 2017-11-10 10:50:45 EST
It is apparently NOT pushed to latest rhel7 yet!



# diff /etc/NetworkManager/dispatcher.d/99-origin-dns.sh 99-origin-dns.sh
66c66
<       if [[ -z "${IP4_NAMESERVERS}" || "${IP4_NAMESERVERS}" == "${def_route_ip}" ]]; then
---
>       if ! [[ -n "${IP4_NAMESERVERS}" && "${IP4_NAMESERVERS}" != "${def_route_ip}" ]]; then
Comment 26 Marc Jadoul 2017-11-10 11:05:13 EST
Could you clarify why IP4_NAMESERVERS would be null?
We do not modify /etc/resolv.conf manually and on one node we had this issue. I would like to understand why this node is different.
Comment 27 Marc Jadoul 2017-11-10 11:40:16 EST
Hello,

We tried to implement the fix but it does not solve the issue.
Comment 28 Marc Jadoul 2017-11-10 11:53:49 EST
OK.
We found our issue.
In our case the watermark was always there!! This is because some stupid operator disable the plugin of NetworkManager to manage the DNS. Therefor the watermark was not "removed" by NetworkManager at boot.

So I think the logic of the script is not fully bullet proof!

Regards,

Marc

Note You need to log in before you can comment on or make changes to this bug.