Red Hat Bugzilla – Bug 1480438
[3.6] Installer fails due to missing /etc/origin/node/resolv.conf
Last modified: 2017-09-05 13:42:58 EDT
Created attachment 1311986 [details]
Description of problem:
OCP 3.6 installation fails with:
Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journal
ctl -xe" for details.
Investigating this further reveals:
F0810 09:06:10.912359 4298 start_node.go:140] could not start DNS,
unable to read config file: open /etc/origin/node/resolv.conf: no such
file or directory
I touching the files before installation then the service starts up ok.
I'm attaching full ansible log, the used inventory file, and sosreport from the node after the failure.
Version-Release number of the following components:
config file = /root/.ansible.cfg
configured module search path = Default w/o overrides
python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]
Created attachment 1311987 [details]
ansible.log from the failed installation
Very weird to me it still can be reproduced.
I thought it should have been fixed since openshift-ansible-126.96.36.199.0-1
Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707
What kind of network configurations are you using? NM_CONTROLLED disabled on the interface?
(In reply to Gan Huang from comment #3)
> Very weird to me it still can be reproduced.
> I thought it should have been fixed since openshift-ansible-188.8.131.52.0-1
> Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707
> What kind of network configurations are you using? NM_CONTROLLED disabled on
> the interface?
One interface per node, ifcfg-eth0 being something like:
[root@master01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Generated by parse-kickstart
And NM in use, as required per the documentation.
Could you please attach your log for NetworkManager-dispatcher service? and /etc/resolv.conf?
journalctl -u NetworkManager-dispatcher > dispatcher.log
Can you please check the attached sosreport, it should container at last most of this information? Thanks.
From the logs it's happening because there are no nameservers in IP4_NAMESERVERS. How are your nameservers for this host defined?
https://github.com/openshift/openshift-ansible/issues/4935 probably related
When I say how are they configured I mean, what configuration method is used to define them? They're not in the interface config you mentioned above and it's not using DHCP. Are they in /etc/sysconfig/network ? If they're manually added to /etc/resolv.conf then we definitely don't account for that but that seems like an invalid method to define the nameservers.
The host has:
[root@master01 ~]# cat /etc/resolv.conf
But in fact the nameserver is currently unreachable:
[root@master01 ~]# host www.redhat.com
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached
Nothing in /etc/sysconfig/network. I'm not expecting these configuration to survive the installation (we can e.g. openshift_node_dnsmasq_additional_config_file for custom local configs if needed) but the installer should not choke in case there's something working or not in /etc/resolv.conf.
Right, we shouldn't have touched /etc/resolv.conf and we need to fix that. It's still not clear to me how, prior to any openshift-ansible playbook, your dns servers were configured? Can you help me understand that? Did some process directly edit /etc/resolv.conf?
If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and reboot the host what do you have in /etc/resolv.conf and what method was used to populate the nameservers there?
(In reply to Scott Dodson from comment #12)
> Right, we shouldn't have touched /etc/resolv.conf and we need to fix that.
> It's still not clear to me how, prior to any openshift-ansible playbook,
> your dns servers were configured? Can you help me understand that? Did some
> process directly edit /etc/resolv.conf?
What happened roughly was 1) RHEL 7 base installation and network configuration, 2) test something by manually updating /etc/resolv.conf and forget about it, 3) follow the host preparation steps, 4) install.
> If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and
> reboot the host what do you have in /etc/resolv.conf and what method was
> used to populate the nameservers there?
After this the contents are still the same.
Thanks, I think I understand the problem now.
https://github.com/openshift/openshift-ansible/pull/5145 proposed fix
Created attachment 1315400 [details]
can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with this version?
(In reply to Scott Dodson from comment #16)
> Created attachment 1315400 [details]
> fixed script
> can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with
> this version?
I replaced the file after installation and rebooted a node, this seems to work.
Reproduced with openshift-ansible-184.108.40.206.5-3
Verified with openshift-ansible-220.127.116.11.19-2.git.0.eb719a4.el7.noarch
1) Create 2 vms by using libvirt on local host, and using NAT network mode
2) Killed the dnsmasq process on local host so that no nameservers generated on the vms by NetworkManager
3) On the vms, manually add a nameserver to /etc/resolv.conf, and modify /etc/sysconfig/network-scripts/ifcfg-eth0 to use static IP.
4) Trigger the installation.
Can be reproduced with openshift-ansible-18.104.22.168.5-3
No errors with openshift-ansible-22.214.171.124.19-2.git.0.eb719a4.el7.noarch
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.