Created attachment 1311986 [details] ansible.hosts Description of problem: OCP 3.6 installation fails with: Unable to restart service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See "systemctl status atomic-openshift-node.service" and "journal ctl -xe" for details. Investigating this further reveals: F0810 09:06:10.912359 4298 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory I touching the files before installation then the service starts up ok. I'm attaching full ansible log, the used inventory file, and sosreport from the node after the failure. Version-Release number of the following components: openshift-ansible-3.6.173.0.5-3.git.0.522a92a.el7.noarch ansible-2.3.1.0-3.el7.noarch ansible 2.3.1.0 config file = /root/.ansible.cfg configured module search path = Default w/o overrides python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] How reproducible: Always.
Created attachment 1311987 [details] ansible.log ansible.log from the failed installation
Very weird to me it still can be reproduced. I thought it should have been fixed since openshift-ansible-3.6.172.0.0-1 Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707 What kind of network configurations are you using? NM_CONTROLLED disabled on the interface?
(In reply to Gan Huang from comment #3) > Very weird to me it still can be reproduced. > > I thought it should have been fixed since openshift-ansible-3.6.172.0.0-1 > > Please see: https://bugzilla.redhat.com/show_bug.cgi?id=1474707 > > What kind of network configurations are you using? NM_CONTROLLED disabled on > the interface? One interface per node, ifcfg-eth0 being something like: [root@master01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0 # Generated by parse-kickstart IPV6INIT="yes" BOOTPROTO=none DEVICE="eth0" ONBOOT="yes" UUID="6a31ba44-ce22-48ef-941b-fab4b5defacc" IPADDR=192.168.200.254 NETWORK=192.168.200.0 NETMASK=255.255.255.0 GATEWAY=192.168.200.200 Thanks.
And NM in use, as required per the documentation.
Could you please attach your log for NetworkManager-dispatcher service? and /etc/resolv.conf? journalctl -u NetworkManager-dispatcher > dispatcher.log cat /etc/resolv.conf
Can you please check the attached sosreport, it should container at last most of this information? Thanks.
From the logs it's happening because there are no nameservers in IP4_NAMESERVERS. How are your nameservers for this host defined? https://github.com/openshift/openshift-ansible/issues/4935 probably related
When I say how are they configured I mean, what configuration method is used to define them? They're not in the interface config you mentioned above and it's not using DHCP. Are they in /etc/sysconfig/network ? If they're manually added to /etc/resolv.conf then we definitely don't account for that but that seems like an invalid method to define the nameservers.
The host has: [root@master01 ~]# cat /etc/resolv.conf search test.example.com nameserver 192.168.200.200 But in fact the nameserver is currently unreachable: [root@master01 ~]# host www.redhat.com ;; connection timed out; trying next origin ;; connection timed out; no servers could be reached [root@master01 ~]# Nothing in /etc/sysconfig/network. I'm not expecting these configuration to survive the installation (we can e.g. openshift_node_dnsmasq_additional_config_file for custom local configs if needed) but the installer should not choke in case there's something working or not in /etc/resolv.conf. Thanks.
Right, we shouldn't have touched /etc/resolv.conf and we need to fix that. It's still not clear to me how, prior to any openshift-ansible playbook, your dns servers were configured? Can you help me understand that? Did some process directly edit /etc/resolv.conf? If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and reboot the host what do you have in /etc/resolv.conf and what method was used to populate the nameservers there?
(In reply to Scott Dodson from comment #12) > Right, we shouldn't have touched /etc/resolv.conf and we need to fix that. > It's still not clear to me how, prior to any openshift-ansible playbook, > your dns servers were configured? Can you help me understand that? Did some > process directly edit /etc/resolv.conf? What happened roughly was 1) RHEL 7 base installation and network configuration, 2) test something by manually updating /etc/resolv.conf and forget about it, 3) follow the host preparation steps, 4) install. > If you `chmod -x /etc/NetworkManager/dispatcher.d/99-origin-dns.sh` and > reboot the host what do you have in /etc/resolv.conf and what method was > used to populate the nameservers there? After this the contents are still the same. Thanks.
Thanks, I think I understand the problem now.
https://github.com/openshift/openshift-ansible/pull/5145 proposed fix
Created attachment 1315400 [details] fixed script can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with this version?
(In reply to Scott Dodson from comment #16) > Created attachment 1315400 [details] > fixed script > > can you try replacing /etc/NetworkManager/dispatcher.d/99-origin-dns.sh with > this version? I replaced the file after installation and rebooted a node, this seems to work. Thanks.
PR merged
Reproduced with openshift-ansible-3.6.173.0.5-3 Verified with openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7.noarch Steps: 1) Create 2 vms by using libvirt on local host, and using NAT network mode 2) Killed the dnsmasq process on local host so that no nameservers generated on the vms by NetworkManager 3) On the vms, manually add a nameserver to /etc/resolv.conf, and modify /etc/sysconfig/network-scripts/ifcfg-eth0 to use static IP. 4) Trigger the installation. Results: Can be reproduced with openshift-ansible-3.6.173.0.5-3 No errors with openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2639
How to know the right release version of this Bug in upstream. I am facing this issue in latest versions of atomic-openshift-utils installed via upstream centos-release-openshift-origin repo. I have rebounce the servers to see the `/etc/origin/node/resolv.conf` to be created. wondering ansible handlers queue is fired bit late than needed. This usually happens after node is restart handler is fired after opensvs switch install and resart
Hello, We are also facing the same issue today on latest OSCP 3.6 on one of our nodes. And I am not aware we changed /etc/resolv.conf manually. But I effectivelly see of is slighly different on the node failing. Still it is generated at boot on the failing server! Regards, Marc
It is apparently NOT pushed to latest rhel7 yet! # diff /etc/NetworkManager/dispatcher.d/99-origin-dns.sh 99-origin-dns.sh 66c66 < if [[ -z "${IP4_NAMESERVERS}" || "${IP4_NAMESERVERS}" == "${def_route_ip}" ]]; then --- > if ! [[ -n "${IP4_NAMESERVERS}" && "${IP4_NAMESERVERS}" != "${def_route_ip}" ]]; then
Could you clarify why IP4_NAMESERVERS would be null? We do not modify /etc/resolv.conf manually and on one node we had this issue. I would like to understand why this node is different.
Hello, We tried to implement the fix but it does not solve the issue.
OK. We found our issue. In our case the watermark was always there!! This is because some stupid operator disable the plugin of NetworkManager to manage the DNS. Therefor the watermark was not "removed" by NetworkManager at boot. So I think the logic of the script is not fully bullet proof! Regards, Marc