Description of problem: Ansible openshift playbooks run fails on the following task: TASK [openshift_service_catalog : wait for api server to be ready] Looking at the command that is run for that task, customer reran it on one of the masters: [usrnrp@wi01vmd-ospc1 ~]$ curl -k https://apiserver.kube-service-catalog.svc/healthz [28/6114]curl: (6) Could not resolve host: apiserver.kube-service-catalog.svc; Name or service not known Looking into this a bit it looks like the resolv.conf has the following contents: [usrnrp@wi01vmd-ospc1 ~]$ cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search tds.net svc.tds.net web.tds.net dns.tds.net sec.tds.net ent.tds.net cluster.local nameserver 10.35.129.211 [usrnrp@wi01vmd-ospc1 ~]$ One problem may be that the cluster.local is the 7th entry which according to some brief reading sounds like it may an issue. Going down this route, customer deleted the 6th entry (ent.tds.net) on just the masters. Then customer reran the playbook but failed on that same ansible task. Customer retried the curl command as well and got a new failure for curl: [usrnrp@wi01vmd-ospc1 ~]$ curl -vvv -k https://apiserver.kube-service-catalog.svc/healthz * About to connect() to apiserver.kube-service-catalog.svc port 443 (#0) * Trying 172.30.61.126... * Connection refused * Failed connect to apiserver.kube-service-catalog.svc:443; Connection refused * Closing connection 0 curl: (7) Failed connect to apiserver.kube-service-catalog.svc:443; Connection refused [usrnrp@wi01vmd-ospc1 ~]$ Version-Release number of selected component (if applicable): OCP 3.7 How reproducible: Customer verified Steps to Reproduce: /etc/NetworkManager/dispatcher.d/99-origin-dns.sh is appending a search suffix of cluster.local to /etc/resolv.conf and not taking into account the glibc limitation of /etc/resolv.conf as documented at https://access.redhat.com/solutions/58028. This causes cluster.local to become the 7th search suffix (in customer environment because they are already adding 6 search suffixes - see /etc/resolv.conf content below) and then when the ansible playbook that does the OpenShift install attempts to verify the cluster via the services short name apiserver.kube-service-catalog.svc (i.e. not a FQDN) and it gets a "Name Not or Service not known" error from curl. Customer /etc/hosts file looks like this: [usrnrp@wi01vmd-ospc1 ~]$ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 [usrnrp@wi01vmd-ospc1 ~]$ Customer /etc/resolv.conf file looks like this: [usrnrp@wi01vmd-ospc1 ~]$ cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search tds.net svc.tds.net web.tds.net dns.tds.net sec.tds.net ent.tds.net nameserver 69.128.137.195 nameserver 69.128.137.196 Actual results: Customer patched /usr/share/ansible/openshift-ansible/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh before running the playbook to install OpenShift and the installer is able to successfully verify apiserver.kube-service-catalog.svc. [usrnrp@wi01vmd-ospc0 ~]$ diff /usr/share/ansible/openshift-ansible/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh /tmp/99-origin-dns.sh 120,121c120 < #sed -i '/^search/ s/$/ cluster.local/' ${NEW_RESOLV_CONF} < sed -i 's/^search.*$/search cluster.local/g' ${NEW_RESOLV_CONF} --- > sed -i '/^search/ s/$/ cluster.local/' ${NEW_RESOLV_CONF} [usrnrp@wi01vmd-ospc0 ~]$ This patch might not be ideal since it again does not take glibc limitations into consideration and it wipes out whatever was in the search path to begin with, but for now it has gotten the customer moving forward.
It seems "cluster.local" should be listed first instead of being appended.
Created https://github.com/openshift/openshift-ansible/pull/7103 for master
Customer still affected by this issue and requesting update on Bug. Has there been any movement on this issue at all and are there any updates I can provide to the customer?
(In reply to Greg Rodriguez II from comment #9) > Customer still affected by this issue and requesting update on Bug. Has > there been any movement on this issue at all and are there any updates I can > provide to the customer? The solution is to update to RHEL 7.5 (glib resolver is fixed there) and update to the later version of openshift-ansible. The PR to use cluster.local first has been merged and available since openshift-ansible-3.7.31-1
verify success. Test Step: 1) config the "cluster.local" as 8th and restart the dnsmasq, like below: [root@host-172-16-120-115 ~]# cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search tds.net svc.tds.net web.tds.net dns.net sec.net ent.tds.net tds.net cluster.local openstacklocal nameserver 172.16.120.115 [root@host-172-16-120-115 ~]# systemctl restart dnsmasq.service 2) Install the service catalog success by using openshift-ansible, and it works well! system info: [root@host-172-16-120-115 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo) [root@host-172-16-120-115 ~]# rpm -qa | grep glibc glibc-common-2.17-222.el7.x86_64 glibc-2.17-222.el7.x86_64 ansible version: openshift-ansible-3.9.27 Additional info: I can reproduce this bug in RHEL 7.4. [root@qe-zitang-gcemaster-etcd-1 ~]# cat /etc/redhat-release l Red Hat Enterprise Linux Server release 7.4 (Maipo) [root@qe-zitang-gcemaster-etcd-1 ~]# rpm -qa | grep glibc glibc-common-2.17-196.el7_4.2.x86_64 glibc-2.17-196.el7_4.2.x86_64
(In reply to Vadim Rutkovsky from comment #10) > (In reply to Greg Rodriguez II from comment #9) > > Customer still affected by this issue and requesting update on Bug. Has > > there been any movement on this issue at all and are there any updates I can > > provide to the customer? > > The solution is to update to RHEL 7.5 (glib resolver is fixed there) and > update to the later version of openshift-ansible. The advisory with the glibc fix is https://access.redhat.com/errata/RHSA-2018:0805