1544001 – Ansible openshift playbooks run fails on TASK [openshift_service_catalog : wait for api server to be ready]

Bug 1544001 - Ansible openshift playbooks run fails on TASK [openshift_service_catalog : wait for api server to be ready]

Summary: Ansible openshift playbooks run fails on TASK [openshift_service_catalog : wa...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Vadim Rutkovsky
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-09 19:31 UTC by Greg Rodriguez II
Modified:	2019-02-01 12:18 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-18 15:59:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	677316	0	low	CLOSED	glibc: Increase number of search domains supported by /etc/resolv.conf	2023-03-24 13:25:57 UTC

Internal Links: 677316

Description Greg Rodriguez II 2018-02-09 19:31:33 UTC

Description of problem:
Ansible openshift playbooks run fails on the following task:

TASK [openshift_service_catalog : wait for api server to be ready]

Looking at the command that is run for that task, customer reran it on one of the masters:

[usrnrp@wi01vmd-ospc1 ~]$ curl -k https://apiserver.kube-service-catalog.svc/healthz                      [28/6114]curl: (6) Could not resolve host: apiserver.kube-service-catalog.svc; Name or service not known

Looking into this a bit it looks like the resolv.conf has the following contents:

[usrnrp@wi01vmd-ospc1 ~]$ cat /etc/resolv.conf
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search tds.net svc.tds.net web.tds.net dns.tds.net sec.tds.net ent.tds.net cluster.local
nameserver 10.35.129.211
[usrnrp@wi01vmd-ospc1 ~]$

One problem may be that the cluster.local is the 7th entry which according to some brief reading sounds like it may an issue. Going down this route, customer deleted the 6th entry (ent.tds.net) on just the masters. Then customer reran the playbook but failed on that same ansible task. Customer retried the curl command as well and got a new failure for curl:

[usrnrp@wi01vmd-ospc1 ~]$ curl -vvv -k https://apiserver.kube-service-catalog.svc/healthz
* About to connect() to apiserver.kube-service-catalog.svc port 443 (#0)
*   Trying 172.30.61.126...
* Connection refused
* Failed connect to apiserver.kube-service-catalog.svc:443; Connection refused
* Closing connection 0
curl: (7) Failed connect to apiserver.kube-service-catalog.svc:443; Connection refused
[usrnrp@wi01vmd-ospc1 ~]$

Version-Release number of selected component (if applicable):
OCP 3.7

How reproducible:
Customer verified

Steps to Reproduce:
/etc/NetworkManager/dispatcher.d/99-origin-dns.sh is appending a search suffix of cluster.local to /etc/resolv.conf and not taking into account the glibc limitation of /etc/resolv.conf as documented at https://access.redhat.com/solutions/58028.  This causes cluster.local to become the 7th search suffix (in customer environment because they are already adding 6 search suffixes - see /etc/resolv.conf content below) and then when the ansible playbook that does the OpenShift install attempts to verify the cluster via the services short name apiserver.kube-service-catalog.svc (i.e. not a FQDN) and it gets a "Name Not or Service not known" error from curl.

Customer /etc/hosts file looks like this:
[usrnrp@wi01vmd-ospc1 ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
[usrnrp@wi01vmd-ospc1 ~]$

Customer /etc/resolv.conf file looks like this:
[usrnrp@wi01vmd-ospc1 ~]$ cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search tds.net svc.tds.net web.tds.net dns.tds.net sec.tds.net ent.tds.net
nameserver 69.128.137.195
nameserver 69.128.137.196

Actual results:
Customer patched /usr/share/ansible/openshift-ansible/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh before running the playbook to install OpenShift and the installer is able to successfully verify apiserver.kube-service-catalog.svc.

[usrnrp@wi01vmd-ospc0 ~]$ diff /usr/share/ansible/openshift-ansible/roles/openshift_node_dnsmasq/files/networkmanager/99-origin-dns.sh /tmp/99-origin-dns.sh
120,121c120
<         #sed -i '/^search/ s/$/ cluster.local/' ${NEW_RESOLV_CONF}
<         sed -i 's/^search.*$/search cluster.local/g' ${NEW_RESOLV_CONF}
---
>         sed -i '/^search/ s/$/ cluster.local/' ${NEW_RESOLV_CONF}
[usrnrp@wi01vmd-ospc0 ~]$ 

This patch might not be ideal since it again does not take glibc limitations into consideration and it wipes out whatever was in the search path to begin with, but for now it has gotten the customer moving forward.

Comment 1 Vadim Rutkovsky 2018-02-12 10:52:32 UTC

It seems "cluster.local" should be listed first instead of being appended.

Comment 2 Vadim Rutkovsky 2018-02-12 11:36:54 UTC

Created https://github.com/openshift/openshift-ansible/pull/7103 for master

Comment 9 Greg Rodriguez II 2018-05-01 17:26:16 UTC

Customer still affected by this issue and requesting update on Bug.  Has there been any movement on this issue at all and are there any updates I can provide to the customer?

Comment 10 Vadim Rutkovsky 2018-05-11 09:34:09 UTC

(In reply to Greg Rodriguez II from comment #9)
> Customer still affected by this issue and requesting update on Bug.  Has
> there been any movement on this issue at all and are there any updates I can
> provide to the customer?

The solution is to update to RHEL 7.5 (glib resolver is fixed there) and update to the later version of openshift-ansible. The PR to use cluster.local first has been merged and available since openshift-ansible-3.7.31-1

Comment 11 Jian Zhang 2018-05-15 10:02:35 UTC

verify success.

Test Step:
1) config the "cluster.local" as 8th and restart the dnsmasq, like below:
[root@host-172-16-120-115 ~]# cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search tds.net svc.tds.net web.tds.net dns.net sec.net ent.tds.net tds.net cluster.local openstacklocal
nameserver 172.16.120.115

[root@host-172-16-120-115 ~]# systemctl restart dnsmasq.service

2) Install the service catalog success by using openshift-ansible, and it works well!

system info:
[root@host-172-16-120-115 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)

[root@host-172-16-120-115 ~]# rpm -qa | grep glibc
glibc-common-2.17-222.el7.x86_64
glibc-2.17-222.el7.x86_64

ansible version: openshift-ansible-3.9.27


Additional info:
I can reproduce this bug in RHEL 7.4.
[root@qe-zitang-gcemaster-etcd-1 ~]# cat /etc/redhat-release l
Red Hat Enterprise Linux Server release 7.4 (Maipo)
[root@qe-zitang-gcemaster-etcd-1 ~]# rpm -qa | grep glibc
glibc-common-2.17-196.el7_4.2.x86_64
glibc-2.17-196.el7_4.2.x86_64

Comment 14 Vadim Rutkovsky 2019-01-16 12:56:45 UTC

(In reply to Vadim Rutkovsky from comment #10)
> (In reply to Greg Rodriguez II from comment #9)
> > Customer still affected by this issue and requesting update on Bug.  Has
> > there been any movement on this issue at all and are there any updates I can
> > provide to the customer?
> 
> The solution is to update to RHEL 7.5 (glib resolver is fixed there) and
> update to the later version of openshift-ansible.

The advisory with the glibc fix is https://access.redhat.com/errata/RHSA-2018:0805

Note You need to log in before you can comment on or make changes to this bug.