Bug 1566037

Summary: dnsmasq is not correctly forwarding to upstream nameservers when having internal nameserver on the network
Product: OpenShift Container Platform Reporter: Filip Brychta <fbrychta>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED NOTABUG QA Contact: Johnny Liu <jialiu>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.9.0CC: aos-bugs, bward, jokerman, mgugino, mmccomas
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-29 16:51:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dnsmasq conf files
none
grepped forwarding msgs none

Description Filip Brychta 2018-04-11 11:57:25 UTC
Created attachment 1420270 [details]
dnsmasq conf files

Description of problem:
There are two nameservers (master 10.16.23.35 /slave 10.16.23.54) for jonqe.lab.eng.bos.redhat.com. Openshif master was installed on 10.16.23.46.

/etc/resolv.conf before installation:
# Generated by NetworkManager
search jonqe.lab.eng.bos.redhat.com
nameserver 10.16.23.35
nameserver 10.16.23.54
nameserver 10.11.5.19
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.5.30.160
nameserver 10.38.5.26

Master was installed successfully and everything was working fine.

/etc/resolv.conf after installation:
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local jonqe.lab.eng.bos.redhat.com
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.16.23.46

dnsmasq.conf, node-dnsmasq.conf, origin-dns.conf and origin-upstream-dns.conf are attached.

Interesting is content of origin-upstream-dns.conf (note that domain is not set for those internal nameservers):
server=10.16.23.35
server=10.16.23.54
server=10.11.5.19
server=10.5.30.160
server=10.38.5.26


It's working fine with this configuration for a while and dns lookups are working as expected e.g.:
Apr 11 02:30:33 b22 dnsmasq[11420]: query[A] ldap.corp.redhat.com from 10.16.23.46
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.16.23.35
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.38.5.26
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.5.30.160
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.11.5.19
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.16.23.54
Apr 11 02:30:33 b22 dnsmasq[11420]: query[AAAA] ldap.corp.redhat.com from 10.16.23.46
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.16.23.35
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.38.5.26
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.5.30.160
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.11.5.19
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.16.23.54
Apr 11 02:30:33 b22 dnsmasq[11420]: reply ldap.corp.redhat.com is <CNAME>$
Apr 11 02:30:33 b22 dnsmasq[11420]: reply corp.ldap.prod.int.rdu2.redhat.com is 10.11.200.20
Apr 11 02:30:33 b22 dnsmasq[11420]: reply ldap.corp.redhat.com is <CNAME>$
Apr 11 02:30:33 b22 dnsmasq[11420]: query[A] ldap.corp.redhat.com from 10.16.23.46
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.11.5.19
Apr 11 02:30:33 b22 dnsmasq[11420]: query[AAAA] ldap.corp.redhat.com from 10.16.23.46
Apr 11 02:30:33 b22 dnsmasq[11420]: forwarded ldap.corp.redhat.com to 10.11.5.19
Apr 11 02:30:33 b22 dnsmasq[11420]: reply ldap.corp.redhat.com is <CNAME>$
Apr 11 02:30:33 b22 dnsmasq[11420]: reply corp.ldap.prod.int.rdu2.redhat.com is 10.11.200.20
Apr 11 02:30:33 b22 dnsmasq[11420]: reply ldap.corp.redhat.com is <CNAME>

dnsmasq si correctly forwarding to upstream nameservers and reply is delivered.

After some time when trying to resolve again following errors are visible:
Apr 11 02:55:13 b22 atomic-openshift-master-api: E0411 02:55:13.833410  129029 login.go:187] Error authenticating "fbrychta" with provider "rht_ldap_provider": LDAP Result Code 200 "Network Error": dial tcp: lookup ldap.corp.redhat.com on 10.16.23.46:53: read udp 10.16.23.46:53826->10.16.23.46:53: i/o timeout

Cause of this is following:
Apr 11 02:55:03 b22 dnsmasq[128666]: query[A] ldap.corp.redhat.com from 10.16.23.46$
Apr 11 02:55:03 b22 dnsmasq[128666]: forwarded ldap.corp.redhat.com to 10.16.23.54

dnsmasq is forwarding only to 10.16.23.54 which does not know the answer.
After 5 sec (default timeout) it's trying again:

Apr 11 02:55:08 b22 dnsmasq[128666]: query[A] ldap.corp.redhat.com from 10.16.23.46$
Apr 11 02:55:08 b22 dnsmasq[128666]: forwarded ldap.corp.redhat.com to 10.16.23.54

it's again forwarding only to 10.16.23.54 not trying other upstream nameservers from origin-upstream-dns.conf
After another 5s it's trying to attach cluster.local:

Apr 11 02:55:13 b22 dnsmasq[128666]: query[A] ldap.corp.redhat.com.cluster.local from 10.16.23.46$
Apr 11 02:55:13 b22 dnsmasq[128666]: forwarded ldap.corp.redhat.com.cluster.local to 127.0.0.1$
Apr 11 02:55:13 b22 dnsmasq[128666]: forwarded ldap.corp.redhat.com.cluster.local to 127.0.0.1

and jonqe.lab.eng.bos.redhat.com:
Apr 11 02:55:13 b22 dnsmasq[128666]: query[A] ldap.corp.redhat.com.jonqe.lab.eng.bos.redhat.com from 10.16.23.46$
Apr 11 02:55:13 b22 dnsmasq[128666]: forwarded ldap.corp.redhat.com.jonqe.lab.eng.bos.redhat.com to 10.16.23.54

Note that the i/o timeout error is thrown at Apr 11 02:55:13

I edited origin-upstream-dns.conf (added domains for internal nameservers):
server=/jonqe.lab.eng.bos.redhat.com/10.16.23.35
server=/jonqe.lab.eng.bos.redhat.com/10.16.23.54
server=10.11.5.19
server=10.5.30.160
server=10.38.5.26

and the issue is no longer visible and dnsmasq correctly forwards to upstream nameserver:
Apr 11 07:36:10 b22 dnsmasq[9236]: query[A] ldap.corp.redhat.com from 10.16.23.46
Apr 11 07:36:10 b22 dnsmasq[9236]: forwarded ldap.corp.redhat.com to 10.11.5.19
Apr 11 07:36:10 b22 dnsmasq[9236]: reply ldap.corp.redhat.com is <CNAME>
Apr 11 07:36:10 b22 dnsmasq[9236]: reply ldap.corp.redhat.com is <CNAME>
Apr 11 07:36:10 b22 dnsmasq[9236]: reply corp.ldap.prod.int.rdu2.redhat.com is 10.11.200.20


Question1: why is dnsmasq forwarding only to 10.16.23.54 and not to other upstream nameservers defined in origin-upstream-dns.conf? (scenario without workaround)
Question2: should be the installer responsible for editing origin-upstream-dns.conf? (adding domains for internal nameservers)


Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch
rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch
ansible --version
ansible 2.4.3.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]


How reproducible:
1/1


Additional info:
The same issue applies for other external hostnames like registry-1.docker.io,registry.access.redhat.com

Comment 1 Scott Dodson 2018-04-11 12:23:44 UTC
Filip,

The dispatcher script in /etc/NetworkManager/dispatcher.d/99-origin-dns.sh is what maintains origin-upstream-dns.conf file and it should put all nameservers that NetworkManager is informed about via DHCP or configuration into that file.

It seems like that's happening, right? You modified the config file and restarted and things started working. What happens if you simply restarted dnsmasq? I'm wondering if dnsmasq timed out once to those servers and then stopped sending requests there.

Can you look for other clues in the dnsmasq logs as to why it's stopped sending requests to those hosts? and/or attach them to the bug?

Comment 2 Filip Brychta 2018-04-11 13:49:48 UTC
Yes, the dispatched script /etc/NetworkManager/dispatcher.d/99-origin-dns.sh correctly put all nameservers to origin-upstream-dns.conf.

I grepped all forwarding msgs for registry-1.docker.io (attached) and it's changing nameservers accodring to some key (I guess according to some internal dnsmasq algorithm).
That means the timeout is happening only when it hits nameserver which doesn't know the answer (in our case it is 10.16.23.54, other servers send reply with IP)

So the original description is not accurate, dnsmasq is not always using 10.16.23.54 nameserver. It's changing them and the dns lookups timeout only when it uses 10.16.23.54.


Adding domain for 10.16.23.54 in origin-upstream-dns.conf so it looks like this:
server=/jonqe.lab.eng.bos.redhat.com/10.16.23.35
server=/jonqe.lab.eng.bos.redhat.com/10.16.23.54
server=10.11.5.19
server=10.5.30.160
server=10.38.5.26

probably makes dnsmasq to use 10.16.23.54 only for jonqe.lab.eng.bos.redhat.com domain which resolves the issue because all other nameservers can resolve registry-1.docker.io correctly.

Now I don't know if this should be closed as not a bug or if openshift installer should always set domains for internal nameservers in origin-upstream-dns.conf so dnsmasq would not be using it for resolving external hostnames.

Comment 3 Filip Brychta 2018-04-11 13:50:21 UTC
Created attachment 1420331 [details]
grepped forwarding msgs

Comment 4 Scott Dodson 2018-04-11 17:16:37 UTC
In the past we added the 'strict-order' option which ensured that it always progressed through the list but that would also cause problems as it had to wait for things to timeout. By default dnsmasq is going to bias against using any server that times out, rather than timing out can 10.16.23.35 / 10.16.23.54 be made to NXDOMAIN or SRVFAIL on those requests for which it refuses to recurse?

You can probably add those two lines to a new dropin file but I'm not certain which would win.

Comment 5 Filip Brychta 2018-04-16 16:23:00 UTC
'strict-order' option resolved the issue too.
As a final solution I updated configuration of our internal nameservers to allow recursive queries and added forwarders so they always send the answer. This way there is no need for 'strict-order' option or adding domain as described in comment 2.

I have only basic knowledge in this area so I don't know how to resolve this bz.
Maybe just note in documentation about default behavior of dnsmasq after installation?

Comment 6 Scott Dodson 2018-04-16 19:24:12 UTC
Ok, we can do that. I'll write up a summary and send this over to the docs component.

Comment 7 Michael Gugino 2018-11-29 16:51:04 UTC
Attached customer case no longer seems related.

This scenario is expected behavior for dnsmasq.  Since dnsmasq receives a reply for the first upstream server it tries (a reply that the record does not exist), it does not continue to try other servers.  While this behavior does not match the behavior of /etc/hosts, it does match the dns specified behavior.  All configured upstream dns servers in dnsmasq should be capable of resolving all the same records.  This appears to have been implemented in the original bug report by end user via enabling recursive DNS on their dns servers.