Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1399756

Summary:	Changes in dnsmasq to increase resilience when external primary DNS is down
Product:	OpenShift Container Platform	Reporter:	Javier Ramirez <javier.ramirez>
Component:	Installer	Assignee:	Scott Dodson <sdodson>
Status:	CLOSED DUPLICATE	QA Contact:	Johnny Liu <jialiu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.3.0	CC:	aos-bugs, ghuang, jokerman, mmccomas, rhowe
Target Milestone:	---	Keywords:	UpcomingRelease
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-30 13:39:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Javier Ramirez 2016-11-29 16:36:28 UTC

In an OpenShift Container Platform environment, when the primary DNS fails, even though there is a working secondary DNS, we can see general issues on OpenShift masters and nodes, mostly cause by the timeout to reach the secondary dns.

By default dnsmasq actually queries multiple servers simultaneously. This is disabled by the `strict-order` configuration which causes dnsmasq to query servers one at a time. 

What would be the consequences of removing strict-order ?

What other options do we have to tune dnsmasq timeouts ?

Comment 1 Ryan Howe 2016-11-29 20:11:46 UTC

1.
I do not know the consequences of removing strict-order but can not see much harm in doing so as dnsmasq will favour dns servers with more specific domains. Meaning that it should favour SKYDNS for all queries for domain "cluster.local"


2.
There is no way to directly configure a timeout in dnsmasq.

A timeout would be configured in the resolv.conf which will set the timeout for the resolver which is used by dnsmasq. Lowest value that can be set is 1 second. 

* Also note if NetworkManager is configuring your resolv.conf to set this value either modify NM's confg adding dns=none , or create a dispatch script that adds this option. 

Example: 

# cat /etc/resolv.conf 

search example.com
nameserver 192.168.0.6
options timeout:10


# cat /etc/dnsmasq.d/origin-upstream-dns.conf 
strict-order
no-resolv
domain-needed
server=8.8.8.8
server=192.168.0.3


# time dig master-1.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.4 <<>> master-1.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30392
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master-1.example.com.		IN	A

;; ANSWER SECTION:
master-1.example.com.	60	IN	A	192.168.0.4

;; AUTHORITY SECTION:
example.com.		10	IN	NS	dns.example.com.

;; ADDITIONAL SECTION:
infra.example.com.	60	IN	A	192.168.0.3

;; Query time: 2 msec
;; SERVER: 192.168.0.6#53(192.168.0.6)
;; WHEN: Tue Nov 29 14:57:41 EST 2016
;; MSG SIZE  rcvd: 101


real	0m10.019s
user	0m0.010s
sys	0m0.008s

Comment 2 Scott Dodson 2016-11-29 22:57:17 UTC

I think we can remove strict-order option. If we do that dnsmasq will prefer servers that it knows to be up which should avoid any need to tune the timeout. This will probably also address the issues in Bug 1399577 too.

Comment 4 Scott Dodson 2016-11-30 13:39:30 UTC


*** This bug has been marked as a duplicate of bug 1399577 ***

Comment 5 Ryan Howe 2016-11-30 14:58:28 UTC

Correction to comment1, the timeout option works when set in resolv.conf but dnsmasq but I am not sure how this gets set as I do not know how dnsmasq uses glibc resolver, it might just accept some options that are set. I have confirmed that it works. (even when no-resolv option is set for dnsmasq config)