Bug 1399756 - Changes in dnsmasq to increase resilience when external primary DNS is down
Summary: Changes in dnsmasq to increase resilience when external primary DNS is down
Keywords:
Status: CLOSED DUPLICATE of bug 1399577
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Scott Dodson
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-29 16:36 UTC by Javier Ramirez
Modified: 2016-11-30 14:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-30 13:39:30 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1399577 0 medium CLOSED [3.4] dnsmasq should not set strict-order 2021-02-22 00:41:40 UTC

Internal Links: 1399577

Description Javier Ramirez 2016-11-29 16:36:28 UTC
In an OpenShift Container Platform environment, when the primary DNS fails, even though there is a working secondary DNS, we can see general issues on OpenShift masters and nodes, mostly cause by the timeout to reach the secondary dns.

By default dnsmasq actually queries multiple servers simultaneously. This is disabled by the `strict-order` configuration which causes dnsmasq to query servers one at a time. 

What would be the consequences of removing strict-order ?

What other options do we have to tune dnsmasq timeouts ?

Comment 1 Ryan Howe 2016-11-29 20:11:46 UTC
1.
I do not know the consequences of removing strict-order but can not see much harm in doing so as dnsmasq will favour dns servers with more specific domains. Meaning that it should favour SKYDNS for all queries for domain "cluster.local"


2.
There is no way to directly configure a timeout in dnsmasq.

A timeout would be configured in the resolv.conf which will set the timeout for the resolver which is used by dnsmasq. Lowest value that can be set is 1 second. 

* Also note if NetworkManager is configuring your resolv.conf to set this value either modify NM's confg adding dns=none , or create a dispatch script that adds this option. 

Example: 

# cat /etc/resolv.conf 

search example.com
nameserver 192.168.0.6
options timeout:10


# cat /etc/dnsmasq.d/origin-upstream-dns.conf 
strict-order
no-resolv
domain-needed
server=8.8.8.8
server=192.168.0.3


# time dig master-1.example.com

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.4 <<>> master-1.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30392
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;master-1.example.com.		IN	A

;; ANSWER SECTION:
master-1.example.com.	60	IN	A	192.168.0.4

;; AUTHORITY SECTION:
example.com.		10	IN	NS	dns.example.com.

;; ADDITIONAL SECTION:
infra.example.com.	60	IN	A	192.168.0.3

;; Query time: 2 msec
;; SERVER: 192.168.0.6#53(192.168.0.6)
;; WHEN: Tue Nov 29 14:57:41 EST 2016
;; MSG SIZE  rcvd: 101


real	0m10.019s
user	0m0.010s
sys	0m0.008s

Comment 2 Scott Dodson 2016-11-29 22:57:17 UTC
I think we can remove strict-order option. If we do that dnsmasq will prefer servers that it knows to be up which should avoid any need to tune the timeout. This will probably also address the issues in Bug 1399577 too.

Comment 4 Scott Dodson 2016-11-30 13:39:30 UTC

*** This bug has been marked as a duplicate of bug 1399577 ***

Comment 5 Ryan Howe 2016-11-30 14:58:28 UTC
Correction to comment1, the timeout option works when set in resolv.conf but dnsmasq but I am not sure how this gets set as I do not know how dnsmasq uses glibc resolver, it might just accept some options that are set. I have confirmed that it works. (even when no-resolv option is set for dnsmasq config)


Note You need to log in before you can comment on or make changes to this bug.