Created attachment 1198922 [details] Benchmark graph showing the errors when dnsmasq is restarted Description of problem: During load tests performed on the our app in the AWS environment we experienced intermittent hostname resolution errors. From this test run [where we have experienced the dns errors] it happened 3 times out of +/- 216000 attempts [test done via 10 threads]. Attached you shall find the graphics from this benchmark, where it is seen that all 3 errors happened at the same time, i.e. during one and the same dnsmasq restart - so the occurrence of this dns error is rare, but still may happen [it took us 'only' half an hour of load to hit this issue]. The cause seems to be when "systemctl restart dnsmasq" is called from 99-origin-dns.sh each time a DHCP lease is renewed. DHCP leases renew ever 30-60 minutes and the probability of a request being affected by such restarts increases with the number of requests per second received on each node. Version-Release number of selected component (if applicable): oc v3.2.1.13-1-gc2a90e1 kubernetes v1.2.0-36-g4a3f9c5 How reproducible: Always Steps to Reproduce: 1. Benchmark against the pod 2. Wait for dhcp renewal 3. Actual results: hostname resolution fails for a moment while dnsmasq is restarted. Expected results: hostname resolution should not fail every time a dhcp lease is renewed. Additional info:
Not sure if the installer is the right group for this... but it was my best guess for who owns the offending script.
Scott, I suspect we need to do the restart conditionally instead of every time the script is invoked.
https://github.com/openshift/openshift-ansible/pull/2690
Verified with openshift-ansible-3.4.20 Create two environment, one of them with new version script(openshift-ansible-3.4.20), another with old version script(openshift-ansible-3.2.13-1). In each environment has a pod that keep trying visit www.google.com. When DHCP leases renew, the new-script-node didn't restart dnsmasq, and the old-script-node restart dnsmasq. [root@new-script-node ~]# grep "dnsmasq" /var/log/messages ... Nov 9 17:56:51 new-script-node nm-dispatcher: + UPSTREAM_DNS=/etc/dnsmasq.d/origin-upstream-dns.conf Nov 9 17:56:51 new-script-node nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']' Nov 9 17:56:51 new-script-node nm-dispatcher: + sort /etc/dnsmasq.d/origin-upstream-dns.conf ... [root@old-script-node ~]# grep "dnsmasq" /var/log/messages ... Nov 9 17:41:35 old-script-node nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']' Nov 9 17:41:35 old-script-node nm-dispatcher: + systemctl restart dnsmasq ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066