| Summary: | hostname resolution errors | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Brendan Mchugh <bmchugh> | ||||
| Component: | Installer | Assignee: | Tim Bielawa <tbielawa> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Wenkai Shi <weshi> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.2.1 | CC: | aos-bugs, jokerman, mmccomas, tbielawa | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Cause: The openshift-ansible networkmanager config script was unconditionally restarting the dnsmasq service every time it ran.
Consequence: Hostname resolution would fail temporarily while the dnsmasq service is restarting.
Fix: The openshift-ansible networkmanager config script only restarts the dnsmasq service if a change was detected in the upstream DNS resolvers.
Result: Hostname resolution will continue to function as expected.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-01-18 12:53:46 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
Not sure if the installer is the right group for this... but it was my best guess for who owns the offending script. Scott, I suspect we need to do the restart conditionally instead of every time the script is invoked. Verified with openshift-ansible-3.4.20 Create two environment, one of them with new version script(openshift-ansible-3.4.20), another with old version script(openshift-ansible-3.2.13-1). In each environment has a pod that keep trying visit www.google.com. When DHCP leases renew, the new-script-node didn't restart dnsmasq, and the old-script-node restart dnsmasq. [root@new-script-node ~]# grep "dnsmasq" /var/log/messages ... Nov 9 17:56:51 new-script-node nm-dispatcher: + UPSTREAM_DNS=/etc/dnsmasq.d/origin-upstream-dns.conf Nov 9 17:56:51 new-script-node nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']' Nov 9 17:56:51 new-script-node nm-dispatcher: + sort /etc/dnsmasq.d/origin-upstream-dns.conf ... [root@old-script-node ~]# grep "dnsmasq" /var/log/messages ... Nov 9 17:41:35 old-script-node nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']' Nov 9 17:41:35 old-script-node nm-dispatcher: + systemctl restart dnsmasq ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066 |
Created attachment 1198922 [details] Benchmark graph showing the errors when dnsmasq is restarted Description of problem: During load tests performed on the our app in the AWS environment we experienced intermittent hostname resolution errors. From this test run [where we have experienced the dns errors] it happened 3 times out of +/- 216000 attempts [test done via 10 threads]. Attached you shall find the graphics from this benchmark, where it is seen that all 3 errors happened at the same time, i.e. during one and the same dnsmasq restart - so the occurrence of this dns error is rare, but still may happen [it took us 'only' half an hour of load to hit this issue]. The cause seems to be when "systemctl restart dnsmasq" is called from 99-origin-dns.sh each time a DHCP lease is renewed. DHCP leases renew ever 30-60 minutes and the probability of a request being affected by such restarts increases with the number of requests per second received on each node. Version-Release number of selected component (if applicable): oc v3.2.1.13-1-gc2a90e1 kubernetes v1.2.0-36-g4a3f9c5 How reproducible: Always Steps to Reproduce: 1. Benchmark against the pod 2. Wait for dhcp renewal 3. Actual results: hostname resolution fails for a moment while dnsmasq is restarted. Expected results: hostname resolution should not fail every time a dhcp lease is renewed. Additional info: