1374170 – hostname resolution errors

Bug 1374170 - hostname resolution errors

Summary: hostname resolution errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Tim Bielawa
QA Contact:	Wenkai Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-08 07:40 UTC by Brendan Mchugh
Modified:	2017-03-08 18:43 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The openshift-ansible networkmanager config script was unconditionally restarting the dnsmasq service every time it ran. Consequence: Hostname resolution would fail temporarily while the dnsmasq service is restarting. Fix: The openshift-ansible networkmanager config script only restarts the dnsmasq service if a change was detected in the upstream DNS resolvers. Result: Hostname resolution will continue to function as expected.
Clone Of:
Environment:
Last Closed:	2017-01-18 12:53:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Benchmark graph showing the errors when dnsmasq is restarted (41.44 KB, image/png) 2016-09-08 07:40 UTC, Brendan Mchugh	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Brendan Mchugh 2016-09-08 07:40:39 UTC

Created attachment 1198922 [details]
Benchmark graph showing the errors when dnsmasq is restarted

Description of problem:
During load tests performed on the our app in the AWS environment we experienced intermittent hostname resolution errors.

From this test run [where we have experienced the dns errors] it happened 3 times out of +/- 216000 attempts [test done via 10 threads]. 

Attached you shall find the graphics from this benchmark, where it is seen that all 3 errors happened at the same time, i.e. during one and the same dnsmasq restart - so the occurrence of this dns error is rare, but still may happen [it took us 'only' half an hour of load to hit this issue].

The cause seems to be when "systemctl restart dnsmasq" is called from 99-origin-dns.sh each time a DHCP lease is renewed.

DHCP leases renew ever 30-60 minutes and the probability of a request being affected by such restarts increases with the number of requests per second received on each node.


Version-Release number of selected component (if applicable):
oc v3.2.1.13-1-gc2a90e1
kubernetes v1.2.0-36-g4a3f9c5


How reproducible:
Always


Steps to Reproduce:
1. Benchmark against the pod
2. Wait for dhcp renewal
3. 


Actual results:
hostname resolution fails for a moment while dnsmasq is restarted.


Expected results:
hostname resolution should not fail every time a dhcp lease is renewed.

Additional info:

Comment 1 Ben Bennett 2016-09-08 13:37:50 UTC

Not sure if the installer is the right group for this... but it was my best guess for who owns the offending script.

Comment 2 Jason DeTiberus 2016-09-08 17:32:08 UTC

Scott, I suspect we need to do the restart conditionally instead of every time the script is invoked.

Comment 4 Tim Bielawa 2016-11-01 16:28:18 UTC

https://github.com/openshift/openshift-ansible/pull/2690

Comment 6 Wenkai Shi 2016-11-10 02:58:12 UTC

Verified with openshift-ansible-3.4.20

Create two environment, one of them with new version script(openshift-ansible-3.4.20), another with old version script(openshift-ansible-3.2.13-1). In each environment has a pod that keep trying visit www.google.com. When DHCP leases renew, the new-script-node didn't restart dnsmasq, and the old-script-node restart dnsmasq.

[root@new-script-node ~]# grep "dnsmasq" /var/log/messages
...
Nov  9 17:56:51 new-script-node nm-dispatcher: + UPSTREAM_DNS=/etc/dnsmasq.d/origin-upstream-dns.conf
Nov  9 17:56:51 new-script-node nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']'
Nov  9 17:56:51 new-script-node nm-dispatcher: + sort /etc/dnsmasq.d/origin-upstream-dns.conf
...

[root@old-script-node ~]# grep "dnsmasq" /var/log/messages
...
Nov  9 17:41:35 old-script-node nm-dispatcher: + '[' '!' -f /etc/dnsmasq.d/origin-dns.conf ']'
Nov  9 17:41:35 old-script-node nm-dispatcher: + systemctl restart dnsmasq
...

Comment 8 errata-xmlrpc 2017-01-18 12:53:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.