Bug 1600551 - Intermittent dnsmasq outages [NEEDINFO]
Summary: Intermittent dnsmasq outages
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.z
Assignee: Miciah Dashiel Butler Masters
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1620230
Blocks: 1267746 1609390 1614981 1614983 1614984
TreeView+ depends on / blocked
 
Reported: 2018-07-12 13:25 UTC by Robert Bost
Modified: 2019-03-29 06:35 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: By default, older versions of dnsmasq can use privileged, lower-numbered source ports for outbound DNS queries. Consequence: Outbound DNS queries may be dropped; for example, firewall rules may drop queries coming from reserved ports. Fix: We now configure dnsmasq using its min-port setting to set the minimum port number for outbound queries to 1024. Result: DNS queries should no longer be dropped. Additional information: dnsmasq 2.79 changes the default min-port setting to 1024.
Clone Of:
: 1609390 1614981 1614983 1614984 1620230 (view as bug list)
Environment:
Last Closed: 2019-01-29 16:16:38 UTC
Target Upstream Version:
dmace: needinfo? (jfoots)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift openshift-ansible pull 9539 'None' 'closed' '[3.7] Adding min-port to dnsmasq configuration' 2019-12-04 05:36:27 UTC
Red Hat Knowledge Base (Solution) 3558531 None None None 2018-08-16 15:51:46 UTC

Description Robert Bost 2018-07-12 13:25:57 UTC
Description of problem: Pods are experiencing intermittent DNS lookup failures when reaching out to dnsmasq. A similar upstream issue has been reported: https://github.com/kubernetes/kubernetes/issues/45976


Version-Release number of selected component (if applicable):
atomic-openshift-3.7.54-1.git.0.4a2366a.el7.x86_64 (or atomic-openshift-3.7.23-1.git.5.83efd71.el7.x86_64)
dnsmasq-2.76-5.el7.x86_64


How reproducible: Intermittently


Steps to Reproduce:
(uses the amazonaws.com address like in the upstream issue mentioned above, could really be any hostname...):
1) oc new-app https://github.com/bostrt/java-inetaddress.git
2) oc set env dc/java-inetaddress TEST_HOSTNAME=dynamodb.us-east-1.amazonaws.com
3) oc set env dc/java-inetaddress DELAY=1 # Set 1s delay between each DNS lookup.
4) oc scale dc/java-inetaddress --replicas=12 # Scale up to increase probability of failure.
5) Let it run for a while (mine started reproducing the issue around 3 or 4 hours after starting the test though I imagine this could change wildly since issue is so intermittent.
6) Use this to quickly check logs on all the running pods:
# for pod in $(oc get pods -l app=java-inetaddress  | grep java-inetaddress | awk '{print $1}'); do echo $pod; oc logs $pod; done
7) The app will print out the current time and UnknownHostException when it occurs.

Actual results:
Example of real failure from one of my pods:
Thu Jul 12 00:35:16 UTC 2018
java.net.UnknownHostException: dynamodb.us-east-1.amazonaws.com: Name or service not known
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at java.net.InetAddress.getByName(InetAddress.java:1076)
	at Main.main(Main.java:29)

Comment 9 Ryan Howe 2018-08-09 14:28:47 UTC
Working another OpenShift dnsmasq issue we figured the issue to happen when dnsmasq uses a low port number. 

Setting min-port=1024 in dnsmasq worked around the issue. 

--min-port=<port>
              Do not use ports less than that given as source for outbound DNS queries. Dnsmasq picks random ports as source for outbound queries: when this option is given, the ports used will always to larger than that  specified. Useful for systems behind firewalls.


Dnsmasq bug was logged: 
  https://bugzilla.redhat.com/show_bug.cgi?id=1614331


I was not able to reproduce the issue again with this configuration in place.

Comment 10 Ryan Howe 2018-08-09 15:11:26 UTC
Created PR to add this configuration via the ansible installer for OpenShift: 

https://github.com/openshift/openshift-ansible/pull/9505

Comment 13 Miciah Dashiel Butler Masters 2018-08-10 23:10:06 UTC
Backport for OCP 3.7.z: https://github.com/openshift/openshift-ansible/pull/9539

Comment 19 Ryan Howe 2018-08-16 15:53:17 UTC
Workaround 

# echo "min-port=1024" > /etc/dnsmasq.d/lowport.conf
# systemctl restart dnsmasq 

https://access.redhat.com/solutions/3558531

Comment 28 Miciah Dashiel Butler Masters 2018-08-22 18:37:14 UTC
I opened bug 1620230 to track the dnsmasq configuration change for OCP 3.7.z so we can get that fix verified and shipped while we continue to determine other issues that could be causing the problems that we are tracking with this bug.

Comment 44 Dan Mace 2019-01-29 16:16:38 UTC
All linked cases are closed and with resolutions that didn't require a new release. I'm going to close this bug and we can open new bugs as necessary.


Note You need to log in before you can comment on or make changes to this bug.