Description of problem: Pods are experiencing intermittent DNS lookup failures when reaching out to dnsmasq. A similar upstream issue has been reported: https://github.com/kubernetes/kubernetes/issues/45976 Version-Release number of selected component (if applicable): atomic-openshift-3.7.54-1.git.0.4a2366a.el7.x86_64 (or atomic-openshift-3.7.23-1.git.5.83efd71.el7.x86_64) dnsmasq-2.76-5.el7.x86_64 How reproducible: Intermittently Steps to Reproduce: (uses the amazonaws.com address like in the upstream issue mentioned above, could really be any hostname...): 1) oc new-app https://github.com/bostrt/java-inetaddress.git 2) oc set env dc/java-inetaddress TEST_HOSTNAME=dynamodb.us-east-1.amazonaws.com 3) oc set env dc/java-inetaddress DELAY=1 # Set 1s delay between each DNS lookup. 4) oc scale dc/java-inetaddress --replicas=12 # Scale up to increase probability of failure. 5) Let it run for a while (mine started reproducing the issue around 3 or 4 hours after starting the test though I imagine this could change wildly since issue is so intermittent. 6) Use this to quickly check logs on all the running pods: # for pod in $(oc get pods -l app=java-inetaddress | grep java-inetaddress | awk '{print $1}'); do echo $pod; oc logs $pod; done 7) The app will print out the current time and UnknownHostException when it occurs. Actual results: Example of real failure from one of my pods: Thu Jul 12 00:35:16 UTC 2018 java.net.UnknownHostException: dynamodb.us-east-1.amazonaws.com: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at java.net.InetAddress.getByName(InetAddress.java:1076) at Main.main(Main.java:29)
Working another OpenShift dnsmasq issue we figured the issue to happen when dnsmasq uses a low port number. Setting min-port=1024 in dnsmasq worked around the issue. --min-port=<port> Do not use ports less than that given as source for outbound DNS queries. Dnsmasq picks random ports as source for outbound queries: when this option is given, the ports used will always to larger than that specified. Useful for systems behind firewalls. Dnsmasq bug was logged: https://bugzilla.redhat.com/show_bug.cgi?id=1614331 I was not able to reproduce the issue again with this configuration in place.
Created PR to add this configuration via the ansible installer for OpenShift: https://github.com/openshift/openshift-ansible/pull/9505
Backport for OCP 3.7.z: https://github.com/openshift/openshift-ansible/pull/9539
Workaround # echo "min-port=1024" > /etc/dnsmasq.d/lowport.conf # systemctl restart dnsmasq https://access.redhat.com/solutions/3558531
I opened bug 1620230 to track the dnsmasq configuration change for OCP 3.7.z so we can get that fix verified and shipped while we continue to determine other issues that could be causing the problems that we are tracking with this bug.
All linked cases are closed and with resolutions that didn't require a new release. I'm going to close this bug and we can open new bugs as necessary.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days