Description of problem: Customer seeing intermittent unknownhost problems in multiple apps on multiple clusters. Sep 6 12:12:01 node1.example.com ocp.dev2.container: { "docker": { "container_id": "587ae1d24d171b6eb73ef3b0970f9625eadf338eddba2f20e085b7cfa67a764c" }, "kubernetes": { "namespace_name": "app-b4ea56-1", "pod_name": "app-2-cvxz6", "labels": { "application": "myapp", "deploymentConfig": "app", "microservice": "app" }, "host": "node1.example.com", "container_name": "app" }, "hostname": "node1", "message": "org.springframework.web.client.ResourceAccessException: I\/O error on GET request for \"http:\/\/some-other-service:8080\/health\": some-other-service; nested exception is java.net.UnknownHostException: some-other-service", "@timestamp": "2018-09-06T12:12:01.193478-04:00" } Version-Release number of selected component (if applicable): 3.9 Actual results: We enabled dnsmasq query logging. For the above message we see: Sep 06 12:12:01 node1 dnsmasq[14214]: query[A] some-other-service from 100.xx.xx.157 Sep 06 12:12:01 node1 dnsmasq[14214]: config some-other-service is NODATA-IPv4 Sep 06 12:12:01 node1 dnsmasq[14214]: query[AAAA] some-other-service from 100.xx.xx.157 Sep 06 12:12:01 node1 dnsmasq[14214]: config some-other-service is NODATA-IPv6 We also see the same messages periodically when reaching out to external (non-OpenShift) services: Sep 6 13:38:20 node2.example.com ocp.dev2.container: { "docker": { "container_id": "f258f0680dfa0e369a1cfca7387456736f5c8b41a510b4508fa0cfcbe89eb539" }, "kubernetes": { "namespace_name": "myproject", "pod_name": "mypod", "labels": { "application": "someapp", "deploymentConfig": "somedc", "microservice": "someapp" }, "host": "node2", "container_name": "mycontainer" }, "hostname": "node2", "message": "Caused by: java.net.UnknownHostException: SOMEOTHERHOST.text.example2.com", "@timestamp": "2018-09-06T13:38:20.059681-04:00" } Strangely, dnsmasq query logs do NOT show anything about this query at this time! We also tried enabling SkyDNS debug logging but SkyDNS doesn't seem to log out in atomic-openshift-node (possibly we're setting the flag wrong?): root 50072 1 55 Sep05 ? 15:27:19 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 --logspec=dns*=8 I'm not sure that this is *not* an issue with dnsmasq being overloaded or something similar, but this wouldn't explain why dnsmasq is able to log a NODATA error, which implies it searched its cache and upstream and did not find the record. (At least thats my understanding from https://umbrella.cisco.com/blog/2014/06/23/nxdomain-nodata-debugging-dns-dual-stacked-hosts/)
- For the external failure, we see nothing in the dnsmasq logs. There are plenty of entries around the time of failure but nothing shows for the failure itself. It's as if the pod didn't reach out to the node. - For the internal service failure, we do see a NODATA-IPv4 message in the dnsmasq logs. We have some logging options added to the node but cannot find any info in the journal. We'd like to find a way to get some visibility into what SkyDNS is doing. The "--loglevel=2 --logspec=dns*=8" options don't seem to give us what we are looking for. - We do not see any evidence that dnsmasq is being overwhelmed and the times of failure do not correlate with peak utilization times for the apps involved.
Note: The issue resolves itself in a few seconds or a minute or two, without further action, and the pod continues to run it appears. I will upload in private comment a few attachments with examples.
*** This bug has been marked as a duplicate of bug 1614331 ***
Ansible installer has had the min-port fix since 8/13: https://github.com/openshift/openshift-ansible/pull/9541
*** This bug has been marked as a duplicate of bug 1614983 ***