Bug 1626248

Summary: intermittent shortlived dns resolution failures for internal and external traffic
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bbennett, dkaylor, dmace, jmalde, knakai, openshift-bugs-escalate, rhowe, scuppett, stwalter
Version: 3.9.0   
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-05 17:12:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steven Walter 2018-09-06 20:59:14 UTC
Description of problem:
Customer seeing intermittent unknownhost problems in multiple apps on multiple clusters.

Sep  6 12:12:01 node1.example.com ocp.dev2.container: { "docker": { "container_id": "587ae1d24d171b6eb73ef3b0970f9625eadf338eddba2f20e085b7cfa67a764c" }, "kubernetes": { "namespace_name": "app-b4ea56-1", "pod_name": "app-2-cvxz6", "labels": { "application": "myapp", "deploymentConfig": "app", "microservice": "app" }, "host": "node1.example.com", "container_name": "app" }, "hostname": "node1", "message": "org.springframework.web.client.ResourceAccessException: I\/O error on GET request for \"http:\/\/some-other-service:8080\/health\": some-other-service; nested exception is java.net.UnknownHostException: some-other-service", "@timestamp": "2018-09-06T12:12:01.193478-04:00" }



Version-Release number of selected component (if applicable):
3.9

Actual results:
We enabled dnsmasq query logging. For the above message we see:

Sep 06 12:12:01 node1 dnsmasq[14214]: query[A] some-other-service from 100.xx.xx.157
Sep 06 12:12:01 node1 dnsmasq[14214]: config some-other-service is NODATA-IPv4
Sep 06 12:12:01 node1 dnsmasq[14214]: query[AAAA] some-other-service from 100.xx.xx.157
Sep 06 12:12:01 node1 dnsmasq[14214]: config some-other-service is NODATA-IPv6

We also see the same messages periodically when reaching out to external (non-OpenShift) services:

Sep  6 13:38:20 node2.example.com ocp.dev2.container: { "docker": { "container_id": "f258f0680dfa0e369a1cfca7387456736f5c8b41a510b4508fa0cfcbe89eb539" }, "kubernetes": { "namespace_name": "myproject", "pod_name": "mypod", "labels": { "application": "someapp", "deploymentConfig": "somedc", "microservice": "someapp" }, "host": "node2", "container_name": "mycontainer" }, "hostname": "node2", "message": "Caused by: java.net.UnknownHostException: SOMEOTHERHOST.text.example2.com", "@timestamp": "2018-09-06T13:38:20.059681-04:00" }

Strangely, dnsmasq query logs do NOT show anything about this query at this time!


We also tried enabling SkyDNS debug logging but SkyDNS doesn't seem to log out in atomic-openshift-node (possibly we're setting the flag wrong?):
root      50072      1 55 Sep05 ?        15:27:19 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 --logspec=dns*=8


I'm not sure that this is *not* an issue with dnsmasq being overloaded or something similar, but this wouldn't explain why dnsmasq is able to log a NODATA error, which implies it searched its cache and upstream and did not find the record. (At least thats my understanding from https://umbrella.cisco.com/blog/2014/06/23/nxdomain-nodata-debugging-dns-dual-stacked-hosts/)

Comment 1 Steven Walter 2018-09-06 21:00:21 UTC
- For the external failure, we see nothing in the dnsmasq logs. There are plenty of entries around the time of failure but nothing shows for the failure itself. It's as if the pod didn't reach out to the node.

- For the internal service failure, we do see a NODATA-IPv4 message in the dnsmasq logs. We have some logging options added to the node but cannot find any info in the journal. We'd like to find a way to get some visibility into what SkyDNS is doing. The "--loglevel=2 --logspec=dns*=8" options don't seem to give us what we are looking for.

- We do not see any evidence that dnsmasq is being overwhelmed and the times of failure do not correlate with peak utilization times for the apps involved.

Comment 2 Steven Walter 2018-09-06 21:01:46 UTC
Note: The issue resolves itself in a few seconds or a minute or two, without further action, and the pod continues to run it appears. I will upload in private comment a few attachments with examples.

Comment 20 Stephen Cuppett 2018-10-05 17:12:25 UTC

*** This bug has been marked as a duplicate of bug 1614331 ***

Comment 21 Stephen Cuppett 2018-10-05 17:26:52 UTC
Ansible installer has had the min-port fix since 8/13:

https://github.com/openshift/openshift-ansible/pull/9541

Comment 22 Stephen Cuppett 2018-10-05 17:27:26 UTC

*** This bug has been marked as a duplicate of bug 1614983 ***