Bug 1626248 - intermittent shortlived dns resolution failures for internal and external traffic
Summary: intermittent shortlived dns resolution failures for internal and external tra...
Keywords:
Status: CLOSED DUPLICATE of bug 1614983
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.z
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-06 20:59 UTC by Steven Walter
Modified: 2022-08-04 22:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-05 17:12:25 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Steven Walter 2018-09-06 20:59:14 UTC
Description of problem:
Customer seeing intermittent unknownhost problems in multiple apps on multiple clusters.

Sep  6 12:12:01 node1.example.com ocp.dev2.container: { "docker": { "container_id": "587ae1d24d171b6eb73ef3b0970f9625eadf338eddba2f20e085b7cfa67a764c" }, "kubernetes": { "namespace_name": "app-b4ea56-1", "pod_name": "app-2-cvxz6", "labels": { "application": "myapp", "deploymentConfig": "app", "microservice": "app" }, "host": "node1.example.com", "container_name": "app" }, "hostname": "node1", "message": "org.springframework.web.client.ResourceAccessException: I\/O error on GET request for \"http:\/\/some-other-service:8080\/health\": some-other-service; nested exception is java.net.UnknownHostException: some-other-service", "@timestamp": "2018-09-06T12:12:01.193478-04:00" }



Version-Release number of selected component (if applicable):
3.9

Actual results:
We enabled dnsmasq query logging. For the above message we see:

Sep 06 12:12:01 node1 dnsmasq[14214]: query[A] some-other-service from 100.xx.xx.157
Sep 06 12:12:01 node1 dnsmasq[14214]: config some-other-service is NODATA-IPv4
Sep 06 12:12:01 node1 dnsmasq[14214]: query[AAAA] some-other-service from 100.xx.xx.157
Sep 06 12:12:01 node1 dnsmasq[14214]: config some-other-service is NODATA-IPv6

We also see the same messages periodically when reaching out to external (non-OpenShift) services:

Sep  6 13:38:20 node2.example.com ocp.dev2.container: { "docker": { "container_id": "f258f0680dfa0e369a1cfca7387456736f5c8b41a510b4508fa0cfcbe89eb539" }, "kubernetes": { "namespace_name": "myproject", "pod_name": "mypod", "labels": { "application": "someapp", "deploymentConfig": "somedc", "microservice": "someapp" }, "host": "node2", "container_name": "mycontainer" }, "hostname": "node2", "message": "Caused by: java.net.UnknownHostException: SOMEOTHERHOST.text.example2.com", "@timestamp": "2018-09-06T13:38:20.059681-04:00" }

Strangely, dnsmasq query logs do NOT show anything about this query at this time!


We also tried enabling SkyDNS debug logging but SkyDNS doesn't seem to log out in atomic-openshift-node (possibly we're setting the flag wrong?):
root      50072      1 55 Sep05 ?        15:27:19 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 --logspec=dns*=8


I'm not sure that this is *not* an issue with dnsmasq being overloaded or something similar, but this wouldn't explain why dnsmasq is able to log a NODATA error, which implies it searched its cache and upstream and did not find the record. (At least thats my understanding from https://umbrella.cisco.com/blog/2014/06/23/nxdomain-nodata-debugging-dns-dual-stacked-hosts/)

Comment 1 Steven Walter 2018-09-06 21:00:21 UTC
- For the external failure, we see nothing in the dnsmasq logs. There are plenty of entries around the time of failure but nothing shows for the failure itself. It's as if the pod didn't reach out to the node.

- For the internal service failure, we do see a NODATA-IPv4 message in the dnsmasq logs. We have some logging options added to the node but cannot find any info in the journal. We'd like to find a way to get some visibility into what SkyDNS is doing. The "--loglevel=2 --logspec=dns*=8" options don't seem to give us what we are looking for.

- We do not see any evidence that dnsmasq is being overwhelmed and the times of failure do not correlate with peak utilization times for the apps involved.

Comment 2 Steven Walter 2018-09-06 21:01:46 UTC
Note: The issue resolves itself in a few seconds or a minute or two, without further action, and the pod continues to run it appears. I will upload in private comment a few attachments with examples.

Comment 20 Stephen Cuppett 2018-10-05 17:12:25 UTC

*** This bug has been marked as a duplicate of bug 1614331 ***

Comment 21 Stephen Cuppett 2018-10-05 17:26:52 UTC
Ansible installer has had the min-port fix since 8/13:

https://github.com/openshift/openshift-ansible/pull/9541

Comment 22 Stephen Cuppett 2018-10-05 17:27:26 UTC

*** This bug has been marked as a duplicate of bug 1614983 ***


Note You need to log in before you can comment on or make changes to this bug.