Bug 1459630

Summary:	OpenShift Dedicated outgoing pod DNS requests failing
Product:	OpenShift Container Platform	Reporter:	Eric Jones <erjones>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Status:	CLOSED NOTABUG	QA Contact:	Meng Bo <bmeng>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.4.1	CC:	aos-bugs, erjones
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-06-09 18:12:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Jones 2017-06-07 15:51:48 UTC

Description of problem:
Our app, retail-locator-production, connects to PostgreSQL at [0]. This has been working (correctly resolving to an IP address, currently 172.16.70.146) for months. However, beginning May 25th, DNS for that hostname began failing at a rate of about 2000 / per day. Other hostnames, like [1] have begun failing also, but not at nearly the same rate as [0].

The actual curl command will always fail since thats not an HTTP endpoint, but you can easily see if the DNS hostname lookup fails or succeeds. That's the only important part. It should reliably and quickly resolve to an IP address in 172.16.0.0/12. If it fails to resolve or the resolution takes more than a few milliseconds, something is broken. Outside of Openshift, DNS resolution of those hostnames works reliably and quickly.

Version-Release number of selected component (if applicable):
3.4.1.18

How reproducible:
Customer indicates Very reproducible

Steps to Reproduce:
1. oc rsh <POD>
2. curl -v http://[0]/
 or 

2. curl -v http://[1]/

Actual results:
Failure that takes a long time

Expected results:
very rapid failure or success

Additional info:
I removed the hostnames from the description to keep it generic, I will add them as [0] and [1] in a private comment

Comment 2 Ben Bennett 2017-06-07 15:56:54 UTC

Is it failing from the node, or from pods?

Can I get /etc/resolv.conf and /etc/nsswitch.conf, and the output from:

  dig retail-locator-prod.cx4w7ilrr6c0.us-east-1.rds.amazonaws.com

Comment 3 Ben Bennett 2017-06-09 18:12:02 UTC

Ok.  This was caused by the cron job we put in to clear the ARP cache every minute.  That was causing simple curl commands from a node to a remote pod to take > 40s after the cache was flushed.  That's weird and is still merits investigation, and that will happen on https://bugzilla.redhat.com/show_bug.cgi?id=1451854 where we are looking at the other ARP issues.

However, given that this was due to the cronjob, and that has now been removed, I'm closing this.