Bug 1459630 - OpenShift Dedicated outgoing pod DNS requests failing [NEEDINFO]
OpenShift Dedicated outgoing pod DNS requests failing
Status: CLOSED NOTABUG
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.4.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Ben Bennett
Meng Bo
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-07 11:51 EDT by Eric Jones
Modified: 2017-06-09 14:12 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-06-09 14:12:02 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
bbennett: needinfo? (erjones)


Attachments (Terms of Use)

  None (edit)
Description Eric Jones 2017-06-07 11:51:48 EDT
Description of problem:
Our app, retail-locator-production, connects to PostgreSQL at [0]. This has been working (correctly resolving to an IP address, currently 172.16.70.146) for months. However, beginning May 25th, DNS for that hostname began failing at a rate of about 2000 / per day. Other hostnames, like [1] have begun failing also, but not at nearly the same rate as [0].

The actual curl command will always fail since thats not an HTTP endpoint, but you can easily see if the DNS hostname lookup fails or succeeds. That's the only important part. It should reliably and quickly resolve to an IP address in 172.16.0.0/12. If it fails to resolve or the resolution takes more than a few milliseconds, something is broken. Outside of Openshift, DNS resolution of those hostnames works reliably and quickly.

Version-Release number of selected component (if applicable):
3.4.1.18

How reproducible:
Customer indicates Very reproducible

Steps to Reproduce:
1. oc rsh <POD>
2. curl -v http://[0]/
 or 

2. curl -v http://[1]/

Actual results:
Failure that takes a long time

Expected results:
very rapid failure or success

Additional info:
I removed the hostnames from the description to keep it generic, I will add them as [0] and [1] in a private comment
Comment 2 Ben Bennett 2017-06-07 11:56:54 EDT
Is it failing from the node, or from pods?

Can I get /etc/resolv.conf and /etc/nsswitch.conf, and the output from:

  dig retail-locator-prod.cx4w7ilrr6c0.us-east-1.rds.amazonaws.com
Comment 3 Ben Bennett 2017-06-09 14:12:02 EDT
Ok.  This was caused by the cron job we put in to clear the ARP cache every minute.  That was causing simple curl commands from a node to a remote pod to take > 40s after the cache was flushed.  That's weird and is still merits investigation, and that will happen on https://bugzilla.redhat.com/show_bug.cgi?id=1451854 where we are looking at the other ARP issues.

However, given that this was due to the cronjob, and that has now been removed, I'm closing this.

Note You need to log in before you can comment on or make changes to this bug.