Bug 1459630 - OpenShift Dedicated outgoing pod DNS requests failing
Summary: OpenShift Dedicated outgoing pod DNS requests failing
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-07 15:51 UTC by Eric Jones
Modified: 2018-06-26 14:33 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-09 18:12:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Eric Jones 2017-06-07 15:51:48 UTC
Description of problem:
Our app, retail-locator-production, connects to PostgreSQL at [0]. This has been working (correctly resolving to an IP address, currently 172.16.70.146) for months. However, beginning May 25th, DNS for that hostname began failing at a rate of about 2000 / per day. Other hostnames, like [1] have begun failing also, but not at nearly the same rate as [0].

The actual curl command will always fail since thats not an HTTP endpoint, but you can easily see if the DNS hostname lookup fails or succeeds. That's the only important part. It should reliably and quickly resolve to an IP address in 172.16.0.0/12. If it fails to resolve or the resolution takes more than a few milliseconds, something is broken. Outside of Openshift, DNS resolution of those hostnames works reliably and quickly.

Version-Release number of selected component (if applicable):
3.4.1.18

How reproducible:
Customer indicates Very reproducible

Steps to Reproduce:
1. oc rsh <POD>
2. curl -v http://[0]/
 or 

2. curl -v http://[1]/

Actual results:
Failure that takes a long time

Expected results:
very rapid failure or success

Additional info:
I removed the hostnames from the description to keep it generic, I will add them as [0] and [1] in a private comment

Comment 2 Ben Bennett 2017-06-07 15:56:54 UTC
Is it failing from the node, or from pods?

Can I get /etc/resolv.conf and /etc/nsswitch.conf, and the output from:

  dig retail-locator-prod.cx4w7ilrr6c0.us-east-1.rds.amazonaws.com

Comment 3 Ben Bennett 2017-06-09 18:12:02 UTC
Ok.  This was caused by the cron job we put in to clear the ARP cache every minute.  That was causing simple curl commands from a node to a remote pod to take > 40s after the cache was flushed.  That's weird and is still merits investigation, and that will happen on https://bugzilla.redhat.com/show_bug.cgi?id=1451854 where we are looking at the other ARP issues.

However, given that this was due to the cronjob, and that has now been removed, I'm closing this.


Note You need to log in before you can comment on or make changes to this bug.