Description of problem: Our app, retail-locator-production, connects to PostgreSQL at [0]. This has been working (correctly resolving to an IP address, currently 172.16.70.146) for months. However, beginning May 25th, DNS for that hostname began failing at a rate of about 2000 / per day. Other hostnames, like [1] have begun failing also, but not at nearly the same rate as [0]. The actual curl command will always fail since thats not an HTTP endpoint, but you can easily see if the DNS hostname lookup fails or succeeds. That's the only important part. It should reliably and quickly resolve to an IP address in 172.16.0.0/12. If it fails to resolve or the resolution takes more than a few milliseconds, something is broken. Outside of Openshift, DNS resolution of those hostnames works reliably and quickly. Version-Release number of selected component (if applicable): 3.4.1.18 How reproducible: Customer indicates Very reproducible Steps to Reproduce: 1. oc rsh <POD> 2. curl -v http://[0]/ or 2. curl -v http://[1]/ Actual results: Failure that takes a long time Expected results: very rapid failure or success Additional info: I removed the hostnames from the description to keep it generic, I will add them as [0] and [1] in a private comment
Is it failing from the node, or from pods? Can I get /etc/resolv.conf and /etc/nsswitch.conf, and the output from: dig retail-locator-prod.cx4w7ilrr6c0.us-east-1.rds.amazonaws.com
Ok. This was caused by the cron job we put in to clear the ARP cache every minute. That was causing simple curl commands from a node to a remote pod to take > 40s after the cache was flushed. That's weird and is still merits investigation, and that will happen on https://bugzilla.redhat.com/show_bug.cgi?id=1451854 where we are looking at the other ARP issues. However, given that this was due to the cronjob, and that has now been removed, I'm closing this.