Description of problem: The service IP for the registry intermittently fails. We're seeing this in our cluster registry health check. This is happening in our 'preview' production cluster (aka dev-preview). When we hit the individual IP addresses of the service, they seem to always work. This is an issue that we had a conference call with Dan Williams a couple of weeks ago and we gave him a bunch of logs that he said he would analyze and get back to us. This bug is to track his efforts. Here is the output from when the service IP works and then fails: root@preview-master-afbb8 ~]# curl --head https://172.30.47.227:5000 HTTP/1.1 200 OK Cache-Control: no-cache Date: Thu, 18 Aug 2016 14:41:52 GMT Content-Type: text/plain; charset=utf-8 [root@preview-master-afbb8 ~]# curl --head https://172.30.47.227:5000 curl: (7) Failed connect to 172.30.47.227:5000; No route to host [root@preview-master-afbb8 ~]# Version-Release number of selected component (if applicable): atomic-openshift-3.2.1.15-1.git.8.c402626.el7.x86_64 atomic-openshift-master-3.2.1.15-1.git.8.c402626.el7.x86_64 atomic-openshift-clients-3.2.1.15-1.git.8.c402626.el7.x86_64 tuned-profiles-atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64 atomic-openshift-sdn-ovs-3.2.1.15-1.git.8.c402626.el7.x86_64 atomic-openshift-node-3.2.1.15-1.git.8.c402626.el7.x86_64 How reproducible: Very sporadic, but our check shows it happening on a regular basis. Steps to Reproduce: 1. Unknown, it's sporadic, but we're seeing it running the curl command above. 2. 3. Actual results: We're sporadically unable to connect to the registry using the service IP, but can using the individual IPs. This might be load related as preview prod is one of our biggest / busiest clusters. Expected results: We should always be able to connect to the registry using the service IP. Additional Info: Even though this bug is specifically talking about the registry, this might be a general issue with the kube proxy.
On 29 Aug, we changed openshift-node iptablesSyncPeriod from 5s (shipped default) to 300s, since that change we can't reliably reproduce this issue.
Sounds like a combination of: https://bugzilla.redhat.com/show_bug.cgi?id=1367199 https://bugzilla.redhat.com/show_bug.cgi?id=1362661
It would be worthwhile to see how iptables-restore is spending that time and determine if there are some hotspots in the code and whether something in the iptables-restore process could be made more efficient. Flame graphs (http://www.brendangregg.com/flamegraphs.html) can give a diagram of the stack backtraces of "perf record" data showing which functions and children functions the processor is spending time in. To get user-space function names in the analysis you should uses the following command to install the associated debuginfo for iptables-restore before running the experiments: # debuginfo-install iptables
I'm not sure what else OpenShift networking can do here right now, given that we have a fix to decrease the contention (installer defaults in 1367199) and issues in the kernel too (1362661). Should I dupe this issue to one of those, re-assign to iptables, or close?
imho this is a dupe, and close this one.
*** This bug has been marked as a duplicate of bug 1362661 ***